Document

APPROX

**Published in:** LIPIcs, Volume 145, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019)

In the time-decay model for data streams, elements of an underlying data set arrive sequentially with the recently arrived elements being more important. A common approach for handling large data sets is to maintain a coreset, a succinct summary of the processed data that allows approximate recovery of a predetermined query. We provide a general framework that takes any offline-coreset and gives a time-decay coreset for polynomial time decay functions.
We also consider the exponential time decay model for k-median clustering, where we provide a constant factor approximation algorithm that utilizes the online facility location algorithm. Our algorithm stores O(k log(h Delta)+h) points where h is the half-life of the decay function and Delta is the aspect ratio of the dataset. Our techniques extend to k-means clustering and M-estimators as well.

Vladimir Braverman, Harry Lang, Enayat Ullah, and Samson Zhou. Improved Algorithms for Time Decay Streams. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 145, pp. 27:1-27:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{braverman_et_al:LIPIcs.APPROX-RANDOM.2019.27, author = {Braverman, Vladimir and Lang, Harry and Ullah, Enayat and Zhou, Samson}, title = {{Improved Algorithms for Time Decay Streams}}, booktitle = {Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019)}, pages = {27:1--27:17}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-125-2}, ISSN = {1868-8969}, year = {2019}, volume = {145}, editor = {Achlioptas, Dimitris and V\'{e}gh, L\'{a}szl\'{o} A.}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.APPROX-RANDOM.2019.27}, URN = {urn:nbn:de:0030-drops-112429}, doi = {10.4230/LIPIcs.APPROX-RANDOM.2019.27}, annote = {Keywords: Streaming algorithms, approximation algorithms, facility location and clustering} }

Document

RANDOM

**Published in:** LIPIcs, Volume 145, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019)

We introduce a new method of maintaining a (k,epsilon)-coreset for clustering M-estimators over insertion-only streams. Let (P,w) be a weighted set (where w : P - > [0,infty) is the weight function) of points in a rho-metric space (meaning a set X equipped with a positive-semidefinite symmetric function D such that D(x,z) <=rho(D(x,y) + D(y,z)) for all x,y,z in X). For any set of points C, we define COST(P,w,C) = sum_{p in P} w(p) min_{c in C} D(p,c). A (k,epsilon)-coreset for (P,w) is a weighted set (Q,v) such that for every set C of k points, (1-epsilon)COST(P,w,C) <= COST(Q,v,C) <= (1+epsilon)COST(P,w,C). Essentially, the coreset (Q,v) can be used in place of (P,w) for all operations concerning the COST function. Coresets, as a method of data reduction, are used to solve fundamental problems in machine learning of streaming and distributed data.
M-estimators are functions D(x,y) that can be written as psi(d(x,y)) where ({X}, d) is a true metric (i.e. 1-metric) space. Special cases of M-estimators include the well-known k-median (psi(x) =x) and k-means (psi(x) = x^2) functions. Our technique takes an existing offline construction for an M-estimator coreset and converts it into the streaming setting, where n data points arrive sequentially. To our knowledge, this is the first streaming construction for any M-estimator that does not rely on the merge-and-reduce tree. For example, our coreset for streaming metric k-means uses O(epsilon^{-2} k log k log n) points of storage. The previous state-of-the-art required storing at least O(epsilon^{-2} k log k log^{4} n) points.

Vladimir Braverman, Dan Feldman, Harry Lang, and Daniela Rus. Streaming Coreset Constructions for M-Estimators. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 145, pp. 62:1-62:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{braverman_et_al:LIPIcs.APPROX-RANDOM.2019.62, author = {Braverman, Vladimir and Feldman, Dan and Lang, Harry and Rus, Daniela}, title = {{Streaming Coreset Constructions for M-Estimators}}, booktitle = {Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019)}, pages = {62:1--62:15}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-125-2}, ISSN = {1868-8969}, year = {2019}, volume = {145}, editor = {Achlioptas, Dimitris and V\'{e}gh, L\'{a}szl\'{o} A.}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.APPROX-RANDOM.2019.62}, URN = {urn:nbn:de:0030-drops-112778}, doi = {10.4230/LIPIcs.APPROX-RANDOM.2019.62}, annote = {Keywords: Streaming, Clustering, Coresets} }

Document

**Published in:** LIPIcs, Volume 116, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2018)

We study the distinct elements and l_p-heavy hitters problems in the sliding window model, where only the most recent n elements in the data stream form the underlying set. We first introduce the composable histogram, a simple twist on the exponential (Datar et al., SODA 2002) and smooth histograms (Braverman and Ostrovsky, FOCS 2007) that may be of independent interest. We then show that the composable histogram{} along with a careful combination of existing techniques to track either the identity or frequency of a few specific items suffices to obtain algorithms for both distinct elements and l_p-heavy hitters that are nearly optimal in both n and epsilon.
Applying our new composable histogram framework, we provide an algorithm that outputs a (1+epsilon)-approximation to the number of distinct elements in the sliding window model and uses O{1/(epsilon^2) log n log (1/epsilon)log log n+ (1/epsilon) log^2 n} bits of space. For l_p-heavy hitters, we provide an algorithm using space O{(1/epsilon^p) log^2 n (log^2 log n+log 1/epsilon)} for 0<p <=2, improving upon the best-known algorithm for l_2-heavy hitters (Braverman et al., COCOON 2014), which has space complexity O{1/epsilon^4 log^3 n}. We also show complementing nearly optimal lower bounds of Omega ((1/epsilon) log^2 n+(1/epsilon^2) log n) for distinct elements and Omega ((1/epsilon^p) log^2 n) for l_p-heavy hitters, both tight up to O{log log n} and O{log 1/epsilon} factors.

Vladimir Braverman, Elena Grigorescu, Harry Lang, David P. Woodruff, and Samson Zhou. Nearly Optimal Distinct Elements and Heavy Hitters on Sliding Windows. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 116, pp. 7:1-7:22, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{braverman_et_al:LIPIcs.APPROX-RANDOM.2018.7, author = {Braverman, Vladimir and Grigorescu, Elena and Lang, Harry and Woodruff, David P. and Zhou, Samson}, title = {{Nearly Optimal Distinct Elements and Heavy Hitters on Sliding Windows}}, booktitle = {Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2018)}, pages = {7:1--7:22}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-085-9}, ISSN = {1868-8969}, year = {2018}, volume = {116}, editor = {Blais, Eric and Jansen, Klaus and D. P. Rolim, Jos\'{e} and Steurer, David}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.APPROX-RANDOM.2018.7}, URN = {urn:nbn:de:0030-drops-94118}, doi = {10.4230/LIPIcs.APPROX-RANDOM.2018.7}, annote = {Keywords: Streaming algorithms, sliding windows, heavy hitters, distinct elements} }

Document

**Published in:** LIPIcs, Volume 107, 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018)

Given a finite set of points P subseteq R^d, we would like to find a small subset S subseteq P such that the convex hull of S approximately contains P. More formally, every point in P is within distance epsilon from the convex hull of S. Such a subset S is called an epsilon-hull. Computing an epsilon-hull is an important problem in computational geometry, machine learning, and approximation algorithms.
In many applications, the set P is too large to fit in memory. We consider the streaming model where the algorithm receives the points of P sequentially and strives to use a minimal amount of memory. Existing streaming algorithms for computing an epsilon-hull require O(epsilon^{(1-d)/2}) space, which is optimal for a worst-case input. However, this ignores the structure of the data. The minimal size of an epsilon-hull of P, which we denote by OPT, can be much smaller. A natural question is whether a streaming algorithm can compute an epsilon-hull using only O(OPT) space.
We begin with lower bounds that show, under a reasonable streaming model, that it is not possible to have a single-pass streaming algorithm that computes an epsilon-hull with O(OPT) space. We instead propose three relaxations of the problem for which we can compute epsilon-hulls using space near-linear to the optimal size. Our first algorithm for points in R^2 that arrive in random-order uses O(log n * OPT) space. Our second algorithm for points in R^2 makes O(log(epsilon^{-1})) passes before outputting the epsilon-hull and requires O(OPT) space. Our third algorithm, for points in R^d for any fixed dimension d, outputs, with high probability, an epsilon-hull for all but delta-fraction of directions and requires O(OPT * log OPT) space.

Avrim Blum, Vladimir Braverman, Ananya Kumar, Harry Lang, and Lin F. Yang. Approximate Convex Hull of Data Streams. In 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 107, pp. 21:1-21:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{blum_et_al:LIPIcs.ICALP.2018.21, author = {Blum, Avrim and Braverman, Vladimir and Kumar, Ananya and Lang, Harry and Yang, Lin F.}, title = {{Approximate Convex Hull of Data Streams}}, booktitle = {45th International Colloquium on Automata, Languages, and Programming (ICALP 2018)}, pages = {21:1--21:13}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-076-7}, ISSN = {1868-8969}, year = {2018}, volume = {107}, editor = {Chatzigiannakis, Ioannis and Kaklamanis, Christos and Marx, D\'{a}niel and Sannella, Donald}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2018.21}, URN = {urn:nbn:de:0030-drops-90254}, doi = {10.4230/LIPIcs.ICALP.2018.21}, annote = {Keywords: Convex Hulls, Streaming Algorithms, Epsilon Kernels, Sparse Coding} }

Document

**Published in:** LIPIcs, Volume 45, 35th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2015)

In PODS 2003, Babcock, Datar, Motwani and O'Callaghan gave the first streaming solution for the k-median problem on sliding windows using
O(frack k tau^4 W^2tau log^2 W) space, with a O(2^O(1/tau)) approximation factor, where W is the window size and tau in (0,1/2) is a user-specified parameter. They left as an open question whether it is possible to improve this to polylogarithmic space. Despite much progress on clustering and sliding windows, this question has remained open for more than a decade.
In this paper, we partially answer the main open question posed by Babcock, Datar, Motwani and O'Callaghan. We present an algorithm yielding an exponential improvement in space compared to the previous result given in Babcock, et al. In particular, we give the first polylogarithmic space (alpha,beta)-approximation for metric k-median clustering in the sliding window model, where alpha and beta are constants, under the assumption, also made by Babcock et al., that the optimal k-median cost on any given window is bounded by a polynomial in the window size. We justify this assumption by showing that when the cost is exponential in the window size, no sublinear space approximation is possible. Our main technical contribution is a simple but elegant extension of smooth functions as introduced by Braverman and Ostrovsky, which allows us to apply well-known techniques for solving problems in the sliding window model
to functions that are not smooth, such as the k-median cost.

Vladimir Braverman, Harry Lang, Keith Levin, and Morteza Monemizadeh. Clustering on Sliding Windows in Polylogarithmic Space. In 35th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2015). Leibniz International Proceedings in Informatics (LIPIcs), Volume 45, pp. 350-364, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2015)

Copy BibTex To Clipboard

@InProceedings{braverman_et_al:LIPIcs.FSTTCS.2015.350, author = {Braverman, Vladimir and Lang, Harry and Levin, Keith and Monemizadeh, Morteza}, title = {{Clustering on Sliding Windows in Polylogarithmic Space}}, booktitle = {35th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2015)}, pages = {350--364}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-939897-97-2}, ISSN = {1868-8969}, year = {2015}, volume = {45}, editor = {Harsha, Prahladh and Ramalingam, G.}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.FSTTCS.2015.350}, URN = {urn:nbn:de:0030-drops-56549}, doi = {10.4230/LIPIcs.FSTTCS.2015.350}, annote = {Keywords: Streaming, Clustering, Sliding windows} }