Document

# Streaming Coreset Constructions for M-Estimators

## File

LIPIcs.APPROX-RANDOM.2019.62.pdf
• Filesize: 0.52 MB
• 15 pages

## Cite As

Vladimir Braverman, Dan Feldman, Harry Lang, and Daniela Rus. Streaming Coreset Constructions for M-Estimators. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 145, pp. 62:1-62:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)
https://doi.org/10.4230/LIPIcs.APPROX-RANDOM.2019.62

## Abstract

We introduce a new method of maintaining a (k,epsilon)-coreset for clustering M-estimators over insertion-only streams. Let (P,w) be a weighted set (where w : P - > [0,infty) is the weight function) of points in a rho-metric space (meaning a set X equipped with a positive-semidefinite symmetric function D such that D(x,z) <=rho(D(x,y) + D(y,z)) for all x,y,z in X). For any set of points C, we define COST(P,w,C) = sum_{p in P} w(p) min_{c in C} D(p,c). A (k,epsilon)-coreset for (P,w) is a weighted set (Q,v) such that for every set C of k points, (1-epsilon)COST(P,w,C) <= COST(Q,v,C) <= (1+epsilon)COST(P,w,C). Essentially, the coreset (Q,v) can be used in place of (P,w) for all operations concerning the COST function. Coresets, as a method of data reduction, are used to solve fundamental problems in machine learning of streaming and distributed data. M-estimators are functions D(x,y) that can be written as psi(d(x,y)) where ({X}, d) is a true metric (i.e. 1-metric) space. Special cases of M-estimators include the well-known k-median (psi(x) =x) and k-means (psi(x) = x^2) functions. Our technique takes an existing offline construction for an M-estimator coreset and converts it into the streaming setting, where n data points arrive sequentially. To our knowledge, this is the first streaming construction for any M-estimator that does not rely on the merge-and-reduce tree. For example, our coreset for streaming metric k-means uses O(epsilon^{-2} k log k log n) points of storage. The previous state-of-the-art required storing at least O(epsilon^{-2} k log k log^{4} n) points.

## Subject Classification

##### ACM Subject Classification
• Theory of computation → Streaming models
• Theory of computation → Facility location and clustering
• Information systems → Query optimization
• Streaming
• Clustering
• Coresets

## Metrics

• Access Statistics
• Total Accesses (updated on a weekly basis)
0

## References

1. Marcel R. Ackermann, Marcus Märtens, Christoph Raupach, Kamil Swierkot, Christiane Lammersen, and Christian Sohler. StreamKM++: A Clustering Algorithm for Data Streams. J. Exp. Algorithmics, 17:2.4:2.1-2.4:2.30, May 2012. URL: https://doi.org/10.1145/2133803.2184450.
2. Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu. A Framework for Projected Clustering of High Dimensional Data Streams. In Mario A. Nascimento, M. Tamer Özsu, Donald Kossmann, Renée J. Miller, José A. Blakeley, and K. Bernhard Schiefer, editors, (e)Proceedings of the Thirtieth International Conference on Very Large Data Bases, Toronto, Canada, August 31 - September 3 2004, pages 852-863. Morgan Kaufmann, 2004. URL: http://www.vldb.org/conf/2004/RS21P7.PDF, URL: https://doi.org/10.1016/B978-012088469-8.50075-9.
3. Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, and Vinayaka Pandit. Local Search Heuristic for K-median and Facility Location Problems. In Proceedings of the Thirty-third Annual ACM Symposium on Theory of Computing, STOC '01, pages 21-29, New York, NY, USA, 2001. ACM. URL: https://doi.org/10.1145/380752.380755.
4. Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop. The Hardness of Approximation of Euclidean k-Means. In 31st International Symposium on Computational Geometry, SoCG 2015, June 22-25, 2015, Eindhoven, The Netherlands, pages 754-767, 2015.
5. Jon Louis Bentley and James B Saxe. Decomposable searching problems I. Static-to-dynamic transformation. Journal of Algorithms, 1(4):301-358, 1980.
6. Vladimir Braverman, Dan Feldman, and Harry Lang. New Frameworks for Offline and Streaming Coreset Constructions. CoRR, abs/1612.00889, 2016. URL: http://arxiv.org/abs/1612.00889.
7. Vladimir Braverman, Adam Meyerson, Rafail Ostrovsky, Alan Roytman, Michael Shindler, and Brian Tagiku. Streaming K-means on Well-clusterable Data. In Proceedings of the Twenty-second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '11, pages 26-40. SIAM, 2011. URL: http://dl.acm.org/citation.cfm?id=2133036.2133039.
8. Jaroslaw Byrka, Thomas Pensyl, Bartosz Rybicki, Aravind Srinivasan, and Khoa Trinh. An Improved Approximation for K-median, and Positive Correlation in Budgeted Optimization. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '15, pages 737-756. SIAM, 2015. URL: http://dl.acm.org/citation.cfm?id=2722129.2722179.
9. Moses Charikar, Liadan O'Callaghan, and Rina Panigrahy. Better Streaming Algorithms for Clustering Problems. In Proceedings of the Thirty-fifth Annual ACM Symposium on Theory of Computing, STOC '03, pages 30-39, New York, NY, USA, 2003. ACM. URL: https://doi.org/10.1145/780542.780548.
10. Ke Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM Journal on Computing, 39(3):923-947, 2009.
11. Ke Chen. On Coresets for K-Median and K-Means Clustering in Metric and Euclidean Spaces and Their Applications. SIAM J. Comput., 39(3):923-947, August 2009. URL: https://doi.org/10.1137/070699007.
12. Kenneth L. Clarkson and David P. Woodruff. Sketching for M-estimators: A Unified Approach to Robust Regression. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '15, pages 921-939, Philadelphia, PA, USA, 2015. Society for Industrial and Applied Mathematics. URL: http://dl.acm.org/citation.cfm?id=2722129.2722192.
13. Dan Feldman and Michael Langberg. A Unified Framework for Approximating and Clustering Data. In Proceedings of the Forty-third Annual ACM Symposium on Theory of Computing, STOC '11, pages 569-578, New York, NY, USA, 2011. ACM. URL: https://doi.org/10.1145/1993636.1993712.
14. Dan Feldman and Leonard J Schulman. Data reduction for weighted and outlier-resistant clustering. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms, pages 1343-1354. SIAM, 2012.
15. G. Frahling and C. Sohler. Coresets in dynamic geometric data streams. In Proc. 37th Annu. ACM Symp. on Theory of Computing (STOC), pages 209-217, 2005.
16. Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani, and Liadan O'Callaghan. Clustering Data Streams: Theory and Practice. IEEE Trans. on Knowl. and Data Eng., 15(3):515-528, March 2003. URL: https://doi.org/10.1109/TKDE.2003.1198387.
17. Frank Hampel, Christian Hennig, and Elvezio Ronchetti. A smoothing principle for the Huber and other location M-estimators. Computational Statistics & Data Analysis, 55(1):324-337, 2011. URL: https://doi.org/10.1016/j.csda.2010.05.001.
18. S. Har-Peled and A. Kushal. Smaller coresets for k-median and k-means clustering. Discrete Comput. Geom., 37(1):3-19, 2007. URL: https://doi.org/10.1007/s00454-006-1271-x.
19. S. Har-Peled and S. Mazumdar. On coresets for k-means and k-median clustering. In STOC, 2004.
20. P. J. Huber. Robust Statistics. Wiley, 1981.
21. Harry Lang. Online Facility Location Against a t-bounded Adversary. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '18, pages 1002-1014, Philadelphia, PA, USA, 2018. Society for Industrial and Applied Mathematics.
22. Z. Zhang. M-estimators. http://research.microsoft.com/en-us/um/people/zhang/INRIA/ Publis/Tutorial-Estim/node20.html, [accessed July 2011].