Polylogarithmic Sketches for Clustering

Authors Moses Charikar , Erik Waingarten



PDF
Thumbnail PDF

File

LIPIcs.ICALP.2022.38.pdf
  • Filesize: 0.8 MB
  • 20 pages

Document Identifiers

Author Details

Moses Charikar
  • Stanford University, CA, USA
Erik Waingarten
  • Stanford University, CA, USA

Cite AsGet BibTex

Moses Charikar and Erik Waingarten. Polylogarithmic Sketches for Clustering. In 49th International Colloquium on Automata, Languages, and Programming (ICALP 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 229, pp. 38:1-38:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.ICALP.2022.38

Abstract

Given n points in 𝓁_p^d, we consider the problem of partitioning points into k clusters with associated centers. The cost of a clustering is the sum of p-th powers of distances of points to their cluster centers. For p ∈ [1,2], we design sketches of size poly(log(nd),k,1/ε) such that the cost of the optimal clustering can be estimated to within factor 1+ε, despite the fact that the compressed representation does not contain enough information to recover the cluster centers or the partition into clusters. This leads to a streaming algorithm for estimating the clustering cost with space poly(log(nd),k,1/ε). We also obtain a distributed memory algorithm, where the n points are arbitrarily partitioned amongst m machines, each of which sends information to a central party who then computes an approximation of the clustering cost. Prior to this work, no such streaming or distributed-memory algorithm was known with sublinear dependence on d for p ∈ [1,2).

Subject Classification

ACM Subject Classification
  • Theory of computation → Sketching and sampling
Keywords
  • sketching
  • clustering

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Pankaj K. Agarwal, Sariel Har-Peled, and Kasturi R. Varadarajan. Geometric approximation via coresets. Combinatorial and computational geometry, 2005. Google Scholar
  2. Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137-147, 1999. Google Scholar
  3. Alexandr Andoni, Khanh Do Ba, Piotr Indyk, and David Woodruff. Efficient sketches for earth-mover distance, with applications. In Proceedings of the 50th Annual IEEE Symposium on Foundations of Computer Science (FOCS '2009), 2009. Google Scholar
  4. Alexandr Andoni, Moses Charikar, Ofer Neiman, and Huy L. Nguyen. Near linear lower bound for dimension reduction in l1. In Proceedings of the 52nd Annual IEEE Symposium on Foundations of Computer Science (FOCS '2011), 2011. Google Scholar
  5. Alexandr Andoni, Piotr Indyk, and Robert Krauthgamer. Earth mover distance over high-dimensional spaces. In Proceedings of the 19th ACM-SIAM Symposium on Discrete Algorithms (SODA '2008), pages 343-352, 2008. Google Scholar
  6. Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Polylogarithmic approximation for edit distance and the asymmetric query complexity. In Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS '2010), 2010. Google Scholar
  7. Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Streaming algorithms from precision sampling. In Proceedings of the 52nd Annual IEEE Symposium on Foundations of Computer Science (FOCS '2011), 2011. Google Scholar
  8. Alexandr Andoni, Robert Krauthgamer, and Ilya Razenshteyn. Sketching and embedding are equivalent for norms. In Proceedings of the 47th ACM Symposium on the Theory of Computing (STOC '2015), pages 479-488, 2015. Available as URL: https://arxiv.org/abs/1411.2577.
  9. Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop. The hardness of approximation of euclidean k-means. In 31st International Symposium on Computational Geometry (SoCG 2015). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2015. Google Scholar
  10. Arturs Backurs, Piotr Indyk, Ilya Razenshteyn, and David P. Woodruff. Nearly-optimal bounds for sparse recovery in generic norms, with applications to k-median sketching. In Proceedings of the 27th ACM-SIAM Symposium on Discrete Algorithms (SODA '2016), pages 318-337, 2016. Available as URL: https://arxiv.org/abs/1504.01076.
  11. Mihai Badoiu, Sariel Har-Peled, and Piotr Indyk. Approximate clustering via core-sets. In Proceedings of the 34th ACM Symposium on the Theory of Computing (STOC '2002), 2002. Google Scholar
  12. Ziv Bar-Yossef, T.S. Jayram, Ravi Kumar, and D. Sivakumar. An information statistics approach to data stream and communication complexity. Journal of Computer and System Sciences, 68(4):702-732, 2004. Google Scholar
  13. Arturs Bačkurs and Piotr Indyk. Better embeddings for planar earth-mover distance over sparse sets. In Proceedings of the 41st International Colloquium on Automata, Languages and Programming (ICALP '2014), 2014. Google Scholar
  14. Luca Becchetti, Marc Bury, Vincent Cohen-Addad, Fabrizio Grandoni, and Chris Schwiegelshohn. Oblivious dimension reduction for k-means: Beyond subspaces and the johnson-lindenstrauss lemma. In Proceedings of the 51th ACM Symposium on the Theory of Computing (STOC '2019), 2019. Google Scholar
  15. Soheil Behnezhad. Time-optimal sublinear algorithms for matching and vertex cover. In Proceedings of the 62nd Annual IEEE Symposium on Foundations of Computer Science (FOCS '2021), 2021. Google Scholar
  16. Jon Louis Bentley and James B Saxe. Decomposable searching problems i. static-to-dynamic transformation. Journal of Algorithms, 1(4):301-358, 1980. Google Scholar
  17. Guy Blanc, Neha Gupta, Jane Lange, and Li-Yang Tan. Estimating decision tree learnability with polylogarithmic sample complexity. In Proceedings of Advances in Neural Information Processing Systems 33 (NeurIPS '2020), 2020. Google Scholar
  18. Jaroslaw Blasiok, Vladimir Braverman, Stephen R. Chestnut, and Robert Krauthgamerand Lin F. Yang. Streaming symmetric norms via measure concentration. In Proceedings of the 50th ACM Symposium on the Theory of Computing (STOC '2017), 2017. Google Scholar
  19. Christos Boutsidis, Anastasios Zouzias, and Petros Drineas. Random projections for k-means clustering. In Proceedings of Advances in Neural Information Processing Systems 23 (NeurIPS '2010), 2010. Google Scholar
  20. Vladimir Braverman, Dan Feldman, and Harry Lang. New frameworks for offline and streaming coreset constructions. arXiv preprint, 2016. URL: http://arxiv.org/abs/1612.00889.
  21. Vladimir Braverman, Gereon Frahling, Harry Lang, Christian Sohler, and Lin F Yang. Clustering high dimensional dynamic data streams. In Proceedings of the 34th International Conference on Machine Learning (ICML '2017), 2017. Google Scholar
  22. Bo Brinkman and Moses Charikar. On the impossibility of dimension reduction in 𝓁₁. Journal of the ACM, 52(5):766-788, 2005. Google Scholar
  23. Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. Theoretical Computer Science, 312(1):3-15, 2004. Google Scholar
  24. Ke Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM Journal on Computing, 39(3):923-947, 2009. Google Scholar
  25. Xi Chen, Rajesh Jayaram, Amit Levi, and Erik Waingarten. New streaming algorithms for high dimensional emd and mst. In Proceedings of the 54th ACM Symposium on the Theory of Computing (STOC '2022), 2022. Google Scholar
  26. Yu Chen, Sampath Kannan, and Sanjeev Khanna. Sublinear algorithms and lower bounds for metric tsp cost estimation. In 47th International Colloquium on Automata, Languages, and Programming (ICALP 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020. Google Scholar
  27. Michael B. Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Mădălina Persu. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the 47th ACM Symposium on the Theory of Computing (STOC '2015), 2015. Google Scholar
  28. Vincent Cohen-Addad and CS Karthik. Inapproximability of clustering in lp metrics. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 519-539. IEEE, 2019. Google Scholar
  29. Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. A new coreset framework for clustering. In Proceedings of the 53rd ACM Symposium on the Theory of Computing (STOC '2021), 2021. Google Scholar
  30. Graham Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1):58-75, 2005. Google Scholar
  31. Artur Czumaj, Funda Engün, Lance Fortnow, Avner Magen, Ilan Newman, Ronitt Rubinfeld, and Christian Sohler. Approximating the weight of the euclidean minimum spanning tree in sublinear time. SIAM Journal on Computing, 2005. Google Scholar
  32. Artur Czumaj, Shaofeng H.-C. Jiang, Robert Krauthgamer, Pavel Veselý, and Mingwei Yang. Streaming facility location in high dimension via new geometric hashing. arXiv preprint, 2022. URL: http://arxiv.org/abs/2204.02095.
  33. Artur Czumaj and Chirstian Sohler. Estimating the weight of metric minimum spanning trees in sublinear time. SIAM Journal on Computing, 2009. Google Scholar
  34. Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In Proceedings of the 43rd ACM Symposium on the Theory of Computing (STOC '2011), 2011. Google Scholar
  35. Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: constant-size coresets for k-means, pca and projective clustering. In Proceedings of the 24th ACM-SIAM Symposium on Discrete Algorithms (SODA '2013), 2013. Google Scholar
  36. Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In Proceedings of the 36th ACM Symposium on the Theory of Computing (STOC '2004), 2004. Google Scholar
  37. Monika Henzinger and Sagar Kale. Fully-dynamic coresets. In 28th Annual European Symposium on Algorithms (ESA 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020. Google Scholar
  38. Wei Hu, Zhao Song, Lin F. Yang, and Peilin Zhong. Nearly optimal dynamic k-means clustering for high-dimensional data. arXiv preprint, 2019. URL: http://arxiv.org/abs/1802.00459.
  39. Lingxiao Huang and Nisheeth K. Vishnoi. Coresets for clustering in euclidean spaces: importance sampling is nearly optimal. In Proceedings of the 52nd ACM Symposium on the Theory of Computing (STOC '2020), 2020. Google Scholar
  40. Piotr Indyk. Algorithms for dynamic geometric problems over data streams. In Proceedings of the 36th ACM Symposium on the Theory of Computing (STOC '2004), 2004. Google Scholar
  41. Piotr Indyk. Stable distributions, pseudorandom generators, embeddings, and data stream computation. Journal of the ACM, 53(3):307-323, 2006. Google Scholar
  42. Piotr Indyk and David Woodruff. Optimal approximations of the frequency moments of data streams. In Proceedings of the 37th ACM Symposium on the Theory of Computing (STOC '2005), pages 202-208, 2005. Google Scholar
  43. Rajesh Jayaram and David Woodruff. Perfect lp sampling in a data stream. In Proceedings of the 59th Annual IEEE Symposium on Foundations of Computer Science (FOCS '2018), 2018. Google Scholar
  44. Thathachar S. Jayram and David Woodruff. The data stream complexity of cascaded norms. In Proceedings of the 50th Annual IEEE Symposium on Foundations of Computer Science (FOCS '2009), 2009. Google Scholar
  45. William Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. In Conference in modern analysis and probability (New Haven, Conn., 1982), volume 26 of Contemporary Mathematics, pages 189-206. AMS, 1984. Google Scholar
  46. Daniel M. Kane, Jelani Nelson, and David P. Woodruff. On the exact space complexity of sketching and streaming small norms. In Proceedings of the 21st ACM-SIAM Symposium on Discrete Algorithms (SODA '2010), 2010. Google Scholar
  47. Michael Kapralov, Slobodan Mitrović, Ashkan Norouzi-Fard, and Jakab Tardos. Space efficient approximation to maximum matching size from uniform edge samples. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1753-1772. SIAM, 2020. Google Scholar
  48. Weihao Kong, Emma Brunskill, and Gregory Valiant. Sublinear optimal policy value estimation in contextual bandits. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS '2020), 2020. Google Scholar
  49. Weihao Kong and Gregory Valiant. Estimating learnability in the sublinear data regime. In Proceedings of Advances in Neural Information Processing Systems 31 (NeurIPS '2018), pages 5455-5464, 2018. Google Scholar
  50. Amit Kumar, Yogish Sabharwal, and Sandeep Sen. A simple linear time (1+ε)-approximation algorithm for k-means clustering in any dimension. In Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science (FOCS '2004), 2004. Google Scholar
  51. Eyal Kushilevitz, Rafail Ostrovsky, and Yuval Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM Journal on Computing, 30(2):457-474, 2000. Google Scholar
  52. Michael Langberg and Leonard J. Schulman. Universal epsilon-approximators for integrals. In Proceedings of the 21st ACM-SIAM Symposium on Discrete Algorithms (SODA '2010), 2010. Google Scholar
  53. Euiwoong Lee, Melanie Schmidt, and John Wright. Improved and simplified inapproximability for k-means. Information Processing Letters, 120:40-43, 2017. Google Scholar
  54. James R. Lee and Assaf Naor. Embedding the diamond graph in L_p and dimension reduction in L₁. Geometric and Functional Analysis, 14(4):745-747, 2004. Google Scholar
  55. Konstantin Makarychev, Yuri Makarychev, and Ilya Razenshteyn. Performance of johnson-lindenstrauss transform for k-means and k-medians clustering. In Proceedings of the 51th ACM Symposium on the Theory of Computing (STOC '2019), 2019. Google Scholar
  56. Huy N Nguyen and Krzysztof Onak. Constant-time approximation algorithms via local improvements. In 2008 49th Annual IEEE Symposium on Foundations of Computer Science, pages 327-336. IEEE, 2008. Google Scholar
  57. Krzysztof Onak, Dana Ron, Michal Rosen, and Ronitt Rubinfeld. A near-optimal sublinear-time algorithm for approximating the minimum vertex cover size. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms, pages 1123-1131. SIAM, 2012. Google Scholar
  58. Art B. Owen. Monte carlo theory, methods, and examples, 2013. Google Scholar
  59. Michal Parnas and Dana Ron. Approximating the minimum vertex cover in sublinear time and a connection to distributed algorithms. Theoretical Computer Science, 381(1-3):183-196, 2007. Google Scholar
  60. Michael Saks and Xiaodong Sun. Space lower bounds for distance approximation in the data stream model. In Proceedings of the 34th ACM Symposium on the Theory of Computing (STOC '2002), 2002. Google Scholar
  61. Christian Sohler and David Woodruff. Strong coresets for k-median and subspace approximation. In Proceedings of the 59th Annual IEEE Symposium on Foundations of Computer Science (FOCS '2018), 2018. Google Scholar
  62. Yuichi Yoshida, Masaki Yamamoto, and Hiro Ito. An improved constant-time approximation algorithm for maximum matchings. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pages 225-234, 2009. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail