Space Complexity of Euclidean Clustering

Authors Xiaoyi Zhu , Yuxiang Tian , Lingxiao Huang , Zengfeng Huang



PDF
Thumbnail PDF

File

LIPIcs.SoCG.2024.82.pdf
  • Filesize: 0.83 MB
  • 16 pages

Document Identifiers

Author Details

Xiaoyi Zhu
  • School of Data Science, Fudan University, Shanghai, China
Yuxiang Tian
  • School of Data Science, Fudan University, Shanghai, China
Lingxiao Huang
  • State Key Laboratory of Novel Software Technology, Nanjing University, China
Zengfeng Huang
  • School of Data Science, Fudan University, Shanghai, China

Cite AsGet BibTex

Xiaoyi Zhu, Yuxiang Tian, Lingxiao Huang, and Zengfeng Huang. Space Complexity of Euclidean Clustering. In 40th International Symposium on Computational Geometry (SoCG 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 293, pp. 82:1-82:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.SoCG.2024.82

Abstract

The (k, z)-Clustering problem in Euclidean space ℝ^d has been extensively studied. Given the scale of data involved, compression methods for the Euclidean (k, z)-Clustering problem, such as data compression and dimension reduction, have received significant attention in the literature. However, the space complexity of the clustering problem, specifically, the number of bits required to compress the cost function within a multiplicative error ε, remains unclear in existing literature. This paper initiates the study of space complexity for Euclidean (k, z)-Clustering and offers both upper and lower bounds. Our space bounds are nearly tight when k is constant, indicating that storing a coreset, a well-known data compression approach, serves as the optimal compression scheme. Furthermore, our lower bound result for (k, z)-Clustering establishes a tight space bound of Θ(n d) for terminal embedding, where n represents the dataset size. Our technical approach leverages new geometric insights for principal angles and discrepancy methods, which may hold independent interest.

Subject Classification

ACM Subject Classification
  • Theory of computation → Computational geometry
  • Theory of computation → Facility location and clustering
  • Theory of computation → Data compression
Keywords
  • Space complexity
  • Euclidean clustering
  • coreset
  • terminal embedding

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. P-A Absil, Alan Edelman, and Plamen Koev. On the largest principal angle between random subspaces. Linear Algebra and its applications, 414(1):288-294, 2006. Google Scholar
  2. Noga Alon and Bo'az Klartag. Optimal compression of approximate inner products and dimension reduction. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 639-650. IEEE, 2017. Google Scholar
  3. David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027-1035, 2007. Google Scholar
  4. Maria-Florina F Balcan, Steven Ehrlich, and Yingyu Liang. Distributed k-means and k-median clustering on general topologies. Advances in neural information processing systems, 26, 2013. Google Scholar
  5. Ake Björck and Gene H Golub. Numerical methods for computing angles between linear subspaces. Mathematics of computation, 27(123):579-594, 1973. Google Scholar
  6. Vladimir Braverman, Dan Feldman, Harry Lang, and Daniela Rus. Streaming coreset constructions for m-estimators. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019. Google Scholar
  7. Vladimir Braverman, Dan Feldman, Harry Lang, Adiel Statman, and Samson Zhou. New frameworks for offline and streaming coreset constructions. arXiv preprint, 2016. URL: https://arxiv.org/abs/1612.00889.
  8. Vladimir Braverman, Gereon Frahling, Harry Lang, Christian Sohler, and Lin F. Yang. Clustering high dimensional dynamic data streams. In ICML, volume 70 of Proceedings of Machine Learning Research, pages 576-585. PMLR, 2017. Google Scholar
  9. Charles Carlson, Alexandra Kolla, Nikhil Srivastava, and Luca Trevisan. Optimal lower bounds for sketching graph cuts. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2565-2569. SIAM, 2019. Google Scholar
  10. Moses Charikar and Erik Waingarten. The Johnson-Lindenstrauss lemma for clustering and subspace approximation: From coresets to dimension reduction. arXiv preprint, 2022. URL: https://arxiv.org/abs/2205.00371.
  11. Xiaoyu Chen, Shaofeng H.-C. Jiang, and Robert Krauthgamer. Streaming Euclidean max-cut: Dimension vs data reduction. In STOC, pages 170-182. ACM, 2023. Google Scholar
  12. Yeshwanth Cherapanamjeri and Jelani Nelson. Terminal embeddings in sublinear time. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 1209-1216. IEEE, 2022. Google Scholar
  13. Adam Coates and Andrew Y Ng. Learning feature representations with k-means. In Neural Networks: Tricks of the Trade: Second Edition, pages 561-580. Springer, 2012. Google Scholar
  14. Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, and Chris Schwiegelshohn. Towards optimal lower bounds for k-median and k-means coresets. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 1038-1051, 2022. Google Scholar
  15. Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, Chris Schwiegelshohn, and Omar Ali Sheikh-Omar. Improved coresets for Euclidean k-Means. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, 2022. Google Scholar
  16. Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. A new coreset framework for clustering. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 169-182, 2021. Google Scholar
  17. Vincent Cohen-Addad, David P Woodruff, and Samson Zhou. Streaming euclidean k-median and k-means with o (log n) space. In 2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS), pages 883-908. IEEE, 2023. Google Scholar
  18. Artur Czumaj, Christiane Lammersen, Morteza Monemizadeh, and Christian Sohler. (1+ε)-approximation for facility location in data streams. In SODA, pages 1710-1728. SIAM, 2013. Google Scholar
  19. Daniel Dadush, Haotian Jiang, and Victor Reis. A new framework for matrix discrepancy: partial coloring bounds via mirror descent. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 649-658, 2022. Google Scholar
  20. G Dahlquist, B Sjöberg, and P Svensson. Comparison of the method of averages with the method of least squares. Mathematics of Computation, 22(104):833-845, 1968. Google Scholar
  21. Michael Elkin, Arnold Filtser, and Ofer Neiman. Terminal embeddings. Theoretical Computer Science, 697:1-36, 2017. Google Scholar
  22. Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 569-578, 2011. Google Scholar
  23. Gereon Frahling, Piotr Indyk, and Christian Sohler. Sampling in dynamic data streams and applications. Int. J. Comput. Geom. Appl., 18(1/2):3-28, 2008. Google Scholar
  24. Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-medians clustering. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages 291-300, 2004. Google Scholar
  25. Monika Henzinger and Sagar Kale. Fully-dynamic coresets. arXiv preprint, 2020. URL: https://arxiv.org/abs/2004.14891.
  26. Wei Hu, Zhao Song, Lin F. Yang, and Peilin Zhong. Nearly optimal dynamic k-Means clustering for high-dimensional data. arXiv: Data Structures and Algorithms, 2018. URL: https://api.semanticscholar.org/CorpusID:127972547.
  27. Lingxiao Huang, Ruiyuan Huang, Zengfeng Huang, and Xuan Wu. On coresets for clustering in small dimensional Euclidean spaces. In ICML, volume 202 of Proceedings of Machine Learning Research, pages 13891-13915. PMLR, 2023. Google Scholar
  28. Lingxiao Huang, Jian Li, and Xuan Wu. On optimal coreset construction for Euclidean (k,z)-clustering, 2022. URL: https://arxiv.org/abs/2211.11923.
  29. Lingxiao Huang and Nisheeth K Vishnoi. Coresets for clustering in Euclidean spaces: importance sampling is nearly optimal. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 1416-1429, 2020. Google Scholar
  30. Piotr Indyk and Tal Wagner. Approximate nearest neighbors in limited space. In Conference On Learning Theory, pages 2012-2036. PMLR, 2018. Google Scholar
  31. Piotr Indyk and Tal Wagner. Optimal (Euclidean) metric compression. SIAM Journal on Computing, 51(3):467-491, 2022. Google Scholar
  32. William B Johnson, Joram Lindenstrauss, and Gideon Schechtman. Extensions of lipschitz maps into Banach spaces. Israel Journal of Mathematics, 54(2):129-138, 1986. Google Scholar
  33. Camille Jordan. Essai sur la géométrie à n dimensions. Bulletin de la Société mathématique de France, 3:103-174, 1875. Google Scholar
  34. Daniel Kane, Raghu Meka, and Jelani Nelson. Almost optimal explicit Johnson-Lindenstrauss families. In International Workshop on Approximation Algorithms for Combinatorial Optimization, pages 628-639. Springer, 2011. Google Scholar
  35. Kasper Green Larsen and Jelani Nelson. Optimality of the Johnson-Lindenstrauss lemma. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 633-638. IEEE, 2017. Google Scholar
  36. Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129-137, 1982. Google Scholar
  37. Konstantin Makarychev, Yury Makarychev, and Ilya Razenshteyn. Performance of Johnson-Lindenstrauss transform for k-means and k-medians clustering. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 1027-1038, 2019. Google Scholar
  38. Elizabeth S. Meckes. The Random Matrix Theory of the Classical Compact Groups. Cambridge Tracts in Mathematics. Cambridge University Press, 2019. Google Scholar
  39. Shyam Narayanan and Jelani Nelson. Optimal terminal dimensionality reduction in Euclidean space. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 1064-1069, 2019. Google Scholar
  40. Christian Sohler and David P. Woodruff. Strong coresets for k-Median and subspace approximation: Goodbye dimension. In FOCS, pages 802-813. IEEE Computer Society, 2018. Google Scholar
  41. Joel Spencer. Six standard deviations suffice. Transactions of the American mathematical society, 289(2):679-706, 1985. Google Scholar
  42. James M Varah. Computing invariant subspaces of a general matrix when the eigensystem is poorly conditioned. Mathematics of Computation, 24(109):137-149, 1970. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail