Space Complexity of Euclidean Clustering

Zhu, Xiaoyi; Tian, Yuxiang; Huang, Lingxiao; Huang, Zengfeng

doi:10.4230/LIPIcs.SoCG.2024.82

Abstract

The (k, z)-Clustering problem in Euclidean space ℝ^d has been extensively studied. Given the scale of data involved, compression methods for the Euclidean (k, z)-Clustering problem, such as data compression and dimension reduction, have received significant attention in the literature. However, the space complexity of the clustering problem, specifically, the number of bits required to compress the cost function within a multiplicative error ε, remains unclear in existing literature.
This paper initiates the study of space complexity for Euclidean (k, z)-Clustering and offers both upper and lower bounds. Our space bounds are nearly tight when k is constant, indicating that storing a coreset, a well-known data compression approach, serves as the optimal compression scheme. Furthermore, our lower bound result for (k, z)-Clustering establishes a tight space bound of Θ(n d) for terminal embedding, where n represents the dataset size. Our technical approach leverages new geometric insights for principal angles and discrepancy methods, which may hold independent interest.

P-A Absil, Alan Edelman, and Plamen Koev. On the largest principal angle between random subspaces. Linear Algebra and its applications, 414(1):288-294, 2006.
Noga Alon and Bo'az Klartag. Optimal compression of approximate inner products and dimension reduction. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 639-650. IEEE, 2017.
David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027-1035, 2007.
Maria-Florina F Balcan, Steven Ehrlich, and Yingyu Liang. Distributed k-means and k-median clustering on general topologies. Advances in neural information processing systems, 26, 2013.
Ake Björck and Gene H Golub. Numerical methods for computing angles between linear subspaces. Mathematics of computation, 27(123):579-594, 1973.
Vladimir Braverman, Dan Feldman, Harry Lang, and Daniela Rus. Streaming coreset constructions for m-estimators. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019.
Vladimir Braverman, Dan Feldman, Harry Lang, Adiel Statman, and Samson Zhou. New frameworks for offline and streaming coreset constructions. arXiv preprint, 2016. URL: https://arxiv.org/abs/1612.00889.
Vladimir Braverman, Gereon Frahling, Harry Lang, Christian Sohler, and Lin F. Yang. Clustering high dimensional dynamic data streams. In ICML, volume 70 of Proceedings of Machine Learning Research, pages 576-585. PMLR, 2017.
Charles Carlson, Alexandra Kolla, Nikhil Srivastava, and Luca Trevisan. Optimal lower bounds for sketching graph cuts. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2565-2569. SIAM, 2019.
Moses Charikar and Erik Waingarten. The Johnson-Lindenstrauss lemma for clustering and subspace approximation: From coresets to dimension reduction. arXiv preprint, 2022. URL: https://arxiv.org/abs/2205.00371.
Xiaoyu Chen, Shaofeng H.-C. Jiang, and Robert Krauthgamer. Streaming Euclidean max-cut: Dimension vs data reduction. In STOC, pages 170-182. ACM, 2023.
Yeshwanth Cherapanamjeri and Jelani Nelson. Terminal embeddings in sublinear time. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 1209-1216. IEEE, 2022.
Adam Coates and Andrew Y Ng. Learning feature representations with k-means. In Neural Networks: Tricks of the Trade: Second Edition, pages 561-580. Springer, 2012.
Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, and Chris Schwiegelshohn. Towards optimal lower bounds for k-median and k-means coresets. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 1038-1051, 2022.
Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, Chris Schwiegelshohn, and Omar Ali Sheikh-Omar. Improved coresets for Euclidean k-Means. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, 2022.
Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. A new coreset framework for clustering. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 169-182, 2021.
Vincent Cohen-Addad, David P Woodruff, and Samson Zhou. Streaming euclidean k-median and k-means with o (log n) space. In 2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS), pages 883-908. IEEE, 2023.
Artur Czumaj, Christiane Lammersen, Morteza Monemizadeh, and Christian Sohler. (1+ε)-approximation for facility location in data streams. In SODA, pages 1710-1728. SIAM, 2013.
Daniel Dadush, Haotian Jiang, and Victor Reis. A new framework for matrix discrepancy: partial coloring bounds via mirror descent. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 649-658, 2022.
G Dahlquist, B Sjöberg, and P Svensson. Comparison of the method of averages with the method of least squares. Mathematics of Computation, 22(104):833-845, 1968.
Michael Elkin, Arnold Filtser, and Ofer Neiman. Terminal embeddings. Theoretical Computer Science, 697:1-36, 2017.
Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 569-578, 2011.
Gereon Frahling, Piotr Indyk, and Christian Sohler. Sampling in dynamic data streams and applications. Int. J. Comput. Geom. Appl., 18(1/2):3-28, 2008.
Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-medians clustering. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages 291-300, 2004.
Monika Henzinger and Sagar Kale. Fully-dynamic coresets. arXiv preprint, 2020. URL: https://arxiv.org/abs/2004.14891.
Wei Hu, Zhao Song, Lin F. Yang, and Peilin Zhong. Nearly optimal dynamic k-Means clustering for high-dimensional data. arXiv: Data Structures and Algorithms, 2018. URL: https://api.semanticscholar.org/CorpusID:127972547.
Lingxiao Huang, Ruiyuan Huang, Zengfeng Huang, and Xuan Wu. On coresets for clustering in small dimensional Euclidean spaces. In ICML, volume 202 of Proceedings of Machine Learning Research, pages 13891-13915. PMLR, 2023.
Lingxiao Huang, Jian Li, and Xuan Wu. On optimal coreset construction for Euclidean (k,z)-clustering, 2022. URL: https://arxiv.org/abs/2211.11923.
Lingxiao Huang and Nisheeth K Vishnoi. Coresets for clustering in Euclidean spaces: importance sampling is nearly optimal. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 1416-1429, 2020.
Piotr Indyk and Tal Wagner. Approximate nearest neighbors in limited space. In Conference On Learning Theory, pages 2012-2036. PMLR, 2018.
Piotr Indyk and Tal Wagner. Optimal (Euclidean) metric compression. SIAM Journal on Computing, 51(3):467-491, 2022.
William B Johnson, Joram Lindenstrauss, and Gideon Schechtman. Extensions of lipschitz maps into Banach spaces. Israel Journal of Mathematics, 54(2):129-138, 1986.
Camille Jordan. Essai sur la géométrie à n dimensions. Bulletin de la Société mathématique de France, 3:103-174, 1875.
Daniel Kane, Raghu Meka, and Jelani Nelson. Almost optimal explicit Johnson-Lindenstrauss families. In International Workshop on Approximation Algorithms for Combinatorial Optimization, pages 628-639. Springer, 2011.
Kasper Green Larsen and Jelani Nelson. Optimality of the Johnson-Lindenstrauss lemma. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 633-638. IEEE, 2017.
Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129-137, 1982.
Konstantin Makarychev, Yury Makarychev, and Ilya Razenshteyn. Performance of Johnson-Lindenstrauss transform for k-means and k-medians clustering. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 1027-1038, 2019.
Elizabeth S. Meckes. The Random Matrix Theory of the Classical Compact Groups. Cambridge Tracts in Mathematics. Cambridge University Press, 2019.
Shyam Narayanan and Jelani Nelson. Optimal terminal dimensionality reduction in Euclidean space. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 1064-1069, 2019.
Christian Sohler and David P. Woodruff. Strong coresets for k-Median and subspace approximation: Goodbye dimension. In FOCS, pages 802-813. IEEE Computer Society, 2018.
Joel Spencer. Six standard deviations suffice. Transactions of the American mathematical society, 289(2):679-706, 1985.
James M Varah. Computing invariant subspaces of a general matrix when the eigensystem is poorly conditioned. Mathematics of Computation, 24(109):137-149, 1970.

Space Complexity of Euclidean Clustering

Authors Xiaoyi Zhu , Yuxiang Tian , Lingxiao Huang , Zengfeng Huang

File

Document Identifiers

Author Details

Cite As Get BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Space Complexity of Euclidean Clustering

Authors Xiaoyi Zhu , Yuxiang Tian , Lingxiao Huang , Zengfeng Huang

File

Document Identifiers

Author Details

Funding

Cite As Get BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

References

Thanks for your feedback!

Could not send message