Noisy k-Means++ Revisited

Grunau, Christoph; Özüdoğru, Ahmet Alper; Rozhoň, Václav

doi:10.4230/LIPIcs.ESA.2023.55

Abstract

The k-means++ algorithm by Arthur and Vassilvitskii [SODA 2007] is a classical and time-tested algorithm for the k-means problem. While being very practical, the algorithm also has good theoretical guarantees: its solution is O(log k)-approximate, in expectation. 
In a recent work, Bhattacharya, Eube, Roglin, and Schmidt [ESA 2020] considered the following question: does the algorithm retain its guarantees if we allow for a slight adversarial noise in the sampling probability distributions used by the algorithm? This is motivated e.g. by the fact that computations with real numbers in k-means++ implementations are inexact. Surprisingly, the analysis under this scenario gets substantially more difficult and the authors were able to prove only a weaker approximation guarantee of O(log² k). In this paper, we close the gap by providing a tight, O(log k)-approximate guarantee for the k-means++ algorithm with noise.

Cite As Get BibTex

Christoph Grunau, Ahmet Alper Özüdoğru, and Václav Rozhoň. Noisy k-Means++ Revisited. In 31st Annual European Symposium on Algorithms (ESA 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 274, pp. 55:1-55:7, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023) https://doi.org/10.4230/LIPIcs.ESA.2023.55

Author Details

Christoph Grunau

ETH Zürich, Switzerland

Ahmet Alper Özüdoğru

ETH Zürich, Switzerland

Václav Rozhoň

ETH Zürich, Switzerland

Funding

Grunau, Christoph: Supported by the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant agreement No. 853109).
Rozhoň, Václav: Supported by the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant agreement No. 853109).

Acknowledgements

We would like to thank Mohsen Ghaffari for many helpful comments.

References

Ankit Aggarwal, Amit Deshpande, and Ravi Kannan. Adaptive sampling for k-means clustering. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 15-28. Springer, 2009.
Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat. Np-hardness of euclidean sum-of-squares clustering. Machine learning, 75(2):245-248, 2009.
David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027-1035. Society for Industrial and Applied Mathematics, 2007.
Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop. The hardness of approximation of euclidean k-means. arXiv preprint, 2015. URL: https://arxiv.org/abs/1502.03316.
Olivier Bachem, Mario Lucic, Hamed Hassani, and Andreas Krause. Fast and provably good seedings for k-means. In Advances in neural information processing systems, pages 55-63, 2016.
Olivier Bachem, Mario Lucic, S Hamed Hassani, and Andreas Krause. Approximate k-means++ in sublinear time. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
Olivier Bachem, Mario Lucic, and Andreas Krause. Distributed and provably good seedings for k-means in constant rounds. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 292-300. JMLR. org, 2017.
Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. Scalable k-means++. Proceedings of the VLDB Endowment, 5(7):622-633, 2012.
Aditya Bhaskara, Sharvaree Vadgama, and Hong Xu. Greedy sampling for approximate clustering in the presence of outliers. Advances in Neural Information Processing Systems, 32, 2019.
Anup Bhattacharya, Jan Eube, Heiko Röglin, and Melanie Schmidt. Noisy, greedy and not so greedy k-means++. In 28th Annual European Symposium on Algorithms (ESA 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020.
Davin Choo, Christoph Grunau, Julian Portmann, and Václav Rozhon. k-means++: few more steps yield constant approximation. In International Conference on Machine Learning, pages 1909-1917. PMLR, 2020.
Vincent Cohen-Addad, Hossein Esfandiari, Vahab Mirrokni, and Shyam Narayanan. Improved approximations for euclidean k-means and k-median, via nested quasi-independent sets, 2022. URL: https://doi.org/10.48550/ARXIV.2204.04828.
Sanjoy Dasgupta. Lecture 3 – algorithms for k-means clustering, 2013, accessed May 8th, 2019.
Christoph Grunau, Ahmet Alper Özüdoğru, Václav Rozhoň, and Jakub Tětek. A nearly tight analysis of greedy k-means++. arXiv preprint, 2022. URL: https://arxiv.org/abs/2207.07949.
Christoph Grunau and Václav Rozhoň. Adapting k-means algorithms for outliers, 2020. URL: https://doi.org/10.48550/arXiv.2007.01118.
Silvio Lattanzi and Christian Sohler. A better k-means++ algorithm via local search. In International Conference on Machine Learning, pages 3662-3671, 2019.
Konstantin Makarychev, Aravind Reddy, and Liren Shan. Improved guarantees for k-means++ and k-means++ parallel. Advances in Neural Information Processing Systems, 33:16142-16152, 2020.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825-2830, 2011.
Václav Rozhoň. Simple and sharp analysis of k-means||. In International Conference on Machine Learning, pages 8266-8275. PMLR, 2020.
Dennis Wei. A constant-factor bi-criteria approximation guarantee for k-means++. In Advances in Neural Information Processing Systems, pages 604-612, 2016.

Noisy k-Means++ Revisited

Authors Christoph Grunau , Ahmet Alper Özüdoğru, Václav Rozhoň

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message