Noisy k-Means++ Revisited

Authors Christoph Grunau , Ahmet Alper Özüdoğru, Václav Rozhoň



PDF
Thumbnail PDF

File

LIPIcs.ESA.2023.55.pdf
  • Filesize: 0.64 MB
  • 7 pages

Document Identifiers

Author Details

Christoph Grunau
  • ETH Zürich, Switzerland
Ahmet Alper Özüdoğru
  • ETH Zürich, Switzerland
Václav Rozhoň
  • ETH Zürich, Switzerland

Acknowledgements

We would like to thank Mohsen Ghaffari for many helpful comments.

Cite AsGet BibTex

Christoph Grunau, Ahmet Alper Özüdoğru, and Václav Rozhoň. Noisy k-Means++ Revisited. In 31st Annual European Symposium on Algorithms (ESA 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 274, pp. 55:1-55:7, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.ESA.2023.55

Abstract

The k-means++ algorithm by Arthur and Vassilvitskii [SODA 2007] is a classical and time-tested algorithm for the k-means problem. While being very practical, the algorithm also has good theoretical guarantees: its solution is O(log k)-approximate, in expectation. In a recent work, Bhattacharya, Eube, Roglin, and Schmidt [ESA 2020] considered the following question: does the algorithm retain its guarantees if we allow for a slight adversarial noise in the sampling probability distributions used by the algorithm? This is motivated e.g. by the fact that computations with real numbers in k-means++ implementations are inexact. Surprisingly, the analysis under this scenario gets substantially more difficult and the authors were able to prove only a weaker approximation guarantee of O(log² k). In this paper, we close the gap by providing a tight, O(log k)-approximate guarantee for the k-means++ algorithm with noise.

Subject Classification

ACM Subject Classification
  • Theory of computation → Approximation algorithms analysis
  • Theory of computation → Unsupervised learning and clustering
Keywords
  • clustering
  • k-means
  • k-means++
  • adversarial noise

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Ankit Aggarwal, Amit Deshpande, and Ravi Kannan. Adaptive sampling for k-means clustering. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 15-28. Springer, 2009. Google Scholar
  2. Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat. Np-hardness of euclidean sum-of-squares clustering. Machine learning, 75(2):245-248, 2009. Google Scholar
  3. David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027-1035. Society for Industrial and Applied Mathematics, 2007. Google Scholar
  4. Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop. The hardness of approximation of euclidean k-means. arXiv preprint, 2015. URL: https://arxiv.org/abs/1502.03316.
  5. Olivier Bachem, Mario Lucic, Hamed Hassani, and Andreas Krause. Fast and provably good seedings for k-means. In Advances in neural information processing systems, pages 55-63, 2016. Google Scholar
  6. Olivier Bachem, Mario Lucic, S Hamed Hassani, and Andreas Krause. Approximate k-means++ in sublinear time. In Thirtieth AAAI Conference on Artificial Intelligence, 2016. Google Scholar
  7. Olivier Bachem, Mario Lucic, and Andreas Krause. Distributed and provably good seedings for k-means in constant rounds. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 292-300. JMLR. org, 2017. Google Scholar
  8. Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. Scalable k-means++. Proceedings of the VLDB Endowment, 5(7):622-633, 2012. Google Scholar
  9. Aditya Bhaskara, Sharvaree Vadgama, and Hong Xu. Greedy sampling for approximate clustering in the presence of outliers. Advances in Neural Information Processing Systems, 32, 2019. Google Scholar
  10. Anup Bhattacharya, Jan Eube, Heiko Röglin, and Melanie Schmidt. Noisy, greedy and not so greedy k-means++. In 28th Annual European Symposium on Algorithms (ESA 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020. Google Scholar
  11. Davin Choo, Christoph Grunau, Julian Portmann, and Václav Rozhon. k-means++: few more steps yield constant approximation. In International Conference on Machine Learning, pages 1909-1917. PMLR, 2020. Google Scholar
  12. Vincent Cohen-Addad, Hossein Esfandiari, Vahab Mirrokni, and Shyam Narayanan. Improved approximations for euclidean k-means and k-median, via nested quasi-independent sets, 2022. URL: https://doi.org/10.48550/ARXIV.2204.04828.
  13. Sanjoy Dasgupta. Lecture 3 – algorithms for k-means clustering, 2013, accessed May 8th, 2019. Google Scholar
  14. Christoph Grunau, Ahmet Alper Özüdoğru, Václav Rozhoň, and Jakub Tětek. A nearly tight analysis of greedy k-means++. arXiv preprint, 2022. URL: https://arxiv.org/abs/2207.07949.
  15. Christoph Grunau and Václav Rozhoň. Adapting k-means algorithms for outliers, 2020. URL: https://doi.org/10.48550/arXiv.2007.01118.
  16. Silvio Lattanzi and Christian Sohler. A better k-means++ algorithm via local search. In International Conference on Machine Learning, pages 3662-3671, 2019. Google Scholar
  17. Konstantin Makarychev, Aravind Reddy, and Liren Shan. Improved guarantees for k-means++ and k-means++ parallel. Advances in Neural Information Processing Systems, 33:16142-16152, 2020. Google Scholar
  18. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825-2830, 2011. Google Scholar
  19. Václav Rozhoň. Simple and sharp analysis of k-means||. In International Conference on Machine Learning, pages 8266-8275. PMLR, 2020. Google Scholar
  20. Dennis Wei. A constant-factor bi-criteria approximation guarantee for k-means++. In Advances in Neural Information Processing Systems, pages 604-612, 2016. Google Scholar