Document

# Noisy, Greedy and Not so Greedy k-Means++

## File

LIPIcs.ESA.2020.18.pdf
• Filesize: 0.51 MB
• 21 pages

## Acknowledgements

We thank the reviewers for their detailed comments.

## Cite As

Anup Bhattacharya, Jan Eube, Heiko Röglin, and Melanie Schmidt. Noisy, Greedy and Not so Greedy k-Means++. In 28th Annual European Symposium on Algorithms (ESA 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 173, pp. 18:1-18:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)
https://doi.org/10.4230/LIPIcs.ESA.2020.18

## Abstract

The k-means++ algorithm due to Arthur and Vassilvitskii [David Arthur and Sergei Vassilvitskii, 2007] has become the most popular seeding method for Lloyd’s algorithm. It samples the first center uniformly at random from the data set and the other k-1 centers iteratively according to D²-sampling, i.e., the probability that a data point becomes the next center is proportional to its squared distance to the closest center chosen so far. k-means++ is known to achieve an approximation factor of 𝒪(log k) in expectation. Already in the original paper on k-means++, Arthur and Vassilvitskii suggested a variation called greedy k-means++ algorithm in which in each iteration multiple possible centers are sampled according to D²-sampling and only the one that decreases the objective the most is chosen as a center for that iteration. It is stated as an open question whether this also leads to an 𝒪(log k)-approximation (or even better). We show that this is not the case by presenting a family of instances on which greedy k-means++ yields only an Ω(𝓁⋅log k)-approximation in expectation where 𝓁 is the number of possible centers that are sampled in each iteration. Inspired by the negative results, we study a variation of greedy k-means++ which we call noisy k-means++ algorithm. In this variation only one center is sampled in every iteration but not exactly by D²-sampling. Instead in each iteration an adversary is allowed to change the probabilities arising from D²-sampling individually for each point by a factor between 1-ε₁ and 1+ε₂ for parameters ε₁ ∈ [0,1) and ε₂ ≥ 0. We prove that noisy k-means++ computes an 𝒪(log² k)-approximation in expectation. We use the analysis of noisy k-means++ to design a moderately greedy k-means++ algorithm.

## Subject Classification

##### ACM Subject Classification
• Theory of computation → Approximation algorithms analysis
• Theory of computation → Facility location and clustering
• Theory of computation → Design and analysis of algorithms
• k-means++
• greedy

## Metrics

• Access Statistics
• Total Accesses (updated on a weekly basis)
0

## References

1. Ankit Aggarwal, Amit Deshpande, and Ravi Kannan. Adaptive sampling for k-means clustering. In Proceedings of the 12th and 13th APPROX-RANDOM, pages 15-28, 2009.
2. Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. In Proceedings of the 58th IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 61-72, 2017.
3. Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75(2):245-248, 2009.
4. David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the 18th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1027-1035, 2007.
5. Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop. The hardness of approximation of euclidean k-means. In Proceedings of the 31st International Symposium on Computational Geometry (SoCG), pages 754-767, 2015.
6. Olivier Bachem, Mario Lucic, S. Hamed Hassani, and Andreas Krause. Approximate k-means++ in sublinear time. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, pages 1459-1467, 2016. URL: http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12147.
7. Anup Bhattacharya, Ragesh Jaiswal, and Nir Ailon. Tight lower bound instances for k-means++ in two dimensions. Theoretical Computer Science, 634:55-66, 2016.
8. Tobias Brunsch and Heiko Röglin. A bad instance for k-means++. Theoretical Computer Science, 505:19-26, 2013.
9. M. Emre Celebi, Hassan A. Kingravi, and Patricio A. Vela. A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Systems with Applications, 40(1):200-210, 2013.
10. Davin Choo, Christoph Grunau, Julian Portmann, and Václav Rozhon. k-means++: few more steps yield constant approximation. CoRR, abs/2002.07784, 2020. URL: http://arxiv.org/abs/2002.07784.
11. Vincent Cohen-Addad, Philip N. Klein, and Claire Mathieu. Local search yields approximation schemes for k-means and k-median in euclidean and minor-free metrics. In Proceedings of the 57th IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 353-364, 2016.
12. Sanjoy Dasgupta. Lecture 3 - Algorithms for k-means clustering, 2013. accessed May 8th, 2019. URL: http://cseweb.ucsd.edu/~dasgupta/291-geom/kmeans.pdf.
13. Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In Proceedings of the 43rd ACM Symposium on Theory of Computing (STOC), pages 569-578, 2011.
14. Zachary Friggstad, Mohsen Rezapour, and Mohammad R. Salavatipour. Local search yields a PTAS for k-means in doubling metrics. SIAM Journal on Computing, 48(2):452-480, 2019.
15. Daniel J. Hsu and Matus Telgarsky. Greedy bi-criteria approximations for k-medians and k-means. CoRR, abs/1607.06203, 2016. URL: http://arxiv.org/abs/1607.06203.
16. Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. A local search approximation algorithm for k-means clustering. Computational Geometry, 28(2-3):89-112, 2004.
17. Silvio Lattanzi and Christian Sohler. A better k-means++ algorithm via local search. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 3662-3671, 2019.
18. Euiwoong Lee, Melanie Schmidt, and John Wright. Improved and simplified inapproximability for k-means. Information Processing Letters, 120:40-43, 2017.
19. Stuart P. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129-137, 1982. originally published as Bell Laboratories Technical Memorandum in 1957.
20. Meena Mahajan, Prajakta Nimbhorkar, and Kasturi R. Varadarajan. The Planar k-means Problem is NP-Hard. In Proceedings of the 3rd Workshop on Algorithms and Computation (WALCOM), pages 274-285, 2009.
21. Benedikt Pago. Upper and lower bounds for the approximation ratios of incremental and hierarchical clustering algorithms. Master’s thesis, University of Bonn, 2018.
22. Sergei Vassilvitskii. k-means: Algorithms, Analyses, Experiments. PhD thesis, Stanford University, 2007.
23. Dennis Wei. A constant-factor bi-criteria approximation guarantee for k-means++. In Proceedings of the Annual Conference on Neural Information Processing Systems 2016 (NIPS), pages 604-612, 2016.
24. Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus F. M. Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1):1-37, 2008.
X

Feedback for Dagstuhl Publishing