Noisy, Greedy and Not so Greedy k-Means++

Authors Anup Bhattacharya, Jan Eube, Heiko Röglin, Melanie Schmidt

Thumbnail PDF


  • Filesize: 0.51 MB
  • 21 pages

Document Identifiers

Author Details

Anup Bhattacharya
  • Indian Statistical Institute, Kolkata, India
Jan Eube
  • University of Bonn, Germany
Heiko Röglin
  • University of Bonn, Germany
Melanie Schmidt
  • University of Cologne, Germany


We thank the reviewers for their detailed comments.

Cite AsGet BibTex

Anup Bhattacharya, Jan Eube, Heiko Röglin, and Melanie Schmidt. Noisy, Greedy and Not so Greedy k-Means++. In 28th Annual European Symposium on Algorithms (ESA 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 173, pp. 18:1-18:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)


The k-means++ algorithm due to Arthur and Vassilvitskii [David Arthur and Sergei Vassilvitskii, 2007] has become the most popular seeding method for Lloyd’s algorithm. It samples the first center uniformly at random from the data set and the other k-1 centers iteratively according to D²-sampling, i.e., the probability that a data point becomes the next center is proportional to its squared distance to the closest center chosen so far. k-means++ is known to achieve an approximation factor of 𝒪(log k) in expectation. Already in the original paper on k-means++, Arthur and Vassilvitskii suggested a variation called greedy k-means++ algorithm in which in each iteration multiple possible centers are sampled according to D²-sampling and only the one that decreases the objective the most is chosen as a center for that iteration. It is stated as an open question whether this also leads to an 𝒪(log k)-approximation (or even better). We show that this is not the case by presenting a family of instances on which greedy k-means++ yields only an Ω(𝓁⋅log k)-approximation in expectation where 𝓁 is the number of possible centers that are sampled in each iteration. Inspired by the negative results, we study a variation of greedy k-means++ which we call noisy k-means++ algorithm. In this variation only one center is sampled in every iteration but not exactly by D²-sampling. Instead in each iteration an adversary is allowed to change the probabilities arising from D²-sampling individually for each point by a factor between 1-ε₁ and 1+ε₂ for parameters ε₁ ∈ [0,1) and ε₂ ≥ 0. We prove that noisy k-means++ computes an 𝒪(log² k)-approximation in expectation. We use the analysis of noisy k-means++ to design a moderately greedy k-means++ algorithm.

Subject Classification

ACM Subject Classification
  • Theory of computation → Approximation algorithms analysis
  • Theory of computation → Facility location and clustering
  • Theory of computation → Design and analysis of algorithms
  • k-means++
  • greedy
  • adaptive sampling


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Ankit Aggarwal, Amit Deshpande, and Ravi Kannan. Adaptive sampling for k-means clustering. In Proceedings of the 12th and 13th APPROX-RANDOM, pages 15-28, 2009. Google Scholar
  2. Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. In Proceedings of the 58th IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 61-72, 2017. Google Scholar
  3. Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75(2):245-248, 2009. Google Scholar
  4. David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the 18th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1027-1035, 2007. Google Scholar
  5. Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop. The hardness of approximation of euclidean k-means. In Proceedings of the 31st International Symposium on Computational Geometry (SoCG), pages 754-767, 2015. Google Scholar
  6. Olivier Bachem, Mario Lucic, S. Hamed Hassani, and Andreas Krause. Approximate k-means++ in sublinear time. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, pages 1459-1467, 2016. URL:
  7. Anup Bhattacharya, Ragesh Jaiswal, and Nir Ailon. Tight lower bound instances for k-means++ in two dimensions. Theoretical Computer Science, 634:55-66, 2016. Google Scholar
  8. Tobias Brunsch and Heiko Röglin. A bad instance for k-means++. Theoretical Computer Science, 505:19-26, 2013. Google Scholar
  9. M. Emre Celebi, Hassan A. Kingravi, and Patricio A. Vela. A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Systems with Applications, 40(1):200-210, 2013. Google Scholar
  10. Davin Choo, Christoph Grunau, Julian Portmann, and Václav Rozhon. k-means++: few more steps yield constant approximation. CoRR, abs/2002.07784, 2020. URL:
  11. Vincent Cohen-Addad, Philip N. Klein, and Claire Mathieu. Local search yields approximation schemes for k-means and k-median in euclidean and minor-free metrics. In Proceedings of the 57th IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 353-364, 2016. Google Scholar
  12. Sanjoy Dasgupta. Lecture 3 - Algorithms for k-means clustering, 2013. accessed May 8th, 2019. URL:
  13. Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In Proceedings of the 43rd ACM Symposium on Theory of Computing (STOC), pages 569-578, 2011. Google Scholar
  14. Zachary Friggstad, Mohsen Rezapour, and Mohammad R. Salavatipour. Local search yields a PTAS for k-means in doubling metrics. SIAM Journal on Computing, 48(2):452-480, 2019. Google Scholar
  15. Daniel J. Hsu and Matus Telgarsky. Greedy bi-criteria approximations for k-medians and k-means. CoRR, abs/1607.06203, 2016. URL:
  16. Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. A local search approximation algorithm for k-means clustering. Computational Geometry, 28(2-3):89-112, 2004. Google Scholar
  17. Silvio Lattanzi and Christian Sohler. A better k-means++ algorithm via local search. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 3662-3671, 2019. Google Scholar
  18. Euiwoong Lee, Melanie Schmidt, and John Wright. Improved and simplified inapproximability for k-means. Information Processing Letters, 120:40-43, 2017. Google Scholar
  19. Stuart P. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129-137, 1982. originally published as Bell Laboratories Technical Memorandum in 1957. Google Scholar
  20. Meena Mahajan, Prajakta Nimbhorkar, and Kasturi R. Varadarajan. The Planar k-means Problem is NP-Hard. In Proceedings of the 3rd Workshop on Algorithms and Computation (WALCOM), pages 274-285, 2009. Google Scholar
  21. Benedikt Pago. Upper and lower bounds for the approximation ratios of incremental and hierarchical clustering algorithms. Master’s thesis, University of Bonn, 2018. Google Scholar
  22. Sergei Vassilvitskii. k-means: Algorithms, Analyses, Experiments. PhD thesis, Stanford University, 2007. Google Scholar
  23. Dennis Wei. A constant-factor bi-criteria approximation guarantee for k-means++. In Proceedings of the Annual Conference on Neural Information Processing Systems 2016 (NIPS), pages 604-612, 2016. Google Scholar
  24. Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus F. M. Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1):1-37, 2008. Google Scholar
Questions / Remarks / Feedback

Feedback for Dagstuhl Publishing

Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail