Noisy, Greedy and Not so Greedy k-Means++

Authors Anup Bhattacharya, Jan Eube, Heiko Röglin, Melanie Schmidt



PDF
Thumbnail PDF

File

LIPIcs.ESA.2020.18.pdf
  • Filesize: 0.51 MB
  • 21 pages

Document Identifiers

Author Details

Anup Bhattacharya
  • Indian Statistical Institute, Kolkata, India
Jan Eube
  • University of Bonn, Germany
Heiko Röglin
  • University of Bonn, Germany
Melanie Schmidt
  • University of Cologne, Germany

Acknowledgements

We thank the reviewers for their detailed comments.

Cite AsGet BibTex

Anup Bhattacharya, Jan Eube, Heiko Röglin, and Melanie Schmidt. Noisy, Greedy and Not so Greedy k-Means++. In 28th Annual European Symposium on Algorithms (ESA 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 173, pp. 18:1-18:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)
https://doi.org/10.4230/LIPIcs.ESA.2020.18

Abstract

The k-means++ algorithm due to Arthur and Vassilvitskii [David Arthur and Sergei Vassilvitskii, 2007] has become the most popular seeding method for Lloyd’s algorithm. It samples the first center uniformly at random from the data set and the other k-1 centers iteratively according to D²-sampling, i.e., the probability that a data point becomes the next center is proportional to its squared distance to the closest center chosen so far. k-means++ is known to achieve an approximation factor of 𝒪(log k) in expectation. Already in the original paper on k-means++, Arthur and Vassilvitskii suggested a variation called greedy k-means++ algorithm in which in each iteration multiple possible centers are sampled according to D²-sampling and only the one that decreases the objective the most is chosen as a center for that iteration. It is stated as an open question whether this also leads to an 𝒪(log k)-approximation (or even better). We show that this is not the case by presenting a family of instances on which greedy k-means++ yields only an Ω(𝓁⋅log k)-approximation in expectation where 𝓁 is the number of possible centers that are sampled in each iteration. Inspired by the negative results, we study a variation of greedy k-means++ which we call noisy k-means++ algorithm. In this variation only one center is sampled in every iteration but not exactly by D²-sampling. Instead in each iteration an adversary is allowed to change the probabilities arising from D²-sampling individually for each point by a factor between 1-ε₁ and 1+ε₂ for parameters ε₁ ∈ [0,1) and ε₂ ≥ 0. We prove that noisy k-means++ computes an 𝒪(log² k)-approximation in expectation. We use the analysis of noisy k-means++ to design a moderately greedy k-means++ algorithm.

Subject Classification

ACM Subject Classification
  • Theory of computation → Approximation algorithms analysis
  • Theory of computation → Facility location and clustering
  • Theory of computation → Design and analysis of algorithms
Keywords
  • k-means++
  • greedy
  • adaptive sampling

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Ankit Aggarwal, Amit Deshpande, and Ravi Kannan. Adaptive sampling for k-means clustering. In Proceedings of the 12th and 13th APPROX-RANDOM, pages 15-28, 2009. Google Scholar
  2. Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. In Proceedings of the 58th IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 61-72, 2017. Google Scholar
  3. Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75(2):245-248, 2009. Google Scholar
  4. David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the 18th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1027-1035, 2007. Google Scholar
  5. Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop. The hardness of approximation of euclidean k-means. In Proceedings of the 31st International Symposium on Computational Geometry (SoCG), pages 754-767, 2015. Google Scholar
  6. Olivier Bachem, Mario Lucic, S. Hamed Hassani, and Andreas Krause. Approximate k-means++ in sublinear time. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, pages 1459-1467, 2016. URL: http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12147.
  7. Anup Bhattacharya, Ragesh Jaiswal, and Nir Ailon. Tight lower bound instances for k-means++ in two dimensions. Theoretical Computer Science, 634:55-66, 2016. Google Scholar
  8. Tobias Brunsch and Heiko Röglin. A bad instance for k-means++. Theoretical Computer Science, 505:19-26, 2013. Google Scholar
  9. M. Emre Celebi, Hassan A. Kingravi, and Patricio A. Vela. A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Systems with Applications, 40(1):200-210, 2013. Google Scholar
  10. Davin Choo, Christoph Grunau, Julian Portmann, and Václav Rozhon. k-means++: few more steps yield constant approximation. CoRR, abs/2002.07784, 2020. URL: http://arxiv.org/abs/2002.07784.
  11. Vincent Cohen-Addad, Philip N. Klein, and Claire Mathieu. Local search yields approximation schemes for k-means and k-median in euclidean and minor-free metrics. In Proceedings of the 57th IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 353-364, 2016. Google Scholar
  12. Sanjoy Dasgupta. Lecture 3 - Algorithms for k-means clustering, 2013. accessed May 8th, 2019. URL: http://cseweb.ucsd.edu/~dasgupta/291-geom/kmeans.pdf.
  13. Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In Proceedings of the 43rd ACM Symposium on Theory of Computing (STOC), pages 569-578, 2011. Google Scholar
  14. Zachary Friggstad, Mohsen Rezapour, and Mohammad R. Salavatipour. Local search yields a PTAS for k-means in doubling metrics. SIAM Journal on Computing, 48(2):452-480, 2019. Google Scholar
  15. Daniel J. Hsu and Matus Telgarsky. Greedy bi-criteria approximations for k-medians and k-means. CoRR, abs/1607.06203, 2016. URL: http://arxiv.org/abs/1607.06203.
  16. Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. A local search approximation algorithm for k-means clustering. Computational Geometry, 28(2-3):89-112, 2004. Google Scholar
  17. Silvio Lattanzi and Christian Sohler. A better k-means++ algorithm via local search. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 3662-3671, 2019. Google Scholar
  18. Euiwoong Lee, Melanie Schmidt, and John Wright. Improved and simplified inapproximability for k-means. Information Processing Letters, 120:40-43, 2017. Google Scholar
  19. Stuart P. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129-137, 1982. originally published as Bell Laboratories Technical Memorandum in 1957. Google Scholar
  20. Meena Mahajan, Prajakta Nimbhorkar, and Kasturi R. Varadarajan. The Planar k-means Problem is NP-Hard. In Proceedings of the 3rd Workshop on Algorithms and Computation (WALCOM), pages 274-285, 2009. Google Scholar
  21. Benedikt Pago. Upper and lower bounds for the approximation ratios of incremental and hierarchical clustering algorithms. Master’s thesis, University of Bonn, 2018. Google Scholar
  22. Sergei Vassilvitskii. k-means: Algorithms, Analyses, Experiments. PhD thesis, Stanford University, 2007. Google Scholar
  23. Dennis Wei. A constant-factor bi-criteria approximation guarantee for k-means++. In Proceedings of the Annual Conference on Neural Information Processing Systems 2016 (NIPS), pages 604-612, 2016. Google Scholar
  24. Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus F. M. Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1):1-37, 2008. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail