Noisy, Greedy and Not so Greedy k-Means++

Bhattacharya, Anup; Eube, Jan; Röglin, Heiko; Schmidt, Melanie

doi:10.4230/LIPIcs.ESA.2020.18

Abstract

The k-means++ algorithm due to Arthur and Vassilvitskii [David Arthur and Sergei Vassilvitskii, 2007] has become the most popular seeding method for Lloyd’s algorithm. It samples the first center uniformly at random from the data set and the other k-1 centers iteratively according to D²-sampling, i.e., the probability that a data point becomes the next center is proportional to its squared distance to the closest center chosen so far. k-means++ is known to achieve an approximation factor of 𝒪(log k) in expectation.
Already in the original paper on k-means++, Arthur and Vassilvitskii suggested a variation called greedy k-means++ algorithm in which in each iteration multiple possible centers are sampled according to D²-sampling and only the one that decreases the objective the most is chosen as a center for that iteration. It is stated as an open question whether this also leads to an 𝒪(log k)-approximation (or even better). We show that this is not the case by presenting a family of instances on which greedy k-means++ yields only an Ω(𝓁⋅log k)-approximation in expectation where 𝓁 is the number of possible centers that are sampled in each iteration.
Inspired by the negative results, we study a variation of greedy k-means++ which we call noisy k-means++ algorithm. In this variation only one center is sampled in every iteration but not exactly by D²-sampling. Instead in each iteration an adversary is allowed to change the probabilities arising from D²-sampling individually for each point by a factor between 1-ε₁ and 1+ε₂ for parameters ε₁ ∈ [0,1) and ε₂ ≥ 0. We prove that noisy k-means++ computes an 𝒪(log² k)-approximation in expectation. We use the analysis of noisy k-means++ to design a moderately greedy k-means++ algorithm.

Cite As Get BibTex

Anup Bhattacharya, Jan Eube, Heiko Röglin, and Melanie Schmidt. Noisy, Greedy and Not so Greedy k-Means++. In 28th Annual European Symposium on Algorithms (ESA 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 173, pp. 18:1-18:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020) https://doi.org/10.4230/LIPIcs.ESA.2020.18

Author Details

Anup Bhattacharya

Indian Statistical Institute, Kolkata, India

Jan Eube

University of Bonn, Germany

Heiko Röglin

University of Bonn, Germany

Melanie Schmidt

University of Cologne, Germany

Funding

Bhattacharya, Anup: supported by an NPDF Fellowship, sponsored by the Government of India.
Röglin, Heiko: supported by DFG grant RO 5439/1 - 1.
Schmidt, Melanie: supported by DFG grant SCHM 2765/1 - 1.

Acknowledgements

We thank the reviewers for their detailed comments.

References

Ankit Aggarwal, Amit Deshpande, and Ravi Kannan. Adaptive sampling for k-means clustering. In Proceedings of the 12th and 13th APPROX-RANDOM, pages 15-28, 2009.
Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. In Proceedings of the 58th IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 61-72, 2017.
Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75(2):245-248, 2009.
David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the 18th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1027-1035, 2007.
Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop. The hardness of approximation of euclidean k-means. In Proceedings of the 31st International Symposium on Computational Geometry (SoCG), pages 754-767, 2015.
Olivier Bachem, Mario Lucic, S. Hamed Hassani, and Andreas Krause. Approximate k-means++ in sublinear time. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, pages 1459-1467, 2016. URL: http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12147.
Anup Bhattacharya, Ragesh Jaiswal, and Nir Ailon. Tight lower bound instances for k-means++ in two dimensions. Theoretical Computer Science, 634:55-66, 2016.
Tobias Brunsch and Heiko Röglin. A bad instance for k-means++. Theoretical Computer Science, 505:19-26, 2013.
M. Emre Celebi, Hassan A. Kingravi, and Patricio A. Vela. A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Systems with Applications, 40(1):200-210, 2013.
Davin Choo, Christoph Grunau, Julian Portmann, and Václav Rozhon. k-means++: few more steps yield constant approximation. CoRR, abs/2002.07784, 2020. URL: http://arxiv.org/abs/2002.07784.
Vincent Cohen-Addad, Philip N. Klein, and Claire Mathieu. Local search yields approximation schemes for k-means and k-median in euclidean and minor-free metrics. In Proceedings of the 57th IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 353-364, 2016.
Sanjoy Dasgupta. Lecture 3 - Algorithms for k-means clustering, 2013. accessed May 8th, 2019. URL: http://cseweb.ucsd.edu/~dasgupta/291-geom/kmeans.pdf.
Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In Proceedings of the 43rd ACM Symposium on Theory of Computing (STOC), pages 569-578, 2011.
Zachary Friggstad, Mohsen Rezapour, and Mohammad R. Salavatipour. Local search yields a PTAS for k-means in doubling metrics. SIAM Journal on Computing, 48(2):452-480, 2019.
Daniel J. Hsu and Matus Telgarsky. Greedy bi-criteria approximations for k-medians and k-means. CoRR, abs/1607.06203, 2016. URL: http://arxiv.org/abs/1607.06203.
Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. A local search approximation algorithm for k-means clustering. Computational Geometry, 28(2-3):89-112, 2004.
Silvio Lattanzi and Christian Sohler. A better k-means++ algorithm via local search. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 3662-3671, 2019.
Euiwoong Lee, Melanie Schmidt, and John Wright. Improved and simplified inapproximability for k-means. Information Processing Letters, 120:40-43, 2017.
Stuart P. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129-137, 1982. originally published as Bell Laboratories Technical Memorandum in 1957.
Meena Mahajan, Prajakta Nimbhorkar, and Kasturi R. Varadarajan. The Planar k-means Problem is NP-Hard. In Proceedings of the 3rd Workshop on Algorithms and Computation (WALCOM), pages 274-285, 2009.
Benedikt Pago. Upper and lower bounds for the approximation ratios of incremental and hierarchical clustering algorithms. Master’s thesis, University of Bonn, 2018.
Sergei Vassilvitskii. k-means: Algorithms, Analyses, Experiments. PhD thesis, Stanford University, 2007.
Dennis Wei. A constant-factor bi-criteria approximation guarantee for k-means++. In Proceedings of the Annual Conference on Neural Information Processing Systems 2016 (NIPS), pages 604-612, 2016.
Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus F. M. Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1):1-37, 2008.

Noisy, Greedy and Not so Greedy k-Means++

Authors Anup Bhattacharya, Jan Eube, Heiko Röglin, Melanie Schmidt

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message