Worst-Case and Smoothed Analysis of the Hartigan-Wong Method for k-Means Clustering

Authors Bodo Manthey , Jesse van Rhijn



PDF
Thumbnail PDF

File

LIPIcs.STACS.2024.52.pdf
  • Filesize: 0.73 MB
  • 16 pages

Document Identifiers

Author Details

Bodo Manthey
  • Faculty of Electrical Engineering, Mathematics, and Computer Science, University of Twente, Enschede, The Netherlands
Jesse van Rhijn
  • Faculty of Electrical Engineering, Mathematics, and Computer Science, University of Twente, Enschede, The Netherlands

Cite AsGet BibTex

Bodo Manthey and Jesse van Rhijn. Worst-Case and Smoothed Analysis of the Hartigan-Wong Method for k-Means Clustering. In 41st International Symposium on Theoretical Aspects of Computer Science (STACS 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 289, pp. 52:1-52:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.STACS.2024.52

Abstract

We analyze the running time of the Hartigan-Wong method, an old algorithm for the k-means clustering problem. First, we construct an instance on the line on which the method can take 2^{Ω(n)} steps to converge, demonstrating that the Hartigan-Wong method has exponential worst-case running time even when k-means is easy to solve. As this is in contrast to the empirical performance of the algorithm, we also analyze the running time in the framework of smoothed analysis. In particular, given an instance of n points in d dimensions, we prove that the expected number of iterations needed for the Hartigan-Wong method to terminate is bounded by k^{12kd}⋅ poly(n, k, d, 1/σ) when the points in the instance are perturbed by independent d-dimensional Gaussian random variables of mean 0 and standard deviation σ.

Subject Classification

ACM Subject Classification
  • Theory of computation → Randomness, geometry and discrete structures
  • Theory of computation → Approximation algorithms analysis
  • Theory of computation → Discrete optimization
Keywords
  • k-means clustering
  • smoothed analysis
  • probabilistic analysis
  • local search
  • heuristics

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. David Arthur, Bodo Manthey, and Heiko Röglin. Smoothed Analysis of the k-Means Method. Journal of the ACM, 58(5):19:1-19:31, October 2011. URL: https://doi.org/10.1145/2027216.2027217.
  2. David Arthur and Sergei Vassilvitskii. K-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '07, pages 1027-1035, USA, January 2007. Society for Industrial and Applied Mathematics. Google Scholar
  3. Matthias Englert, Heiko Röglin, and Berthold Vöcking. Worst Case and Probabilistic Analysis of the 2-Opt Algorithm for the TSP. Algorithmica, 68(1):190-264, January 2014. URL: https://doi.org/10.1007/s00453-013-9801-4.
  4. Matthias Englert, Heiko Röglin, and Berthold Vöcking. Smoothed Analysis of the 2-Opt Algorithm for the General TSP. ACM Transactions on Algorithms, 13(1):10:1-10:15, September 2016. URL: https://doi.org/10.1145/2972953.
  5. Michael Etscheid and Heiko Röglin. Smoothed Analysis of the Squared Euclidean Maximum-Cut Problem. In Nikhil Bansal and Irene Finocchi, editors, Algorithms - ESA 2015, Lecture Notes in Computer Science, pages 509-520, Berlin, Heidelberg, 2015. Springer. URL: https://doi.org/10.1007/978-3-662-48350-3_43.
  6. Michael Etscheid and Heiko Röglin. Smoothed Analysis of Local Search for the Maximum-Cut Problem. ACM Transactions on Algorithms, 13(2):25:1-25:12, March 2017. URL: https://doi.org/10.1145/3011870.
  7. Gurobi Optimization LLC. Gurobi Optimizer Reference Manual. Gurobi Optimization, LLC, 2023. Google Scholar
  8. J. A. Hartigan and M. A. Wong. Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):100-108, 1979. URL: https://doi.org/10.2307/2346830.
  9. M Inaba, Naoki Katoh, and Hiroshi Imai. Variance-based k-clustering algorithms by Voronoi diagrams and randomization. IEICE Transactions on Information and Systems, E83D, June 2000. Google Scholar
  10. Norman L. Johnson, Samuel Kotz, and Narayanaswamy Balakrishnan. Continuous Univariate Distributions, Volume 2. John Wiley & Sons, May 1995. Google Scholar
  11. S. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129-137, March 1982. URL: https://doi.org/10.1109/TIT.1982.1056489.
  12. Bodo Manthey and Heiko Röglin. Improved Smoothed Analysis of the k-Means Method. In Proceedings of the 2009 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), Proceedings, pages 461-470. Society for Industrial and Applied Mathematics, January 2009. URL: https://doi.org/10.1137/1.9781611973068.51.
  13. Bodo Manthey and Rianne Veenstra. Smoothed Analysis of the 2-Opt Heuristic for the TSP: Polynomial Bounds for Gaussian Noise. In Leizhen Cai, Siu-Wing Cheng, and Tak-Wah Lam, editors, Algorithms and Computation, Lecture Notes in Computer Science, pages 579-589, Berlin, Heidelberg, 2013. Springer. URL: https://doi.org/10.1007/978-3-642-45030-3_54.
  14. Matus Telgarsky and Andrea Vattani. Hartigan’s Method: K-means Clustering without Voronoi. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 820-827. JMLR Workshop and Conference Proceedings, March 2010. Google Scholar
  15. Andrea Vattani. K-means Requires Exponentially Many Iterations Even in the Plane. Discrete & Computational Geometry, 45(4):596-616, June 2011. URL: https://doi.org/10.1007/s00454-011-9340-1.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail