ACM Other Conferences

10.1145/acmotherconferences

0000000

10.5555/0000000

Proceedings of the 22nd International Symposium on Experimental Algorithms (SEA 2024)

SEA 2024

10.4230/LIPIcs.SEA.2024.7

10002950.10003624.10003625.10003628

Mathematics of computing~Combinatorial algorithms

500

10003752.10003809.10003636.10003812

Theory of computation~Facility location and clustering

500

10002951.10003227.10003351.10003444

Information systems~Clustering

500

Local Search k-means++ with Foresight

Conrads

Theo

Department of Computer Science, University of Cologne, Germany Author

https://orcid.org/0000-0001-9395-6711

Drexler

Lukas

Faculty of Mathematics and Natural Sciences, Department of Computer Science, Heinrich Heine University Düsseldorf, Germany lukas.drexler@hhu.de Author

https://orcid.org/0000-0003-4245-4812

Könen

Joshua

Institute of Computer Science, University of Bonn, Germany s6jjkoen@uni-bonn.de Author

https://orcid.org/0000-0001-7381-912X

Schmidt

Daniel R.

Faculty of Mathematics and Natural Sciences, Department of Computer Science, Heinrich Heine University Düsseldorf, Germany dschmidt@hhu.de Author

https://orcid.org/0000-0003-4856-3905

Schmidt

Melanie

Faculty of Mathematics and Natural Sciences, Department of Computer Science, Heinrich Heine University Düsseldorf, Germany mschmidt@hhu.de Author

11 07 2024

7:1 7:20

Since its introduction in 1957, Lloyd’s algorithm for k-means clustering has been extensively studied and has undergone several improvements. While in its original form it does not guarantee any approximation factor at all, Arthur and Vassilvitskii (SODA 2007) proposed k-means++ which enhances Lloyd’s algorithm by a seeding method which guarantees a 𝒪(log k)-approximation in expectation. More recently, Lattanzi and Sohler (ICML 2019) proposed LS++ which further improves the solution quality of k-means++ by local search techniques to obtain a 𝒪(1)-approximation. On the practical side, the greedy variant of k-means++ is often used although its worst-case behaviour is provably worse than for the standard k-means++ variant.

We investigate how to improve LS++ further in practice. We study two options for improving the practical performance: (a) Combining LS++ with greedy k-means++ instead of k-means++, and (b) Improving LS++ by better entangling it with Lloyd’s algorithm. Option (a) worsens the theoretical guarantees of k-means++ but improves the practical quality also in combination with LS++ as we confirm in our experiments. Option (b) is our new algorithm, Foresight LS++. We experimentally show that FLS++ improves upon the solution quality of LS++. It retains its asymptotic runtime and its worst-case approximation bounds.

k-means clustering kmeans++ greedy local search

Available at https://www.kdd.org/kdd-cup/view/kdd-cup-2004/data.

Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for k-means and Eucl. k-median by primal-dual alg. SIAM J. Comput., 49(4), 2020.

Daniel Aloise, Pierre Hansen, and Leo Liberti. An improved column generation algorithm for minimum sum-of-squares clustering. Mathematical Programming, 131:195-220, 2012.

David Arthur and Sergei Vassilvitskii. K-means++: The advantages of careful seeding. In Proceedings of the 18th SODA, pages 1027-1035, USA, 2007.

Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop. The hardness of approximation of Euclidean k-means. In Lars Arge and János Pach, editors, Proc. of the 31st SoCG, volume 34 of LIPIcs, pages 754-767, 2015.

Anup Bhattacharya, Jan Eube, Heiko Röglin, and Melanie Schmidt. Noisy, greedy and not so greedy k-means++. In Proc. of the 28th ESA, 2020.

M. Emre Celebi, Hassan A. Kingravi, and Patricio A. Vela. A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst. Appl., 40(1):200-210, 2013.

Davin Choo, Christoph Grunau, Julian Portmann, and Václav Rozhon. k-means++: few more steps yield constant approximation. In International Conference on Machine Learning, pages 1909-1917. PMLR, 2020.

Theo Conrads. Lokale Such- und Samplingmethoden für das k-Means- und k-Median-Problem. Master’s thesis, Universität zu Köln, 2021.

Sanjoy Dasgupta. The hardness of k-means clustering, 2008. Technical report.

Lukas Drexler, Joshua Könen, Daniel R. Schmidt, Melanie Schmidt, and Giulia Baldini. algo-hhu/FLSpp. Software, URL: https://github.com/algo-hhu/FLSpp (visited on 2024-06-27).10.4230/artifacts.22470

Charles Elkan. Using the triangle inequality to accelerate k-means. In Tom Fawcett and Nina Mishra, editors, Proc. of the 20th ICML, pages 147-153, 2003.

Gereon Frahling and Christian Sohler. A fast k-means implementation using coresets. In Proc. of the 22nd SoCG, pages 135-143, 2006.

Pasi Fränti and Sami Sieranoja. K-means properties on six clustering benchmark datasets. Appl. Intell., 48(12):4743-4759, 2018.

Pasi Fränti and Sami Sieranoja. How much can k-means be improved by using better initialization and repeats? Pattern Recognition, 93:95-112, 2019.

Bernd Fritzke. The k-means-u* algorithm: non-local jumps and greedy retries improve k-means++ clustering. CoRR, abs/1706.09059, 2017.

Christoph Grunau, Ahmet Alper Özüdoğru, Václav Rozhoň, and Jakub Tětek. A nearly tight analysis of greedy k-means++. In Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1012-1070. SIAM, 2023.

Greg Hamerly. Making k-means even faster. In SDM, pages 130-140. SIAM, 2010.

Grete Heinz, Louis J. Peterson, Roger W. Johnson, and Carter J. Kerk. Exploring relationships in body dimensions. Journal of Statistics Education, 11(2), 2003.

Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. A local search approximation algorithm for k-means clustering. Comput. Geom., 28(2-3):89-112, 2004.

Silvio Lattanzi and Christian Sohler. A better k-means++ algorithm via local search. In Proc. of the 36th ICML, volume 97 of Proceedings of Machine Learning Research, pages 3662-3671. PMLR, 09-15 June 2019.

Euiwoong Lee, Melanie Schmidt, and John Wright. Improved and simplified inapproximability for k-means. Inf. Process. Lett., 120:40-43, 2017.

Meena Mahajan, Prajakta Nimbhorkar, and Kasturi R. Varadarajan. The planar k-means problem is np-hard. Theor. Comput. Sci., 442:13-21, 2012.

Manfred Padberg and Giovanni Rinaldi. A branch-and-cut algorithm for the resolution of large-scale symmetric traveling salesman problems. SIAM Review, 33(1):60-100, 1991.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825-2830, 2011.

J.M Peña, J.A Lozano, and P Larrañaga. An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognition Letters, 20(10):1027-1040, 1999.

Dennis Wei. A constant-factor bi-criteria approximation guarantee for k-means++. In Advances in Neural Information Processing Systems, volume 29, 2016.

I.-C. Yeh. Modeling of strength of high-performance concrete using artificial neural networks. Cement and Concrete Research, 28(12):1797-1808, 1998.

<book-part-wrapper xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="2.0" xml:lang="en" content-type="research-article">

<collection-meta collection-type="book-series">

<collection-id collection-id-type="doi">10.1145/acmotherconferences</collection-id>

<title-group>

<title>ACM Other Conferences</title>

</title-group>

</collection-meta>

<book-meta>

<book-id book-id-type="acm-id">0000000</book-id>

<book-id book-id-type="doi">10.5555/0000000</book-id>

<book-title-group>

<book-title>Proceedings of the 22nd International Symposium on Experimental Algorithms (SEA 2024)</book-title>

<alt-title alt-title-type="acronym">SEA 2024</alt-title>

</book-title-group>

</book-meta>

<book-part book-part-type="chapter" xml:lang="en">

<book-part-meta>

<book-part-id book-part-id-type="doi">10.4230/LIPIcs.SEA.2024.7</book-part-id>

<book-part-id book-part-id-type="article-no">7</book-part-id>

<subj-group subj-group-type="ccs2012">

<compound-subject>

<compound-subject-part content-type="code">10002950.10003624.10003625.10003628</compound-subject-part>

<compound-subject-part content-type="text">Mathematics of computing~Combinatorial algorithms</compound-subject-part>

<compound-subject-part content-type="weight">500</compound-subject-part>

</compound-subject>

<compound-subject>

<compound-subject-part content-type="code">10003752.10003809.10003636.10003812</compound-subject-part>

<compound-subject-part content-type="text">Theory of computation~Facility location and clustering</compound-subject-part>

<compound-subject-part content-type="weight">500</compound-subject-part>

</compound-subject>

<compound-subject>

<compound-subject-part content-type="code">10002951.10003227.10003351.10003444</compound-subject-part>

<compound-subject-part content-type="text">Information systems~Clustering</compound-subject-part>

<compound-subject-part content-type="weight">500</compound-subject-part>

</compound-subject>

</subj-group>

<title-group>

<title>Local Search k-means++ with Foresight</title>

</title-group>

<contrib-group>

<name>

<surname>Conrads</surname>

<given-names>Theo</given-names>

</name>

<aff>Department of Computer Science, University of Cologne, Germany</aff>

<role>Author</role>

</contrib>

<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0001-9395-6711</contrib-id>

<name>

<surname>Drexler</surname>

<given-names>Lukas</given-names>

</name>

<aff>Faculty of Mathematics and Natural Sciences, Department of Computer Science, Heinrich Heine University Düsseldorf, Germany</aff>

<email>lukas.drexler@hhu.de</email>

<role>Author</role>

</contrib>

<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0003-4245-4812</contrib-id>

<name>

<surname>Könen</surname>

<given-names>Joshua</given-names>

</name>

<aff>Institute of Computer Science, University of Bonn, Germany</aff>

<email>s6jjkoen@uni-bonn.de</email>

<role>Author</role>

</contrib>

<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0001-7381-912X</contrib-id>

<name>

<surname>Schmidt</surname>

<given-names>Daniel R.</given-names>

</name>

<aff>Faculty of Mathematics and Natural Sciences, Department of Computer Science, Heinrich Heine University Düsseldorf, Germany</aff>

<email>dschmidt@hhu.de</email>

<role>Author</role>

</contrib>

<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0003-4856-3905</contrib-id>

<name>

<surname>Schmidt</surname>

<given-names>Melanie</given-names>

</name>

<aff>Faculty of Mathematics and Natural Sciences, Department of Computer Science, Heinrich Heine University Düsseldorf, Germany</aff>

<email>mschmidt@hhu.de</email>

<role>Author</role>

</contrib>

</contrib-group>

<pub-date date-type="publication">

</pub-date>

<p>Since its introduction in 1957, Lloyd’s algorithm for k-means clustering has been extensively studied and has undergone several improvements. While in its original form it does not guarantee any approximation factor at all, Arthur and Vassilvitskii (SODA 2007) proposed k-means++ which enhances Lloyd’s algorithm by a seeding method which guarantees a 𝒪(log k)-approximation in expectation. More recently, Lattanzi and Sohler (ICML 2019) proposed LS++ which further improves the solution quality of k-means++ by local search techniques to obtain a 𝒪(1)-approximation. On the practical side, the greedy variant of k-means++ is often used although its worst-case behaviour is provably worse than for the standard k-means++ variant. </p>

<p>We investigate how to improve LS++ further in practice. We study two options for improving the practical performance: (a) Combining LS++ with greedy k-means++ instead of k-means++, and (b) Improving LS++ by better entangling it with Lloyd’s algorithm. Option (a) worsens the theoretical guarantees of k-means++ but improves the practical quality also in combination with LS++ as we confirm in our experiments. Option (b) is our new algorithm, Foresight LS++. We experimentally show that FLS++ improves upon the solution quality of LS++. It retains its asymptotic runtime and its worst-case approximation bounds.</p>

</abstract>

<kwd-group>

<kwd>k-means clustering</kwd>

<kwd>kmeans++</kwd>

<kwd>greedy</kwd>

<kwd>local search</kwd>

</kwd-group>

</book-part-meta>

<back>

<ref-list specific-use="unparsed">

<mixed-citation>Available at https://www.kdd.org/kdd-cup/view/kdd-cup-2004/data.</mixed-citation>

</ref>

<mixed-citation>Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for k-means and Eucl. k-median by primal-dual alg. SIAM J. Comput., 49(4), 2020.</mixed-citation>

</ref>

<mixed-citation>Daniel Aloise, Pierre Hansen, and Leo Liberti. An improved column generation algorithm for minimum sum-of-squares clustering. Mathematical Programming, 131:195-220, 2012.</mixed-citation>

</ref>

<mixed-citation>David Arthur and Sergei Vassilvitskii. K-means++: The advantages of careful seeding. In Proceedings of the 18th SODA, pages 1027-1035, USA, 2007.</mixed-citation>

</ref>

<mixed-citation>Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop. The hardness of approximation of Euclidean k-means. In Lars Arge and János Pach, editors, Proc. of the 31st SoCG, volume 34 of LIPIcs, pages 754-767, 2015.</mixed-citation>

</ref>

<mixed-citation>Anup Bhattacharya, Jan Eube, Heiko Röglin, and Melanie Schmidt. Noisy, greedy and not so greedy k-means++. In Proc. of the 28th ESA, 2020.</mixed-citation>

</ref>

<mixed-citation>M. Emre Celebi, Hassan A. Kingravi, and Patricio A. Vela. A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst. Appl., 40(1):200-210, 2013.</mixed-citation>

</ref>

<mixed-citation>Davin Choo, Christoph Grunau, Julian Portmann, and Václav Rozhon. k-means++: few more steps yield constant approximation. In International Conference on Machine Learning, pages 1909-1917. PMLR, 2020.</mixed-citation>

</ref>

<mixed-citation>Theo Conrads. Lokale Such- und Samplingmethoden für das k-Means- und k-Median-Problem. Master’s thesis, Universität zu Köln, 2021.</mixed-citation>

</ref>

<mixed-citation>Sanjoy Dasgupta. The hardness of k-means clustering, 2008. Technical report.</mixed-citation>

</ref>

<mixed-citation>

Lukas Drexler, Joshua Könen, Daniel R. Schmidt, Melanie Schmidt, and Giulia Baldini. algo-hhu/FLSpp. Software, URL: https://github.com/algo-hhu/FLSpp (visited on 2024-06-27).

<pub-id pub-id-type="doi" xlink:href="10.4230/artifacts.22470">10.4230/artifacts.22470</pub-id>

</mixed-citation>

</ref>

<mixed-citation>Charles Elkan. Using the triangle inequality to accelerate k-means. In Tom Fawcett and Nina Mishra, editors, Proc. of the 20th ICML, pages 147-153, 2003.</mixed-citation>

</ref>

<mixed-citation>Gereon Frahling and Christian Sohler. A fast k-means implementation using coresets. In Proc. of the 22nd SoCG, pages 135-143, 2006.</mixed-citation>

</ref>

<mixed-citation>Pasi Fränti and Sami Sieranoja. K-means properties on six clustering benchmark datasets. Appl. Intell., 48(12):4743-4759, 2018.</mixed-citation>

</ref>

<mixed-citation>Pasi Fränti and Sami Sieranoja. How much can k-means be improved by using better initialization and repeats? Pattern Recognition, 93:95-112, 2019.</mixed-citation>

</ref>

<mixed-citation>Bernd Fritzke. The k-means-u* algorithm: non-local jumps and greedy retries improve k-means++ clustering. CoRR, abs/1706.09059, 2017.</mixed-citation>

</ref>

<mixed-citation>Christoph Grunau, Ahmet Alper Özüdoğru, Václav Rozhoň, and Jakub Tětek. A nearly tight analysis of greedy k-means++. In Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1012-1070. SIAM, 2023.</mixed-citation>

</ref>

<mixed-citation>Greg Hamerly. Making k-means even faster. In SDM, pages 130-140. SIAM, 2010.</mixed-citation>

</ref>

<mixed-citation>Grete Heinz, Louis J. Peterson, Roger W. Johnson, and Carter J. Kerk. Exploring relationships in body dimensions. Journal of Statistics Education, 11(2), 2003.</mixed-citation>

</ref>

<mixed-citation>Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. A local search approximation algorithm for k-means clustering. Comput. Geom., 28(2-3):89-112, 2004.</mixed-citation>

</ref>

<mixed-citation>Silvio Lattanzi and Christian Sohler. A better k-means++ algorithm via local search. In Proc. of the 36th ICML, volume 97 of Proceedings of Machine Learning Research, pages 3662-3671. PMLR, 09-15 June 2019.</mixed-citation>

</ref>

<mixed-citation>Euiwoong Lee, Melanie Schmidt, and John Wright. Improved and simplified inapproximability for k-means. Inf. Process. Lett., 120:40-43, 2017.</mixed-citation>

</ref>

<mixed-citation>Meena Mahajan, Prajakta Nimbhorkar, and Kasturi R. Varadarajan. The planar k-means problem is np-hard. Theor. Comput. Sci., 442:13-21, 2012.</mixed-citation>

</ref>

<mixed-citation>Manfred Padberg and Giovanni Rinaldi. A branch-and-cut algorithm for the resolution of large-scale symmetric traveling salesman problems. SIAM Review, 33(1):60-100, 1991.</mixed-citation>

</ref>

<mixed-citation>F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825-2830, 2011.</mixed-citation>

</ref>

<mixed-citation>J.M Peña, J.A Lozano, and P Larrañaga. An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognition Letters, 20(10):1027-1040, 1999.</mixed-citation>

</ref>

<mixed-citation>Dennis Wei. A constant-factor bi-criteria approximation guarantee for k-means++. In Advances in Neural Information Processing Systems, volume 29, 2016.</mixed-citation>

</ref>

<mixed-citation>I.-C. Yeh. Modeling of strength of high-performance concrete using artificial neural networks. Cement and Concrete Research, 28(12):1797-1808, 1998.</mixed-citation>

</ref>

</ref-list>

</back>

</book-part>

</book-part-wrapper>