On Variants of k-means Clustering

Authors Sayan Bandyapadhyay, Kasturi Varadarajan



PDF
Thumbnail PDF

File

LIPIcs.SoCG.2016.14.pdf
  • Filesize: 0.5 MB
  • 15 pages

Document Identifiers

Author Details

Sayan Bandyapadhyay
Kasturi Varadarajan

Cite As Get BibTex

Sayan Bandyapadhyay and Kasturi Varadarajan. On Variants of k-means Clustering. In 32nd International Symposium on Computational Geometry (SoCG 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 51, pp. 14:1-14:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016) https://doi.org/10.4230/LIPIcs.SoCG.2016.14

Abstract

Clustering problems often arise in fields like data mining and machine learning. Clustering usually refers to the task of partitioning a collection of objects into groups with similar elements, with respect to a similarity (or dissimilarity) measure. Among the clustering problems, k-means clustering in particular has received much attention from researchers. Despite the fact that k-means is a well studied problem, its status in the plane is still open. In particular, it is unknown whether it admits a PTAS in the plane. The best known approximation bound achievable in polynomial time is 9+epsilon.

In this paper, we consider the following variant of k-means. Given a set C of points in R^d and a real f > 0, find a finite set F of points in R^d that minimizes the quantity f*|F|+sum_{p in C} min_{q in F} {||p-q||}^2. For any fixed dimension d, we design a PTAS for this problem that is based on local search. We also give a "bi-criterion" local search algorithm for k-means which uses (1+epsilon)k centers and yields a solution whose cost is at most (1+epsilon) times the cost of an optimal k-means solution. The algorithm runs in polynomial time for any fixed dimension.  

The contribution of this paper is two-fold. On the one hand, we are able to handle the square of distances in an elegant manner, obtaining a near-optimal approximation bound. This leads us towards a better understanding of the k-means problem. On the other hand, our analysis of local search might also be useful for other geometric problems. This is important considering that little is known about the local search method for geometric approximation.

Subject Classification

Keywords
  • k-means
  • Facility location
  • Local search
  • Geometric approximation

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat. Np-hardness of euclidean sum-of-squares clustering. Machine Learning, 75(2):245-248, 2009. URL: http://dx.doi.org/10.1007/s10994-009-5103-0.
  2. Sanjeev Arora. Polynomial time approximation schemes for euclidean traveling salesman and other geometric problems. J. ACM, 45(5):753-782, 1998. URL: http://dx.doi.org/10.1145/290179.290180.
  3. Sanjeev Arora, Prabhakar Raghavan, and Satish Rao. Approximation schemes for euclidean k-medians and related problems. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC'98, pages 106-113, New York, NY, USA, 1998. ACM. URL: http://dx.doi.org/10.1145/276698.276718.
  4. Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, and Vinayaka Pandit. Local search heuristics for k-median and facility location problems. SIAM J. Comput., 33(3):544-562, 2004. URL: http://dx.doi.org/10.1137/S0097539702416402.
  5. Pranjal Awasthi, Avrim Blum, and Or Sheffet. Stability yields a PTAS for k-median and k-means clustering. In 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, October 23-26, 2010, Las Vegas, Nevada, USA, pages 309-318, 2010. URL: http://dx.doi.org/10.1109/FOCS.2010.36.
  6. Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop. The hardness of approximation of euclidean k-means. In 31st International Symposium on Computational Geometry, SoCG 2015, June 22-25, 2015, Eindhoven, The Netherlands, pages 754-767, 2015. URL: http://dx.doi.org/10.4230/LIPIcs.SOCG.2015.754.
  7. Vijay V. S. P. Bhattiprolu and Sariel Har-Peled. Separating a voronoi diagram via local search. CoRR, abs/1401.0174, 2014. URL: http://arxiv.org/abs/1401.0174.
  8. Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic clustering of the web. Computer Networks, 29(8-13):1157-1166, 1997. URL: http://dx.doi.org/10.1016/S0169-7552(97)00031-7.
  9. Timothy M. Chan and Sariel Har-Peled. Approximation algorithms for maximum independent set of pseudo-disks. Discrete & Computational Geometry, 48(2):373-392, 2012. URL: http://dx.doi.org/10.1007/s00454-012-9417-5.
  10. Vincent Cohen-Addad and Claire Mathieu. Effectiveness of local search for geometric optimization. In 31st International Symposium on Computational Geometry, SoCG 2015, June 22-25, 2015, Eindhoven, The Netherlands, pages 329-343, 2015. URL: http://dx.doi.org/10.4230/LIPIcs.SOCG.2015.329.
  11. Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. Indexing by latent semantic analysis. JASIS, 41(6):391-407, 1990. URL: http://dx.doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.
  12. Richard O Duda, Peter E Hart, et al. Pattern classification and scene analysis. J. Wiley and Sons, 1973. Google Scholar
  13. Christos Faloutsos, Ron Barber, Myron Flickner, Jim Hafner, Wayne Niblack, Dragutin Petkovic, and William Equitz. Efficient and effective querying by image content. J. Intell. Inf. Syst., 3(3/4):231-262, 1994. URL: http://dx.doi.org/10.1007/BF00962238.
  14. Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy, editors. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996. Google Scholar
  15. Anupam Gupta and Kanat Tangwongsan. Simpler analyses of local search algorithms for facility location. CoRR, abs/0809.2554, 2008. URL: http://arxiv.org/abs/0809.2554.
  16. Sariel Har-Peled and Akash Kushal. Smaller coresets for k-median and k-means clustering. Discrete & Computational Geometry, 37(1):3-19, 2007. URL: http://dx.doi.org/10.1007/s00454-006-1271-x.
  17. Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, June 13-16, 2004, pages 291-300, 2004. URL: http://dx.doi.org/10.1145/1007352.1007400.
  18. Mary Inaba, Naoki Katoh, and Hiroshi Imai. Applications of weighted voronoi diagrams and randomization to variance-based k-clustering (extended abstract). In Proceedings of the Tenth Annual Symposium on Computational Geometry, Stony Brook, New York, USA, June 6-8, 1994, pages 332-339, 1994. URL: http://dx.doi.org/10.1145/177424.178042.
  19. Anil K. Jain, Robert P. W. Duin, and Jianchang Mao. Statistical pattern recognition: A review. IEEE Trans. Pattern Anal. Mach. Intell., 22(1):4-37, 2000. URL: http://dx.doi.org/10.1109/34.824819.
  20. Anil K. Jain, M. Narasimha Murty, and Patrick J. Flynn. Data clustering: A review. ACM Comput. Surv., 31(3):264-323, 1999. URL: http://dx.doi.org/10.1145/331499.331504.
  21. Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. A local search approximation algorithm for k-means clustering. Comput. Geom., 28(2-3):89-112, 2004. URL: http://dx.doi.org/10.1016/j.comgeo.2004.03.003.
  22. L. Kaufman and Peter J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley, 1990. Google Scholar
  23. Stavros G. Kolliopoulos and Satish Rao. A nearly linear-time approximation scheme for the euclidean k-median problem. SIAM J. Comput., 37(3):757-782, 2007. URL: http://dx.doi.org/10.1137/S0097539702404055.
  24. Amit Kumar, Yogish Sabharwal, and Sandeep Sen. Linear time algorithms for clustering problems in any dimensions. In Automata, Languages and Programming, 32nd International Colloquium, ICALP 2005, Lisbon, Portugal, July 11-15, 2005, Proceedings, pages 1374-1385, 2005. URL: http://dx.doi.org/10.1007/11523468_111.
  25. S. Lloyd. Least squares quantization in pcm. Information Theory, IEEE Transactions on, 28(2):129-137, Mar 1982. URL: http://dx.doi.org/10.1109/TIT.1982.1056489.
  26. Meena Mahajan, Prajakta Nimbhorkar, and Kasturi R. Varadarajan. The planar k-means problem is np-hard. Theor. Comput. Sci., 442:13-21, 2012. URL: http://dx.doi.org/10.1016/j.tcs.2010.05.034.
  27. Konstantin Makarychev, Yury Makarychev, Maxim Sviridenko, and Justin Ward. A bi-criteria approximation algorithm for k means. CoRR, abs/1507.04227, 2015. URL: http://arxiv.org/abs/1507.04227.
  28. J. Matoušek. On approximate geometric k-clustering. Discrete &Computational Geometry, 24(1):61-84. URL: http://dx.doi.org/10.1007/s004540010019.
  29. Nabil H. Mustafa and Saurabh Ray. Improved results on geometric hitting set problems. Discrete & Computational Geometry, 44(4):883-895, 2010. URL: http://dx.doi.org/10.1007/s00454-010-9285-9.
  30. Rafail Ostrovsky, Yuval Rabani, Leonard J. Schulman, and Chaitanya Swamy. The effectiveness of lloyd-type methods for the k-means problem. J. ACM, 59(6):28, 2012. URL: http://dx.doi.org/10.1145/2395116.2395117.
  31. Michael J. Swain and Dana H. Ballard. Color indexing. International Journal of Computer Vision, 7(1):11-32, 1991. URL: http://dx.doi.org/10.1007/BF00130487.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail