On Sampling Based Algorithms for k-Means

Authors Anup Bhattacharya, Dishant Goyal, Ragesh Jaiswal, Amit Kumar



PDF
Thumbnail PDF

File

LIPIcs.FSTTCS.2020.13.pdf
  • Filesize: 0.63 MB
  • 17 pages

Document Identifiers

Author Details

Anup Bhattacharya
  • Indian Statistical Institute Kolkata, India
Dishant Goyal
  • Indian Institute of Technology Delhi, India
Ragesh Jaiswal
  • Indian Institute of Technology Delhi, India
Amit Kumar
  • Indian Institute of Technology Delhi, India

Acknowledgements

The authors would like to thank Sanjeev Khanna and Sepehr Assadi for allowing us to use their impossibility argument for the chromatic k-means problem. Anup Bhattacharya would like to thank SERB-National Post Doctoral Fellowship, India. Dishant Goyal would like to thank TCS Research Scholar Program.

Cite AsGet BibTex

Anup Bhattacharya, Dishant Goyal, Ragesh Jaiswal, and Amit Kumar. On Sampling Based Algorithms for k-Means. In 40th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 182, pp. 13:1-13:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)
https://doi.org/10.4230/LIPIcs.FSTTCS.2020.13

Abstract

We generalise the results of Bhattacharya et al. [Bhattacharya et al., 2018] for the list-k-means problem defined as - for a (unknown) partition X₁, ..., X_k of the dataset X ⊆ ℝ^d, find a list of k-center-sets (each element in the list is a set of k centers) such that at least one of k-center-sets {c₁, ..., c_k} in the list gives an (1+ε)-approximation with respect to the cost function min_{permutation π} [∑_{i = 1}^{k} ∑_{x ∈ X_i} ||x - c_{π(i)}||²]. The list-k-means problem is important for the constrained k-means problem since algorithms for the former can be converted to {PTAS} for various versions of the latter. The algorithm for the list-k-means problem by Bhattacharya et al. is a D²-sampling based algorithm that runs in k iterations. Making use of a constant factor solution for the (classical or unconstrained) k-means problem, we generalise the algorithm of Bhattacharya et al. in two ways - (i) for any fixed set X_{j₁}, ..., X_{j_t} of t ≤ k clusters, the algorithm produces a list of (k/(ε))^{O(t/(ε))} t-center sets such that (w.h.p.) at least one of them is good for X_{j₁}, ..., X_{j_t}, and (ii) the algorithm runs in a single iteration. Following are the consequences of our generalisations: 1) Faster PTAS under stability and a parameterised reduction: Property (i) of our generalisation is useful in scenarios where finding good centers becomes easier once good centers for a few "bad" clusters have been chosen. One such case is clustering under stability of Awasthi et al. [Awasthi et al., 2010] where the number of such bad clusters is a constant. Using property (i), we significantly improve the running time of their algorithm from O(dn³) (k log{n})^{poly(1/(β), 1/(ε))} to O (dn³ (k/(ε)) ^{O(1/βε²)}). Another application is a parameterised reduction from the outlier version of k-means to the classical one where the bad clusters are the outliers. 2) Streaming algorithms: The sampling algorithm running in a single iteration (i.e., property (ii)) allows us to design a constant-pass, logspace streaming algorithm for the list-k-means problem. This can be converted to a constant-pass, logspace streaming PTAS for various constrained versions of the k-means problem. In particular, this gives a 3-pass, polylog-space streaming PTAS for the constrained binary k-means problem which in turn gives a 4-pass, polylog-space streaming PTAS for the generalised binary 𝓁₀-rank-r approximation problem. This is the first constant pass, polylog-space streaming algorithm for either of the two problems. Coreset based techniques, which is another approach for designing streaming algorithms in general, is not known to work for the constrained binary k-means problem to the best of our knowledge.

Subject Classification

ACM Subject Classification
  • Theory of computation → Streaming, sublinear and near linear time algorithms
  • Theory of computation → Facility location and clustering
Keywords
  • k-means
  • low rank approximation

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Ankit Aggarwal, Amit Deshpande, and Ravi Kannan. Adaptive sampling for k-means clustering. In Proceedings of the 12th International Workshop and 13th International Workshop on Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX '09 / RANDOM '09, pages 15-28, Berlin, Heidelberg, 2009. Springer-Verlag. URL: https://doi.org/10.1007/978-3-642-03685-9_2.
  2. S. Ahmadian, A. Norouzi-Fard, O. Svensson, and J. Ward. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 61-72, October 2017. URL: https://doi.org/10.1109/FOCS.2017.15.
  3. Nir Ailon, Anup Bhattacharya, Ragesh Jaiswal, and Amit Kumar. Approximate Clustering with Same-Cluster Queries. In Anna R. Karlin, editor, 9th Innovations in Theoretical Computer Science Conference (ITCS 2018), volume 94 of Leibniz International Proceedings in Informatics (LIPIcs), pages 40:1-40:21, Dagstuhl, Germany, 2018. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. URL: https://doi.org/10.4230/LIPIcs.ITCS.2018.40.
  4. David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, SODA '07, pages 1027-1035, Philadelphia, PA, USA, 2007. Society for Industrial and Applied Mathematics. URL: http://dl.acm.org/citation.cfm?id=1283383.1283494.
  5. Pranjal Awasthi, Avrim Blum, and Or Sheffet. Stability yields a PTAS for k-median and k-means clustering. In Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, FOCS '10, pages 309-318, Washington, DC, USA, 2010. IEEE Computer Society. URL: https://doi.org/10.1109/FOCS.2010.36.
  6. Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop. The Hardness of Approximation of Euclidean k-Means. In Lars Arge and János Pach, editors, 31st International Symposium on Computational Geometry (SoCG 2015), volume 34 of Leibniz International Proceedings in Informatics (LIPIcs), pages 754-767, Dagstuhl, Germany, 2015. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. URL: https://doi.org/10.4230/LIPIcs.SOCG.2015.754.
  7. Frank Ban, Vijay Bhattiprolu, Karl Bringmann, Pavel Kolev, Euiwoong Lee, and David P. Woodruff. A PTAS for 𝓁_p-low rank approximation. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '19, pages 747-766, Philadelphia, PA, USA, 2019. Society for Industrial and Applied Mathematics. URL: http://dl.acm.org/citation.cfm?id=3310435.3310482.
  8. Aditya Bhaskara, Sharvaree Vadgama, and Hong Xu. Greedy sampling for approximate clustering in the presence of outliers. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d' Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 11146-11155. Curran Associates, Inc., 2019. URL: http://papers.nips.cc/paper/9294-greedy-sampling-for-approximate-clustering-in-the-presence-of-outliers.pdf.
  9. Anup Bhattacharya, Ragesh Jaiswal, and Amit Kumar. Faster algorithms for the constrained k-means problem. Theory of Computing Systems, 62(1):93-115, January 2018. Google Scholar
  10. Vladimir Braverman, Adam Meyerson, Rafail Ostrovsky, Alan Roytman, Michael Shindler, and Brian Tagiku. Streaming k-means on well-clusterable data. In Proceedings of the Twenty-second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '11, pages 26-40, Philadelphia, PA, USA, 2011. Society for Industrial and Applied Mathematics. URL: http://dl.acm.org/citation.cfm?id=2133036.2133039.
  11. Moses Charikar, Samir Khuller, David M. Mount, and Giri Narasimhan. Algorithms for facility location problems with outliers. In Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’01, page 642–651, USA, 2001. Society for Industrial and Applied Mathematics. Google Scholar
  12. Sanjay Chawla and Aristides Gionis. k-means: A unified approach to clustering and outlier detection, pages 189-197. Society for Industrial and Applied Mathematics, 2013. URL: https://doi.org/10.1137/1.9781611972832.21.
  13. Ke Chen. A constant factor approximation algorithm for k-median clustering with outliers. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’08, page 826–835, USA, 2008. Society for Industrial and Applied Mathematics. Google Scholar
  14. V. Cohen-Addad and Karthik C.S. Inapproximability of clustering in lp metrics. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 519-539, November 2019. URL: https://doi.org/10.1109/FOCS.2019.00040.
  15. Vincent Cohen-Addad, Philip N. Klein, and Claire Mathieu. Local search yields approximation schemes for k-means and k-median in euclidean and minor-free metrics. 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), 00:353-364, 2016. URL: https://doi.org/10.1109/FOCS.2016.46.
  16. Vincent Cohen-Addad and Chris Schwiegelshohn. On the local structure of stable clustering instances. In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15-17, 2017, pages 49-60, 2017. URL: https://doi.org/10.1109/FOCS.2017.14.
  17. Sanjoy Dasgupta. The hardness of k-means clustering. Technical Report CS2008-0916, Department of Computer Science and Engineering, University of California San Diego, 2008. Google Scholar
  18. Hu Ding and Jinhui Xu. A unified framework for clustering constrained data without locality property. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '15, pages 1471-1490, 2015. URL: https://doi.org/10.1137/1.9781611973730.97.
  19. Dan Feldman, Morteza Monemizadeh, and Christian Sohler. A PTAS for k-means clustering based on weak coresets. In Proceedings of the twenty-third annual symposium on Computational geometry, SCG '07, pages 11-18, New York, NY, USA, 2007. ACM. URL: https://doi.org/10.1145/1247069.1247072.
  20. Hendrik Fichtenberger, Marc Gillé, Melanie Schmidt, Chris Schwiegelshohn, and Christian Sohler. Bico: Birch meets coresets for k-means clustering. In Hans L. Bodlaender and Giuseppe F. Italiano, editors, Algorithms - ESA 2013, pages 481-492, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg. Google Scholar
  21. Fedor V. Fomin, Petr A. Golovach, Daniel Lokshtanov, Fahad Panolan, and Saket Saurabh. Approximation schemes for low-rank binary matrix approximation problems. CoRR, abs/1807.07156, 2018. URL: http://arxiv.org/abs/1807.07156.
  22. Zachary Friggstad, Kamyar Khodamoradi, Mohsen Rezapour, and Mohammad R. Salavatipour. Approximation schemes for clustering with outliers. ACM Trans. Algorithms, 15(2), February 2019. URL: https://doi.org/10.1145/3301446.
  23. Zachary Friggstad, Mohsen Rezapour, and Mohammad R. Salavatipour. Local search yields a PTAS for k-means in doubling metrics. 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), 00:365-374, 2016. URL: https://doi.org/10.1109/FOCS.2016.47.
  24. Buddhima Gamlath, Sangxia Huang, and Ola Svensson. Semi-Supervised Algorithms for Approximately Optimal and Accurate Clustering. In Ioannis Chatzigiannakis, Christos Kaklamanis, Dániel Marx, and Donald Sannella, editors, 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018), volume 107 of Leibniz International Proceedings in Informatics (LIPIcs), pages 57:1-57:14, Dagstuhl, Germany, 2018. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. URL: https://doi.org/10.4230/LIPIcs.ICALP.2018.57.
  25. Shalmoli Gupta, Ravi Kumar, Kefu Lu, Benjamin Moseley, and Sergei Vassilvitskii. Local search methods for k-means with outliers. Proc. VLDB Endow., 10(7):757?768, March 2017. URL: https://doi.org/10.14778/3067421.3067425.
  26. Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, STOC '04, pages 291-300, New York, NY, USA, 2004. ACM. URL: https://doi.org/10.1145/1007352.1007400.
  27. Mary Inaba, Naoki Katoh, and Hiroshi Imai. Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract). In Proceedings of the tenth annual symposium on Computational geometry, SCG '94, pages 332-339, New York, NY, USA, 1994. ACM. URL: https://doi.org/10.1145/177424.178042.
  28. Ragesh Jaiswal, Amit Kumar, and Sandeep Sen. A simple D²-sampling based PTAS for k-means and other clustering problems. Algorithmica, 70(1):22-46, 2014. URL: https://doi.org/10.1007/s00453-013-9833-9.
  29. Ragesh Jaiswal, Mehul Kumar, and Pulkit Yadav. Improved analysis of D²-sampling based PTAS for k-means and other clustering problems. Information Processing Letters, 115(2):100-103, 2015. URL: https://doi.org/10.1016/j.ipl.2014.07.009.
  30. Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. A local search approximation algorithm for k-means clustering. In Proc. 18th Annual Symposium on Computational Geometry, pages 10-18, 2002. URL: https://doi.org/10.1145/513400.513402.
  31. Ravishankar Krishnaswamy, Shi Li, and Sai Sandeep. Constant approximation for k-median and k-means with outliers via iterative rounding. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, page 646–659, New York, NY, USA, 2018. Association for Computing Machinery. URL: https://doi.org/10.1145/3188745.3188882.
  32. Amit Kumar, Yogish Sabharwal, and Sandeep Sen. Linear-time approximation schemes for clustering problems in any dimensions. J. ACM, 57(2):5:1-5:32, February 2010. URL: https://doi.org/10.1145/1667053.1667054.
  33. Ravi Kumar, Rina Panigrahy, Ali Rahimi, and David Woodruff. Faster algorithms for binary matrix factorization. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3551-3559, Long Beach, California, USA, 09-15 June 2019. PMLR. URL: http://proceedings.mlr.press/v97/kumar19a.html.
  34. Nathan Linial, Eran London, and Yuri Rabinovich. The geometry of graphs and some of its algorithmic applications. Combinatorica, 15(2):215-245, June 1995. URL: https://doi.org/10.1007/BF01200757.
  35. Mario Lucic, Matthew Faulkner, Andreas Krause, and Dan Feldman. Training gaussian mixture models at scale via coresets. J. Mach. Learn. Res., 18(1):5885-5909, January 2017. URL: http://dl.acm.org/citation.cfm?id=3122009.3242017.
  36. Meena Mahajan, Prajakta Nimbhorkar, and Kasturi Varadarajan. The planar k-means problem is NP-hard. Theor. Comput. Sci., 442:13-21, July 2012. URL: https://doi.org/10.1016/j.tcs.2010.05.034.
  37. Konstantin Makarychev, Yury Makarychev, and Ilya P. Razenshteyn. Performance of johnson-lindenstrauss transform for k-means and k-medians clustering. CoRR, abs/1811.03195, 2018. URL: http://arxiv.org/abs/1811.03195.
  38. Melanie Schmidt, Chris Schwiegelshohn, and Christian Sohler. Fair coresets and streaming algorithms for fair k-means. In Evripidis Bampis and Nicole Megow, editors, Approximation and Online Algorithms, pages 232-251, Cham, 2020. Springer International Publishing. Google Scholar
  39. Andrea Vattani. The hardness of k-means clustering in the plane. Technical report, Department of Computer Science and Engineering, University of California San Diego, 2009. Google Scholar
  40. J S Vitter. Random sampling with a reservoir. ACM Trans. Math. Software, 11(1):37-57, 1985. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail