Document Open Access Logo

Min-Sum Clustering (With Outliers)

Authors Sandip Banerjee, Rafail Ostrovsky, Yuval Rabani



PDF
Thumbnail PDF

File

LIPIcs.APPROX-RANDOM.2021.16.pdf
  • Filesize: 0.69 MB
  • 16 pages

Document Identifiers

Author Details

Sandip Banerjee
  • The Hebrew University of Jerusalem, Israel
Rafail Ostrovsky
  • University of California, Los Angeles, CA, USA
Yuval Rabani
  • The Hebrew University of Jerusalem, Israel

Cite AsGet BibTex

Sandip Banerjee, Rafail Ostrovsky, and Yuval Rabani. Min-Sum Clustering (With Outliers). In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 207, pp. 16:1-16:16, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.APPROX/RANDOM.2021.16

Abstract

We give a constant factor polynomial time pseudo-approximation algorithm for min-sum clustering with or without outliers. The algorithm is allowed to exclude an arbitrarily small constant fraction of the points. For instance, we show how to compute a solution that clusters 98% of the input data points and pays no more than a constant factor times the optimal solution that clusters 99% of the input data points. More generally, we give the following bicriteria approximation: For any ε > 0, for any instance with n input points and for any positive integer n' ≤ n, we compute in polynomial time a clustering of at least (1-ε) n' points of cost at most a constant factor greater than the optimal cost of clustering n' points. The approximation guarantee grows with 1/(ε). Our results apply to instances of points in real space endowed with squared Euclidean distance, as well as to points in a metric space, where the number of clusters, and also the dimension if relevant, is arbitrary (part of the input, not an absolute constant).

Subject Classification

ACM Subject Classification
  • Theory of computation → Facility location and clustering
Keywords
  • Clustering
  • approximation algorithms
  • primal-dual

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. S. Ahmadian, A. Norouzi-Fard, O. Svensson, and J. Ward. Better guarantees for k-means and Euclidean k-median by primal-dual algorithms. Proc. of the 58th Ann. IEEE Symp. on Foundations of Computer Science, pages 61-72, 2017. Google Scholar
  2. S. Ahmadian and C. Swamy. Approximation algorithms for clustering problems with lower bounds and outliers. In Proc. of the 43rd Int'l Colloq. on Automata, Languages, and Programming, pages 69:1-69:15, 2016. Google Scholar
  3. N. Ailon, R. Jaiswal, and C. Monteleoni. Streaming k-means approximation. In Proc. of the 23rd Ann. Conf. on Neural Information Processing Systems, pages 10-18, 2009. Google Scholar
  4. D. Aloise, A. Deshpande, P. Hansen, and P. Popat. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75(2):245-248, May 2009. Google Scholar
  5. D. Arthur, B. Manthey, and H. Röglin. Smoothed analysis of the k-means method. J. ACM, 58(5):19:1-19:31, 2011. Google Scholar
  6. D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proc. of the 18th Ann. ACM-SIAM Symp. on Discrete Algorithms, pages 1027-1035, 2007. Google Scholar
  7. P. Awasthi, A. Blum, and O. Sheffet. Stability yields a PTAS for k-median and k-means clustering. In Proc. of the 51st Ann. IEEE Symp. on Foundations of Computer Science, pages 309-318, 2010. Google Scholar
  8. M.-F. Balcan, A. Blum, and A. Gupta. Approximate clustering without the approximation. In Proc. of the 20th Ann. ACM-SIAM Symp. on Discrete Algorithms, pages 1068-1077, 2009. Google Scholar
  9. M.-F. Balcan and M. Braverman. Finding low error clusterings. In COLT 2009 - The 22nd Conference on Learning Theory, 2009. Google Scholar
  10. Y. Bartal. Probabilistic approximation of metric spaces and its algorithmic applications. In Proc. of the 37th Ann. IEEE Symp. on Foundations of Computer Science, page 184, 1996. Google Scholar
  11. Y. Bartal. On approximating arbitrary metrices by tree metrics. In Proc. of the 30th Ann. ACM Symp. on Theory of Computing, pages 161-168, 1998. Google Scholar
  12. Y. Bartal, M. Charikar, and D. Raz. Approximating min-sum k-clustering in metric spaces. In Proc. of the 33rd Ann. ACM Symp. on Theory of Computing, pages 11-20, 2001. Google Scholar
  13. B. Behsaz, Z. Friggstad, M. R. Salavatipour, and R. Sivakumar. Approximation algorithms for min-sum k-clustering and balanced k-median. Algorithmica, 81(3):1006-1030, 2019. Google Scholar
  14. V. Braverman, D. Feldman, and H. Lang. New frameworks for offline and streaming coreset constructions. CoRR, abs/1612.00889, 2016. Google Scholar
  15. V. Braverman, A. Meyerson, R. Ostrovsky, A. Roytman, M. Shindler, and B. Tagiku. Streaming k-means on well-clusterable data. In Proc. of the 22nd Ann. ACM-SIAM Symp. on Discrete Algorithms, pages 26-40, 2011. Google Scholar
  16. P. Bunn and R. Ostrovsky. Secure two-party k-means clustering. In Proc. of the 14th Ann. ACM Conf. on Computer and Communications Security, pages 486-497, 2007. Google Scholar
  17. K. Chen. On coresets for k-median and k-means clustering in metric and Euclidean spaces and their applications. SIAM J. Comput., 39:923-947, 2009. Google Scholar
  18. J. Chuzhoy and Y. Rabani. Approximating k-median with non-uniform capacities. In Proc. of the 16th Ann. ACM-SIAM Symp. on Discrete Algorithms, page 952–958, 2005. Google Scholar
  19. V. Cohen-Addad, Karthik C. S., and E. Lee. On approximability of k-means, k-median, and k-minsum clustering, 2019. Google Scholar
  20. V. Cohen-Addad and Karthik C.S. Inapproximability of clustering in l_p metrics. In Proc. of the 60th Ann. IEEE Symp. on Foundations of Computer Science, pages 519-539, 2019. Google Scholar
  21. V. Cohen-Addad, P. N. Klein, and C. Mathieu. Local search yields approximation schemes for k-means and k-median in Euclidean and minor-free metrics. Proc. of the 57th Ann. IEEE Symp. on Foundations of Computer Science, pages 353-364, 2016. Google Scholar
  22. Vincent Cohen-Addad and Chris Schwiegelshohn. On the local structure of stable clustering instances. In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15-17, 2017, pages 49-60. IEEE Computer Society, 2017. URL: https://doi.org/10.1109/FOCS.2017.14.
  23. A. Czumaj and C. Sohler. Small space representations for metric min-sum k-clustering and their applications. In Proc. of the 24th Ann. Conf. on Theoretical Aspects of Computer Science, pages 536-548, 2007. Google Scholar
  24. H. G. Demirci and S. Li. Constant approximation for capacitated k-median with (1 + ε)-capacity violation. ArXiv, abs/1603.02324, 2016. Google Scholar
  25. J. Fakcharoenphol, S. Rao, and K. Talwar. A tight bound on approximating arbitrary metrics by tree metrics. In Proc. of the 35th Ann. ACM Symp. on Theory of Computing, pages 448-455, 2003. Google Scholar
  26. D. Feldman and M. Langberg. A unified framework for approximating and clustering data. In Proc. of the 43rd Ann. ACM Symp. on Theory of Computing, pages 569-578, 2011. Google Scholar
  27. D. Feldman, M. Monemizadeh, and C. Sohler. A PTAS for k-means clustering based on weak coresets. In Proc. of the 23rd Ann. Symp. on Computational Geometry, pages 11-18, 2007. Google Scholar
  28. W. Fernandez de la Vega, M. Karpinski, C. Kenyon, and Y. Rabani. Approximation schemes for clustering problems. In Proc. of the 35th Ann. ACM Symp. on Theory of Computing, pages 50-58, 2003. Google Scholar
  29. W. Fernandez de la Vega and C. Kenyon. A randomized approximation scheme for metric MAX-CUT. Journal of Computer and System Sciences, 63(4):531-541, 2001. Google Scholar
  30. Z. Friggstad, M. Rezapour, and M. R. Salavatipour. Local search yields a PTAS for k-means in doubling metrics. Proc. of the 57th Ann. IEEE Symp. on Foundations of Computer Science, pages 365-374, 2016. Google Scholar
  31. V. Guruswami and P. Indyk. Embeddings and non-approximability of geometric problems. In Proc. of the 14th Ann. ACM-SIAM Symp. on Discrete Algorithms, pages 537-538, 2003. Google Scholar
  32. N. Guttmann-Beck and R. Hassin. Approximation algorithms for min-sum p-clustering. Discret. Appl. Math., 89(1-3):125-142, 1998. Google Scholar
  33. S. Har-Peled and A. Kushal. Smaller coresets for k-median and k-means clustering. Discrete Comput. Geom., 37(1):3-19, 2007. Google Scholar
  34. R. Hassin and E. Or. Min sum clustering with penalties. European Journal of Operational Research, 206(3):547-554, 2010. Google Scholar
  35. M. Inaba, N. Katoh, and H. Imai. Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering. In Proc. of the 10th Ann. Symp. on Computational Geometry, pages 332-339, 1994. Google Scholar
  36. P. Indyk. A sublinear time approximation scheme for clustering in metric spaces. In Proc. of th 40th Ann. IEEE Symp. on Foundations of Computer Science, pages 154-159, 1999. Google Scholar
  37. K. Jain and V. V. Vazirani. Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation. J. ACM, 48(2):274-296, 2001. Google Scholar
  38. R. Jaiswal and N. Garg. Analysis of k-means++ for separable data. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 591-602, 2012. Google Scholar
  39. J. Kleinberg. An impossibility theorem for clustering. In Proc. of the 15th Int'l Conf. on Neural Information Processing Systems, pages 463-470, 2002. Google Scholar
  40. A. Kumar and R. Kannan. Clustering with spectral norm and the k-means algorithm. In Proc. of the 51st Ann. IEEE Symp. on Foundations of Computer Science, pages 299-308, 2010. Google Scholar
  41. A. Kumar, Y. Sabharwal, and S. Sen. Linear time algorithms for clustering problems in any dimensions. In Proc. of the 32nd Int'l Conf. on Automata, Languages and Programming, pages 1374-1385, 2005. Google Scholar
  42. M. Mahajan, P. Nimbhorkar, and K. Varadarajan. The planar k-means problem is NP-hard. In Proc. of the 3rd Int'l Workshop on Algorithms and Computation, pages 274-285, 2009. Google Scholar
  43. J. Matoušek. On approximate geometric k-clustering. Discrete & Computational Geometry, 24(1):61-84, January 2000. Google Scholar
  44. R. Ostrovsky, Y. Rabani, L. J. Schulman, and C. Swamy. The effectiveness of Lloyd-type methods for the k-means problem. In Proc. of the 47th Ann. IEEE Symp. on Foundations of Computer Science, pages 165-176, 2006. Google Scholar
  45. L. J. Schulman. Clustering for edge-cost minimization. In Proc. of the 32nd Ann. ACM Symp. on Theory of Computing, pages 547-555, 2000. Google Scholar
  46. K. Voevodski, M.-F. Balcan, H. Röglin, S.-H. Teng, and Y. Xia. Min-sum clustering of protein sequences with limited distance information. In Proc. of the 1st Int'l Conf. on Similarity-Based Pattern Recognition, pages 192-206, 2011. Google Scholar
  47. R. B. Zadeh and S. Ben-David. A uniqueness theorem for clustering. In Proc. of the 25th Ann. Conf. on Uncertainty in Artificial Intelligence, pages 639-646, 2009. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail