Min-Sum Clustering (With Outliers)

Banerjee, Sandip; Ostrovsky, Rafail; Rabani, Yuval

doi:10.4230/LIPIcs.APPROX/RANDOM.2021.16

Abstract

We give a constant factor polynomial time pseudo-approximation algorithm for min-sum clustering with or without outliers. The algorithm is allowed to exclude an arbitrarily small constant fraction of the points. For instance, we show how to compute a solution that clusters 98% of the input data points and pays no more than a constant factor times the optimal solution that clusters 99% of the input data points. More generally, we give the following bicriteria approximation: For any ε > 0, for any instance with n input points and for any positive integer n' ≤ n, we compute in polynomial time a clustering of at least (1-ε) n' points of cost at most a constant factor greater than the optimal cost of clustering n' points. The approximation guarantee grows with 1/(ε). Our results apply to instances of points in real space endowed with squared Euclidean distance, as well as to points in a metric space, where the number of clusters, and also the dimension if relevant, is arbitrary (part of the input, not an absolute constant).

S. Ahmadian, A. Norouzi-Fard, O. Svensson, and J. Ward. Better guarantees for k-means and Euclidean k-median by primal-dual algorithms. Proc. of the 58th Ann. IEEE Symp. on Foundations of Computer Science, pages 61-72, 2017.
S. Ahmadian and C. Swamy. Approximation algorithms for clustering problems with lower bounds and outliers. In Proc. of the 43rd Int'l Colloq. on Automata, Languages, and Programming, pages 69:1-69:15, 2016.
N. Ailon, R. Jaiswal, and C. Monteleoni. Streaming k-means approximation. In Proc. of the 23rd Ann. Conf. on Neural Information Processing Systems, pages 10-18, 2009.
D. Aloise, A. Deshpande, P. Hansen, and P. Popat. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75(2):245-248, May 2009.
D. Arthur, B. Manthey, and H. Röglin. Smoothed analysis of the k-means method. J. ACM, 58(5):19:1-19:31, 2011.
D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proc. of the 18th Ann. ACM-SIAM Symp. on Discrete Algorithms, pages 1027-1035, 2007.
P. Awasthi, A. Blum, and O. Sheffet. Stability yields a PTAS for k-median and k-means clustering. In Proc. of the 51st Ann. IEEE Symp. on Foundations of Computer Science, pages 309-318, 2010.
M.-F. Balcan, A. Blum, and A. Gupta. Approximate clustering without the approximation. In Proc. of the 20th Ann. ACM-SIAM Symp. on Discrete Algorithms, pages 1068-1077, 2009.
M.-F. Balcan and M. Braverman. Finding low error clusterings. In COLT 2009 - The 22nd Conference on Learning Theory, 2009.
Y. Bartal. Probabilistic approximation of metric spaces and its algorithmic applications. In Proc. of the 37th Ann. IEEE Symp. on Foundations of Computer Science, page 184, 1996.
Y. Bartal. On approximating arbitrary metrices by tree metrics. In Proc. of the 30th Ann. ACM Symp. on Theory of Computing, pages 161-168, 1998.
Y. Bartal, M. Charikar, and D. Raz. Approximating min-sum k-clustering in metric spaces. In Proc. of the 33rd Ann. ACM Symp. on Theory of Computing, pages 11-20, 2001.
B. Behsaz, Z. Friggstad, M. R. Salavatipour, and R. Sivakumar. Approximation algorithms for min-sum k-clustering and balanced k-median. Algorithmica, 81(3):1006-1030, 2019.
V. Braverman, D. Feldman, and H. Lang. New frameworks for offline and streaming coreset constructions. CoRR, abs/1612.00889, 2016.
V. Braverman, A. Meyerson, R. Ostrovsky, A. Roytman, M. Shindler, and B. Tagiku. Streaming k-means on well-clusterable data. In Proc. of the 22nd Ann. ACM-SIAM Symp. on Discrete Algorithms, pages 26-40, 2011.
P. Bunn and R. Ostrovsky. Secure two-party k-means clustering. In Proc. of the 14th Ann. ACM Conf. on Computer and Communications Security, pages 486-497, 2007.
K. Chen. On coresets for k-median and k-means clustering in metric and Euclidean spaces and their applications. SIAM J. Comput., 39:923-947, 2009.
J. Chuzhoy and Y. Rabani. Approximating k-median with non-uniform capacities. In Proc. of the 16th Ann. ACM-SIAM Symp. on Discrete Algorithms, page 952–958, 2005.
V. Cohen-Addad, Karthik C. S., and E. Lee. On approximability of k-means, k-median, and k-minsum clustering, 2019.
V. Cohen-Addad and Karthik C.S. Inapproximability of clustering in l_p metrics. In Proc. of the 60th Ann. IEEE Symp. on Foundations of Computer Science, pages 519-539, 2019.
V. Cohen-Addad, P. N. Klein, and C. Mathieu. Local search yields approximation schemes for k-means and k-median in Euclidean and minor-free metrics. Proc. of the 57th Ann. IEEE Symp. on Foundations of Computer Science, pages 353-364, 2016.
Vincent Cohen-Addad and Chris Schwiegelshohn. On the local structure of stable clustering instances. In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15-17, 2017, pages 49-60. IEEE Computer Society, 2017. URL: https://doi.org/10.1109/FOCS.2017.14.
A. Czumaj and C. Sohler. Small space representations for metric min-sum k-clustering and their applications. In Proc. of the 24th Ann. Conf. on Theoretical Aspects of Computer Science, pages 536-548, 2007.
H. G. Demirci and S. Li. Constant approximation for capacitated k-median with (1 + ε)-capacity violation. ArXiv, abs/1603.02324, 2016.
J. Fakcharoenphol, S. Rao, and K. Talwar. A tight bound on approximating arbitrary metrics by tree metrics. In Proc. of the 35th Ann. ACM Symp. on Theory of Computing, pages 448-455, 2003.
D. Feldman and M. Langberg. A unified framework for approximating and clustering data. In Proc. of the 43rd Ann. ACM Symp. on Theory of Computing, pages 569-578, 2011.
D. Feldman, M. Monemizadeh, and C. Sohler. A PTAS for k-means clustering based on weak coresets. In Proc. of the 23rd Ann. Symp. on Computational Geometry, pages 11-18, 2007.
W. Fernandez de la Vega, M. Karpinski, C. Kenyon, and Y. Rabani. Approximation schemes for clustering problems. In Proc. of the 35th Ann. ACM Symp. on Theory of Computing, pages 50-58, 2003.
W. Fernandez de la Vega and C. Kenyon. A randomized approximation scheme for metric MAX-CUT. Journal of Computer and System Sciences, 63(4):531-541, 2001.
Z. Friggstad, M. Rezapour, and M. R. Salavatipour. Local search yields a PTAS for k-means in doubling metrics. Proc. of the 57th Ann. IEEE Symp. on Foundations of Computer Science, pages 365-374, 2016.
V. Guruswami and P. Indyk. Embeddings and non-approximability of geometric problems. In Proc. of the 14th Ann. ACM-SIAM Symp. on Discrete Algorithms, pages 537-538, 2003.
N. Guttmann-Beck and R. Hassin. Approximation algorithms for min-sum p-clustering. Discret. Appl. Math., 89(1-3):125-142, 1998.
S. Har-Peled and A. Kushal. Smaller coresets for k-median and k-means clustering. Discrete Comput. Geom., 37(1):3-19, 2007.
R. Hassin and E. Or. Min sum clustering with penalties. European Journal of Operational Research, 206(3):547-554, 2010.
M. Inaba, N. Katoh, and H. Imai. Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering. In Proc. of the 10th Ann. Symp. on Computational Geometry, pages 332-339, 1994.
P. Indyk. A sublinear time approximation scheme for clustering in metric spaces. In Proc. of th 40th Ann. IEEE Symp. on Foundations of Computer Science, pages 154-159, 1999.
K. Jain and V. V. Vazirani. Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation. J. ACM, 48(2):274-296, 2001.
R. Jaiswal and N. Garg. Analysis of k-means++ for separable data. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 591-602, 2012.
J. Kleinberg. An impossibility theorem for clustering. In Proc. of the 15th Int'l Conf. on Neural Information Processing Systems, pages 463-470, 2002.
A. Kumar and R. Kannan. Clustering with spectral norm and the k-means algorithm. In Proc. of the 51st Ann. IEEE Symp. on Foundations of Computer Science, pages 299-308, 2010.
A. Kumar, Y. Sabharwal, and S. Sen. Linear time algorithms for clustering problems in any dimensions. In Proc. of the 32nd Int'l Conf. on Automata, Languages and Programming, pages 1374-1385, 2005.
M. Mahajan, P. Nimbhorkar, and K. Varadarajan. The planar k-means problem is NP-hard. In Proc. of the 3rd Int'l Workshop on Algorithms and Computation, pages 274-285, 2009.
J. Matoušek. On approximate geometric k-clustering. Discrete & Computational Geometry, 24(1):61-84, January 2000.
R. Ostrovsky, Y. Rabani, L. J. Schulman, and C. Swamy. The effectiveness of Lloyd-type methods for the k-means problem. In Proc. of the 47th Ann. IEEE Symp. on Foundations of Computer Science, pages 165-176, 2006.
L. J. Schulman. Clustering for edge-cost minimization. In Proc. of the 32nd Ann. ACM Symp. on Theory of Computing, pages 547-555, 2000.
K. Voevodski, M.-F. Balcan, H. Röglin, S.-H. Teng, and Y. Xia. Min-sum clustering of protein sequences with limited distance information. In Proc. of the 1st Int'l Conf. on Similarity-Based Pattern Recognition, pages 192-206, 2011.
R. B. Zadeh and S. Ben-David. A uniqueness theorem for clustering. In Proc. of the 25th Ann. Conf. on Uncertainty in Artificial Intelligence, pages 639-646, 2009.

Min-Sum Clustering (With Outliers)

Authors Sandip Banerjee, Rafail Ostrovsky, Yuval Rabani

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Min-Sum Clustering (With Outliers)

Authors Sandip Banerjee, Rafail Ostrovsky, Yuval Rabani

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message