Faster Algorithms for the Constrained k-Means Problem

Bhattacharya, Anup; Jaiswal, Ragesh; Kumar, Amit

doi:10.4230/LIPIcs.STACS.2016.16

Abstract

The classical center based clustering problems such as k-means/median/center assume that the optimal clusters satisfy the locality property that the points in the same cluster are close to each other. A number of clustering problems arise in machine learning where the optimal clusters do not follow such a locality property. For instance, consider the r-gather clustering problem where there is an additional constraint that each of the clusters should have at least r points or the capacitated clustering problem where there is an upper bound on the cluster sizes. Consider a variant of the k-means problem that may be regarded as a general version of such problems. Here, the optimal clusters O_1, ..., O_k are an arbitrary partition of the dataset and the goal is to output k-centers c_1, ..., c_k such that the objective function sum_{i=1}^{k} sum_{x in O_{i}} ||x - c_{i}||^2 is minimized. It is not difficult to argue that any algorithm (without knowing the optimal clusters) that outputs a single set of k centers, will not behave well as far as optimizing the above objective function is concerned. However, this does not rule out the existence of algorithms that output a list of such k centers such that at least one of these k centers behaves well. Given an error parameter epsilon > 0, let l denote the size of the smallest list of k-centers such that at least one of the k-centers gives a (1+epsilon) approximation w.r.t. the objective function above. In this paper, we show an upper bound on l by giving a randomized algorithm that outputs a list of 2^{~O(k/epsilon)} k-centers. We also give a closely matching lower bound of 2^{~Omega(k/sqrt{epsilon})}. Moreover, our algorithm runs in time O(n * d * 2^{~O(k/epsilon)}). This is a significant improvement over the previous result of Ding and Xu who gave an algorithm with running time O(n * d * (log{n})^{k} * 2^{poly(k/epsilon)}) and output a list of size O((log{n})^k * 2^{poly(k/epsilon)}). Our techniques generalize for the k-median problem and for many other settings where non-Euclidean distance measures are involved.

Cite As Get BibTex

Anup Bhattacharya, Ragesh Jaiswal, and Amit Kumar. Faster Algorithms for the Constrained k-Means Problem. In 33rd Symposium on Theoretical Aspects of Computer Science (STACS 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 47, pp. 16:1-16:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016) https://doi.org/10.4230/LIPIcs.STACS.2016.16

Author Details

Anup Bhattacharya

Ragesh Jaiswal

Amit Kumar

References

Marcel R. Ackermann, Johannes Blömer, and Christian Sohler. Clustering for metric and nonmetric distance measures. ACM Trans. Algorithms, 6:59:1-59:26, September 2010. URL: http://doi.acm.org/10.1145/1824777.1824779.
Mihai Bādoiu, Sariel Har-Peled, and Piotr Indyk. Approximate clustering via core-sets. In Proc. of the 34th Annual ACM Symp. on Theory of Computing, STOC'02, pages 250-257, New York, NY, USA, 2002. ACM. URL: http://dx.doi.org/10.1145/509907.509947.
Ke Chen. On k-median clustering in high dimensions. In Proc. of the 17th Annual ACM-SIAM Symp. on Discrete Algorithm, SODA'06, pages 1177-1185, New York, NY, USA, 2006. ACM. URL: http://dx.doi.org/10.1145/1109557.1109687.
W. Fernandez de la Vega, Marek Karpinski, Claire Kenyon, and Yuval Rabani. Approximation schemes for clustering problems. In Proc. of the 35th Annual ACM Symp. on Theory of Computing, STOC'03, pages 50-58, New York, NY, USA, 2003. ACM. URL: http://dx.doi.org/10.1145/780542.780550.
Hu Ding and Jinhui Xu. A unified framework for clustering constrained data without locality property. In Proc. of the 26th Annual ACM-SIAM Symp. on Discrete Algorithms, SODA'15, pages 1471-1490, 2015. URL: http://dx.doi.org/10.1137/1.9781611973730.97.
Dan Feldman, Morteza Monemizadeh, and Christian Sohler. A PTAS for k-means clustering based on weak coresets. In Proc. of the 23rd Annual Symp. on Computational Geometry, SoCG'07, pages 11-18, New York, NY, USA, 2007. ACM. URL: http://dx.doi.org/10.1145/1247069.1247072.
Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In Proc. of the 36th Annual ACM Symp. on Theory of Computing, STOC'04, pages 291-300, New York, NY, USA, 2004. ACM. URL: http://dx.doi.org/10.1145/1007352.1007400.
Mary Inaba, Naoki Katoh, and Hiroshi Imai. Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract). In Proc. of the 10th Annual Symp. on Computational Geometry, SoCG'94, pages 332-339, New York, NY, USA, 1994. ACM. URL: http://dx.doi.org/10.1145/177424.178042.
Ragesh Jaiswal, Amit Kumar, and Sandeep Sen. A simple D²-sampling based PTAS for k-means and other clustering problems. Algorithmica, 70(1):22-46, 2014. URL: http://dx.doi.org/10.1007/s00453-013-9833-9.
Ragesh Jaiswal, Mehul Kumar, and Pulkit Yadav. Improved analysis of D²-sampling based PTAS for k-means and other clustering problems. Information Processing Letters, 115(2):100-103, 2015. URL: http://dx.doi.org/http://dx.doi.org/10.1016/j.ipl.2014.07.009.
Amit Kumar, Yogish Sabharwal, and Sandeep Sen. Linear-time approximation schemes for clustering problems in any dimensions. J. ACM, 57(2):5:1-5:32, February 2010. URL: http://dx.doi.org/10.1145/1667053.1667054.
J. Matoušek. On approximate geometric k-clustering. Discrete and Computational Geometry, 24(1):61-84, 2000. URL: http://dx.doi.org/10.1007/s004540010019.

Faster Algorithms for the Constrained k-Means Problem

Authors Anup Bhattacharya, Ragesh Jaiswal, Amit Kumar

File

Document Identifiers

Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message