Accurate MapReduce Algorithms for k-Median and k-Means in General Metric Spaces
Center-based clustering is a fundamental primitive for data analysis and becomes very challenging for large datasets. In this paper, we focus on the popular k-median and k-means variants which, given a set P of points from a metric space and a parameter k<|P|, require to identify a set S of k centers minimizing, respectively, the sum of the distances and of the squared distances of all points in P from their closest centers. Our specific focus is on general metric spaces, for which it is reasonable to require that the centers belong to the input set (i.e., S subseteq P). We present coreset-based 3-round distributed approximation algorithms for the above problems using the MapReduce computational model. The algorithms are rather simple and obliviously adapt to the intrinsic complexity of the dataset, captured by the doubling dimension D of the metric space. Remarkably, the algorithms attain approximation ratios that can be made arbitrarily close to those achievable by the best known polynomial-time sequential approximations, and they are very space efficient for small D, requiring local memory sizes substantially sublinear in the input size. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance guarantees in general metric spaces.
Clustering
k-median
k-means
MapReduce
Coreset
Theory of computation~Approximation algorithms analysis
Theory of computation~Facility location and clustering
Theory of computation~MapReduce algorithms
34:1-34:16
Regular Paper
This work was supported, in part, by the University of Padova under grant SID2017 and by MIUR, the Italian Ministry of Education, University and Research, under grant PRIN AHeAD: efficient Algorithms for HArnessing networked Data and grant L. 232 "Dipartimenti di Eccellenza’’.
Alessio
Mazzetto
Alessio Mazzetto
Department of Computer Science, Brown University, Providence, USA
Andrea
Pietracaprina
Andrea Pietracaprina
Department of Information Engineering, University of Padova, Padova, Italy
Geppino
Pucci
Geppino Pucci
Department of Information Engineering, University of Padova, Padova, Italy
10.4230/LIPIcs.ISAAC.2019.34
D. Arthur and S. Vassilvitskii. k-means++: the advantages of careful seeding. In Proc. 18th ACM-SIAM SODA, pages 1027-1035, 2007.
V. Arya, N. Garg, R. Khandekar, A. Meyerson, K. Munagala, and V. Pandit. Local Search Heuristics for k-Median and Facility Location Problems. SIAM J. Comput., 33(3):544-562, 2004.
P. Awasthi and M.F. Balcan. Center based clustering: A foundational perspective. In Handbook of cluster analysis. CRC Press, 2015.
O. Bachem, M. Lucic, and A. Krause. Scalable k -Means Clustering via Lightweight Coresets. In Proc. 24th ACM KDD, pages 1119-1127, 2018.
B. Bahmani, B. Moseley, A. Vattani, R. Kumar, and S. Vassilvitskii. Scalable K-Means++. PVLDB, 5(7):622-633, 2012.
M.F. Balcan, S. Ehrlich, and Y. Liang. Distributed k-means and k-median clustering on general communication topologies. In Proc. 27th NIPS, pages 1995-2003, 2013.
M. Ceccarello, A. Pietracaprina, and G. Pucci. Solving k-center Clustering (with Outliers) in MapReduce and Streaming, almost as Accurately as Sequentially. PVLDB, 12(7), 2019.
E. Cohen, S. Chechik, and H. Kaplan. Clustering Small Samples With Quality Guarantees: Adaptivity With One2all PPS. In Proc. 32nd AAAI, pages 2884-2891, 2018.
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1):107-113, 2008.
A. Ene, S. Im, and B. Moseley. Fast Clustering Using MapReduce. In Proc. 17th ACM KDD, pages 681-689, 2011.
D. Feldman and M. Langberg. A Unified Framework for Approximating and Clustering Data. In Proc. 43rd ACM STOC, pages 569-578, 2011.
A. Gupta and K. Tangwongsan. Simpler Analyses of Local Search Algorithms for Facility Location. CoRR, abs/0809.2554, 2008. URL: http://arxiv.org/abs/0809.2554.
http://arxiv.org/abs/0809.2554
S. Har-Peled and A. Kushal. Smaller Coresets for K-median and K-means Clustering. In Proc. 21st SCG, pages 126-134, 2005.
S. Har-Peled and S. Mazumdar. On Coresets for K-means and K-median Clustering. In Proc. 36th ACM STOC, pages 291-300, 2004.
C. Hennig, M. Meila, F. Murtagh, and R. Rocci. Handbook of cluster analysis. CRC Press, 2015.
L. Huang, S. Jiang, J. Li, and X. Wu. Epsilon-Coresets for Clustering (with Outliers) in Doubling Metrics. In Proc. 59th IEEE FOCS, pages 814-825, 2018.
P. Indyk, S. Mahabadi, M. Mahdian, and V.S. Mirrokni. Composable Core-sets for Diversity and Coverage Maximization. In Proc. 33rd ACM PODS, pages 100-108, 2014.
T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. A Local Search Approximation Algorithm for K-means Clustering. In Proc. 18th SCG, pages 10-18, 2002.
L. Kaufmann and P. Rousseeuw. Clustering by Means of Medoids. Data Analysis based on the L1-Norm and Related Methods, pages 405-416, 1987.
J. Leskovec, A. Rajaraman, and J.D. Ullman. Mining of Massive Datasets, 2nd Ed. Cambridge University Press, 2014.
J. M. Phillips. Coresets and Sketches. Handbook of Discrete and Computational Geometry, 3rd Ed, 2016.
A. Pietracaprina, G. Pucci, M. Riondato, F. Silvestri, and E. Upfal. Space-Round Tradeoffs for MapReduce Computations. In Proc. 26th ACM ICS, pages 235-244, 2012.
C. Sohler and D. P. Woodruff. Strong Coresets for k-Median and Subspace Approximation: Goodbye Dimension. In Proc. 59th IEEE FOCS, pages 802-813, 2018.
H. Song, J.G. Lee, and W.S. Han. PAMAE: parallel k-medoids clustering with high accuracy and efficiency. In Proc. 23rd ACM KDD, pages 1087-1096, 2017.
D. Wei. A Constant-Factor Bi-Criteria Approximation Guarantee for k-means++. In Proc. 30th NIPS, pages 604-612, 2016.
Alessio Mazzetto, Andrea Pietracaprina, and Geppino Pucci
Creative Commons Attribution 3.0 Unported license
https://creativecommons.org/licenses/by/3.0/legalcode