Accurate MapReduce Algorithms for k-Median and k-Means in General Metric Spaces

Authors Alessio Mazzetto, Andrea Pietracaprina, Geppino Pucci



PDF
Thumbnail PDF

File

LIPIcs.ISAAC.2019.34.pdf
  • Filesize: 0.53 MB
  • 16 pages

Document Identifiers

Author Details

Alessio Mazzetto
  • Department of Computer Science, Brown University, Providence, USA
Andrea Pietracaprina
  • Department of Information Engineering, University of Padova, Padova, Italy
Geppino Pucci
  • Department of Information Engineering, University of Padova, Padova, Italy

Cite AsGet BibTex

Alessio Mazzetto, Andrea Pietracaprina, and Geppino Pucci. Accurate MapReduce Algorithms for k-Median and k-Means in General Metric Spaces. In 30th International Symposium on Algorithms and Computation (ISAAC 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 149, pp. 34:1-34:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)
https://doi.org/10.4230/LIPIcs.ISAAC.2019.34

Abstract

Center-based clustering is a fundamental primitive for data analysis and becomes very challenging for large datasets. In this paper, we focus on the popular k-median and k-means variants which, given a set P of points from a metric space and a parameter k<|P|, require to identify a set S of k centers minimizing, respectively, the sum of the distances and of the squared distances of all points in P from their closest centers. Our specific focus is on general metric spaces, for which it is reasonable to require that the centers belong to the input set (i.e., S subseteq P). We present coreset-based 3-round distributed approximation algorithms for the above problems using the MapReduce computational model. The algorithms are rather simple and obliviously adapt to the intrinsic complexity of the dataset, captured by the doubling dimension D of the metric space. Remarkably, the algorithms attain approximation ratios that can be made arbitrarily close to those achievable by the best known polynomial-time sequential approximations, and they are very space efficient for small D, requiring local memory sizes substantially sublinear in the input size. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance guarantees in general metric spaces.

Subject Classification

ACM Subject Classification
  • Theory of computation → Approximation algorithms analysis
  • Theory of computation → Facility location and clustering
  • Theory of computation → MapReduce algorithms
Keywords
  • Clustering
  • k-median
  • k-means
  • MapReduce
  • Coreset

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. D. Arthur and S. Vassilvitskii. k-means++: the advantages of careful seeding. In Proc. 18th ACM-SIAM SODA, pages 1027-1035, 2007. Google Scholar
  2. V. Arya, N. Garg, R. Khandekar, A. Meyerson, K. Munagala, and V. Pandit. Local Search Heuristics for k-Median and Facility Location Problems. SIAM J. Comput., 33(3):544-562, 2004. Google Scholar
  3. P. Awasthi and M.F. Balcan. Center based clustering: A foundational perspective. In Handbook of cluster analysis. CRC Press, 2015. Google Scholar
  4. O. Bachem, M. Lucic, and A. Krause. Scalable k -Means Clustering via Lightweight Coresets. In Proc. 24th ACM KDD, pages 1119-1127, 2018. Google Scholar
  5. B. Bahmani, B. Moseley, A. Vattani, R. Kumar, and S. Vassilvitskii. Scalable K-Means++. PVLDB, 5(7):622-633, 2012. Google Scholar
  6. M.F. Balcan, S. Ehrlich, and Y. Liang. Distributed k-means and k-median clustering on general communication topologies. In Proc. 27th NIPS, pages 1995-2003, 2013. Google Scholar
  7. M. Ceccarello, A. Pietracaprina, and G. Pucci. Solving k-center Clustering (with Outliers) in MapReduce and Streaming, almost as Accurately as Sequentially. PVLDB, 12(7), 2019. Google Scholar
  8. E. Cohen, S. Chechik, and H. Kaplan. Clustering Small Samples With Quality Guarantees: Adaptivity With One2all PPS. In Proc. 32nd AAAI, pages 2884-2891, 2018. Google Scholar
  9. J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1):107-113, 2008. Google Scholar
  10. A. Ene, S. Im, and B. Moseley. Fast Clustering Using MapReduce. In Proc. 17th ACM KDD, pages 681-689, 2011. Google Scholar
  11. D. Feldman and M. Langberg. A Unified Framework for Approximating and Clustering Data. In Proc. 43rd ACM STOC, pages 569-578, 2011. Google Scholar
  12. A. Gupta and K. Tangwongsan. Simpler Analyses of Local Search Algorithms for Facility Location. CoRR, abs/0809.2554, 2008. URL: http://arxiv.org/abs/0809.2554.
  13. S. Har-Peled and A. Kushal. Smaller Coresets for K-median and K-means Clustering. In Proc. 21st SCG, pages 126-134, 2005. Google Scholar
  14. S. Har-Peled and S. Mazumdar. On Coresets for K-means and K-median Clustering. In Proc. 36th ACM STOC, pages 291-300, 2004. Google Scholar
  15. C. Hennig, M. Meila, F. Murtagh, and R. Rocci. Handbook of cluster analysis. CRC Press, 2015. Google Scholar
  16. L. Huang, S. Jiang, J. Li, and X. Wu. Epsilon-Coresets for Clustering (with Outliers) in Doubling Metrics. In Proc. 59th IEEE FOCS, pages 814-825, 2018. Google Scholar
  17. P. Indyk, S. Mahabadi, M. Mahdian, and V.S. Mirrokni. Composable Core-sets for Diversity and Coverage Maximization. In Proc. 33rd ACM PODS, pages 100-108, 2014. Google Scholar
  18. T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. A Local Search Approximation Algorithm for K-means Clustering. In Proc. 18th SCG, pages 10-18, 2002. Google Scholar
  19. L. Kaufmann and P. Rousseeuw. Clustering by Means of Medoids. Data Analysis based on the L1-Norm and Related Methods, pages 405-416, 1987. Google Scholar
  20. J. Leskovec, A. Rajaraman, and J.D. Ullman. Mining of Massive Datasets, 2nd Ed. Cambridge University Press, 2014. Google Scholar
  21. J. M. Phillips. Coresets and Sketches. Handbook of Discrete and Computational Geometry, 3rd Ed, 2016. Google Scholar
  22. A. Pietracaprina, G. Pucci, M. Riondato, F. Silvestri, and E. Upfal. Space-Round Tradeoffs for MapReduce Computations. In Proc. 26th ACM ICS, pages 235-244, 2012. Google Scholar
  23. C. Sohler and D. P. Woodruff. Strong Coresets for k-Median and Subspace Approximation: Goodbye Dimension. In Proc. 59th IEEE FOCS, pages 802-813, 2018. Google Scholar
  24. H. Song, J.G. Lee, and W.S. Han. PAMAE: parallel k-medoids clustering with high accuracy and efficiency. In Proc. 23rd ACM KDD, pages 1087-1096, 2017. Google Scholar
  25. D. Wei. A Constant-Factor Bi-Criteria Approximation Guarantee for k-means++. In Proc. 30th NIPS, pages 604-612, 2016. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail