Fully-Scalable MPC Algorithms for Clustering in High Dimension

Authors Artur Czumaj , Guichen Gao , Shaofeng H.-C. Jiang , Robert Krauthgamer , Pavel Veselý



PDF
Thumbnail PDF

File

LIPIcs.ICALP.2024.50.pdf
  • Filesize: 0.83 MB
  • 20 pages

Document Identifiers

Author Details

Artur Czumaj
  • Department of Computer Science, University of Warwick, Coventry, UK
Guichen Gao
  • School of Computer Science, Peking University, Beijing, China
Shaofeng H.-C. Jiang
  • School of Computer Science, Peking University, Beijing, China
Robert Krauthgamer
  • Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel
Pavel Veselý
  • Computer Science Institute of Charles University, Prague, Czech Republic

Cite AsGet BibTex

Artur Czumaj, Guichen Gao, Shaofeng H.-C. Jiang, Robert Krauthgamer, and Pavel Veselý. Fully-Scalable MPC Algorithms for Clustering in High Dimension. In 51st International Colloquium on Automata, Languages, and Programming (ICALP 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 297, pp. 50:1-50:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.ICALP.2024.50

Abstract

We design new parallel algorithms for clustering in high-dimensional Euclidean spaces. These algorithms run in the Massively Parallel Computation (MPC) model, and are fully scalable, meaning that the local memory in each machine may be n^σ for arbitrarily small fixed σ > 0. Importantly, the local memory may be substantially smaller than the number of clusters k, yet all our algorithms are fast, i.e., run in O(1) rounds. We first devise a fast MPC algorithm for O(1)-approximation of uniform Facility Location. This is the first fully-scalable MPC algorithm that achieves O(1)-approximation for any clustering problem in general geometric setting; previous algorithms only provide poly(log n)-approximation or apply to restricted inputs, like low dimension or small number of clusters k; e.g. [Bhaskara and Wijewardena, ICML'18; Cohen-Addad et al., NeurIPS'21; Cohen-Addad et al., ICML'22]. We then build on this Facility Location result and devise a fast MPC algorithm that achieves O(1)-bicriteria approximation for k-Median and for k-Means, namely, it computes (1+ε)k clusters of cost within O(1/ε²)-factor of the optimum for k clusters. A primary technical tool that we introduce, and may be of independent interest, is a new MPC primitive for geometric aggregation, namely, computing for every data point a statistic of its approximate neighborhood, for statistics like range counting and nearest-neighbor search. Our implementation of this primitive works in high dimension, and is based on consistent hashing (aka sparse partition), a technique that was recently used for streaming algorithms [Czumaj et al., FOCS'22].

Subject Classification

ACM Subject Classification
  • Theory of computation → Massively parallel algorithms
  • Theory of computation → Facility location and clustering
  • Theory of computation → Randomness, geometry and discrete structures
Keywords
  • Massively parallel computing
  • high dimension
  • facility location
  • k-median
  • k-means

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Ankit Aggarwal, Amit Deshpande, and Ravi Kannan. Adaptive sampling for k-means clustering. In Proceedings of the 12th International Workshop on Approximation Algorithms for Combinatorial Optimization, and of the 13th International Workshop on Randomization and Approximation Techniques in Computer Science (APPROX/RANDOM), pages 15-28, 2009. Google Scholar
  2. AmirMohsen Ahanchi, Alexandr Andoni, MohammadTaghi Hajiaghayi, Marina Knittel, and Peilin Zhong. Massively parallel tree embeddings for high dimensional spaces. In Proceedings of the 35th Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 77-88, 2023. URL: https://doi.org/10.1145/3558481.3591096.
  3. Hyung-Chan An, Ashkan Norouzi-Fard, and Ola Svensson. Dynamic facility location via exponential clocks. In Proceedings of the 26th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 708-721, 2015. Google Scholar
  4. Alexandr Andoni, Aleksandar Nikolov, Krzysztof Onak, and Grigory Yaroslavtsev. Parallel algorithms for geometric graph problems. In Proceedings of the 46th Annual ACM Symposium on Theory of Computing (STOC), pages 574-583, 2014. URL: https://doi.org/10.1145/2591796.2591805.
  5. Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, and Vinayaka Pandit. Local search heuristics for k-median and facility location problems. SIAM Journal on Computing, 33(3):544-562, 2004. Google Scholar
  6. Olivier Bachem, Mario Lucic, and Andreas Krause. Scalable k-means clustering via lightweight coresets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pages 1119-1127, 2018. Google Scholar
  7. Mihai Bădoiu, Artur Czumaj, Piotr Indyk, and Christian Sohler. Facility location in sublinear time. In Proceedings of the 32nd International Colloquium on Automata, Languages, and Programming (ICALP), pages 866-877, 2005. URL: https://doi.org/10.1007/11523468_70.
  8. Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. Scalable K-Means++. Proc. VLDB Endow., 5(7):622-633, 2012. Google Scholar
  9. Maria-Florina Balcan, Steven Ehrlich, and Yingyu Liang. Distributed k-means and k-median clustering on general communication topologies. In NIPS, pages 1995-2003, 2013. Google Scholar
  10. MohammadHossein Bateni, Hossein Esfandiari, Manuela Fischer, and Vahab S. Mirrokni. Extreme k-center clustering. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, pages 3941-3949, 2021. URL: https://doi.org/10.1609/aaai.v35i5.16513.
  11. Paul Beame, Paraschos Koutris, and Dan Suciu. Communication steps for parallel query processing. Journal of the ACM, 64(6):40:1-40:58, 2017. URL: https://doi.org/10.1145/3125644.
  12. Soheil Behnezhad, Moses Charikar, Weiyun Ma, and Li-Yang Tan. Almost 3-approximate correlation clustering in constant rounds. In Proceedings of the 63rd IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 720-731, 2022. URL: https://doi.org/10.1109/FOCS54457.2022.00074.
  13. Aditya Bhaskara and Maheshakya Wijewardena. Distributed clustering via LSH based data partitioning. In ICML, volume 80 of Proceedings of Machine Learning Research, pages 569-578. PMLR, 2018. URL: https://proceedings.mlr.press/v80/bhaskara18a.html.
  14. Yonatan Bilu and Nathan Linial. Are stable instances easy? Combinatorics, Probability & Computing, 21(5):643-660, 2012. URL: https://doi.org/10.1017/S0963548312000193.
  15. Guy E. Blelloch, Anupam Gupta, and Kanat Tangwongsan. Parallel probabilistic tree embeddings, k-me network design. In Proceedings of the 24th Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 205-213, 2012. Google Scholar
  16. Guy E. Blelloch and Kanat Tangwongsan. Parallel approximation algorithms for facility-location problems. In Proceedings of the 22nd Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 315-324, 2010. URL: https://doi.org/10.1145/1810479.1810535.
  17. Vladimir Braverman, Vincent Cohen-Addad, Shaofeng H.-C. Jiang, Robert Krauthgamer, Chris Schwiegelshohn, Mads Bech Toftrup, and Xuan Wu. The power of uniform sampling for coresets. In Proceedings of the 63rd IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 462-473, 2022. URL: https://doi.org/10.1109/FOCS54457.2022.00051.
  18. Vladimir Braverman, Gereon Frahling, Harry Lang, Christian Sohler, and Lin F. Yang. Clustering high dimensional dynamic data streams. In ICML, volume 70 of Proceedings of Machine Learning Research, pages 576-585. PMLR, 2017. Google Scholar
  19. Jaroslaw Byrka and Karen I. Aardal. An optimal bifactor approximation algorithm for the metric uncapacitated facility location problem. SIAM Journal on Computing, 39(6):2212-2231, 2010. Google Scholar
  20. Mélanie Cambus, Fabian Kuhn, Shreyas Pai, and Jara Uitto. Time and space optimal massively parallel algorithm for the 2-ruling set problem. In Proceedings of the 37th International Symposium on Distributed Computing (DISC), pages 11:1-11:12, 2023. URL: https://doi.org/10.4230/LIPICS.DISC.2023.11.
  21. Matteo Ceccarello, Andrea Pietracaprina, and Geppino Pucci. Solving k-center clustering (with outliers) in MapReduce and streaming, almost as accurately as sequentially. Proc. VLDB Endow., 12(7):766-778, 2019. Google Scholar
  22. Yi-Jun Chang and Da Wei Zheng. Fully scalable massively parallel algorithms for embedded planar graphs. In Proceedings of the 35th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 4410-4450, 2024. URL: https://doi.org/10.1137/1.9781611977912.155.
  23. Jiecao Chen, He Sun, David P. Woodruff, and Qin Zhang. Communication-optimal distributed clustering. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems (NIPS), pages 3720-3728, 2016. URL: https://proceedings.neurips.cc/paper/2016/hash/7503cfacd12053d309b6bed5c89de212-Abstract.html.
  24. Xi Chen, Vincent Cohen-Addad, Rajesh Jayaram, Amit Levi, and Erik Waingarten. Streaming Euclidean MST to a Constant Factor. In Proceedings of the 55th Annual Symposium on Theory of Computing (STOC), pages 156-169, 2023. URL: https://doi.org/10.1145/3564246.3585168.
  25. Xi Chen, Rajesh Jayaram, Amit Levi, and Erik Waingarten. New streaming algorithms for high dimensional EMD and MST. In Proceedings of the 54th Annual Symposium on Theory of Computing (STOC), pages 222-233, 2022. URL: https://doi.org/10.1145/3519935.3519979.
  26. Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, and Chris Schwiegelshohn. Towards optimal lower bounds for k-median and k-means coresets. In Proceedings of the 54th Annual Symposium on Theory of Computing (STOC), pages 1038-1051, 2022. URL: https://doi.org/10.1145/3519935.3519946.
  27. Vincent Cohen-Addad, Silvio Lattanzi, Ashkan Norouzi-Fard, Christian Sohler, and Ola Svensson. Parallel and efficient hierarchical k-median clustering. In NeurIPS, pages 20333-20345, 2021. URL: https://proceedings.neurips.cc/paper/2021/hash/aa495e18c7e3a21a4e48923b92048a61-Abstract.html.
  28. Vincent Cohen-Addad, Vahab S. Mirrokni, and Peilin Zhong. Massively parallel k-means clustering for perturbation resilient instances. In ICML, volume 162 of Proceedings of Machine Learning Research, pages 4180-4201. PMLR, 2022. URL: https://proceedings.mlr.press/v162/cohen-addad22b.html.
  29. Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. A new coreset framework for clustering. In Proceedings of the 53rd Annual Symposium on Theory of Computing (STOC), pages 169-182, 2021. Google Scholar
  30. Sam Coy, Artur Czumaj, and Gopinath Mishra. On parallel k-center clustering. In Proceedings of the 35th Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 65-75, 2023. URL: https://doi.org/10.1145/3558481.3591075.
  31. Artur Czumaj, Peter Davies, and Merav Parter. Component stability in low-space massively parallel computation. In Proceedings of the 40th ACM Symposium on Principles of Distributed Computing (PODC), pages 481-491, 2021. URL: https://doi.org/10.1145/3465084.3467903.
  32. Artur Czumaj, Guichen Gao, Shaofeng H.-C. Jiang, Robert Krauthgamer, and Pavel Veselý. Fully scalable MPC algorithms for clustering in high dimension. CoRR, abs/2307.07848, 2023. URL: https://doi.org/10.48550/arXiv.2307.07848.
  33. Artur Czumaj, Shaofeng H.-C. Jiang, Robert Krauthgamer, Pavel Veselý, and Mingwei Yang. Streaming facility location in high dimension via geometric hashing. In Proceedings of the 63rd Annual Symposium on Foundations of Computer Science (FOCS), pages 450-461, 2022. The latest version can be found at https://arxiv.org/abs/2204.02095 and it has additional results. URL: https://doi.org/10.1109/FOCS54457.2022.00050.
  34. Mark de Berg, Leyla Biabani, and Morteza Monemizadeh. k-center clustering with outliers in the MPC and streaming model. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium, (IPDPS), pages 853-863, 2023. URL: https://doi.org/10.1109/IPDPS54959.2023.00090.
  35. Alina Ene, Sungjin Im, and Benjamin Moseley. Fast clustering using MapReduce. In 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '11, pages 681-689. ACM, 2011. URL: https://doi.org/10.1145/2020408.2020515.
  36. Arnold Filtser. Scattering and sparse partitions, and their applications. In 47th International Colloquium on Automata, Languages, and Programming (ICALP), pages 47:1-47:20, 2020. URL: https://doi.org/10.4230/LIPIcs.ICALP.2020.47.
  37. Naveen Garg. A 3-approximation for the minimum tree spanning k vertices. In Proceedings of the 37th IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 302-309, 1996. Google Scholar
  38. Joachim Gehweiler, Christiane Lammersen, and Christian Sohler. A distributed O(1)-approximation algorithm for the uniform facility location problem. Algorithmica, 68(3):643-670, 2014. URL: https://doi.org/10.1007/s00453-012-9690-y.
  39. Mohsen Ghaffari. Massively parallel algorithms, 2019. Lecture Notes from ETH Zurich. URL: http://people.csail.mit.edu/ghaffari/MPA19/Notes/MPA.pdf.
  40. Mohsen Ghaffari, Fabian Kuhn, and Jara Uitto. Conditional hardness results for massively parallel computation from distributed lower bounds. In Proceedings of the 60th IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 1650-1663, 2019. URL: https://doi.org/10.1109/FOCS.2019.00097.
  41. Michael T. Goodrich, Nodari Sitchinava, and Qin Zhang. Sorting, searching, and simulation in the MapReduce framework. In Proceedings of the 22nd International Symposium on Algorithms and Computation (ISAAC), pages 374-383, 2011. URL: https://doi.org/10.1007/978-3-642-25591-5_39.
  42. Sudipto Guha, Nina Mishra, Rajeev Motwani, and Liadan O'Callaghan. Clustering data streams. In Proceedings of the 41st IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 359-366, 2000. Google Scholar
  43. Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In Proceedings of the 36th Annual Symposium on Theory of Computing (STOC), pages 291-300, 2004. URL: https://doi.org/10.1145/1007352.1007400.
  44. James W. Hegeman and Sriram V. Pemmaraju. Sub-logarithmic distributed algorithms for metric facility location. Distributed Computing, 28(5):351-374, 2015. URL: https://doi.org/10.1007/s00446-015-0243-x.
  45. James W. Hegeman, Sriram V. Pemmaraju, and Vivek Sardeshmukh. Near-constant-time distributed algorithms on a congested clique. In Proceedings of the 28th International Symposium on Distributed Computing (DISC), pages 514-530, 2014. URL: https://doi.org/10.1007/978-3-662-45174-8_35.
  46. Daniel J. Hsu and Matus Telgarsky. Greedy bi-criteria approximations for k-medians and k-means. CoRR, abs/1607.06203, 2016. URL: https://arxiv.org/abs/1607.06203.
  47. Sungjin Im, Ravi Kumar, Silvio Lattanzi, Benjamin Moseley, and Sergei Vassilvitskii. Massively parallel computation: Algorithms and applications. Foundations and Trends in Optimization, 5(4):340-417, 2023. Google Scholar
  48. Piotr Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proceedings of the 41st IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 189-197, 2000. URL: https://doi.org/10.1109/SFCS.2000.892082.
  49. Piotr Indyk. Algorithms for dynamic geometric problems over data streams. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing (STOC), pages 373-380, 2004. URL: https://doi.org/10.1145/1007352.1007413.
  50. Kamal Jain, Mohammad Mahdian, Evangelos Markakis, Amin Saberi, and Vijay V. Vazirani. Greedy facility location algorithms analyzed using dual fitting with factor-revealing LP. Journal of the ACM, 50(6):795-824, 2003. URL: https://doi.org/10.1145/950620.950621.
  51. Kamal Jain and Vijay V. Vazirani. Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation. Journal of the ACM, 48(2):274-296, 2001. URL: https://doi.org/10.1145/375827.375845.
  52. Rajesh Jayaram, Vahab Mirrokni, Shyam Narayanan, and Peilin Zhong. Massively parallel algorithms for high-dimensional Euclidean minimum spanning tree. In Proceedings of the 35th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 3960-3996, 2024. URL: https://doi.org/10.1137/1.9781611977912.139.
  53. Lujun Jia, Guolong Lin, Guevara Noubir, Rajmohan Rajaraman, and Ravi Sundaram. Universal approximations for TSP, Steiner tree, and set cover. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing (STOC), pages 386-395, 2005. URL: https://doi.org/10.1145/1060590.1060649.
  54. W. B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. In Proceedings of the Conference in Modern Analysis and Probability (New Haven, Connecticut, 1982), pages 189-206. American Mathematical Society, 1984. Google Scholar
  55. Howard J. Karloff, Siddharth Suri, and Sergei Vassilvitskii. A model of computation for MapReduce. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 938-948, 2010. URL: https://doi.org/10.1137/1.9781611973075.76.
  56. Shi Li. A 1.488 approximation algorithm for the uncapacitated facility location problem. Inf. Comput., 222:45-58, 2013. Google Scholar
  57. Jyh-Han Lin and Jeffrey Scott Vitter. Approximation algorithms for geometric median problems. Information Processing Letters, 44(5):245-249, 1992. Google Scholar
  58. Jyh-Han Lin and Jeffrey Scott Vitter. ε-approximations with minimum packing constraint violation. In Proceedings of the 24th Annual ACM Symposium on Theory of Computing (STOC), pages 771-782, 1992. URL: https://doi.org/10.1145/129712.129787.
  59. Konstantin Makarychev, Yury Makarychev, Maxim Sviridenko, and Justin Ward. A bi-criteria approximation algorithm for k-means. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM), pages 14:1-14:20, 2016. URL: https://doi.org/10.4230/LIPICS.APPROX-RANDOM.2016.14.
  60. Gustavo Malkomes, Matt J. Kusner, Wenlin Chen, Kilian Q. Weinberger, and Benjamin Moseley. Fast distributed k-center clustering with outliers on massive data. In NIPS, pages 1063-1071, 2015. Google Scholar
  61. Ramgopal R. Mettu and C. Greg Plaxton. The online median problem. SIAM Journal on Computing, 32(3):816-832, 2003. URL: https://doi.org/10.1137/S0097539701383443.
  62. Tim Roughgarden, Sergei Vassilvitskii, and Joshua R. Wang. Shuffles and circuits (On lower bounds for modern parallel computation). Journal of the ACM, 65(6):41:1-41:24, 2018. URL: https://doi.org/10.1145/3232536.
  63. Zhao Song, Lin F. Yang, and Peilin Zhong. Sensitivity sampling over dynamic geometric data streams with applications to k-clustering. CoRR, abs/1802.00459, 2018. URL: https://arxiv.org/abs/1802.00459.
  64. Dennis Wei. A constant-factor bi-criteria approximation guarantee for k-means++. In NIPS, volume 29, pages 604-612, 2016. URL: https://proceedings.neurips.cc/paper/2016/hash/357a6fdf7642bf815a88822c447d9dc4-Abstract.html.
  65. Grigory Yaroslavtsev and Adithya Vadapalli. Massively parallel algorithms and hardness for single-linkage clustering under 𝓁_p distances. In ICML, volume 80 of Proceedings of Machine Learning Research, pages 5596-5605. PMLR, 2018. URL: http://proceedings.mlr.press/v80/yaroslavtsev18a.html.