Massively Parallel Algorithms for the Stochastic Block Model

Authors Zelin Li , Pan Peng , Xianbin Zhu



PDF
Thumbnail PDF

File

LIPIcs.ESA.2023.78.pdf
  • Filesize: 0.96 MB
  • 17 pages

Document Identifiers

Author Details

Zelin Li
  • School of Computer Science and Technology, University of Science and Technology of China, Hefei, China
Pan Peng
  • School of Computer Science and Technology, University of Science and Technology of China, Hefei, China
Xianbin Zhu
  • Department of Computer Science, City University of Hong Kong, Hong Kong, China

Acknowledgements

We would like to thank the anonymous reviewers for their detailed comments.

Cite AsGet BibTex

Zelin Li, Pan Peng, and Xianbin Zhu. Massively Parallel Algorithms for the Stochastic Block Model. In 31st Annual European Symposium on Algorithms (ESA 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 274, pp. 78:1-78:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.ESA.2023.78

Abstract

Learning the community structure of a large-scale graph is a fundamental problem in machine learning, computer science and statistics. Among others, the Stochastic Block Model (SBM) serves a canonical model for community detection and clustering, and the Massively Parallel Computation (MPC) model is a mathematical abstraction of real-world parallel computing systems, which provides a powerful computational framework for handling large-scale datasets. We study the problem of exactly recovering the communities in a graph generated from the SBM in the MPC model. Specifically, given kn vertices that are partitioned into k equal-sized clusters (i.e., each has size n), a graph on these kn vertices is randomly generated such that each pair of vertices is connected with probability p if they are in the same cluster and with probability q if not, where p > q > 0. We give MPC algorithms for the SBM in the (very general) s-space MPC model, where each machine is guaranteed to have memory s = Ω(log n). Under the condition that (p-q)/√p ≥ Ω̃(k^{1/2} n^{-1/2+1/(2(r-1))}) for any integer r ∈ [3,O(log n)], our first algorithm exactly recovers all the k clusters in O(kr log_s n) rounds using Õ(m) total space, or in O(rlog_s n) rounds using Õ(km) total space. If (p-q)/√p ≥ Ω̃(k^{3/4} n^{-1/4}), our second algorithm achieves O(log_s n) rounds and Õ(m) total space complexity. Both algorithms significantly improve upon a recent result of Cohen-Addad et al. [PODC'22], who gave algorithms that only work in the sublinear space MPC model, where each machine has local memory s = O(n^δ) for some constant δ > 0, with a much stronger condition on p,q,k. Our algorithms are based on collecting the r-step neighborhood of each vertex and comparing the difference of some statistical information generated from the local neighborhoods for each pair of vertices. To implement the clustering algorithms in parallel, we present efficient approaches for implementing some basic graph operations in the s-space MPC model.

Subject Classification

ACM Subject Classification
  • Theory of computation → Massively parallel algorithms
Keywords
  • Massively Parallel Computation
  • Stochastic Block Model
  • Graph Algorithms

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Emmanuel Abbe. Community detection and stochastic block models. Found. Trends Commun. Inf. Theory, 14(1-2):1-162, 2018. Google Scholar
  2. Emmanuel Abbe, Afonso S. Bandeira, and Georgina Hall. Exact recovery in the stochastic block model. IEEE Trans. Inf. Theory, 62(1):471-487, 2016. Google Scholar
  3. Sepehr Assadi and Chen Wang. Sublinear time and space algorithms for correlation clustering via sparse-dense decompositions. In Mark Braverman, editor, 13th Innovations in Theoretical Computer Science Conference, ITCS 2022, January 31 - February 3, 2022, Berkeley, CA, USA, volume 215 of LIPIcs, pages 10:1-10:20. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022. Google Scholar
  4. MohammadHossein Bateni, Hossein Esfandiari, Manuela Fischer, and Vahab S. Mirrokni. Extreme k-center clustering. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 3941-3949. AAAI Press, 2021. Google Scholar
  5. Paul Beame, Paraschos Koutris, and Dan Suciu. Communication steps for parallel query processing. In Richard Hull and Wenfei Fan, editors, Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2013, New York, NY, USA - June 22 - 27, 2013, pages 273-284. ACM, 2013. Google Scholar
  6. Soheil Behnezhad, Sebastian Brandt, Mahsa Derakhshan, Manuela Fischer, MohammadTaghi Hajiaghayi, Richard M Karp, and Jara Uitto. Massively parallel computation of matching and mis in sparse graphs. In Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing, pages 481-490, 2019. Google Scholar
  7. Soheil Behnezhad, Moses Charikar, Weiyun Ma, and Li-Yang Tan. Almost 3-approximate correlation clustering in constant rounds. In 63rd IEEE Annual Symposium on Foundations of Computer Science, FOCS 2022, Denver, CO, USA, October 31 - November 3, 2022, pages 720-731. IEEE, 2022. Google Scholar
  8. Aditya Bhaskara and Maheshakya Wijewardena. Distributed clustering via LSH based data partitioning. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 569-578. PMLR, 2018. Google Scholar
  9. Guy E. Blelloch, Jeremy T. Fineman, and Julian Shun. Greedy sequential maximal independent set and matching are parallel on average. In Guy E. Blelloch and Maurice Herlihy, editors, 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA '12, Pittsburgh, PA, USA, June 25-27, 2012, pages 308-317. ACM, 2012. Google Scholar
  10. Mélanie Cambus, Davin Choo, Havu Miikonen, and Jara Uitto. Massively parallel correlation clustering in bounded arboricity graphs. In Seth Gilbert, editor, 35th International Symposium on Distributed Computing, DISC 2021, October 4-8, 2021, Freiburg, Germany (Virtual Conference), volume 209 of LIPIcs, pages 15:1-15:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. Google Scholar
  11. Flavio Chierichetti, Nilesh N. Dalvi, and Ravi Kumar. Correlation clustering in MapReduce. In Proceedings of the 20th International Conference on Knowledge Discovery and Data Mining, pages 641-650. ACM, 2014. Google Scholar
  12. Vincent Cohen-Addad, Adrian Kosowski, Frederik Mallmann-Trenn, and David Saulpic. On the power of louvain in the stochastic block model. In Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. Google Scholar
  13. Vincent Cohen-Addad, Silvio Lattanzi, Slobodan Mitrovic, Ashkan Norouzi-Fard, Nikos Parotsidis, and Jakub Tarnawski. Correlation clustering in constant many parallel rounds. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 2069-2078. PMLR, 2021. Google Scholar
  14. Vincent Cohen-Addad, Silvio Lattanzi, Ashkan Norouzi-Fard, Christian Sohler, and Ola Svensson. Parallel and efficient hierarchical k-median clustering. In Marc'Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 20333-20345, 2021. Google Scholar
  15. Vincent Cohen-Addad, Frederik Mallmann-Trenn, and David Saulpic. Community recovery in the degree-heterogeneous stochastic block model. In Po-Ling Loh and Maxim Raginsky, editors, Conference on Learning Theory, 2-5 July 2022, London, UK, volume 178 of Proceedings of Machine Learning Research, pages 1662-1692. PMLR, 2022. Google Scholar
  16. Vincent Cohen-Addad, Frederik Mallmann-Trenn, and David Saulpic. A massively parallel modularity-maximizing algorithm with provable guarantees. In Alessia Milani and Philipp Woelfel, editors, PODC '22: ACM Symposium on Principles of Distributed Computing, Salerno, Italy, July 25 - 29, 2022, pages 356-365. ACM, 2022. Google Scholar
  17. Vincent Cohen-Addad, Vahab S. Mirrokni, and Peilin Zhong. Massively parallel k-means clustering for perturbation resilient instances. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 4180-4201. PMLR, 2022. Google Scholar
  18. Sam Coy, Artur Czumaj, and Gopinath Mishra. On parallel k-center clustering. arXiv preprint arXiv:2304.05883, 2023. Google Scholar
  19. Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107-113, 2008. Google Scholar
  20. Alessandro Epasto, Vahab S. Mirrokni, and Morteza Zadimoghaddam. Scalable diversity maximization via small-size composable core-sets (brief announcement). In Christian Scheideler and Petra Berenbrink, editors, The 31st ACM on Symposium on Parallelism in Algorithms and Architectures, SPAA 2019, Phoenix, AZ, USA, June 22-24, 2019, pages 41-42. ACM, 2019. Google Scholar
  21. Manuela Fischer and Andreas Noever. Tight analysis of parallel randomized greedy MIS. ACM Trans. Algorithms, 16(1):6:1-6:13, 2020. Google Scholar
  22. Michael T. Goodrich, Nodari Sitchinava, and Qin Zhang. Sorting, searching, and simulation in the mapreduce framework. In Takao Asano, Shin-Ichi Nakano, Yoshio Okamoto, and Osamu Watanabe, editors, Algorithms and Computation - 22nd International Symposium, ISAAC 2011, Yokohama, Japan, December 5-8, 2011. Proceedings, volume 7074 of Lecture Notes in Computer Science, pages 374-383. Springer, 2011. Google Scholar
  23. Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In Paulo Ferreira, Thomas R. Gross, and Luís Veiga, editors, Proceedings of the 2007 EuroSys Conference, Lisbon, Portugal, March 21-23, 2007, pages 59-72. ACM, 2007. Google Scholar
  24. Howard J. Karloff, Siddharth Suri, and Sergei Vassilvitskii. A model of computation for mapreduce. In Moses Charikar, editor, Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17-19, 2010, pages 938-948. SIAM, 2010. Google Scholar
  25. Chandra Sekhar Mukherjee and Jiapeng Zhang. Detecting hidden communities by power iterations with connections to vanilla spectral algorithms, 2022. URL: https://doi.org/10.48550/ARXIV.2211.03939.
  26. Xinghao Pan, Dimitris S. Papailiopoulos, Samet Oymak, Benjamin Recht, Kannan Ramchandran, and Michael I. Jordan. Parallel correlation clustering on big graphs. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, editors, Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 82-90, 2015. Google Scholar
  27. Peng Wang, Zirui Zhou, and Anthony Man-Cho So. A nearly-linear time algorithm for exact community recovery in stochastic block model. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 10126-10135. PMLR, 2020. Google Scholar
  28. Tom White. Hadoop: The Definitive Guide. O'Reilly Media, Inc., 1st edition, 2009. Google Scholar
  29. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster computing with working sets. In Erich M. Nahum and Dongyan Xu, editors, 2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud'10, Boston, MA, USA, June 22, 2010. USENIX Association, 2010. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail