Massively Parallel Correlation Clustering in Bounded Arboricity Graphs

Authors Mélanie Cambus , Davin Choo , Havu Miikonen , Jara Uitto



PDF
Thumbnail PDF

File

LIPIcs.DISC.2021.15.pdf
  • Filesize: 0.8 MB
  • 18 pages

Document Identifiers

Author Details

Mélanie Cambus
  • Aalto University, Finland
Davin Choo
  • National University of Singapore, Singapore
Havu Miikonen
  • Aalto University, Finland
Jara Uitto
  • Aalto University, Finland

Cite AsGet BibTex

Mélanie Cambus, Davin Choo, Havu Miikonen, and Jara Uitto. Massively Parallel Correlation Clustering in Bounded Arboricity Graphs. In 35th International Symposium on Distributed Computing (DISC 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 209, pp. 15:1-15:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.DISC.2021.15

Abstract

Identifying clusters of similar elements in a set is a common task in data analysis. With the immense growth of data and physical limitations on single processor speed, it is necessary to find efficient parallel algorithms for clustering tasks. In this paper, we study the problem of correlation clustering in bounded arboricity graphs with respect to the Massively Parallel Computation (MPC) model. More specifically, we are given a complete graph where the edges are either positive or negative, indicating whether pairs of vertices are similar or dissimilar. The task is to partition the vertices into clusters with as few disagreements as possible. That is, we want to minimize the number of positive inter-cluster edges and negative intra-cluster edges. Consider an input graph G on n vertices such that the positive edges induce a λ-arboric graph. Our main result is a 3-approximation (in expectation) algorithm to correlation clustering that runs in 𝒪(log λ ⋅ poly(log log n)) MPC rounds in the strongly sublinear memory regime. This is obtained by combining structural properties of correlation clustering on bounded arboricity graphs with the insights of Fischer and Noever (SODA '18) on randomized greedy MIS and the PIVOT algorithm of Ailon, Charikar, and Newman (STOC '05). Combined with known graph matching algorithms, our structural property also implies an exact algorithm and algorithms with worst case (1+ε)-approximation guarantees in the special case of forests, where λ = 1.

Subject Classification

ACM Subject Classification
  • Theory of computation → MapReduce algorithms
  • Theory of computation → Unsupervised learning and clustering
Keywords
  • MPC Algorithm
  • Correlation Clustering
  • Bounded Arboricity

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Kook Jin Ahn, Graham Cormode, Sudipto Guha, Andrew McGregor, and Anthony Wirth. Correlation Clustering in Data Streams. In Proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML), pages 2237-2246, 2015. Google Scholar
  2. Nir Ailon, Moses Charikar, and Alantha Newman. Aggregating Inconsistent Information: Ranking and Clustering. Journal of the ACM (JACM), 55(5):1-27, 2008. Google Scholar
  3. Alexandr Andoni, Zhao Song, Clifford Stein, Zhengyu Wang, and Peilin Zhong. Parallel graph connectivity in log diameter rounds. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 674-685. IEEE, 2018. Google Scholar
  4. Nikhil Bansal, Avrim Blum, and Shuchi Chawla. Correlation Clustering. Machine learning, 56(1-3):89-113, 2004. Google Scholar
  5. Reuven Bar-Yehuda, Keren Censor-Hillel, Mohsen Ghaffari, and Gregory Schwartzman. Distributed Approximation of Maximum Independent Set and Maximum Matching. In Proceedings of the ACM Symposium on Principles of Distributed Computing (PODC), pages 165-174, 2017. Google Scholar
  6. Leonid Barenboim, Michael Elkin, Seth Pettie, and Johannes Schneider. The Locality of Distributed Symmetry Breaking. Journal of the ACM (JACM), 63(3):1-45, 2016. Google Scholar
  7. MohammadHossein Bateni, Soheil Behnezhad, Mahsa Derakhshan, MohammadTaghi Hajiaghayi, and Vahab Mirrokni. Brief Announcement: MapReduce Algorithms for Massive Trees. In 45th International Colloquium on Automata, Languages, and Programming (ICALP), pages 162:1-162:4, 2018. Google Scholar
  8. Paul Beame, Paraschos Koutris, and Dan Suciu. Communication Steps for Parallel Query Processing. Journal of the ACM (JACM), 64(6):1-58, 2017. Google Scholar
  9. Soheil Behnezhad, Sebastian Brandt, Mahsa Derakhshan, Manuela Fischer, MohammadTaghi Hajiaghayi, Richard M. Karp, and Jara Uitto. Massively Parallel Computation of Matching and MIS in Sparse Graphs. In Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing (PODC), pages 481-490, 2019. Google Scholar
  10. Guy Blelloch, Jeremy Fineman, and Julian Shun. Greedy Sequential Maximal Independent Set and Matching Are Parallel on Average. In Proceedings of the Twenty-Fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 308–-317, 2012. Google Scholar
  11. Nicolò Cesa-Bianchi, Claudio Gentile, Fabio Vitale, and Giovanni Zappella. A Correlation Clustering Approach to Link Classification in Signed Networks. Journal of Machine Learning Research (JMLR), 23:34.1-34.20, 2013. Google Scholar
  12. Moses Charikar, Venkatesan Guruswami, and Anthony Wirth. Clustering with Qualitative Information. Journal of Computer and System Sciences, 71(3):360-383, 2005. Google Scholar
  13. Shuchi Chawla, Konstantin Makarychev, Tselil Schramm, and Grigory Yaroslavtsev. Near Optimal LP Rounding Algorithm for Correlation Clustering on Complete and Complete k-partite graphs. In Proceedings of the forty-seventh annual ACM symposium on Theory of Computing (STOC), pages 219-228, 2015. Google Scholar
  14. Yudong Chen, Sujay Sanghavi, and Huan Xu. Clustering Sparse Graphs. In Advances in Neural Information Processing Systems (NeurIPS), pages 2204-2212, 2012. Google Scholar
  15. Flavio Chierichetti, Nilesh Dalvi, and Ravi Kumar. Correlation Clustering in MapReduce. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge Discovery and Data mining (KDD), pages 641-650, 2014. Google Scholar
  16. Artur Czumaj, Peter Davies, and Merav Parter. Graph sparsification for derandomizing massively parallel computation with low space. In Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 175-185, 2020. Google Scholar
  17. Artur Czumaj, Jakub Łacki, Aleksander Madry, Slobodan Mitrovic, Krzysztof Onak, and Piotr Sankowski. Round Compression for Parallel Matching Algorithms. SIAM Journal on Computing, 49(5):STOC18-1, 2019. Google Scholar
  18. Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1):107-113, 2008. Google Scholar
  19. Erik D Demaine, Dotan Emanuel, Amos Fiat, and Nicole Immorlica. Correlation Clustering in General Weighted Graphs. Theoretical Computer Science, 361(2-3):172-187, 2006. Google Scholar
  20. Guy Even, Moti Medina, and Dana Ron. Distributed Maximum Matching in Bounded Degree Graphs. In Proceedings of International Conference on Distributed Computing and Networking (ICDCN), 2015. Google Scholar
  21. Manuela Fischer and Andreas Noever. Tight Analysis of Parallel Randomized Greedy MIS. In Proceedings of the 29th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2152-2160, 2018. Google Scholar
  22. Mohsen Ghaffari. Massively Parallel Algorithms, 2019. Available at: URL: https://people.inf.ethz.ch/gmohsen/MPA19/Notes/MPA.pdf.
  23. Mohsen Ghaffari, Themis Gouleakis, Christian Konrad, Slobodan Mitrović, and Ronitt Rubinfeld. Improved Massively Parallel Computation Algorithms for MIS, Matching, and Vertex Cover. In Proceedings of the ACM Symposium on Principles of Distributed Computing (PODC), pages 129-138, 2018. Google Scholar
  24. Mohsen Ghaffari, Christoph Grunau, and Ce Jin. Improved MPC Algorithms for MIS, Matching, and Coloring on Trees and Beyond. In 34th International Symposium on Distributed Computing (DISC), pages 34:1-34:18, 2020. Google Scholar
  25. Mohsen Ghaffari and Krzysztof Nowicki. Massively Parallel Algorithms for Minimum Cut. In Proceedings of the 39th Symposium on Principles of Distributed Computing (PODC), pages 119-128, 2020. Google Scholar
  26. Mohsen Ghaffari and Jara Uitto. Sparsifying Distributed Algorithms with Ramifications in Massively Parallel Computation and Centralized Local Computation. In Proceedings of the 30th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1636-1653, 2019. Google Scholar
  27. Michael T Goodrich, Nodari Sitchinava, and Qin Zhang. Sorting, Searching, and Simulation in the MapReduce Framework. In International Symposium on Algorithms and Computation (ISAAC), pages 374-383, 2011. Google Scholar
  28. John E. Hopcroft and Richard M. Karp. An n^5/2 Algorithm for Maximum Matchings in Bipartite Graphs. SIAM Journal on Computing, 2(4):225-231, 1973. Google Scholar
  29. Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, pages 59-72, 2007. Google Scholar
  30. Howard Karloff, Siddharth Suri, and Sergei Vassilvitskii. A Model of Computation for MapReduce. In Proceedings of the 21st annual ACM-SIAM symposium on Discrete Algorithms (SODA), pages 938-948, 2010. Google Scholar
  31. Christoph Lenzen and Roger Wattenhofer. Brief announcement: Exponential Speed-up of Local Algorithms Using Non-local Communication. In Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of Distributed Computing (PODC), pages 295-296, 2010. Google Scholar
  32. Nathan Linial. Locality in distributed graph algorithms. SIAM Journal on computing, 21(1):193-201, 1992. Google Scholar
  33. Xinghao Pan, Dimitris Papailiopoulos, Samet Oymak, Benjamin Recht, Kannan Ramchandran, and Michael I Jordan. Parallel Correlation Clustering on Big Graphs. In Advances in Neural Information Processing Systems (NeurIPS), pages 82-90, 2015. Google Scholar
  34. David Peleg. Distributed Computing: A Locality-Sensitive Approach. Society for Industrial and Applied Mathematics, 2000. Google Scholar
  35. Chaitanya Swamy. Correlation Clustering: Maximizing Agreements via Semidefinite Programming. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), volume 4, pages 526-527, 2004. Google Scholar
  36. Tom White. Hadoop: The Definitive Guide. O'Reilly Media, Inc., 2012. Google Scholar
  37. Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, Ion Stoica, et al. Spark: Cluster Computing with Working Sets. HotCloud, 10(10-10):95, 2010. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail