Massively Parallel Correlation Clustering in Bounded Arboricity Graphs

Cambus, Mélanie; Choo, Davin; Miikonen, Havu; Uitto, Jara

doi:10.4230/LIPIcs.DISC.2021.15

Abstract

Identifying clusters of similar elements in a set is a common task in data analysis. With the immense growth of data and physical limitations on single processor speed, it is necessary to find efficient parallel algorithms for clustering tasks. In this paper, we study the problem of correlation clustering in bounded arboricity graphs with respect to the Massively Parallel Computation (MPC) model. More specifically, we are given a complete graph where the edges are either positive or negative, indicating whether pairs of vertices are similar or dissimilar. The task is to partition the vertices into clusters with as few disagreements as possible. That is, we want to minimize the number of positive inter-cluster edges and negative intra-cluster edges.
Consider an input graph G on n vertices such that the positive edges induce a λ-arboric graph. Our main result is a 3-approximation (in expectation) algorithm to correlation clustering that runs in 𝒪(log λ ⋅ poly(log log n)) MPC rounds in the strongly sublinear memory regime. This is obtained by combining structural properties of correlation clustering on bounded arboricity graphs with the insights of Fischer and Noever (SODA '18) on randomized greedy MIS and the PIVOT algorithm of Ailon, Charikar, and Newman (STOC '05). Combined with known graph matching algorithms, our structural property also implies an exact algorithm and algorithms with worst case (1+ε)-approximation guarantees in the special case of forests, where λ = 1.

Kook Jin Ahn, Graham Cormode, Sudipto Guha, Andrew McGregor, and Anthony Wirth. Correlation Clustering in Data Streams. In Proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML), pages 2237-2246, 2015.
Nir Ailon, Moses Charikar, and Alantha Newman. Aggregating Inconsistent Information: Ranking and Clustering. Journal of the ACM (JACM), 55(5):1-27, 2008.
Alexandr Andoni, Zhao Song, Clifford Stein, Zhengyu Wang, and Peilin Zhong. Parallel graph connectivity in log diameter rounds. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 674-685. IEEE, 2018.
Nikhil Bansal, Avrim Blum, and Shuchi Chawla. Correlation Clustering. Machine learning, 56(1-3):89-113, 2004.
Reuven Bar-Yehuda, Keren Censor-Hillel, Mohsen Ghaffari, and Gregory Schwartzman. Distributed Approximation of Maximum Independent Set and Maximum Matching. In Proceedings of the ACM Symposium on Principles of Distributed Computing (PODC), pages 165-174, 2017.
Leonid Barenboim, Michael Elkin, Seth Pettie, and Johannes Schneider. The Locality of Distributed Symmetry Breaking. Journal of the ACM (JACM), 63(3):1-45, 2016.
MohammadHossein Bateni, Soheil Behnezhad, Mahsa Derakhshan, MohammadTaghi Hajiaghayi, and Vahab Mirrokni. Brief Announcement: MapReduce Algorithms for Massive Trees. In 45th International Colloquium on Automata, Languages, and Programming (ICALP), pages 162:1-162:4, 2018.
Paul Beame, Paraschos Koutris, and Dan Suciu. Communication Steps for Parallel Query Processing. Journal of the ACM (JACM), 64(6):1-58, 2017.
Soheil Behnezhad, Sebastian Brandt, Mahsa Derakhshan, Manuela Fischer, MohammadTaghi Hajiaghayi, Richard M. Karp, and Jara Uitto. Massively Parallel Computation of Matching and MIS in Sparse Graphs. In Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing (PODC), pages 481-490, 2019.
Guy Blelloch, Jeremy Fineman, and Julian Shun. Greedy Sequential Maximal Independent Set and Matching Are Parallel on Average. In Proceedings of the Twenty-Fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 308–-317, 2012.
Nicolò Cesa-Bianchi, Claudio Gentile, Fabio Vitale, and Giovanni Zappella. A Correlation Clustering Approach to Link Classification in Signed Networks. Journal of Machine Learning Research (JMLR), 23:34.1-34.20, 2013.
Moses Charikar, Venkatesan Guruswami, and Anthony Wirth. Clustering with Qualitative Information. Journal of Computer and System Sciences, 71(3):360-383, 2005.
Shuchi Chawla, Konstantin Makarychev, Tselil Schramm, and Grigory Yaroslavtsev. Near Optimal LP Rounding Algorithm for Correlation Clustering on Complete and Complete k-partite graphs. In Proceedings of the forty-seventh annual ACM symposium on Theory of Computing (STOC), pages 219-228, 2015.
Yudong Chen, Sujay Sanghavi, and Huan Xu. Clustering Sparse Graphs. In Advances in Neural Information Processing Systems (NeurIPS), pages 2204-2212, 2012.
Flavio Chierichetti, Nilesh Dalvi, and Ravi Kumar. Correlation Clustering in MapReduce. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge Discovery and Data mining (KDD), pages 641-650, 2014.
Artur Czumaj, Peter Davies, and Merav Parter. Graph sparsification for derandomizing massively parallel computation with low space. In Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 175-185, 2020.
Artur Czumaj, Jakub Łacki, Aleksander Madry, Slobodan Mitrovic, Krzysztof Onak, and Piotr Sankowski. Round Compression for Parallel Matching Algorithms. SIAM Journal on Computing, 49(5):STOC18-1, 2019.
Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1):107-113, 2008.
Erik D Demaine, Dotan Emanuel, Amos Fiat, and Nicole Immorlica. Correlation Clustering in General Weighted Graphs. Theoretical Computer Science, 361(2-3):172-187, 2006.
Guy Even, Moti Medina, and Dana Ron. Distributed Maximum Matching in Bounded Degree Graphs. In Proceedings of International Conference on Distributed Computing and Networking (ICDCN), 2015.
Manuela Fischer and Andreas Noever. Tight Analysis of Parallel Randomized Greedy MIS. In Proceedings of the 29th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2152-2160, 2018.
Mohsen Ghaffari. Massively Parallel Algorithms, 2019. Available at: URL: https://people.inf.ethz.ch/gmohsen/MPA19/Notes/MPA.pdf.
Mohsen Ghaffari, Themis Gouleakis, Christian Konrad, Slobodan Mitrović, and Ronitt Rubinfeld. Improved Massively Parallel Computation Algorithms for MIS, Matching, and Vertex Cover. In Proceedings of the ACM Symposium on Principles of Distributed Computing (PODC), pages 129-138, 2018.
Mohsen Ghaffari, Christoph Grunau, and Ce Jin. Improved MPC Algorithms for MIS, Matching, and Coloring on Trees and Beyond. In 34th International Symposium on Distributed Computing (DISC), pages 34:1-34:18, 2020.
Mohsen Ghaffari and Krzysztof Nowicki. Massively Parallel Algorithms for Minimum Cut. In Proceedings of the 39th Symposium on Principles of Distributed Computing (PODC), pages 119-128, 2020.
Mohsen Ghaffari and Jara Uitto. Sparsifying Distributed Algorithms with Ramifications in Massively Parallel Computation and Centralized Local Computation. In Proceedings of the 30th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1636-1653, 2019.
Michael T Goodrich, Nodari Sitchinava, and Qin Zhang. Sorting, Searching, and Simulation in the MapReduce Framework. In International Symposium on Algorithms and Computation (ISAAC), pages 374-383, 2011.
John E. Hopcroft and Richard M. Karp. An n^5/2 Algorithm for Maximum Matchings in Bipartite Graphs. SIAM Journal on Computing, 2(4):225-231, 1973.
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, pages 59-72, 2007.
Howard Karloff, Siddharth Suri, and Sergei Vassilvitskii. A Model of Computation for MapReduce. In Proceedings of the 21st annual ACM-SIAM symposium on Discrete Algorithms (SODA), pages 938-948, 2010.
Christoph Lenzen and Roger Wattenhofer. Brief announcement: Exponential Speed-up of Local Algorithms Using Non-local Communication. In Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of Distributed Computing (PODC), pages 295-296, 2010.
Nathan Linial. Locality in distributed graph algorithms. SIAM Journal on computing, 21(1):193-201, 1992.
Xinghao Pan, Dimitris Papailiopoulos, Samet Oymak, Benjamin Recht, Kannan Ramchandran, and Michael I Jordan. Parallel Correlation Clustering on Big Graphs. In Advances in Neural Information Processing Systems (NeurIPS), pages 82-90, 2015.
David Peleg. Distributed Computing: A Locality-Sensitive Approach. Society for Industrial and Applied Mathematics, 2000.
Chaitanya Swamy. Correlation Clustering: Maximizing Agreements via Semidefinite Programming. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), volume 4, pages 526-527, 2004.
Tom White. Hadoop: The Definitive Guide. O'Reilly Media, Inc., 2012.
Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, Ion Stoica, et al. Spark: Cluster Computing with Working Sets. HotCloud, 10(10-10):95, 2010.

Massively Parallel Correlation Clustering in Bounded Arboricity Graphs

Authors Mélanie Cambus , Davin Choo , Havu Miikonen , Jara Uitto

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Massively Parallel Correlation Clustering in Bounded Arboricity Graphs

Authors Mélanie Cambus , Davin Choo , Havu Miikonen , Jara Uitto

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message