Simultaneously Approximating All Norms for Massively Parallel Correlation Clustering
Abstract
We revisit the simultaneous approximation model for the correlation clustering problem introduced by Davies, Moseley, and Newman [21]. The objective is to find a clustering that minimizes given norms of the disagreement vector over all vertices.
We present an efficient algorithm that produces a clustering that is simultaneously a -approximation for all monotone symmetric norms. This significantly improves upon the previous approximation ratio of due to Davies, Moseley, and Newman [21], which works only for -norms.
To achieve this result, we first reduce the problem to approximating all top- norms simultaneously, using the connection between monotone symmetric norms and top- norms established by Chakrabarty and Swamy [11]. Then we develop a novel procedure that constructs a -approximate fractional clustering for all top- norms. Our -approximation ratio is obtained by combining this with the -approximate rounding algorithm by Kalhan, Makarychev, and Zhou [26].
We then demonstrate that with a loss of in the approximation ratio, the algorithm can be adapted to run in nearly linear time and in the MPC (massively parallel computation) model with poly-logarithmic number of rounds.
By allowing a further trade-off in the approximation ratio to , the number of MPC rounds can be reduced to a constant.
Keywords and phrases:
Correlation Clustering, All-Norms, Approximation Algorithm, Massively Parallel AlgorithmCategory:
Track A: Algorithms, Complexity and GamesFunding:
Nairen Cao: Supported by NSF grant CCF-2008422.Copyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Theory of computation Approximation algorithms analysis ; Theory of computation Massively parallel algorithmsEditors:
Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele PuppisSeries and Publisher:

1 Introduction
Clustering is a classic problem in unsupervised machine learning. It aims to classify a given set of data elements based on their similarities, with the goal of maximizing the similarity between elements within the same class and minimizing the similarity between elements in different classes. Among the various graph clustering problems, correlation clustering stands out as a classic model. Initially proposed by Bansal, Blum, and Chawla [4], the model has numerous practical applications, including automated labelling [10, 1], community detection and mining [15, 31, 30], and disambiguation task [25], among others.
The input of the standard correlation clustering problem is a complete graph over a set of vertices, where edges are partitioned into the set of edges and the set of edges. The output of the problem is a clustering (or a partition) of that minimizes the number of edges in disagreement: an edge is in disagreement if but and are in different clusters in , or but and are in a same cluster in . Throughout the paper, we shall use a graph to denote a correlation clustering instance, with being and being .
This problem is known to be APX-Hard [13]. There has been a long stream of -approximation algorithms for the problem [4, 13, 2, 14, 19, 18, 8], with the current best approximation ratio being [8]. In the same paper, the authors presented an improved hardness of 24/23 for the problem, which also made the constant explicit.
Besides the standard setting, other objectives have been studied recently, with the goal of minimizing some norm of the disagreement vector of the clustering over vertices. For a clustering of , the disagreement vector of is defined as , where for every is the number of edges incident to that are in disagreement with respect to . Given some norm 111This means satisfies for every real and , and for every , the goal of the problem is to minimize . Notice that the standard correlation clustering problem corresponds to the case where is the norm.
Puleo and Milenkovic [29] initiated the study of correlation clustering with the goal of minimizing the norm of the disagreement vector, where . They proved that the problem is NP-hard for the -norm objective. Given a fixed , for the -norms objective, they gave a -approximation algorithm. The approximation ratio was subsequently improved by Charikar, Gupta, and Schwartz [12] to for the -norm, and by Kalhan, Makarychev and Zhou [26] to for the -norm with any fixed . Very recently, Heidrich, Irmai, and Andres [24] improved the approximate ratio to 4 for the -norm.
Davies, Moseley and Newman [21] introduced the concept of simultaneous approximation for all -norms. They developed an efficient algorithm that outputs a single clustering , which is simultaneously an -approximation for the norm for all . This is rather surprising, as it was not known a priori whether such a clustering even exists. To achieve the goal, they first construct a fractional clustering that is simultaneously an -approximation for all norms and then use the -approximate rounding algorithm of Kalhan, Makarychev, and Zhou [26] to round into an integral clustering . Crucially, the algorithm of [26] guarantees a per-vertex -approximation, meaning that is at most times the fractional number of edges in disagreement incident to , for every . This strong property is used to obtain the final simultaneous -approximation in [21].
In light of the growing networks, it is imperative to develop efficient parallel algorithms. This urgency is particularly pronounced in machine learning and data mining applications, where timely and efficient processing is essential for extracting meaningful insights from vast datasets. Many works in the literature aim to design efficient parallel algorithms [5, 16, 28, 23, 6, 17, 3, 7, 9]. The MPC model, as a theoretical abstraction of several real-world parallel models such as MapReduce [22], is a prevalent methodology employed in these works.
1.1 Our results
In this paper, we revisit and generalize the simultaneous approximation model for the correlation clustering that was introduced by [21]. Instead of considering only norms, we consider all monotone symmetric norms. We say a norm is monotone if for every with , we have . We say is symmetric if for every such that is a permutation of . Such norms were considered in [11] in the context of load balancing and clustering. Our first result is that there exists simultaneous -approximation for all monotone symmetric norms for correlation clustering and it can be constructed in polynomial time.
Definition 1.
Given a correlation clustering instance and , we say a clustering over is simultaneously -approximate, or a simultaneous -approximation, for a family of norms, if we have for every , where is the optimum clustering for under norm .
Theorem 2.
Given a correlation clustering instance , in polynomial time we can construct a simultaneous -approximate clustering for the family of monotone symmetric norms.
Next, we are concerned with the running time of the algorithm and its implementation under the MPC model. To state the result, we need a formal description of the MPC model.
The MPC model.
In the MPC model, data is distributed across a set of machines, and computation proceeds in synchronous rounds. During each round, each machine first receives messages from other machines, then performs computations based on this information and its own allocated memory, and finally sends messages to other machines to be received at the start of the next round. Each machine has limited local memory, restricting the total number of messages it can receive or send in a round. The efficiency of the algorithm is measured by the number of rounds, the memory used by each machine, the total memory used by all machines, and the running time over all machines, also known as the total work.
In this paper, we consider the MPC model in the strictly sublinear regime: Each machine has local memory, where is the input size and is a constant that can be made arbitrarily small. Under this model, we assume the input received by each machine has size .
We then describe the correlation clustering problem under the MPC model in the strictly sublinear regime. We use and to denote the number of vertices and edges respectively in the input graph . The edges are distributed across the machines, where each machine has memory for a constant which can be made arbitrarily small. At the end of the algorithm, each machine needs to store in its local memory the IDs of the clusters for all the vertices incident to its assigned edges.
Our main result regarding MPC algorithm is given as follows,
Theorem 3.
Let . There exists a randomized MPC algorithm in the strictly sublinear regime that, given a correlation clustering instance , in rounds outputs a simultaneous clustering for for all monotone symmetric norms. This algorithm succeeds with high probability. It uses total memory and total work.222As usual, we use to hide a poly-logarithmic factor in the input size.
In particular, the algorithm can be converted into a nearly linear time algorithm that with high probability outputs a -simultaneous approximation for all monotone symmetric norms.
Along the way, we develop an MPC rounding algorithm with a per-vertex approximation guarantee, based on the sequential algorithm due to [26]. Given its potential independent interest, we state it here for future references.
Theorem 4.
Let be a constant. Given a graph and a set of LP values satisfying the approximate triangle inequality, that is, for any , we have . Let be the LP disagreement for node . There exists an MPC algorithm that computes a clustering such that for any node , we have
This algorithm always succeeds but terminates in rounds with high probability and requires memory per machine. Moreover, let be the set of edges and edges whose LP value is less than 1. The algorithm uses a total memory of and a total work of .
The round in the above theorem might not be desirable for many applications. Our next result shows that we can reduce the number of rounds to , albeit with a worse approximation ratio:
Theorem 5.
Let be a constant. There exists a randomized MPC algorithm in the strictly sublinear regime that, given a correlation clustering instance , in rounds outputs a clustering that is simultaneously a -approximation, for all monotone symmetric norms. This algorithm succeeds with high probability, and uses a total memory of and a total work of .
Overall, relative to [21], our algorithms demonstrate the following improvements.
-
1.
We generalize the family of norms for the simultaneous approximation from norms to all monotone symmetric norms.
-
2.
We obtain a simpler construction, which leads to a much smaller approximation ratio. Using a result from [11], to simultaneously approximate all monotone symmetric norms, it suffices to approximate all top- norms: the top- norm of a non-negative vector is the sum of its largest coordinates. Though being more general mathematically, the top- norms are more convenient to deal with compared to norms.
-
3.
We can make our algorithm run in nearly linear time. This is the first nearly-linear time simultaneous -approximation algorithm for the problem, even when we restrict to norms. In contrast, the algorithm of [21] runs in nearly linear time only when the graph has maximum degree.
-
4.
We can make our algorithm run in the MPC model with rounds. Our work is the first to consider the problem in the MPC model.
1.2 Overview of Techniques
We then discuss our techniques for each of our main results.
Polynomial Time Construction of Simultaneous -Approximation for All Symmetric Norms.
By [11], we can reduce the problem of approximating all monotone symmetric norms to approximating all top- norms. We then construct a fractional solution , which is a metric over with range , such that the fractional disagreement vector for has top- norm at most for any , where is the cost of the optimum clustering under the top- norm. Then, we can use the 5-approximate rounding algorithm of KMZ [26], to obtain a simultaneous -approximation for all top- norms. The KMZ rounding algorithm has two crucial properties that we need: it does not depend on and it achieves a per-vertex guarantee.
We elaborate more on how to construct the metric . A natural idea to assign the LP values, that was used by [21], is to set based on the intersection of the neighborhood between and . Intuitively, the more common neighbors two nodes share, the closer they should be. A straightforward approach to implementing this idea is to set , where denotes the neighboring nodes of in and denotes the degree of ; it is convenient to assume . This approach works for the top- norm (i.e., the norm) as discussed in [20], but fails for the top- norm (i.e., the norm). Consider a star graph, where the optimal clustering under the top- norm has a cost of . This approach will assign for all edges, leading to an LP cost of and a gap of . [21] addressed the issue by rounding up LP values to for edge, if for a given node, its total edges LP disagreement is larger than the number of its edges. After the transformation, the triangle inequalities are only satisfied approximately, but this can be handled with loss in the approximation ratio.
We address this issue using a different approach, that is inspired by the pre-clustering technique in [17]. We first preprocess the graph by removing edges for which is small compared to . Let the resulting graph be . We then set our LP values as if , where is the set of neighbors of in . We show that this solution is a -approximation for all top- norms simultaneously. When compared to [21], in addition to the improved approximation ratio, we obtain a considerably simpler analysis.
Implementation of Algorithm in Nearly-Linear Time and in MPC Model.
We then proceed to discuss our techniques to improve the running time of the algorithm to nearly-linear. The algorithm contains two parts: the construction of the fractional solution and the rounding procedure. We discuss the two procedures separately.
Constructing in nearly linear time poses several challenges. First, the construction of the subgraph requires us to identify edges with small . Second, we can not explicitly assign values to all edges. Finally, to compute , we need to compute .
The first and third challenges can be addressed through sampling, with an factor loss in the running time. To avoid considering too many edges, we only consider edges with length at most . Consequently, we only need to consider edges whose other endpoints share at least an fraction of neighbors with . Given that each neighbor of in has degree similar to , we demonstrate that there will be at most edges to consider for each node . Overall, there will be edges for which we need to explicitly assign values. Moreover, the nearly-linear time algorithm for the construction of can be naturally implemented in the MPC model, with number of rounds.
Then we proceed to the rounding algorithm for . We are explicitly given the values for edges, and for nearly-linear number of edges. For other edges, their values are . The KMZ algorithm works as follows: in each round, the algorithm selects a node as the cluster center and then includes a ball with some radius, meaning the algorithm includes all nodes such that into the cluster, removes the clustered nodes, and repeats the process on the remaining nodes. The rounding algorithm can be easily implemented in nearly-linear time using a priority queue structure. This leads to a nearly-linear time simultaneous -approximation for correlation clustering for all monotone symmetric norms.
The challenge to implement the algorithm in MPC model is the sequential nature of the algorithm. [26] observes that in each round, if we select the nodes that maximize as cluster center, we can effectively bound each node’s algorithmic cost, where is the final ratio. However, choosing a node that maximizes some target inherently makes the process sequential. Our key observation is that, instead of selecting the node that maximizes , we can allow some approximation. This strategy still permits achieving a reasonable approximate ratio with an additional overhead while allowing the selection of multiple nodes as cluster centers, thereby parallelizing the rounding process. In each round, there might be several candidate cluster centers with conflicts. To resolve these conflicts, we employ the classical Luby’s algorithm [27, 16] to find a maximal independent set, ensuring that none of the cluster centers have conflicts.
Organization.
We give some preliminary remarks in Section 2. In Section 3, we describe our simultaneous -approximation algorithm for correlation clustering for all top- norms. The reduction from any monotone symmetric norm to top- norms is deferred to Appendix A. Combining the results leads to a simultaneous -approximation algorithm for all monotone symmetric norms. Then in Appendix B and C, we show how we can run the algorithm in the MPC model with nearly linear work. In particular, Appendix B and C discuss how to solve the LP and round the LP solution in the MPC model, respectively. The constant round MPC algorithm is described in Appendix D. Theorem 2, 3, 4 and 5 are proved in Section 3, Appendix C, C and D respectively.
2 Preliminaries
The input to correlation clustering is a complete graph whose edges are partitioned into edges and edges. We shall use the graph of edges to denote an instance. Let and . For simplicity, we assume contains all the self-loops . So, is the set of edges, and is the set of edges. The graph is fixed in most part of the paper.
For any graph , and any vertex , let . For any vertex and any subset , we define as the number of edges between and . We simply use for . When the graph is the input graph , we omit the subscript. So we use for and for . Notice that and for every . For the input graph and any two vertex , we define as the maximum degree of and for simplicity, as this notion will be frequently used. For any two sets and , we denote their symmetric difference by . Algorithms are parameterized by constants that will be determined later.
A norm on -dimensional non-negative vectors is a function satisfying for every real and , and for every . We say a norm is monotone if for every with , we have . We say is symmetric if for every such that is a permutation of . We say is the top- norm for an integer if is equal to the sum of the largest coordinates of . Chakrabarty and Swamy [11] showed that any monotone and symmetric norm can be written as the maximum of many ordered norms. This leads to the following lemma which reduces the monotone-symmetric norms to top- norms. For completeness, we defer its proof to Appendix A.
Lemma 6.
For any integer , if an algorithm returns a single clustering that is simultaneously a -approximation for all top- norm objectives, then is a -approximation for any monotone and symmetric norm .
For a fixed clustering , we already defined the disagreement vector of as , with for every being the number of edges incident to that are in disagreement w.r.t . Given an integer , and a clustering , we denote the top- value by . Similarly, for any fractional vector , we denote as the disagreement for with respect to . The top- value of is defined as .
We will use the following theorem from [26]:
Theorem 7.
Let be a correlation clustering instance, and be a metric over with range . There is a polynomial time algorithm that, given and , outputs a clustering of such that for every .
We will use the following well-known concentration inequalities.
Theorem 8 (Chernoff Bound).
Let be independent random variables taking values in . Let be the sum of these random variables. Then the following inequalities hold:
-
(a)
For any , if , then .
-
(b)
For any , if , then .
3 Simultaneous -Approximation for Top- Norms
In this section, we describe our simultaneous -approximation for correlation clustering for all top- norms. The algorithm described in this section runs in polynomial time. It first constructs an LP solution to the top- linear program using a combinatorial procedure. Crucially, the construction does not depend on the value of . We show that the solution has cost times the optimum cost under the top- norm, for any integer . Then we use the rounding algorithm of [26] to round the LP solution to an integral one. As it gives a vertex-by-vertex 5-approximation guarantee, this leads to a -approximation for the top- norm for any .
The LP for minimizing the top- norm of the clustering is given in LP (1).
(1a) | |||||
s.t. | (1b) | ||||
(1c) | |||||
(1d) |
In the correspondent integer program, for every indicates if is separated or not. We view as an unordered pair and thus and are the same variable. So is a metric with distances in , which is relaxed to in the linear program. This is captured by constraints (1b), (1c) and (1d). Notice that for any is a linear function of the variables. The top- norm of the fractional clustering is defined by . This could be captured by introducing a variable and constraints for any of size , and setting to be the objective to minimize. For simplicity, we use the form as described. Despite having an exponential number of constraints, the LP can be solved efficiently as there is a simple separation oracle. Moreover, we use a combinatorial algorithm to construct a solution , and thus the algorithm does not solve the LP.
3.1 Algorithm
The algorithm for constructing the LP solution is given in Algorithm 1. It depends on the parameter , whose value will be specified later. During the process, we construct a subgraph by removing any edge where and have significantly different neighbors. We then set as if and otherwise. Recall that is the maximum degree for any nodes and in graph . Intuitively, we treat edges as indicators of whether two nodes belong to the same cluster. The first step is to remove edges that should not be in the same cluster. The second step ensures that the more common neighbors two nodes have, the closer their distance should be.
In the remaining part of this section, we will show
Lemma 9.
Proof of Theorem 2.
Once we obtain Lemma 9, [26] provides a rounding algorithm for any feasible solution. Assume the final clustering is . [26] ensures that for any node , we have , meaning the disagreement for is at most 5 times the disagreement in the LP solution. We can apply the KMZ rounding algorithm to as output by Lemma 9. For any integer , let be the set of vertices with the largest disagreement values with respect to . Then we have:
By Lemma 6, we know that is a simultaneous 63.3-approximate clustering for all monotone symmetric norms.
3.2 The validity of to LP (1)
To show that is a valid solution to LP (1), it suffices to prove that it is a metric over with range . Moreover, (1c) and (1d) hold trivially. Therefore, it remains to show that the triangle inequality (i.e., constraint (1b)) is satisfied:
Lemma 10.
For any , we have .
Proof.
We can assume are distinct since otherwise the inequality holds trivially. We assume that wlog.
The first inequality used that , the second one follows from , and the third one used and .
3.3 Bounding the Top- Norm Cost of
In this section, we compare the top- norm cost of to .
Notations.
We fix the integer . Let be the clustering that minimizes the top- norm of disagreement vector, but our analysis works for any clustering. For every , let be the cluster in that contains .
For every , let and respectively be the number of edges, edges and edges incident to that are in disagreement in the clustering . Recall is the top- norm cost of the clustering , thus we have .
Let be the set of vertices with the largest values. So, . In order to provide a clear demonstration, we divide all of the edges in into five parts. First, we separate out the parts that are easily constrained by . Let be the set of edges that are cut in , and be the set of edges that are not cut in . For the remaining edges in that are not cut in , it is necessary to utilize the properties of edges in . To this end, let be the set of edges in that are not cut in , and be the set of edges in that are not cut in . Finally, we define as the set of edges that are cut in . Formally, we set
and |
For every and , we let be the set of pairs in incident to . Notice that contains all the self-loops. We use to denote the end-vertices of the edges in other than ; so . We let denote the cost of edges in in the solution . For every , we define . Therefore, the top- norm cost of is .
With the notations defined, we can proceed to the analysis. Prior to this, several propositions are presented, which will prove useful in the following analysis. We start with the property of edges in .
Lemma 11.
For every , we have .
Proof.
Since , there are at least disagreements incident to and . For every , we have
Then, we show that edges in have similar degrees.
Lemma 12.
For every , we have .
Proof.
For every , we have . Therefore,
which gives us since .
As we are about to analyze the top- norm objective, we can bound the cost in the solution using coefficients of . This key observation can be formally demonstrated as follows:
Claim 13.
Let be a vector and satisfying that and . Then we have
Proof.
We have and . It is well known that is a convex combination of -vectors, each of which has at most 1’s. For each -vector in the combination, we have . Taking the convex combination of these inequalities gives , which is equivalent to the inequality in the claim.
From the definitions of and , we can see that it can be constrained by .
Claim 14.
.
Proof.
To bound the cost of the remaining edges, we separately analyze the cost coefficients of vertices in and .
Lemma 15.
There exists a vector with the following properties:
-
(a)
.
-
(b)
, for every .
-
(c)
.
Proof.
To show Lemma 15(b), we bound the coefficients for and respectively. If , the coefficient for is ; if , the coefficient is . The coefficient for is . Therefore, .
To bound , we can replace and with 1.
Following the analysis of the cost coefficients of , we analyze the cost coefficients of in edge set .
Lemma 16.
There exists a vector with the following properties:
-
(a)
.
-
(b)
, for every .
-
(c)
.
Proof.
Fix some with . We let , and be the other vertex in . Notice that . Then we have
(2) | ||||
(3) |
We prove the first inequality in sequence (3). Notice . Therefore,
Above, we used that . Also (3) holds trivially when ; this holds for every .
Therefore,
Again, we used Lemma 11 twice to prove the last inequality. Therefore, Lemma 16(a) holds.
To prove Lemma 16(b), we consider the coefficients for and respectively. For a , the coefficient for is
The coefficient for is
The coefficient for is at most
Therefore, for every , we have .
Lemma 17.
.
Proof.
We define for every . Then,
where the first inequality holds because Lemma 15(b) and Lemma 16(b), the second inequality holds because and , the last inequality holds because .
Similarly, we have
where the first inequality holds because Lemma 15(c) and Lemma 16(c), the second inequality holds because and for any there is , the last inequality holds because for any there is .
The lemma follows from Claim 13.
For the cost of the remaining edges, i.e. , we also analyze the cost coefficients of each vertex.
Lemma 18.
There exists a vector with the following properties:
-
(a)
.
-
(b)
, for every .
-
(c)
.
Proof.
We have
Given that indicates that , we distinguish between two cases: and . For simplicity, in the summations below, we automatically impose the constraints and when the vertices involved in constraints are defined. Notice that we have from Lemma 12.
(4) | ||||
(5) |
Adding (4) and (5), and using that for every , we get
To prove Lemma 18(b), we consider the coefficients for and respectively. The coefficient for is . The coefficient for is
Therefore, for any , we have .
With Lemma 18, we can then bound .
Lemma 19.
.
Proof.
References
- [1] Rakesh Agrawal, Alan Halverson, Krishnaram Kenthapadi, Nina Mishra, and Panayiotis Tsaparas. Generating labels from clicks. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, pages 172–181, 2009. doi:10.1145/1498759.1498824.
- [2] Nir Ailon, Moses Charikar, and Alantha Newman. Aggregating inconsistent information: Ranking and clustering. Journal of the ACM, 55(5):1–27, 2008. doi:10.1145/1411509.1411513.
- [3] Sepehr Assadi and Chen Wang. Sublinear time and space algorithms for correlation clustering via sparse-dense decompositions. In Proceedings of the 13th Conference on Innovations in Theoretical Computer Science (ITCS), volume 215 of LIPIcs, pages 10:1–10:20, 2022. doi:10.4230/LIPICS.ITCS.2022.10.
- [4] Nikhil Bansal, Avrim Blum, and Shuchi Chawla. Correlation clustering. Machine learning, 56(1):89–113, 2004. doi:10.1023/B:MACH.0000033116.57574.95.
- [5] Guy E Blelloch, Jeremy T Fineman, and Julian Shun. Greedy sequential maximal independent set and matching are parallel on average. In Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures, pages 308–317, 2012. doi:10.1145/2312005.2312058.
- [6] Mélanie Cambus, Davin Choo, Havu Miikonen, and Jara Uitto. Massively parallel correlation clustering in bounded arboricity graphs. In 35th International Symposium on Distributed Computing (DISC), volume 209 of LIPIcs, pages 15:1–15:18, 2021. doi:10.4230/LIPICS.DISC.2021.15.
- [7] Mélanie Cambus, Fabian Kuhn, Etna Lindy, Shreyas Pai, and Jara Uitto. A (3+)-approximate correlation clustering algorithm in dynamic streams. In Proceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2861–2880. SIAM, 2024. doi:10.1137/1.9781611977912.101.
- [8] Nairen Cao, Vincent Cohen-Addad, Euiwoong Lee, Shi Li, Alantha Newman, and Lukas Vogl. Understanding the cluster linear program for correlation clustering. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing, pages 1605–1616, 2024. doi:10.1145/3618260.3649749.
- [9] Nairen Cao, Shang-En Huang, and Hsin-Hao Su. Breaking 3-factor approximation for correlation clustering in polylogarithmic rounds. In Proceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 4124–4154. SIAM, 2024. doi:10.1137/1.9781611977912.143.
- [10] Deepayan Chakrabarti, Ravi Kumar, and Kunal Punera. A graph-theoretic approach to webpage segmentation. In Proceedings of the 17th International conference on World Wide Web (WWW), pages 377–386, 2008. doi:10.1145/1367497.1367549.
- [11] Deeparnab Chakrabarty and Chaitanya Swamy. Approximation algorithms for minimum norm and ordered optimization problems. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 126–137, 2019. doi:10.1145/3313276.3316322.
- [12] Moses Charikar, Neha Gupta, and Roy Schwartz. Local guarantees in graph cuts and clustering. In International Conference on Integer Programming and Combinatorial Optimization (IPCO), pages 136–147, 2017. doi:10.1007/978-3-319-59250-3_12.
- [13] Moses Charikar, Venkatesan Guruswami, and Anthony Wirth. Clustering with qualitative information. Journal of Computer and System Sciences, 71(3):360–383, 2005. doi:10.1016/J.JCSS.2004.10.012.
- [14] Shuchi Chawla, Konstantin Makarychev, Tselil Schramm, and Grigory Yaroslavtsev. Near optimal LP rounding algorithm for correlation clustering on complete and complete -partite graphs. In Proceedings of the 47th Annual ACM Symposium on Theory of Computing (STOC), pages 219–228, 2015. doi:10.1145/2746539.2746604.
- [15] Yudong Chen, Sujay Sanghavi, and Huan Xu. Clustering sparse graphs. In Advances in Neural Information Processing Systems (Neurips), pages 2204–2212, 2012.
- [16] Flavio Chierichetti, Nilesh Dalvi, and Ravi Kumar. Correlation clustering in MapReduce. In Proceedings of the 20th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 641–650, 2014. doi:10.1145/2623330.2623743.
- [17] Vincent Cohen-Addad, Silvio Lattanzi, Slobodan Mitrović, Ashkan Norouzi-Fard, Nikos Parotsidis, and Jakub Tarnawski. Correlation clustering in constant many parallel rounds. In International Conference on Machine Learning, pages 2069–2078. PMLR, 2021. URL: http://proceedings.mlr.press/v139/cohen-addad21b.html.
- [18] Vincent Cohen-Addad, Euiwoong Lee, Shi Li, and Alantha Newman. Handling correlated rounding error via preclustering: A 1.73-approximation for correlation clustering. In Proceedings of the 64rd Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2023.
- [19] Vincent Cohen-Addad, Euiwoong Lee, and Alantha Newman. Correlation clustering with Sherali-Adams. In Proceedings of 63rd Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 651–661, 2022. doi:10.1109/FOCS54457.2022.00068.
- [20] Sami Davies, Benjamin Moseley, and Heather Newman. Fast combinatorial algorithms for min max correlation clustering. arXiv preprint arXiv:2301.13079, 2023. doi:10.48550/arXiv.2301.13079.
- [21] Sami Davies, Benjamin Moseley, and Heather Newman. Simultaneously approximating all lp-norms in correlation clustering. In 51st International Colloquium on Automata, Languages, and Programming, ICALP 2024, 2024.
- [22] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008. doi:10.1145/1327452.1327492.
- [23] Manuela Fischer and Andreas Noever. Tight analysis of parallel randomized greedy mis. ACM Transactions on Algorithms (TALG), 16(1):1–13, 2019. doi:10.1145/3326165.
- [24] Holger SG Heidrich, Jannik Irmai, and Bjoern Andres. A 4-approximation algorithm for min max correlation clustering. In AISTATS, pages 1945–1953, 2024.
- [25] Dmitri V. Kalashnikov, Zhaoqi Chen, Sharad Mehrotra, and Rabia Nuray-Turan. Web people search via connection analysis. IEEE Transactions on Knowledge and Data Engineering, 20(11):1550–1565, 2008. doi:10.1109/TKDE.2008.78.
- [26] Sanchit Kalhan, Konstantin Makarychev, and Timothy Zhou. Correlation clustering with local objectives. Advances in Neural Information Processing Systems, 32, 2019.
- [27] Michael Luby. A simple parallel algorithm for the maximal independent set problem. In Proceedings of the seventeenth annual ACM symposium on Theory of computing, pages 1–10, 1985. doi:10.1145/22145.22146.
- [28] Xinghao Pan, Dimitris S. Papailiopoulos, Samet Oymak, Benjamin Recht, Kannan Ramchandran, and Michael I. Jordan. Parallel correlation clustering on big graphs. In Advances in Neural Information Processing Systems (Neurips), pages 82–90, 2015. URL: https://proceedings.neurips.cc/paper/2015/hash/b53b3a3d6ab90ce0268229151c9bde11-Abstract.html.
- [29] Gregory Puleo and Olgica Milenkovic. Correlation clustering and biclustering with locally bounded errors. In International Conference on Machine Learning, pages 869–877. PMLR, 2016. URL: http://proceedings.mlr.press/v48/puleo16.html.
- [30] Jessica Shi, Laxman Dhulipala, David Eisenstat, Jakub Łkacki, and Vahab Mirrokni. Scalable community detection via parallel correlation clustering. arXiv preprint arXiv:2108.01731, 2021.
- [31] Nate Veldt, David F. Gleich, and Anthony Wirth. A correlation clustering framework for community detection. In Proceedings of the 2018 ACM World Wide Web Conference (WWW), pages 439–448, 2018. doi:10.1145/3178876.3186110.
Appendix A Reduction from All Monotone Symmetric Norms to Top- Norms
Definition 20 (Ordered Norms).
For any vector , let denote the vector with its coordinates sorted in non-increasing order. Given weight vector with , the -ordered norm of is defined as .
Lemma 21 (Lemma 5.2 of [11]).
For any monotone and symmetric norm , define the set , and . Then we have for every .
See 6
Proof of Lemma 6.
For any such that , if we set as
Let denote the top- norm of . Then we have .
Let and . We construct a new set . By Lemma 21, we have .
Let be the disagreement vector for the given clustering . For any symmetric monotone norm , define to be the disagreement vector for the optimal clustering under the norm . By the assumption that is a simultaneous -approximation for every top- norm, we have we have for every , where is the disagreement vector for the optimal clustering under top-k objective. Now we bound in terms of for any monotone symmetric norm :
Appendix B Implementing Algorithm 1 in Nearly Linear Time
In this section, we show how to run Algorithm 1 approximately in nearly linear time. Indeed, the algorithm can be implemented in MPC model with rounds. More precisely, we will show the following theorem:
Theorem 22.
Let and be small enough constants. Given a graph , there exists an MPC algorithm that computes a solution such that
-
1.
For any integer , we have .
-
2.
For any , we have .
The algorithm succeeds with probability at least . Moreover, the algorithm runs in rounds, has a total work of , requires memory per machine and a total memory.
We give the nearly linear time implementation of Algorithm 1 in Algorithm 2. Line 2 constructs the graph efficiently. Line 3-6 find the set of edges we want to assign LP value. For any edge , we will simply set its LP value as . Last, Line 7 to Line 14 is to set up satisfying the conditions in Theorem 22. The full version of this paper will contain a complete analysis of Algorithm 2 and proof of Theorem 22.
Appendix C Rounding
We will present a nearly linear time rounding algorithm. Furthermore, our algorithm only takes rounds in the MPC model. The purpose of this section is to show
See 4
We emphasize that even if the LP values satisfy the exact triangle inequality, rather than an approximate triangle inequality, the terms in the approximate ratio will still be present. These terms arise from two sources: the approximate inequality itself and the inherent characteristics of our MPC algorithm.
Proof of Theorem 3.
We first run Algorithm 2, which outputs as input to Algorithm 3. Note that by Theorem 22, we know that for any , we have
where is the cost of the optimal correlation clustering solution when using the top- norm objective. By Theorem 4, we know that the final cluster satisfies
By Lemma 6, we know that is a simultaneous -approximate clustering for all monotone symmetric norms.
For the number of rounds, Algorithm 1 takes rounds, and Algorithm 3 takes rounds, resulting in a total of rounds.
Algorithm 1 requires a total memory of and a total work of in the MPC model. By analyzing Algorithm 2, we know that . Therefore, Algorithm 3 requires a total memory of and a total work of in the MPC model, as established by Theorem 4. In total, it requires a total memory of and a total work of in the MPC model.
Rounding Algorithm
Assume that we are given an instance graph and a LP solution such that for any . Given a subgraph and node , radius , define the ball centering at with radius as . Define . Note that since itself is in .
Algorithm Description.
Our algorithm works as follows: in each round, we choose a set of cluster centers and choose the ball with radius as a cluster. More precisely. At step , is the set of unclustered vertices. We first compute to ensure that for each , we have . For any node , if , we will add to the set of candidate cluster centers (Line 6-10). Then, we compute cluster centers by adding vertices in the set to the set with probability , where the more vertices in are in with each other, the smaller the probability (Line 11-14). After that, to avoid conflicts, we remove some cluster centers in if they are too close to each other, and derive the final cluster center set (Line 15). Let be the nodes clustered at step . Then we add each to the cluster from with minimum ID. We will remove the clustered nodes and repeat the above process until all vertices are clustered.
Appendix D A Constant Round MPC Algorithm
Algorithm
Theorem 3 gives us an rounds MPC algorithm. The bottleneck is the rounding procedure. To achieve a constant rounds MPC algorithm, instead of setting up the LP and rounding, we use the pre-clustering algorithm from [17], which is very useful for -norm correlation clustering. We show that the pre-clustering algorithm can also provide an -approximate ratio for all monotone symmetric norms simultaneously.
Algorithm Description.
The algorithm from [17] is parameterized by and . It has three steps:
- 1.
- 2.
-
3.
The last step is to output the connected components of the final graph.
A detailed analysis of Algorithm 4 along with the formal proof of Theorem 5 is deferred to the full version.