Query Lower Bounds for Correlation Clustering Under Memory Constraints
Abstract
This work initiates the study of memory–query tradeoffs for graph problems, with a focus on correlation clustering. Correlation clustering asks for a partition of the vertices that minimizes disagreements: non-edges inside clusters plus edges across clusters. Our first result is a tight query lower bound: to output a partition whose cost approximates the optimum up to an additive error of , any algorithm requires adjacency-matrix queries. Under memory constraints, we show that even for the seemingly easier task of approximating the optimal clustering cost (without producing a partition), any algorithm in the random query model must make adjacency-matrix queries. Finally, we prove the first general graph model query lower bound for correlation clustering, where algorithms are allowed adjacency-matrix, neighbor, and degree queries. The latter two bounds are not yet tight, leaving room for sharper results.
Keywords and phrases:
correlation clustering, query-space complexity, information theoryFunding:
Songhua He: The author was partially supported by Karthik C. S. through the National Science Foundation under Grant CCF-2443697.Copyright and License:
2012 ACM Subject Classification:
Theory of computation Lower bounds and information complexityAcknowledgements:
The authors would like to thank anonymous reviewers for their valuable suggestions. Songhua He would also like to thank Karthik C. S. for generously providing partial research fellowship.Editor:
Shubhangi SarafSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
How much information must an algorithm gather from a massive graph to solve a fundamental task like clustering, and how does this cost increase when the algorithm is constrained by limited memory? This work investigates these questions for the problem of Correlation Clustering (CC) – a cornerstone of machine learning and network analysis. We establish the precise information-theoretic requirements of solving CC in sublinear time (i.e., in terms of query complexity) and, importantly, initiate the study of query-space tradeoffs for graph problems.
Given an input graph for the correlation clustering problem, we interpret edges as pairs of “similar” (+) items and non-edges as pairs of “dissimilar” () items. The goal is to partition the vertices into clusters to minimize the total number of “disagreements”: the number of similar pairs that are cut apart plus the number of dissimilar pairs placed in the same cluster. This problem is closely related to other classic cut problems, such as Max-Cut and Minimum Bisection, and our first two hardness results also apply directly to these.
Correlation clustering has been extensively studied in the literature in various models and settings. The problem is known to be APX-hard [16]. After many years of effort, an exciting line of works has improved the approximation factor significantly from to [6, 16, 1, 17, 22, 21, 23, 15, 14, 13]. For sublinear-time algorithms, the best-known approximation ratio is improved to the same as the best ratio of polynomial-time algorithms [20, 5, 23, 13]. Efficient algorithms and lower bounds for this problem have also been studied in the Massively Parallel Computation (MPC) setting [7, 15, 23], dynamic setting [9], streaming model [10, 20, 5, 4, 8, 35, 12], and the query model [11] with unbounded time. In this work, we study feasibility of approximate correlation clustering (with additive error) in various query models and under memory constraints.
Our work makes three primary contributions towards understanding the query complexity of correlation clustering. Throughout, an additive error of means the algorithm’s solution cost may be at most worse than the true optimum. We remark that our main conceptual and technical contribution is the memory-query tradeoff for approximating the optimal clustering cost (in the random query model). However, to put things in proper context, we will present this as our second result in the introduction.
A tight query bound in the standard model
First, we consider the classic setting. An algorithm with unlimited memory that can adaptively query any pair of vertices to see if an edge exists. This is the adjacency-matrix query model. We show the following, asymptotically tight, query lower bound for finding a nearly optimal clustering.
Theorem 1 (Informal)111We do not include its formal statement and full proof here due to space constraints; a full version with complete details will be posted online.
Every randomized algorithm that finds a clustering within an additive error of must make adaptive adjacency-matrix queries.
This result improves a previous lower bound given in [11] and perfectly matches the known upper bound in the same work [11], settling the query complexity of CC in the adjacency-matrix query model. Next, we show that for the seemingly easier problem of approximating the optimal clustering cost, the query complexity increases dramatically under memory constraints, albeit in the random query model.
A query-space tradeoff for approximating the optimal clustering cost
What happens when memory is scarce? To understand the additional query cost of limited memory algorithms, we turn to a model where the algorithm receives a stream of uniformly random vertex pairs and indicators of whether these pairs connect or not in the underlying graph. This is the random-query model. For computing boolean functions, time-space tradeoffs in the random query model have been introduced and studied in [36, 24]. In our second result, the task is to estimate the value of the optimal clustering, a seemingly easier goal than finding the partition.
Theorem 2 (Informal – restated as Theorem 1)
In the random-query model, an algorithm using only bits of memory to estimate the optimal CC cost, within additive error of , needs queries where
To the best of our knowledge, this is the first query-space tradeoff for any approximation problem as well as a graph problem in the random query model. It demonstrates that memory is not free: with sub-polynomial space (when ), the query complexity blows past the baseline, hitting a barrier of for even modest approximation goals. Our lower bound for the case completes the picture and shows that the query complexity remains high even when the memory is large. Memory and queries are provably and strictly intertwined.222Here is a word of caution on interpretation. Relative to Theorem 1, in Theorem 2 we change three parameters of the setting. (i) We impose a memory cap. (ii) We switch to the random–query model [36]. (iii) We study the value (cost) problem rather than the search problem. Our tradeoff should be read and contrasted with the other results in light of all three changes. The proof of the above theorem faces various technical challenges, which we elaborate on in Section 1.1.
Compared to Theorem 1, Theorem 2 weakens the setting in two ways: random queries instead of arbitrary queries, but turning to the cost variant of the problem. We now explain the reason for doing so. First, if instead of the random-query model we allowed arbitrary queries, then the same lower bound would hold for Turing machines using space , for a constant . Finding a natural problem for which such a bound holds is conceptually difficult. Thus, restricting how we access the input circumvents this difficulty and makes the problem amenable to analysis. Second, we should explain why we use the cost/value formulation. There are two reasons. First, clarity. Second, which is crucial here, under space, the search formulation is vacuous. Merely writing down a near-optimal partition already needs bits (even for two clusters). Intuitively, any lower bound would then be dominated by output size rather than information. Focusing on value estimation removes this artifact and isolates the genuine memory–query tradeoff. This seems to be the “correct” problem variant to study here.
A lower bound in the general graph model
Finally, we consider a stronger query model, the general graph model. In this model, an algorithm can make pair, degree, and neighbor queries. This model was introduced and studied in [33, 2]. This model is significantly stronger, as neighbor queries can reveal global graph structure and create complex statistical dependencies that are challenging to analyze. Specifically, it is known from [33] that testing bipartiteness in the general graph model is strictly stronger than algorithms in the adjacency list or (exclusively) adjacency matrix models alone. However, proving lower bounds in the general model is conceptually difficult.
Theorem 3 (Informal)
Even in the general graph model, any algorithm that finds a clustering with an additive error of requires queries.
While this bound is not as tight as Theorem 1, it establishes for the first time that CC remains information-theoretically hard even when algorithms are equipped with the strongest local query arsenal. In our proof, we deal with the challenges of this model by analyzing query interactions on carefully constructed regular graphs (graphs where symmetry in some sense turns neighbor queries useless).333The formal statement and full proof are omitted for brevity; a full version including details will be available separately.
Our results set forth the first query–space tradeoffs for graph problems. Beyond settling the adjacency–matrix complexity and showing hardness in the general model, Theorem 2 in some sense isolates the “price of memory”. Its proof provides a Boolean–Fourier and information-theoretic toolkit for random–query lower bounds that could potentially be of independent interest. We offer Correlation Clustering as the canonical testbed and this paper as the beginning of a broader research direction on query–space tradeoffs across graph primitives.
1.1 Our techniques
Due to space constraints, we only sketch the proof ideas of our main result here–the memory-query tradeoffs for correlation clustering (Theorem 1); the full version will include the proof techniques for the other two results.
To illustrate the proof techniques, we fix and consider -memory algorithms that approximate the clustering cost within an additive error of . Given a graph on vertices, in the random query model, at each time step , the algorithm receives a uniformly random vertex pair along with the indicator bit specifying whether is an edge of . Using a VC-dimension argument [11], it follows that random queries suffice to produce a partition whose cost is within an additive error of from optimal. Although representing such a partition requires bits of memory, it is natural to expect that just approximating the optimal clustering cost might be possible with significantly less memory444Indeed, low-space streaming algorithms are known for certain approximation parameters [10, 4].. Before our work, it was not known whether even -memory algorithms could achieve an additive -approximation using only random queries555While [4] established -space hardness for approximating the clustering cost within an additive error of in the worst-case streaming model, proving hardness in the random query model introduces new technical challenges, even relative to the random-order streaming setting.. We show that any -memory algorithm achieving this guarantee must in fact use random queries.
The random-query model was introduced in [36] to study memory-query tradeoffs for computing -bit Boolean functions, when the algorithm receives a random input bit at each time step. This and the subsequent work of [24] established non-trivial tradeoffs for functions with high sensitivity and total influence, respectively. However, for promise or approximation problems, no small set of edges has a strong influence, and therefore, these prior techniques do not apply to our setting. Instead, our result builds on the approach of [31], which established the hardness of approximating MAX-CUT (and the clustering cost) within a multiplicative factor of over sparse graphs, in the one-pass streaming model. Specifically, the paper leveraged tight lower bounds for a two-player one-way communication problem – called the Distributional Boolean Hidden Partition (D-BHP) problem – to show that any (random-order) streaming algorithm that distinguishes between a bipartite graph and a random graph must use bits of memory.
There are two main challenges in extending this approach to prove an random-query lower bound for additive approximation of the optimal clustering cost (under memory constraints). First, [31] establishes lower bounds only for sparse graphs with edges; for such sparse graphs, an additive approximation of to the clustering cost is trivial. Second, in the random-query model, after queries, edge repetitions occur with high probability, unlike in the random-order one-pass streaming model, where each edge is guaranteed to appear exactly once. The ability to go over the same edge twice significantly increases the technical difficulties in proving hardness results666Only recently, the breakthrough work of [25] showed that any constant-pass streaming algorithm for distinguishing between a sparse bipartite graph and a random graph requires bits of space. Since multi-pass streaming algorithms read the stream in the same order, this model is incomparable to the random query model..
To overcome the first challenge, we introduce and study a noisy variant of D-BHP, which we call the Perturbed Distributional Boolean Hidden Partition (PD-BHP) problem. This formulation allows us to work with dense graphs that satisfy the bipartite partition “approximately”. Starting with the Boolean Hidden Matching (BHM) problem introduced by [26], BHM, D-BHP, and their variants have been extensively used to prove streaming and sketching lower bounds (see, e.g., [37, 30, 31, 34, 28, 32, 29, 27, 18, 3]). The tight bounds we establish for PD-BHP may therefore be of independent interest. Similar to [31], we use a hybrid argument to lift the two-player one-way communication lower bound to a multi-player one-way communication bound, where each player receives an independent graph obeying the hidden partition. In the random-query model, however, there is a single underlying graph dictating the inputs to all players; in particular, repeated queries to the same edge must be consistent with previous observations. This brings in the second challenge. To overcome it, we show that any algorithm capable of distinguishing between the case where all edge queries are consistent and the case where each queried edge is independently resampled must use either bits of memory or random queries. We refer to this as the same vector problem, and we establish tight memory–query tradeoffs for it.
Tight communication complexity for PD-BHP problem
Consider the following two-player communication problem. Alice receives a uniformly random -bit vector . Bob receives a graph on vertices with a uniformly random set of edges, and a vector . In the NO case, each is an independent uniformly random bit. In the YES case, for the th edge , we have
In contrast, in the D-BHP problem studied by [31], the YES case is noiseless, that is . The goal is for Bob to distinguish between the two cases with as little communication from Alice as possible. We show that when Bob receives edges (for some ), and Alice sends at most bits (for some ), Bob’s distinguishing advantage is at most . This bound is tight: if Alice simply sends the first coordinates of to Bob, then with high probability, the graph will contain at least edges whose endpoints lie entirely within these coordinates, enabling Bob to distinguish the two cases with constant advantage. To lift the two-player communication bound to the multi-player setting, we parametrize when , and when . Our proof builds on the Fourier-analytic approach of [31] for the D-BHP problem. A key challenge in extending their proof is that, since Bob now receives edges, the graph necessarily contains many cycles (of length at least ). In contrast, the argument of [31] relies crucially on the fact that when Bob receives edges, the graph is cycle-free. By carefully balancing the number of cycles against the distinguishing advantage they provide, we are able to extend the Fourier-analytic proof to handle the noisy case on denser graphs.
Same vector problem
Let . Consider the following distinguishing problem, where the goal is to decide whether the random queries are “consistent”. At each time step , the algorithm receives a random index and a bit . In the NO case, is uniformly random, whereas in the YES case, , where is fixed at the start. We show that any algorithm that solves the distinguishing problem, in time-steps, must use at least bits of memory.777Recall that by the Birthday Paradox, the probability of sampling the same index twice is negligible when . Our result covers the entire non-trivial range of where collisions are possible but not guaranteed. This bound is tight (up to log factors): by storing the first queries, an algorithm will, with high probability, encounter a repeated index within time steps, and thus distinguish with constant advantage. To obtain the tight result, we use an information-theoretic potential function that measures the maximum progress any -memory algorithm can make towards distinguishing at a time-step. For our main theorem, we generalize the same vector problem to non-uniform bits.
2 Preliminaries
We always denote as the set . For every binary string , we use to denote its Hamming weight. For every two binary strings of the same length, we use to denote their Hamming distance.
Basics of information theory
Given a random variable , we use to denote the Shannon entropy of , i.e., . For a scalar , denotes the entropy of the Bernoulli random variable with probability to be . We use to denote the mutual information between and conditioned on the random variable . , where . Next, we describe some of the properties of mutual information used in the paper.
-
1.
(Chain Rule) .
-
2.
For discrete random variables, , .
-
3.
If , then .
-
4.
If , then .
Graph notations and problem definitions
Throughout this paper, we use to denote the input graph. We denote as the number of vertices. We use to denote the set of unordered pairs of vertices. Given two disjoint subsets of a graph , we use to denote the induced bipartite subgraph where and are the two parts. Given a vertex and a number , we use to denote the -th neighbor of vertex .
We study the following three problems in this work in both versions where the output is an approximately optimal cost or a partition that yields an approximately optimal cost: the correlation clustering problem, the max cut problem, and the minimum bisection problem. The formal definition and statements to the latter two problems are deferred to the full version of the paper.
Correlation clustering
Given an undirected and unweighted graph . A clustering of the graph is a partition of its vertex set where is unfixed. A clustering also defines an equivalence between vertices: if and belong to the same cluster, and otherwise. The cost of a clustering is defined as the total number of missing edges inside each cluster plus the total number of edges crossing different clusters. Formally, we define it as
Fix a parameter , the correlation clustering cost problem asks for an approximate value such that given as the input. The correlation clustering partition problem asks for a clustering such that .
A popular and equivalent definition of correlation clustering is based on a signed graph, where each edge is associated with a positive () or a negative () sign. The goal is to find a clustering that minimizes the number of disagreements, which corresponds to our cost function . The problem can also be framed as maximizing the number of agreements. In the additive error setting, these two goals are equivalent, as the sum of agreements and disagreements is the fixed total number of pairs of vertices, . In our paper, we model positive edges as existing edges and negative edges as non-edges, and we focus on the case of a complete signed input graph. Our lower bounds for this complete graph model also imply a lower bound for general graph cases.
3 Memory-query tradeoffs in the random query model
We study the memory-query tradeoffs of the correlation clustering cost problem in the random query model. This model was introduced in [36] and subsequently studied in [24]. Our main result is the first memory-query tradeoff lower bound for graph problems in the random query model. Different from the previous works, we adapt the proof structure from [31], which is based on Fourier analysis.
The random query model
In the random query model, an algorithm can make random queries to the input graph . Each query returns an independent and identically distributed (i.i.d.) uniformly random pair of vertices from and an indicator of whether or not . We denote the return of the -th query as , where .
Theorem 1.
Let be an undirected simple graph with vertices. Let be any randomized algorithm that, in the random query model, approximates the correlation clustering cost of to within an additive error of with probability at least . For this algorithm, if the worst-case query complexity is and the space used is at most bits, then the following lower bound holds:
for parameters and .
We note that our lower bound, which holds for deterministic algorithms, also applies to randomized algorithms by a standard application of Yao’s minimax principle. In addition, the same lower bound applies to the max cut and minimum bisection problems. The three problems share the same hard distribution. We defer the lower bound statement and its proof to the latter two problems to the full version of this paper, and focus on the correlation clustering problem in this section.
Our proof extends the streaming lower bound for max cut in [31]. They showed an space lower bound for approximating max cut with -multiplicative error for any constant . We adapt their proof to establish lower bounds for a wide range of additive errors.
The core of our proof strategy is a reduction to two auxiliary problems. The first is a noisy variant of the distributional Boolean Hidden Partition (D-BHP) problem introduced in [26] and extended in [37, 31], and the second is what we call the same vector problem. The desired query-space lower bound follows from establishing lower bounds for these two problems. We will define these auxiliary problems formally in the following subsections.
3.1 Input distribution
We use an input distribution for the correlation clustering that differs from the one in [31]. While they studied the indistinguishability between sparse random graphs, we focus on the indistinguishability between two types of dense graphs: a random dense Erdős-Rényi graph and a random dense Erdős-Rényi graph with edges perturbed in expectation.
Let denote the distribution of Erdős-Rényi graphs with parameters and . In a random graph from this distribution, each of the possible edges is included with an independent probability .
We define two input graph distributions, and .
-
The YES distribution (): A random graph from is sampled as follows. We first uniformly generate a partition . For every pair of vertices , we independently include the edge with probability , where . For a fixed partition , we denote this random graph distribution by .
-
The NO distribution (): This is simply the standard Erdős-Rényi graph distribution .
We will show later that the correlation clustering costs of graphs from these two distributions are -far from each other with high probability. Therefore, the lower bound reduces to the indistinguishability of graphs from the two distributions and .
We will also use the graph distribution to define the perturbed distributional boolean hidden partition (PD-BHP) problem: we uniformly sample edges over all pairs of vertices, allowing for repetition; the graph is obtained by taking the union of the sampled edges, which is a graph of edges.
Distribution of Queries
We describe the distributions to the random queries the algorithm makes and their query answers. We define the distributions over query streams. Let a query stream be a sequence of triples , where each pair is sampled uniformly from .
-
: The distribution of a stream of random queries where the graph is sampled from , and each is set to .
-
: The distribution of a stream of random queries where the graph is sampled from .
Our lower bound applies to the mixed distribution .
The indistinguishability of the two distributions is established on the indistinguishability between the two distributions and intermediate distributions. Specifically, fix a number where we assume divides . For every , we let to denote the following distribution of query streams: (i) uniformly sample a partition from the possible partitions of the graph; (ii) let , and let independently; (iii) for each and every , we let . Intuitively, we divide the input stream into phases, where the samples of each phase is from an independently sampled graph from the YES or NO distribution. Our proof idea is as follows
| (1): Indistinguishable by lower bounding the same-vector problem |
| (2): Indistinguishable by lower bounding the PD-BHP problem |
Given the distributions, we show that the correlation clustering cost of graphs in and are -far apart from each other with high probability.
Lemma 2.
Let . Then with probability a random graph drawn from has correlation clustering cost at most ; and a random graph drawn from has correlation clustering cost at least .
Proof.
We first lower bound the optimal clustering cost of a graph from . Observe that for every clustering, every pair of vertices contribute cost with independent probability . Let denote the cost of an arbitrary clustering. By the standard Chernoff bound,
Since there are at most clusterings, by union bound, a random graph drawn from has correlation clustering cost at least with probability.
For graphs drawn from . By our construction to , each pair of vertices contribute 1 cost to with independent probability . Let denote . Then . By Chernoff bound,
3.2 The boolean hidden partition problem
We analyze the 2-party one-way communication complexity of the perturbed distributional Boolean hidden partition (PD-BHP) problem, a variant of the distributional Boolean hidden partition problem introduced in [31]. In this variant, the input vector is noisy.
Perturbed Distributional Boolean Hidden Partition Problem (PD-BHP)
In this problem, Alice receives a vector , and Bob receives a graph and a vector , where . Let be the edge-vertex incidence matrix of . The problem is to distinguish between two cases based on the relationship between , , and :
-
1.
YES case: , where is a random noise vector with each entry independently set to 1 with probability .
-
2.
NO case: The vector is uniformly random in , independent of . This is equivalent to , where is a uniformly random vector.
Alice sends a message to Bob, who must distinguish between the two cases.
We analyze this problem under a specific distribution where Alice’s input is uniformly random in , and Bob’s graph is sampled from , a graph distribution defined by taking a union of edges sampled over all pairs of vertices, allowing for repetition. The final answer is YES or NO with probability each. As a result, the number of edges in the resulting simple graph is exactly the number of distinct edges from the samples, and there do not exist two identical rows in the edge-vertex incidence matrix .
Lemma 3.
Fix three parameters , and . Consider an instance of the PD-BHP problem where Alice receives a uniformly random string , and Bob receives a graph together with the corresponding vector . Any deterministic protocol for the PD-BHP problem that uses at most bits of communication can achieve an advantage of at most over random guessing.
This lower bound is tight when both and are constants. In this scenario, Alice can send the first bits of her input to Bob. In expectation, Bob can then learn entries of the vector , and thus entries of the noise vector . Since the noise vector in the YES case and the NO case correspond to Bernoulli distributions that are -far apart, independent samples are sufficient to distinguish them with constant advantage.
This subsection focuses on proving the above lemma. Our proof follows the same framework as [26, 37, 31]. Compared to previous works, our problem involves a key trade-off: while the noise vector makes distinguishing the distributions harder, the denser graphs make it easier. We establish a trade-off between the graph’s density, the noise probability, and the communication complexity.
Specifically, Alice’s message induces a partition of , where is the number of bits sent by Alice. Let the protocol’s advantage over random guessing be at least , where . Notice that at least a fraction of all input strings are contained within sets in the partition that have a size of at least . We can therefore focus our analysis on one such “typical” set, which we denote by . Define a parameter , the size of is .
The central idea is to show that if Alice’s input is drawn uniformly from such a typical set , the resulting vector is statistically close to uniform. If this holds, Bob is unable to distinguish the YES case (where ) from the NO case (where is uniformly random).
Our main technical contribution is an extension of Fourier analysis techniques from prior work [26, 37, 31]. While previous work dealt with sparse, cycle-free graphs [31], our analysis is adapted to handle the presence of cycles.
We use the following normalization of the Fourier transform:
We use the following bounds on the Fourier mass of contributed by coefficients of various weight:
Lemma 4 (Lemma 6 in [26]).
Let of size at least and its indicator function. Then for every
Recall that is the graph Bob receives as input. The number of edges in is where . denotes the incidence matrix of where iff is an endpoint of . We are interested in the distribution of where is uniformly randomly selected from and is a random noise vector where each entry is with probability . For , let
Note that is a function of the message . We will prove that is close to uniform by bounding the sum of squares of its Fourier coefficients for non-zero weight vectors. Specifically, let denote the uniform distribution over -bit binary strings, then
By expanding the Fourier coefficients of , we have
where the last equation is derived from the fact that the summation is exactly the expansion of the polynomial
Therefore, we can rewrite the TV distance as
| (1) | ||||
where the first inequality is by Cauchy-Schwartz, the second is Parseval’s equality. We will show that is well-bounded for every . Combined with Lemma 4, we get our desired bound.
If we regard as an indicator of a subgraph of , where each edge is selected if , the summation represents the total weight of different vector such that the parity of degrees of vertices in the subgraph induced by is exactly the vector .
Lemma 5.
Let be the edge incidence matrix of a graph drawn from . For any vector with Hamming weight , the following bound on the expected value holds for parameters and :
The above expectation bound holds because the weight is small enough, and because is upper bounded by .
Proof.
We will first bound the expectation for every fixed . Notice that the graph distribution given the number of edges fixed is uniform over the set of all -edge simple graphs.
Now fix an arbitrary . We bound the expectation of the summation by counting the number of matrices such that for every fixed vector . Assume to be an even number as otherwise the summation is always . Denote by the graph induced by and . Then the set of all edges in can be decomposed into walks, each walk either starts and finishes at the same (arbitrary) vertex, or starts and finishes at vertices where . There are exactly walks of the second type, which forms the vector .
We first bound the probability that random edges form a walk. Either the walk has determined start/end points, or it is a closed walk. For closed walks, this probability that selected (and distinct) edges from of a fixed order form a closed walk is at most
The factor of denotes the number of start vertices of the walk. At each step of the walk, with probability at most the next sampled edge shares an endpoint with the previous edge. The last step of the walk must go back to the start point, hence with probability one samples this edge.
Analogously, for walks with fixed start and end vertices, the probability is at most
Therefore, for every fixed , we have
where the last inequality is due to the fact that . The factor of accounts for ordering the edges arbitrarily ( ways) and independently partitioning it to multiple walks ( ways).
Notice that for , one needs to select at least distinct edges to form since . The number of edges cannot be because two edges do not form a cycle in a simple graph. For , one needs at least edges. Note that the desired expectation is a weighted sum of the expectations for every fixed . We have
The sum is dominated by small because it is a decreasing geometric sum when .
With Lemma 5, we are able to bound the statistical distance between and the uniform distribution.
Lemma 6.
Fix three parameters , and . Let and . Let of size at least , and let be the indicator of . Let be a random graph sampled from . Then
Proof.
Recall that by inequality (1),
We break this summation into two parts: the part and the part . For the first part, by Lemma 4 and Lemma 5
where and .
For the part , we have
where we used the fact that
Together with the part , we have
We will need the following lemma, which is the Lemma 5.6 from [31].
Lemma 7 (Lemma 5.6 of [31]).
Let be random variables taking values on finite sample space . For any let , denote the conditional distribution of given the event . Then
We now prove Lemma 3.
Proof of Lemma 3.
Let be Alice’s message function and let be Bob’s function, where is the edge incidence matrix of the input graph, is input vector of Bob, and is Alice’s message. Since we are analyzing the protocol under a fixed input distribution, we can assume and are deterministic.
Let denote the distribution of under YES instances, and the same under NO instances. We aim to show that
which implies that the protocol’s distinguishing advantage is at most on PD-BHP.
The function partitions into sets , where is the number of bits communicated. Since there are such sets, at least a fraction of is contained in those of size at least , where we define . We call any message with a typical message. Then the probability that is not typical on a uniformly random input is at most .
Let us write , where is the conditional distribution of given and under the YES distribution, and write , where is the uniform distribution over . For each and , let and denote the respective conditional distributions of .
Applying Lemma 7, we get:
Now fix any typical message . Then, by definition, where . Lemma 6 then implies
Applying Jensen’s inequality yields
Combining this with the earlier bound, we conclude that
as claimed.
3.3 Same vector problem
We prove a query-space lower bound for the following problem in the random query model.
Same vector problem
Let , and . Let divides . Let be binary vectors drawn from independently, i.e., they are length- vectors with each bit being with probability . An algorithm for this problem has phases. At each phase , the algorithm makes random queries. The return of each random query is a pair where independently. For , (i) in the YES case, ; (ii) in the NO case, at phase . The algorithm must distinguish between the two cases above.
When , a trivial upper bound is to memorize queries from the first queries, and check if there is any collision with the latter samples. We show that the upper bound is tight using information theory. Formally, we obtain the following lower bound.
Lemma 8.
Fix and . Fix to be a function of . For every algorithm that solves the same vector problem correctly with probability , the algorithm must use bits of memory. Specifically, this is true when the algorithm outputs correctly w.p. on an input distribution where the input is in the YES case and the NO case uniformly.
We will need to use Fano’s inequality and Shearer’s lemma in the proof.
Proposition 9 (Fano’s inequality).
Let be a Markov chain, and let . Let denote the binary entropy function. Let be the support of . Then
Proposition 10 (Shearer’s lemma [19]).
Let be random variables, and let be subsets such that each belongs toat least sets. Then
Proof to Lemma 8.
We will prove the lower bound to any algorithm over the distribution where the input is in the YES case and the NO case uniformly.
Let denote the length of the memory. For every , let denote the memory after receiving the -th random query. Specifically, denotes the initial memory.
Without loss of generality assume that is the output of the algorithm. Let be an indicator such that at the YES case, and at the NO case. Let be the Markov chain. By Fano’s inequality and , we get
| (2) |
We will decompose the mutual information into the sum of increases of mutual information in each phase. Formally, we denote by the list of random variables in phase , and we use the same notation for other variables.
| (3) | ||||
by the data processing inequality.
We will upper bound each by . For every and , let . We can rewrite
where the inequality follows from the fact that conditioning reduces entropy.
And we can rewrite the above as the summation of :
| (4) | ||||
We are able to replace the conditional entropy of by regardless of whether the answer is YES or NO. When (the YES case), by our definition. When (the NO case), is independent from previous samples, and follows the same (conditional) distribution as .
Let denote a random subset to be the range of a random vector from . We have
(5)
Let . We have . Then, by Shearer’s Lemma (Proposition 10),
And,
Substituting in Equation (5), we get
and we get a space lower bound .
3.4 Proof to the main theorem
In this section, we prove the main theorem, which is a query-space lower bound to approximating the max cut in the random query model. We repeat the theorem here.
Theorem 1. [Restated, see original statement.]
Let be an undirected simple graph with vertices. Let be any randomized algorithm that, in the random query model, approximates the correlation clustering cost of to within an additive error of with probability at least . For this algorithm, if the worst-case query complexity is and the space used is at most bits, then the following lower bound holds:
for parameters and .
The lower bound is obtained through a reduction from both the same-vector problem and the PD-BHP problem to the max cut problem. Our reduction follows a similar structure to [31] despite adding a reduction to the same vector problem. This reduction is necessary in our case. We start from introducing the reduction used in [31]. Then we outline how we adapt this reduction to obtain our result. After presenting a full map, we will show the formal reduction.
In [31], the main result is a space lower bound for -approximating the max cut given a stream of edges of the input graph. To that end, they show that every algorithm with limited space cannot distinguish a bipartite graph from an Erdös-Rényi graph with the same expected number of edges. The lower bound is also achieved by a reduction to the D-BHP problem, which is similar to the PD-BHP problem we study in this work but (i) the noise vector is always ; (ii) the number of edges given to Bob is where is the number of phases. Besides, their total number of edges over all phases in the stream is only .
Their reduction has two steps. First, they show that a stream of independently sampled random graphs of edges is close in total variation to a stream of the edges of a random graph of edge. This holds both when the graph is sampled from an Erdös-Rényi distribution or when the graph is a random bipartite graph given a fixed random partition. It is true because the total number of edges is and it is unlikely an edge will be sampled twice in two different phases. Given the first step, we only need to show that algorithms cannot distinguish between a stream of phases of random sparse bipartite graphs given and a stream of phases of random sparse Erdos-Renyi graphs.
The next step is the standard hybrid argument. For every , let denote the input distribution where the first phases are random sparse bipartite graphs given a fixed random partition, and the last phases are sparse Erdös-Rényi graphs. By the hybrid argument, if there is an algorithm that can distinguish the two cases with constant advantage, there also exists an index such that the algorithm can distinguish the two distributions and with advantage. Such an algorithm can be used to device an algorithm for the D-BHP problem: Alice, given a partition as input, samples sparse graphs of edges and runs on it locally. Then, Alice sends the content of the space of to Bob. Bob sets the graph of the -th phase to be its input graph, and the last phases to be Erdös-Rényi graphs. Bob runs on the given space and the last phases, and outputs the output of . Finally, by the lower bound to D-BHP, the streaming algorithm also requires space.
In our case, however, it is impossible to show that a stream of random samples of pairs of vertices to the graph is statistically indistinguishable from a stream of phases of independent random graphs of size. This is because each phase has pairs of vertices in our case. Therefore, almost certainly many pairs of vertices will be sampled twice and cause a conflict (i.e., a pair of vertices is connected in one phase, but disconnected in another). To that end, we use our lower bound established above for the same-vector problem to show that every algorithm in the random query model with bounded time and space cannot distinguish a stream of consistent random queries of a fixed random graph from a stream of random queries that consists of phases of independent random graphs. Given that, we still use the hybrid argument to reduce the streaming lower bound to the communication lower bound for the PD-BHP problem studied in Section 3.2. We refer readers to Figure 1 for an illustration to our reductions.
Proof to Theorem 1.
Suppose for the sake of contradiction that there is a randomized algorithm in the random query model that makes queries to the input graph, uses bits of memory, and approximates the correlation clustering cost to within additive error with probability at least . Here we let
where is a small enough constant. By Yao’s minimax principle, there also exists a deterministic algorithm with the same query complexity and space complexity such that approximates the correlation clustering cost given a random input from with probability. Let be the algorithm that outputs if and only if the output of is less than . By Lemma 2, is a distinguisher for input distributions and with high advantage. Formally, we have
| (6) |
and
| (7) |
where denotes the input stream.
Let . Without loss of generality, we assume divides . We obtain that
Otherwise, by (6), solves the same-vector problem correctly with probability on the distribution , which contradicts Lemma 8 since . Analogously, by (7) and Lemma 8, we have
By the standard hybrid argument, there exists a number such that distinguishes between and . Formally, if we let for every , we have
Given such an algorithm and index , we can obtain an algorithm for the PD-BHP problem with advantage in the communication model.
Formally, the algorithm works as follows: when Alice receives a partition as input, she independently samples a sequence of graphs drawn from locally. Alice then runs the algorithm on random queries to these graphs. After that, she sends the content of the space of to Bob. Bob receives the space of , the input graph and the input vector . He samples a sequence of triples on the marginal distribution that a random graph from is exactly , where each is the the corresponding bit in . After this, Bob samples a sequence of graphs from and correspondingly random queries. Bob then simulates on these random queries with its initial content of the space set to the message of Alice. Finally, Bob outputs the output of .
To see its correctness, notice that the input distribution of Bob is exactly the distribution of subgraphs induced by the -th phase of the random queries. On the YES case, Alice and Bob together simulates on the input distribution , whereas on the NO case, they simulates on . Hence the above algorithm distinguishes YES cases from NO cases with advantage . However, when , by applying Lemma 3 and setting , the advantage should be , which is smaller than when is small enough. When , by setting , the advantage should be . Here is a constant that is decreasing when tends to . Therefore, we get a contradiction.
References
- [1] Nir Ailon, Moses Charikar, and Alantha Newman. Aggregating inconsistent information: ranking and clustering. Journal of the ACM (JACM), 55(5):1–27, 2008. doi:10.1145/1411509.1411513.
- [2] Noga Alon, Tali Kaufman, Michael Krivelevich, and Dana Ron. Testing triangle-freeness in general graphs. SIAM Journal on Discrete Mathematics, 22(2):786–819, 2008. doi:10.1137/07067917X.
- [3] Sepehr Assadi, Vaggos Chatziafratis, Jakub Lacki, Vahab Mirrokni, and Chen Wang. Hierarchical clustering in graph streams: Single-pass algorithms and space lower bounds. In Po-Ling Loh and Maxim Raginsky, editors, Conference on Learning Theory, 2-5 July 2022, London, UK, volume 178 of Proceedings of Machine Learning Research, pages 4643–4702. PMLR, PMLR, 2022. URL: https://proceedings.mlr.press/v178/assadi22a.html.
- [4] Sepehr Assadi, Vihan Shah, and Chen Wang. Streaming algorithms and lower bounds for estimating correlation clustering cost. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, volume 36, pages 75201–75213, 2023. URL: http://papers.nips.cc/paper_files/paper/2023/hash/ee1a1ecc92f35702b5c29dad3dc909ea-Abstract-Conference.html.
- [5] Sepehr Assadi and Chen Wang. Sublinear time and space algorithms for correlation clustering via sparse-dense decompositions. In Mark Braverman, editor, 13th Innovations in Theoretical Computer Science Conference, ITCS 2022, Berkeley, CA, USA, January 31 - February 3, 2022, volume 215 of LIPIcs, pages 10:1–10:20. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2022. doi:10.4230/LIPIcs.ITCS.2022.10.
- [6] Nikhil Bansal, Avrim Blum, and Shuchi Chawla. Correlation clustering. Machine Learning, 56:89–113, 2004. doi:10.1023/B:MACH.0000033116.57574.95.
- [7] Soheil Behnezhad, Moses Charikar, Weiyun Ma, and Li-Yang Tan. Almost 3-approximate correlation clustering in constant rounds. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pages 720–731. IEEE, 2022. doi:10.1109/FOCS54457.2022.00074.
- [8] Soheil Behnezhad, Moses Charikar, Weiyun Ma, and Li-Yang Tan. Single-pass streaming algorithms for correlation clustering. In Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 819–849. SIAM, 2023. doi:10.1137/1.9781611977554.ch33.
- [9] Soheil Behnezhad, Mahsa Derakhshan, MohammadTaghi Hajiaghayi, Cliff Stein, and Madhu Sudan. Fully dynamic maximal independent set with polylogarithmic update time. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 382–405. IEEE, 2019. doi:10.1109/FOCS.2019.00032.
- [10] Aditya Bhaskara, Samira Daruki, and Suresh Venkatasubramanian. Sublinear algorithms for MAXCUT and correlation clustering. In Ioannis Chatzigiannakis, Christos Kaklamanis, Dániel Marx, and Donald Sannella, editors, 45th International Colloquium on Automata, Languages, and Programming, ICALP 2018, Prague, Czech Republic, July 9-13, 2018, volume 107 of LIPIcs, pages 16:1–16:14. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2018. doi:10.4230/LIPIcs.ICALP.2018.16.
- [11] Marco Bressan, Nicolò Cesa-Bianchi, Andrea Paudice, and Fabio Vitale. Correlation clustering with adaptive similarity queries. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, volume 32, pages 12510–12519, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/b0ba5c44aaf65f6ca34cf116e6d82ebf-Abstract.html.
- [12] Mélanie Cambus, Fabian Kuhn, Etna Lindy, Shreyas Pai, and Jara Uitto. A -approximate correlation clustering algorithm in dynamic streams. In Proceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2861–2880. SIAM, 2024. doi:10.1137/1.9781611977912.101.
- [13] Nairen Cao, Vincent Cohen-Addad, Euiwoong Lee, Shi Li, David Rasmussen Lolck, Alantha Newman, Mikkel Thorup, Lukas Vogl, Shuyi Yan, and Hanwen Zhang. Solving the correlation cluster lp in sublinear time. In Proceedings of the 57th Annual ACM Symposium on Theory of Computing (STOC), pages 1154–1165, 2025. doi:10.1145/3717823.3718181.
- [14] Nairen Cao, Vincent Cohen-Addad, Euiwoong Lee, Shi Li, Alantha Newman, and Lukas Vogl. Understanding the cluster linear program for correlation clustering. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing (STOC), pages 1605–1616, 2024. doi:10.1145/3618260.3649749.
- [15] Nairen Cao, Shang-En Huang, and Hsin-Hao Su. Breaking 3-factor approximation for correlation clustering in polylogarithmic rounds. In Proceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 4124–4154. SIAM, 2024. doi:10.1137/1.9781611977912.143.
- [16] Moses Charikar, Venkatesan Guruswami, and Anthony Wirth. Clustering with qualitative information. Journal of Computer and System Sciences, 71(3):360–383, 2005. doi:10.1016/j.jcss.2004.10.012.
- [17] Shuchi Chawla, Konstantin Makarychev, Tselil Schramm, and Grigory Yaroslavtsev. Near optimal lp rounding algorithm for correlationclustering on complete and complete k-partite graphs. In Proceedings of the 47th annual ACM Symposium on Theory of Computing (STOC), pages 219–228, 2015. doi:10.1145/2746539.2746604.
- [18] Chi-Ning Chou, Alexander Golovnev, and Santhoshini Velusamy. Optimal streaming approximations for all boolean max-2csps and max-ksat. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 330–341. IEEE, 2020. doi:10.1109/FOCS46700.2020.00039.
- [19] Fan RK Chung, Ronald L Graham, Peter Frankl, and James B Shearer. Some intersection theorems for ordered sets and graphs. Journal of Combinatorial Theory, Series A, 43(1):23–37, 1986. doi:10.1016/0097-3165(86)90019-1.
- [20] Vincent Cohen-Addad, Silvio Lattanzi, Slobodan Mitrović, Ashkan Norouzi-Fard, Nikos Parotsidis, and Jakub Tarnawski. Correlation clustering in constant many parallel rounds. In International Conference on Machine Learning (ICML), pages 2069–2078. PMLR, 2021. URL: http://proceedings.mlr.press/v139/cohen-addad21b.html.
- [21] Vincent Cohen-Addad, Euiwoong Lee, Shi Li, and Alantha Newman. Handling correlated rounding error via preclustering: A 1.73-approximation for correlation clustering. In IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS), pages 1082–1104. IEEE, 2023. doi:10.1109/FOCS57990.2023.00065.
- [22] Vincent Cohen-Addad, Euiwoong Lee, and Alantha Newman. Correlation clustering with sherali-adams. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pages 651–661. IEEE, 2022. doi:10.1109/FOCS54457.2022.00068.
- [23] Vincent Cohen-Addad, David Rasmussen Lolck, Marcin Pilipczuk, Mikkel Thorup, Shuyi Yan, and Hanwen Zhang. Combinatorial correlation clustering. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing (STOC), pages 1617–1628, 2024. doi:10.1145/3618260.3649712.
- [24] Itai Dinur. Time-space lower bounds for bounded-error computation in the random-query model. In Proceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2900–2915. SIAM, 2024. doi:10.1137/1.9781611977912.103.
- [25] Yumou Fei, Dor Minzer, and Shuo Wang. Multi-pass streaming lower bounds for approximating max-cut. 66th IEEE Symposium on Foundations of Computer Science (FOCS) 2025, 2025.
- [26] Dmitry Gavinsky, Julia Kempe, Iordanis Kerenidis, Ran Raz, and Ronald De Wolf. Exponential separations for one-way quantum communication complexity, with applications to cryptography. In Proceedings of the 39th Annual ACM Symposium on Theory of Computing (STOC), pages 516–525, 2007. doi:10.1145/1250790.1250866.
- [27] Venkatesan Guruswami and Runzhou Tao. Streaming hardness of unique games. In Dimitris Achlioptas and László A. Végh, editors, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2019, Massachusetts Institute of Technology, Cambridge, MA, USA, September 20-22, 2019, volume 145 of LIPIcs, pages 5:1–5:12. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2019. doi:10.4230/LIPIcs.APPROX-RANDOM.2019.5.
- [28] Venkatesan Guruswami, Ameya Velingker, and Santhoshini Velusamy. Streaming complexity of approximating max 2csp and max acyclic subgraph. In Klaus Jansen, José D. P. Rolim, David Williamson, and Santosh S. Vempala, editors, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2017, Berkeley, CA, USA, August 16-18, 2017, volume 81 of LIPIcs, pages 8:1–8:19. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2017. doi:10.4230/LIPIcs.APPROX-RANDOM.2017.8.
- [29] John Kallaugher, Michael Kapralov, and Eric Price. The sketching complexity of graph and hypergraph counting. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 556–567. IEEE, 2018. doi:10.1109/FOCS.2018.00059.
- [30] Michael Kapralov, Sanjeev Khanna, and Madhu Sudan. Approximating matching size from random streams. In Proceedings of the 25th annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 734–751. SIAM, 2014. doi:10.1137/1.9781611973402.55.
- [31] Michael Kapralov, Sanjeev Khanna, and Madhu Sudan. Streaming lower bounds for approximating MAX-CUT. In Piotr Indyk, editor, Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, San Diego, CA, USA, January 4-6, 2015, pages 1263–1282. SIAM, SIAM, 2015. doi:10.1137/1.9781611973730.84.
- [32] Michael Kapralov, Sanjeev Khanna, Madhu Sudan, and Ameya Velingker. (1+ (1))-approximation to max-cut requires linear space. In Philip N. Klein, editor, Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2017, Barcelona, Spain, Hotel Porta Fira, January 16-19, pages 1703–1722. SIAM, SIAM, 2017. doi:10.1137/1.9781611974782.112.
- [33] Tali Kaufman, Michael Krivelevich, and Dana Ron. Tight bounds for testing bipartiteness in general graphs. SIAM Journal on Computing, 33(6):1441–1483, 2004. doi:10.1137/S0097539703436424.
- [34] Dmitry Kogan and Robert Krauthgamer. Sketching cuts in graphs and hypergraphs. In 6th Conference on Innovations in Theoretical Computer Science (ITCS), pages 367–376, 2015. doi:10.1145/2688073.2688093.
- [35] Konstantin Makarychev and Sayak Chakrabarty. Single-pass pivot algorithm for correlation clustering. keep it simple! In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, volume 36, pages 6412–6421, 2023. URL: http://papers.nips.cc/paper_files/paper/2023/hash/149ad6e32c08b73a3ecc3d11977fcc47-Abstract-Conference.html.
- [36] Ran Raz and Wei Zhan. The random-query model and the memory-bounded coupon collector. In Thomas Vidick, editor, 11th Innovations in Theoretical Computer Science Conference, ITCS 2020, Seattle, Washington, USA, January 12-14, 2020, volume 151 of LIPIcs, pages 20:1–20:11. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPIcs.ITCS.2020.20.
- [37] Elad Verbin and Wei Yu. The streaming complexity of cycle counting, sorting by reversals, and other problems. In Proceedings of the 22nd annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 11–25. SIAM, 2011. doi:10.1137/1.9781611973082.2.
