A Lower Bound for Sampling Disjoint Sets

Suppose Alice and Bob each start with private randomness and no other input, and they wish to engage in a protocol in which Alice ends up with a set x⊆ [n] and Bob ends up with a set y⊆ [n], such that (x,y) is uniformly distributed over all pairs of disjoint sets. We prove that for some constant β < 1, this requires Ω (n) communication even to get within statistical distance 1− βn of the target distribution. Previously, Ambainis, Schulman, Ta-Shma, Vazirani, and Wigderson (FOCS 1998) proved that Ω (√n) communication is required to get within some constant statistical distance ɛ > 0 of the uniform distribution over all pairs of disjoint sets of size √n.


INTRODUCTION
In most traditional computational problems, the goal is to take an input and produce the "correct" output or produce one of a set of acceptable outputs. In a sampling problem, however, the goal is to generate a random sample from a specified probability distribution D or at least from a distribution that is close to D. There has been a surge of interest in studying sampling problems from a complexity theory perspective [1,7,13,15,32,36,49,61,[75][76][77][78][79][80][81][82]. Unlike more traditional computational problems, sampling problems do not necessarily need to have any real input, besides the uniformly random bits fed into a sampling algorithm.
One commonly studied type of target distribution is "input-output pairs" of a function f , i.e., (D, f (D)), where D is perhaps the uniform distribution over inputs to f . This means an outcome should be (x, z) where x is distributed according to D, and z = f (x ). Using an algorithm for computing f , one can sample (D, f (D)) by first sampling from D and then evaluating f on that input. However, for some functions f , generating an input jointly with the corresponding output may be computationally easier than evaluating f on an adversarially chosen input. Thus, in general, sampling lower bounds tend to be more challenging to prove than lower bounds for functions. Many of the above-cited works focus on concrete computational models such as low-depth circuits. We consider the model of two-party communication complexity, for which comparatively less is known about sampling. Which problem should we study? Well, the single most important function in communication complexity is Set-Disjointness, in which Alice gets a set x ⊆ [n], Bob gets a set y ⊆ [n], and the goal is to determine whether x ∩ y = ∅. Identifying the sets with their characteristic bit strings, this can be viewed as Disj : {0, 1} n × {0, 1} n → {0, 1}, where Disj(x, y) = 1 iff x ∧ y = 0 n . The applications of communication bounds for Set-Disjointness are far too numerous to list, but they span areas such as streaming, circuit complexity, proof complexity, data structures, property testing, combinatorial optimization, fine-grained complexity, cryptography, and game theory. Because of its central role, Set-Disjointness has become the de facto testbed for proving new types of communication bounds. This function has been studied in the contexts of randomized [9,10,17,51,65] and quantum [2,25,44,66,69,73] protocols; multi-party number-in-hand [6,10,18,22,27,42,50] and number-on-forehead [11,12,28,41,59,63,64,69,71,72,74] models; Merlin-Arthur and related models [3,4,29,35,38,39,52,67]; with a bounded number of rounds of interaction [19,23,48,54,83]; with bounds on the sizes of the sets [26,31,43,45,58,62,68]; very precise relationships between communication and error probability [20,21,30,33,39]; when the goal is to find the intersection [8,24,34,82]; in space-bounded, online, and streaming models [5,16,55]; and direct product theorems [12,14,47,53,56,[70][71][72]. We contribute one more result to this thorough assault on Set-Disjointness.
Here is the definition of our two-party sampling model: Let D be a probability distribution over {0, 1} n × {0, 1} n ; we also think of D as a matrix with rows and columns both indexed by {0, 1} n , where D x,y is the probability of outcome (x, y). We define Samp(D) as the minimum communication cost of any protocol where Alice and Bob each start with private randomness and no other input, and at the end Alice outputs some x ∈ {0, 1} n and Bob outputs some y ∈ {0, 1} n such that (x, y) is distributed according to D. Note that Samp(D) = 0 iff D is a product distribution (x and y are independent), and Samp(D) ≤ n for all D (since Alice can privately sample (x, y) and send y to Bob). Allowing public randomness would not make sense, since Alice and Bob could read a properly distributed (x, y) off of the randomness without communicating. We define Samp ε (D) as the minimum of Samp(D ) over all distributions D with Δ(D, D ) ≤ ε, where Δ denotes statistical (total variation) distance, defined as

A Story
Our story begins with Reference [7], which proved that Samp ε (D, Disj(D)) ≥ Ω( √ n) for some constant ε > 0, where D is uniform over the set of all pairs of sets of size √ n (note that this D is a product distribution and is approximately balanced between 0-inputs and 1-inputs of Disj); here it does not matter which party is responsible for outputting the bit Disj(D). The main tool in the proof was a lemma that was originally employed in Reference [9] to prove an Ω( √ n) bound on the randomized communication complexity of computing Disj. The latter bound was improved to Ω(n) via several different proofs [10,51,65], which leads to a natural question: Can we improve the sampling bound of [7] to Ω(n) by using the techniques of References [10,51,65] instead of Reference [9]?
For starters, the answer is "no" for the particular D considered in Reference [7]-there is a trivial exact protocol with O ( √ n log n) communication, since it only takes that many bits to specify a set of size √ n. What about other interesting distributions D? The following illuminates the situation.

Observation 1. For any D and constants
Proof. It suffices to show Samp ε (D, Disj(D)) ≤ Samp δ (D) + O ( √ n). First, note that for any sampling protocol, if we condition on a particular transcript, then the output distribution becomes product (Alice and Bob are independent after they stop communicating). Second, Reference [17] proved that for every product distribution and every constant γ > 0, there exists a deterministic protocol that uses O ( √ n) bits of communication and computes Disj with error probability ≤ γ on a random input from the distribution. Now to ε-sample (D, Disj(D)), Alice and Bob can δ -sample D to obtain (x, y), and then conditioned on that sampler's transcript, they can run the average-case protocol from Reference [17] for the corresponding product distribution with error ε − δ . A simple calculation shows this indeed gives statistical distance ε.
The upshot is that to get an improved bound, the hardness of sampling (D, Disj(D)) would come entirely from the hardness of just sampling D. Thus, such a result would not really be "about" the Set-Disjointness function, it would be about the distribution on inputs. Instead of abandoning this line of inquiry, we realize that if D itself is somehow defined in terms of Disj, then a bound for sampling D would still be saying something about the complexity of Set-Disjointness. In fact, the proof in Reference [7] actually shows something stronger than the previously stated result: If D is instead defined as the uniform distribution over pairs of disjoint sets of size √ n (which are 1-inputs of Disj), then Samp ε (D) ≥ Ω( √ n). After this pivot, we are now facing a direction in which we can hope for an improvement. We prove that by removing the restriction on the sizes of the sets, the sampling problem becomes maximally hard. Our result holds for error ε < 1 that is exponentially close to 1, but the result is already new and interesting for constant ε > 0.
The proof from Reference [7] was a relatively short application of the technique from Reference [9], but for Theorem 1, harnessing known techniques for proving linear communication lower bounds turns out to be more involved.
For calibration, the uniform distribution over all (x, y) achieves statistical distance 1 − 0.75 n from U , since there are 4 n inputs and 3 n disjoint inputs (for a disjoint input, each coordinate i ∈ [n] has 3 possibilities x i y i ∈ {00, 01, 10}). We can do a little better: Suppose for each coordinate independently, Alice picks 0 with probability √ 1/3 and picks 1 with probability 1 − √ 1/3, and Bob does the same. This again involves no communication, and it achieves statistical distance Theorem 1 shows that the constant 0.82 cannot be improved arbitrarily close to 1 without a lot of communication. (In the setting of lower bounds for circuit samplers, significant effort has gone into handling statistical distances exponentially close to the maximum possible [13,32,79].)

Interpreting the Result
As an important step in the proof of Theorem 1, we first observe that our sampling model is equivalent to two other models. One of these we call (for lack of a better word) "synthesizing" the distribution D: Alice and Bob get inputs x, y ∈ {0, 1} n , respectively, in addition to their private randomness, and their goal is to accept with probability exactly D x,y . We let Synth(D) denote the minimum communication cost of any synthesizing protocol for D, and Synth ε (D) denote the minimum of Synth(D ) over all D with Δ(D, D ) ≤ ε. The other model is the nonnegative rank of a matrix: rank + (D) is defined as the minimum k for which D (viewed as a 2 n × 2 n matrix) can be written as a sum of k many nonnegative rank-1 matrices.
Observation 2. For every distribution D, the following are all within ±O (1) of each other: Proof. Synth(D) ≤ Samp(D) + 2, since a synthesizing protocol can just run a sampling protocol and accept iff the result equals the given input (x, y). (Only this part of Observation 2 is needed in the proof of Theorem 1.) log rank + (D) ≤ Synth(D), since for each transcript of a synthesizing protocol, the matrix that records the probability of getting that transcript on each particular input has rank 1 (since Alice's private randomness being consistent with the transcript, and Bob's private randomness being consistent with the transcript, are independent events); summing these matrices over all accepting transcripts yields a nonnegative rank decomposition of D.
To see that To sample from D, Alice can privately sample i ∼ p and send it to Bob using log k bits, then Alice can sample x ∼ u (i ) and Bob can independently sample y ∼ v (i ) with no further communication.
By this characterization, Theorem 1 can be viewed as a lower bound on the approximate nonnegative rank of the Disj matrix, where the approximation is in 1 (which has an average-case flavor). In the recent literature, "approximate nonnegative rank" generally refers to approximation in ∞ (which is a worst-case requirement), and this model is equivalent to the so-called smooth rectangle bound and WAPP communication complexity [37,46,57].
Observation 2 combined with a result of Reference [60] shows that the deterministic communication complexity of any total two-party Boolean function f is quadratically related to the communication complexity of exactly sampling the uniform distribution over f −1 (1).

PROOF 2.1 Overview
Our proof of Theorem 1 is by a black-box reduction to the well-known corruption lemma for Set-Disjointness due to Razborov [65]. We start with a high-level overview.
For notation: Let |z| denote the Hamming weight of a string z ∈ {0, 1} n . For ∈ N, let U be the uniform distribution over all ( For a randomized protocol Π, let acc Π (x, y) denote the probability that Π accepts (x, y).
Step I: Uniform Corruption. The corruption lemma states that if a rectangle R ⊆ {0, 1} n × {0, 1} n contains a noticeable fraction of disjoint pairs, then it must contain about as large a fraction of uniquely intersecting pairs. More quantitatively, there exist a constant C > 0 and two distributions D , = 0, 1, defined over disjoint ( = 0) and uniquely intersecting pairs ( = 1) such that for every rectangle R, [65] defined D as the uniform distribution over all pairs (x, y) with fixed sizes |x | = |y| = n/4 and |x ∧ y| = . For our purpose, we need the corruption lemma to hold relative to the aforementioned distributions U , = 0, 1, which have no restrictions on set sizes. We derive in Section 2.2 a corruption lemma for U from the original lemma for D . To do this, we exhibit a reduction that uses public randomness and no communication to transform a sample from D into a sample from a distribution that is close to U in a suitable sense, for = 0, 1.
Step II: Truncate and Scale. For simplicity, let us think about proving Theorem 1 for a small error ε > 0. Assume for contradiction there is some distribution D, Δ(U , D) ≤ ε, such that Synth(D) ≤ o(n) as witnessed by a private-randomness synthesizing protocol Π with acc Π (x, y) = D x,y . Note that the total acceptance probability over disjoint inputs is close to 1: x,y : |x ∧y |=0 acc Π (x, y) ≥ 1 − ε and thus E (x,y )∼U 0 [acc Π (x, y)] ≥ (1 − ε)3 −n . Our eventual goal (in Step III) is to apply our corruption lemma to the transcript rectangles, but the above threshold (1 − ε)3 −n is too low for this. To raise the threshold to 2 −o (n) as needed for corruption, we would like to scale up all the acceptance probabilities accordingly. To "make room" for the scaling, we first carry out a certain truncation step. Specifically, in Section 2.3, we transform Π into a public-randomness protocol Π: (1) First, we truncate (using a truncation lemma [37]) the values acc Π (x, y), which has the effect of decreasing some of them, but any acc Π (x, y) that is under 3 −n remains approximately the same. This results in an intermediate protocol Π that still satisfies E (x,y )∼U 0 [acc Π (x, y))] ≥ Ω((1 − ε)3 −n ) (using the assumption that Δ(U , D) ≤ ε). (2) Second, we scale (using the low cost of Π ) the truncated probabilities up by a large factor 3 n 2 −o (n) . This results in a protocol Π with large typical acceptance probabilities: Step III: Iterate Corruption. Because Π has such large acceptance probabilities Equation (1), our corruption lemma can be applied: there is some constant C > 0, such that Since Π is a truncated-and-scaled version of Π , this allows us to infer that and thus x,y : |x ∧y |=1 acc Π (x, y) ≥ Ω((1 − ε)n) using the fact that |supp(U 1 )| = n3 n−1 = (n/3) · |supp(U 0 )|. Thus, for ε = 1 − ω (1/n), this means Π must have placed a total probability mass > 1 on uniquely intersecting inputs, which is the sought contradiction.
To prove Theorem 1 for very large error ε = 1 − β n , in Section 2.4, we iterate the above argument for U over 0 ≤ ≤ o(n). Namely, analogously to Equation (2), we show that the average acceptance probability of Π over U +1 is at least a constant times the average over U . Meanwhile, the support sizes increase as |supp(U +1 )| ≥ ω (1) · |supp(U )| for ≤ o(n). These facts together imply a large constant factor increase in the total probability mass that Π places on supp(U +1 ) as compared to supp(U ). Starting with even a tiny probability mass over supp(U 0 ), this iteration will eventually lead to a contradiction.

Step I: Uniform Corruption
The goal of this step is to derive Lemma 2 from Lemma 1.  Proof. Assume for convenience that n/2 has the form 4k − 1 (otherwise use the nearest such number instead of n/2 throughout). We prove that Lemma 1 for n/2 implies Lemma 2 for n by the contrapositive. Thus, D 0 and D 1 are distributions over {0, 1} n/2 × {0, 1} n/2 while U 0 and U 1 are distributions over {0, 1} n × {0, 1} n . Assume there exists a rectangle R ⊆ {0, 1} n × {0, 1} n such that

Lemma 1 (Corruption [65]). For every rectangle
; by linearity of expectation this implies that there exists such a Q with D 1 Q < 1 45 D 0 Q − 2 −0.017n/2 . To this end, we define a distribution F over functions f : x i y i for the first n/2 coordinates i; 10 for the next v − k coordinates i; 01 for the next w − k coordinates i; 00 for the remaining 4. Let f (x, y) (π (x ), π (y )) (i.e., permute the coordinates according to π ).
For ∈ {0, 1} let F (D ) denote the distribution obtained by sampling (x, y) ∼ D and f ∼ F and outputting f (x, y), and note that F (D ) R = E F [D Q F ]. Now, we claim that F (D ) and U are close, in the following senses: Using R as the event E, we have as desired. To see (1), note that F (D 0 ) is precisely U 0 conditioned on v ≥ k, w ≥ k, and v + w ≤ 2k + n/2, and this conditioning event has probability ≥ 1 − 2 −0.01n by Chernoff bounds: Thus, letting C be the complement of the conditioning event, we have F To see (2), consider any outcome (x, y) ∈ {0, 1} n × {0, 1} n with |x ∧ y| = 1. We have U 1 x,y = 1/(n3 n−1 ). Abbreviating a |x | and b |y|, assume a ≥ k, b ≥ k, and a + b ≤ 2k + n/2, since otherwise F (D 1 ) x,y = 0, and there would be nothing to prove. Henceforth, consider the ACM Transactions on Computation Theory, Vol. 12, No. 3, Article 20. Publication date: July 2020.
A Lower Bound for Sampling Disjoint Sets 20:7 probability space with the randomness of D 1 and of F . Let I be the event that F 1 (D 1 ) ∧ F 2 (D 1 ) = x ∧ y, i.e., that the intersecting coordinate of F (D 1 ) is the same as for (x, y). We have .
For the three terms on the right-hand side, we have

Step II: Truncate and Scale
The goal of this step is to construct a truncated-and-scaled protocol Π from any given low-cost Π that synthesizes a distribution close to U . For a nonnegative matrix M, we define its truncation M to be the same matrix but where each entry > 1 is replaced with 1. We let a ± b denote the real interval [a − b, a + b].
By Observation 2, Synth(D) ≤ δn + 2, so consider a synthesizing protocol Π for D with communication cost ≤ δn + 2. Let A be the set of all accepting transcripts of Π . For each τ ∈ A let N τ be the nonnegative rank-1 matrix such that N τ x,y is the probability Π generates τ on input (x, y); thus, D x,y = τ ∈A N τ x,y . Let Π τ be the public-randomness protocol from Lemma 3 applied to M τ 3 n N τ and d 15δn. Let Π be the public-randomness protocol that picks a uniformly random τ ∈ A and then runs Π τ . The communication cost of Π is ≤ c · (d + log n) ≤ 0.001n.
From this, it follows that We can now formally state the large typical acceptance probability property (Equation (1) from the overview): writing U Π E (x,y )∼U [acc Π (x, y)] (and similarly for other input distributions), where the last line follows because |A| ≤ 2 δ n+2 and 2 −2δ n−2 is at least twice 2 −15δ n .

Step III: Iterate Corruption
Here, we derive the final contradiction: Π places an acceptance probability mass exceeding 1 on supp(U δ n ). This is achieved by iterating our corruption lemma, starting with Equation (3) as the base case. For z ∈ {0, 1} n let U z be the uniform distribution over all (x, y) ∈ {0, 1} n × {0, 1} n with x ∧ y = z (so U is the uniform mixture of all U z with |z| = ; in particular, U 0 = U 0 n ), and if |z| < n, then let U z be the uniform mixture of U z over all z that can be obtained from z by flipping a single 0 to 1 (so U +1 is the uniform mixture of all U z with |z| = ; in particular, U 1 = U 0 n ). Claim 2. For every z ∈ {0, 1} n with |z| ≤ n/2, we have U z Π ≥ 1 765 U z Π − 2 −0.003n . Proof. Since all relevant inputs (x, y) have x i y i = 11 for all i such that z i = 1, we can ignore those coordinates and think of U z and U z as U 1 and U 0 , respectively, but defined on the remaining n − |z| ≥ n/2 coordinates (instead of on all n coordinates). Thus, by Lemma 2, for every outcome of the public randomness of Π and every accepting transcript, say corresponding to rectangle R, we have U z R ≥ 1 765 U z R − 2 −0.008n/2 . Summing over all the (at most 2 0.001n many) accepting transcripts, and then taking the expectation over the public randomness, yields the claim, since 2 0.001n · 2 −0.008n/2 ≤ 2 −0.003n . Claim 3. For every = 0, . . . , δn, we have U Π ≥ 1 |A | 2 −δ n−1−11 .

ACKNOWLEDGMENTS
We thank anonymous reviewers for helpful comments. A preliminary version of this article was published as Reference [40].