Distribution Testing Lower Bounds via Reductions from Communication Complexity

We present a new methodology for proving distribution testing lower bounds, establishing a connection between distribution testing and the simultaneous message passing (SMP) communication model. Extending the framework of Blais, Brody, and Matulef [15], we show a simple way to reduce (private-coin) SMP problems to distribution testing problems. This method allows us to prove new distribution testing lower bounds, as well as to provide simple proofs of known lower bounds. Our main result is concerned with testing identity to a specific distribution, p, given as a parameter. In a recent and influential work, Valiant and Valiant [55] showed that the sample complexity of the aforementioned problem is closely related to the ℓ2/3-quasinorm of p. We obtain alternative bounds on the complexity of this problem in terms of an arguably more intuitive measure and using simpler proofs. More specifically, we prove that the sample complexity is essentially determined by a fundamental operator in the theory of interpolation of Banach spaces, known as Peetre’s K-functional. We show that this quantity is closely related to the size of the effective support of p (loosely speaking, the number of supported elements that constitute the vast majority of the mass of p). This result, in turn, stems from an unexpected connection to functional analysis and refined concentration of measure inequalities, which arise naturally in our reduction.


INTRODUCTION
Distribution testing, as first explicitly introduced in Reference [9], is a branch of property testing [32,50] concerned with the study of sublinear algorithms for making approximate decisions regarding probability distributions over massive domains. These algorithms are granted access to independent samples from an unknown distribution and are required to test whether this distribution has a certain global property. That is, a tester for property Π of distributions over domain Ω receives a proximity parameter ε > 0 and is asked to determine whether a distribution p over Ω (denoted p ∈ Δ(Ω)) has the property Π or is ε-far (say, in 1 -distance) from any distribution that has Π, using a small number of independent samples from p. The sample complexity of Π is then the minimal number of samples needed to test it. Throughout the Introduction, we fix ε to be a small constant and refer to a tester with respect to proximity parameter ε as an ε-tester.
In recent years, distribution testing has been studied extensively. In a significant body of work spanning more than a decade [2,7,8,13,14,23,26,37,39,43,44,55,57], a myriad of properties has been investigated under this lens. Starting with References [8,10,33], this includes the testing of symmetric properties [47,53,54,56], of structured families [3,4,11,17,19,20,36], as well as testing under some assumption on the unknown instance [24,27,28,49]. Tight upper and lower bounds on the sample complexity have been obtained for properties such as uniformity, identity to a specified distribution, monotonicity, 1 and many more (see references above, or References [18,48] for surveys). However, while by now numerous techniques and approaches are available to design distribution testers, our arsenal of tools for proving lower bounds on the sample complexity of distribution testing is significantly more limited. There are only a handful of standard techniques to prove lower bounds; and indeed the vast majority of the lower bounds in the literature are shown via Le Cam's two-point method (also known as the "easy direction" of Yao's minimax principle) [46,58]. In this method, one first defines two distributions Y and N over distributions that are, respectively, yes-instances (having the property) and no-instances (far from having the property). Then it remains to show that with high probability over the choice of the instance, every tester that can distinguish between p yes ∼ Y and p no ∼ N must use at least a certain number of samples. In view of this scarcity, there has been in recent years a trend toward trying to obtain more, or simpler to use, techniques [26,56]; however, this state of affairs largely remains the same.
In this work, we reveal a connection between distribution testing and the simultaneous message passing (SMP) communication model, which in turn leads to a new methodology for proving distribution testing lower bounds. Recall that in a private-coin SMP protocol, Alice and Bob are given strings x, y ∈ {0, 1} k (respectively), and each of the players is allowed to send a message to a referee (which depends on the player's input and private randomness) who is then required to decide whether f (x, y) = 1 by only looking at the players' messages and flipping coins.
Extending the framework of Blais, Brody, and Matulef [15], we show a simple way of reducing (private-coin) SMP problems to distribution testing problems. This foregoing methodology allows us to prove new distribution testing lower bounds, as well as to provide simpler proofs of known lower bounds for problems such as testing uniformity, monotonicity, and k-modality (see Section 8).
Our main result is a characterization of the sample complexity of the distribution identity testing problem in terms of a key operator in the study of interpolation spaces, which arises naturally from our reduction and for which we are able to provide an intuitive interpretation. Recall that in this problem, the goal is to determine whether a distribution q over domain Ω (denoted q ∈ Δ(Ω)) is identical to a fixed distribution p; that is, given a full description of p ∈ Δ(Ω), we ask how many independent samples from q are needed to decide whether q = p, or whether q is ε-far in 1 -distance from p. 2 In a recent and influential work, Valiant and Valiant [55] showed that the sample complexity of the foregoing question is closely related to the 2/3 -quasinorm of p, defined as p 2/3 = ( ω ∈Ω |p(ω)| 2/3 ) 3/2 . That is, viewing a distribution p ∈ Δ(Ω) as an |Ω|-dimensional vector of probabilities, let p − max −ε Distribution Testing Lower Bounds via Reductions from Communication Complexity 6:3 Table 1

Property
Our results
of smallest entries summing to ε (note that p − max −ε is no longer a probability distribution). Valiant and Valiant gave an ε-tester for testing identity to p with sample complexity O ( p − max −cε 2/3 ), where c > 0 is a universal constant, and complemented this result with a lower bound of Ω( p − max −ε 2/3 ). 3,4 In this work, using our new methodology, we show alternative and similarly tight bounds on the complexity of identity testing, in terms of a more intuitive measure (as we discuss below) and using simpler arguments. Specifically, we prove that the sample complexity is essentially determined by a fundamental quantity in the theory of interpolation of Banach spaces, known as Peetre's Kfunctional. Formally, for a distribution p ∈ Δ(Ω), the K-functional between 1 and 2 spaces is the operator defined for t > 0 by This operator can be thought of as an interpolation norm between the 1 and 2 norms of the distribution p (controlled by the parameter t), naturally inducing a partition of p into two subdistributions: p , which consists of "heavy hitters" in 1 -norm, and p , which has a bounded 2norm. Indeed, the approach of isolating elements with large mass and testing in 2 -norm seems inherent to the problem of identity testing, and is the core component of both early works [8,33] and more recent ones [26,28,31]. As a further connection to the identity testing question, we provide an easily interpretable proxy for this measure κ p , showing that the K-functional between the 1 and 2 norms of the distribution p is closely related to the size of the effective support of p, which is the number of supported elements that constitute the vast majority of the mass of p; that is, we say that p has ε-effective support of size T if 1 − O (ε) of the mass of p is concentrated on T elements (see Section 2.4 for details).
Having defined the K-functional, we can proceed to state the lower bound we derive for the problem. 5 Theorem 1.1 (Informally Stated). Any ε-tester of identity to p ∈ Δ(Ω) must have sample complexity Ω(κ −1 p (1 − 2ε)). 3 We remark that for certain p's, the asymptotic behavior of O ( p − max −cε 2/3 ) strongly depends on the constant c, and so it cannot be omitted from the expression. We further remark that this result was referred to by Valiant and Valiant as "instance-optimal identity testing" as the resulting bounds are phrased as a function of the distribution p itself-instead of the standard parameter, which is the domain size n. 4 For the problem of identity testing to a generic distribution p, Diakonikolas et al. [25] show a sample-optimal upper bound of O ( 1 ε 2 n log(1/δ ) + log(1/δ )), where ε denote the proximity parameter and δ the soundness error. This improves on the previous upper bound of O ( 1 ε 2 √ n log(1/δ )) and establishes the optimal dependence on δ in all parameter regimes. 5 As stated, this result is a slight strengthening of our communication complexity reduction, which yields a lower bound of Ω(κ −1 p (1 − 2ε )/ log n). This strengthening is described in Section 7.3. In particular, straightforward calculations show that for the uniform distribution, we obtain a tight lower bound of Ω( √ n), and for the Binomial distribution, we obtain a tight lower bound of Ω(n 1/4 ).
To show the tightness of the lower bound above, we complement it with a nearly matching upper bound, also expressed in terms of the K-functional. Theorem 1.2 (Informally Stated). There exist an absolute constant c > 0 and an ε-tester of identity to p ∈ Δ(Ω) that uses O (κ −1 p (1 − cε)) samples. 6 We remark that for some distributions the bounds in Theorems 1.1 and 1.2 are tighter than the bounds in Reference [55], whereas for other distributions it is the other way around (see discussion in Section 6).
In the following section, we provide an overview of our new methodology as well as the proofs for the above theorems. We also further discuss the interpretability of the K-functional and show its close connection to the effective support size. We conclude this section by outlining a couple of extensions of our methodology.

Dealing with Sub-constant Values of the Proximity Parameter.
Similarly to the communication complexity methodology for proving property testing lower bounds [15], our method inherently excels in the regime of constant values of the proximity parameter ε. Therefore, in this work, we indeed focus on the constant proximity regime. However, in Section 5.1, we demonstrate how to obtain lower bounds that asymptotically increase as ε tends to zero, via an extension of our general reduction.
Extending the Methodology to Testing with Conditional Samples. Testers with sample access are by far the most commonly studied algorithms for distribution testing. However, many scenarios that arise both in theory and practice are not fully captured by this model. In a recent line of works [1,21,22,29,30], testers with access to conditional samples were considered, addressing situations in which one can control the samples that are obtained by requesting samples conditioned on membership on subsets of the domain. In Section 9, we give an example showing that it is possible to extend our methodology to obtain lower bounds in the conditional sampling model.

Organization
We first give a technical overview in Section 2, demonstrating the new methodology and presenting our bounds on identity testing. Section 3 then provides the required preliminaries for the main technical sections. In Section 4, we formally state and analyze the SMP reduction methodology for proving distribution testing lower bounds. In Section 5, we instantiate the basic reduction, obtaining a lower bound on uniformity testing, and in Section 5.1, we show how to extend the methodology to deal with sub-constant values of the proximity parameter. (We stress that Section 5.1 is not a prerequisite for the rest of the sections and can be skipped at the reader's convenience.) In Section 6, we provide an exposition to the K-functional and generalize inequalities that we shall need for the following sections. Section 7 then contains the proofs of both lower and upper bounds on the problem of identity testing, in terms of the K-functional. In Section 8, we demonstrate how to easily obtain lower bounds for other distribution testing problems. Finally, in Section 9, we discuss extensions to our methodology; specifically, we explain how to obtain lower bounds in various metrics, and show a reduction from communication complexity to distribution testing in the conditional sampling model.

TECHNICAL OVERVIEW
In this section, we provide an overview of the proof of our main result, which consists of new lower and upper bounds on the sample complexity of testing identity to a given distribution, expressed in terms of an intuitive, easily interpretable measure. To do so, we first introduce the key component of this proof, the methodology for proving lower bounds on distribution testing problems via reductions from SMP communication complexity. We then explain how the relation to the theory of interpolation spaces and the so-called K-functional naturally arises when applying this methodology to the identity testing problem.
For the sake of simplicity, throughout the overview, we fix the domain Ω = [n] and fix the proximity parameter ε to be a small constant. We begin in Section 2.1 by describing a simple "vanilla" reduction for showing anΩ( √ n) lower bound on the complexity of testing that a distribution is uniform. Then, in Section 2.2, we extend the foregoing approach to obtain a new lower bound on the problem of testing identity to a fixed distribution. This lower bound depends on the best rate obtainable by a special type of error-correcting codes, which we call p-weighted codes. In Section 2.3, we show how to relate the construction of such codes to concentration of measure inequalities for weighted sums of Rademacher random variables; furthermore, we discuss how the use of the K-functional, an interpolation norm between 1 and 2 spaces, leads to stronger concentration inequalities than the ones derived by Chernoff bounds or the central limit theorem. Finally, in Section 2.4, we establish nearly matching upper bounds for testing distribution identity in terms of this K-functional, using a proxy known as the Q-norm. We then infer that the sample complexity of testing identity to a distribution p is roughly determined by the size of the effective support of p (which is, loosely speaking, the number of supported elements that together account for the vast majority of the mass of p).

Warmup: Uniformity Testing
Consider the problem of testing whether a distribution q ∈ Δ([n]) is the uniform distribution; that is, how many independent samples from q are needed to decide whether q is the uniform distribution over [n], or whether q is ε-far in 1 -distance from it. We reduce the SMP communication complexity problem of equality to the distribution testing problem of uniformity testing.
Recall that in a private-coin SMP protocol for equality, Alice and Bob are given strings x, y ∈ {0, 1} k (respectively), and each of the players is allowed to send a message to a referee (which depends on the player's input and private randomness) who is then required to decide whether x = y by only looking at the players' messages and flipping coins.
The reduction is as follows. Assume there exists a uniformity tester with sample complexity s. Each of the players encodes its input string via a balanced asymptotically good code C (that is, C : {0, 1} k → {0, 1} n is an error-correcting code with constant rate and relative distance δ = Ω(1), which satisfies the property that each codeword of C contains the same number of 0's and 1's). Denote by A ⊂ [n] the locations in which C (x ) takes the value 1 (i.e., A = {i ∈ [n] : C (x ) i = 1}), and denote by B ⊂ [n] the locations in which C (y) takes the value 0 (i.e., B = {i ∈ [n] : C (y) i = 0}). Alice and Bob each send O (s) uniformly distributed samples from A and B, respectively. Finally, the referee invokes the uniformity tester with respect to the distribution q = (U A + U B )/2, emulating each draw from q by tossing a random coin and deciding accordingly whether to use a sample by Alice or Bob. See Figure 1.
The idea is that if x = y, then C (x ) = C (y), and so A and B are a partition of the set [n]. Furthermore, since |C (x )| = |C (y)| = n/2, this is a equipartition. Now, since Alice and Bob send uniform samples from an equipartition of [n], the distribution q that the referee emulates is, in fact, the uniform distribution over [n], and so the uniformity tester will accept. However, if x y, then C (x ) Fig. 1. The reduction from equality in the SMP model to uniformity testing of distributions. In (A), we see that the uniform distribution is obtained when x = y, whereas in (B), we see that when x y, we obtain a distribution that is "far" from uniform. and C (y) disagree on a constant fraction of the domain. Thus, A and B intersect on δ/2 elements, as well as do not cover δ/2. Therefore, q is uniform on a (1 − δ )-fraction of the domain, unsupported on a (δ/2)-fraction of the domain, and has "double" weight 2/n on the remaining (δ/2)-fraction. In particular, since δ = Ω(1), the emulated distribution q is Ω(1)-far (in 1 -distance) from uniform, and it will be rejected by the uniformity tester.
As each sample sent by either Alice or Bob was encoded with O (log n) bits, the above constitutes an SMP protocol for equality with communication complexity O (s log(n)). Yet it is well known [42] that the players must communicate Ω( √ k ) bits to solve this problem (see Section 4), and so we deduce that s = Ω( Consider any fixed p ∈ Δ([n]). As a first idea, it is tempting to reduce equality in the SMP model to testing identity to p by following the uniformity reduction described in Section 2.1, only instead of having Alice and Bob send uniform samples from A and B, respectively, we have them send samples from p conditioned on membership in A and B, respectively. That is, as before Alice and Bob encode their inputs x and y via a balanced, asymptotically good code C to obtain the sets A = {i ∈ [n] : C (x ) i = 1} and B = {i ∈ [n] : C (y) i = 0}, which partition [n] if x = y, and intersect on Ω(n) elements (as well as fail to cover Ω(n) elements of [n]) if x y. Only now, Alice sends samples independently drawn from p| A , i.e., p conditioned on the samples belonging to A, and Bob sends samples independently drawn from p| B , i.e., p conditioned on the samples belonging to B; and the referee emulates the distribution q = (p| A + p| B )/2.
However, two problems arise in the foregoing approach. The first is that while indeed when x = y the reduction induces an equipartition A, B of the domain, the resulting weights p(A) and p(B) in the mixture may still be dramatically different, in which case the referee will need much more samples from one of the parties to emulate p. The second is a bit more subtle and has to do with the fact that the properties of this partitioning are with respect to the size of the symmetric difference AΔB, while really we are concerned about its mass under the emulated distribution q (and although both are proportional to each other in the case of the uniform distribution, for general p, we have no such guarantee). Namely, when x y the domain elements that are responsible for the distance from p (that is, the elements that are covered by both parties (A ∩ B) and by neither of the parties ([n] \ (A ∪ B)) may only have a small mass according to p, and thus the emulated distribution q will not be sufficiently far from p. A natural attempt to address these two problems would be to preprocess p by discarding its light elements, focusing only on the part of the domain where p puts enough mass pointwise; yet this approach can also be shown to fail, as in this case the reduction may still not generate enough distance. 7 Instead, we take a different route. The key idea is to consider a new type of codes, which we call p-weighted codes, which will allow us to circumvent the second obstacle. These are code whose distance guarantee is weighted according to the distribution p; that is, instead of requiring that every two codewords c, c in a code C satisfy dist(x, y) Furthermore, to handle the first issue, we adapt the "balance" property accordingly, requiring that each codeword be balanced according to p, that is, every c ∈ C p satisfies n i=1 p(i) · c i = 1/2. It is straightforward to see that if we invoke the above reduction while letting the parties encode their inputs via a balance p-weighted code C p , then both of the aforementioned problems are resolved; that is, by the p-balance property the weights p(A) and p(B) are equal, and by the pdistance of C p , we obtain that for x y the distribution q = (p A + p B )/2 is Ω(1)-far from p. Hence, we obtain a lower bound of Ω( √ k/ log(n)) on the query complexity of testing identity to p. To complete the argument, it remains to construct such codes, and determine what the best rate k/n that can be obtained by p-weighted codes is. 7 In more detail, this approach would consider the distribution p obtained by iteratively removing the lightest elements of p until a total of ε probability mass was removed. This way, every element i in the support of p is guaranteed to have mass p i ≥ ε /n: this implies that the weights p (A) and p (B) are proportional, and that each element that is either covered by both parties or not covered at all will contribute ε /n to the distance from p . However, the total distance of q from p would only be Ω ( | supp(p ) | · ε /n); and this only suffices if p and p have comparable support size, i.e.,

Detour: p-weighted Codes, Peetre's K-functional, and Beating the CLT
The discussion of previous section left us with the task of constructing high-rate p-weighted codes. Note that unlike standard (uniformly weighted) codes, for which we can easily obtain constant rate, there exist some p's for which high rate is impossible (for example, if p ∈ Δ([n]) is only supported on one element, we can only obtain rate 1/n). In particular, by the sphere packing bound, every p-weighted code C : {0, 1} k → {0, 1} n with distance δ must satisfy where Vol F n 2 ,dist p (r ) is the volume of the p-ball of radius r in the n-dimensional hypercube, given by Hence, we must have k ≤ n − log Vol F n 2 ,dist p (δ/2). In Section 7.1, we show that there exist (roughly) balanced p-weighted codes with nearly optimal rate, 8 and so it remains to determine the volume of the p-ball of radius ε in the n-dimensional hypercube, where recall that ε is the proximity parameter of the test. To this end, it will be convenient to represent this quantity as a concentration inequality of sums of weighted Rademacher random variables, as follows: Applying standard tail bounds derived from the central limit theorem (CLT), we have that and so we can obtain a p-weighted code C p : , which in turn, by the reduction described in Section 2.2, implies a lower bound of Ω(1/( p 2 · log(n))) on the complexity of testing identity to p.
Unfortunately, the above lower bound is not as strong as hoped, and in particular, far weaker than the p − max −ε 2/3 bound of [55]. 9 Indeed, it turns out that the CLT-based bound in Equation (2) is only tight for distributions satisfying p ∞ = O ( p 2 2 ) and is, in general, too crude for our purposes. Instead, we look for stronger concentration of measure inequalities that "beat" the CLT. To this end, we shall use powerful tools from the theory of interpolation spaces. Specifically, we consider Peetre's K-functional between 1 and 2 spaces. Loosely speaking, this is the operator defined for t > 0 by This K-functional can be thought of as an interpolation norm between the 1 and 2 norms of the distribution p (and accordingly, for any fixed t it defines a norm on the space 1 + 2 ). In particular, 8 We remark that since these codes are not perfectly p-balanced, a minor modification to the reduction needs to be done. See Section 7.1 for details. 9 For example, fix α ∈ (0, 1), and consider the distribution p ∈ Δ([n]) in which n/2 elements are of mass 1/n, and n α /2 elements are of mass 1/n α . It is straightforward to verify that p −1 (Intuitively, this is because the 2 -norm is mostly determined by the few heavy elements, whereas the 2/3 -quasinorm is mostly determined by the numerous light elements.) 10 Interestingly, Holmstedt [35] showed that the infimum is approximately obtained by partitioning p = (p , p ), such that p consists of heaviest t 2 coordinates of p and p consists of the rest (for more detail, see Proposition 6.3).
note that for large values of t the function κ p (t ) is close to p 1 , whereas for small values of t it will behave like t p 2 .
The foregoing connection is due to Montgomery-Smith [40], who established the following concentration of measure inequality for weighted sums of Rademacher random variables, Furthermore, he proved that this concentration bound is essentially tight (see Section 6 for a precise statement). Plugging Equation (3) into Equation (1), we obtain a lower bound of Ω(κ −1 p (1 − 2ε)/ log(n)) on the complexity of testing identity to p.
To understand and complement this result, we describe in the next subsection a nearly tight upper bound for this problem, also expressed in terms of this K-functional; implying that this unexpected connection is, in fact, not a coincidence, but instead capturing an intrinsic aspect of the identity testing question. We also give a natural interpretation of this bound, showing that the size of the effective support of p (roughly, the number of supported elements that constitute the vast majority of the mass of p) is a good proxy for this parameter κ −1 p (1 − 2ε)-and thus for the complexity of testing identity to p.

Using the Q-norm Proxy to Obtain an Upper Bound
To the end of obtaining an upper bound on the sample complexity of testing identity to p, in terms of the K-functional, it will actually be convenient to look at a related quantity, known as the Qnorm [40]. At a high-level, the Q-norm of a distribution p, for a given parameter T ∈ N, is the maximum one can reach by partitioning the domain of p into T sets and taking the sum of the 2 norms of these T subvectors. That is Astashkin [6], following up Montgomery-Smith [40], showed that the Q-norm constitutes a good approximation of K-functional, by proving that In Section 6, we further generalize this claim and show it is possible to get a tradeoff in the upper bound; specifically, we prove that κ p (t ) ≤ p Q (2t 2 ) . Thus, it suffices to prove an upper bound on distribution identity testing in terms of the Q-norm.
From an algorithmic point of view, it is not immediately clear that switching to this Q-norm is of any help. However, we will argue that this value captures-in a very quantitative sense-the notion of the sparsity of p. As a first step, observe that if p Q (T ) = 1, then the distribution p is supported on at most T elements. To see this, denote by p A j the restriction of the sequence p to the indices in A j , and note that if p Q (T ) def = T j=1 p A j 2 = 1, then by the monotonicity of p norms and since T j=1 . Now, it turns out that it is possible to obtain a robust version of the foregoing observation, yielding a sparsity lemma that, roughly speaking, shows thats if p Q (T ) ≥ 1 − ε, then 1 − O (ε) of the mass of p is concentrated on T elements: in this case, we say that p has O (ε)-effective support of size T . (See Lemma 7.7 for precise statement of the sparsity lemma.) This property of the Q-norm suggests the following natural test for identity to a distribution p: Simply fix T such that p Q (T ) = 1 − ε, and apply one of the standard procedures for testing identity to a distribution with support size T , which require O ( √ T ) samples. But by the previous discussion, we have p Q (2t 2 ) ≥ κ p (t ), so that setting T = 2t 2 for the "right" choice of t = κ −1 p (1 − 2ε) will translate to an O (t ) upper bound-which is what we were aiming for.

PRELIMINARIES
Notation. We write [n] for the (ordered) set of integers {1, . . . , n}, and ln, log for, respectively, the natural and binary logarithms. We use the notationΩ(f ) to hide polylogarithmic dependencies on the argument, i.e., for expressions of the form Ω(f log c f ) (for some absolute constant c). Throughout the article, we denote by Δ(Ω) the set of discrete probability distributions over domain Ω. When the domain is a subset of the natural numbers N, we shall identify a distribution p ∈ Δ(Ω) with the sequence (p i ) i ∈N ∈ 1 corresponding to its probability mass function (pmf). For a subset S ⊆ Ω, we denote by p| S the normalized projection of p to S (so p| S is a probability distribution).
For an alphabet Σ, we denote the projection of x ∈ Σ n to a subset of coordinates I ⊆ [n] by x | I . For i ∈ [n], we write x i = x | {i } to denote the projection to a singleton. We denote the relative Hamming distance, over alphabet Σ, between two strings x ∈ Σ n and y ∈ Σ n by dist(x, y) then we say that x is ε-close to y, and otherwise, we say that x is ε-far from y. Similarly, we denote the relative Hamming distance of x from a non-empty set S ⊆ Σ n by dist(x, S ) def = min y ∈S dist(x, y)). If dist(x, S ) ≤ ε, then we say that x is ε-close to S, and otherwise, we say that x is ε-far from S. Distribution Testing. A property of distributions over Ω is a subset P ⊆ Δ(Ω), consisting of all distributions that have the property. Given two distributions p, q ∈ Δ(Ω), the 1 distance between p and q is defined as the 1 distance between their pmf's, namely, p − q 1 = i ∈Ω |p i − q i |. 11 Given a property P ⊆ Δ(Ω) and a distribution p ∈ Δ(Ω), we then define the distance of p to P as 1 (p, P) = inf q ∈P p − q 1 .
A testing algorithm for a fixed property P is then a randomized algorithm T , which takes as input n, ε ∈ (0, 1], and is granted access to independent samples from an unknown distribution p; and satisfies the following. (i) if p ∈ P, the algorithm outputs accept with probability at least 2/3; (ii) if 1 (p, P) ≥ ε, it outputs reject with probability at least 2/3.
In other words, T must accept with high probability if the unknown distribution has the property, and reject if it is ε-far from having it. The sample complexity of the algorithm is the number of samples it draws from the distribution in the worst case.
Inequalities. We now state a standard probabilistic result that some of our proofs will rely on, the Paley-Zygmund anticoncentration inequality: Theorem 3.1 (Paley-Zygmund Ineqality). Let X be a non-negative random variable with finite variance. Then, for any θ ∈ [0, 1], We will also require the following version of the rearrangement inequality, due to Hardy and Littlewood (cf. for instance [12, Theorem 2.2]): where f * , д * denote the symmetric decreasing rearrangements of f , д, respectively.
Error-Correcting Codes. Let k, n ∈ N, and let Σ be a finite alphabet. A code is a one-to-one function C : Σ k → Σ n that maps messages to codewords, where k and n are called the code's dimension and block length, respectively. The rate of the code, measuring the redundancy of the encoding, is defined to be ρ def = k/n. We will sometime identify the code C with its image C (Σ k ). In particular, we shall write c ∈ C to indicate that there exists x ∈ {0, 1} k such that c = C (x ), and say that c is a codeword of C. The relative distance of a code is the minimal relative distance between two codewords of C, and is denoted by We say that C is an asymptotically good code if it has constant rate and constant relative distance. We shall make an extensive use of asymptotically good codes that are balanced, that is, codes in which each codeword consists of the same number of 0's and 1's Proposition 3.3 (Good Balanced Codes). For any constant δ ∈ [0, 1/3), there exists a good balanced code C : {0, 1} k → {0, 1} n with relative distance δ and constant rate. Namely, there exists a constant ρ > 0 such that the following holds.
Fix any code C with linear distance δ and constant rate (denoted ρ ). We transform C : {0, 1} k → {0, 1} n to a balanced code C : {0, 1} k → {0, 1} 2n by representing 0 and 1 as the balanced strings 01 and 10 (respectively). More accurately, we let C (x ) where denotes the concatenation andz is the bitwise negation of z. It is immediate to check that this transformation preserves the distance, and that C is a balanced code with rate On Uniformity. For the sake of notation and clarity, throughout this work, we define all algorithms and objects non-uniformly. Namely, we fix the relevant parameter (typically n ∈ N), and restrict ourselves to inputs or domains of size n (for instance, probability distributions over domain [n]). However, we still view it as a generic parameter and allow ourselves to write asymptotic expressions such as O (n). Moreover, although our results are stated in terms of non-uniform algorithms, they can be extended to the uniform setting in a straightforward manner.

THE METHODOLOGY: FROM COMMUNICATION COMPLEXITY
TO DISTRIBUTION TESTING In this section, we adapt the methodology for proving property testing lower bounds via reductions from communication complexity, due to Blais, Brody, and Matulef [15], to the setting of distribution testing. As observed in References [15,16], to prove lower bounds on the query complexity of nonadaptive testers it suffices to reduce from one-sided communication complexity. We show that for distribution testers (which are inherently non-adaptive), it suffices to reduce from the more restricted communication complexity model of private-coin simultaneous message passing (SMP).
Recall that a private-coin SMP protocol for a communication complexity predicate f : consists of three computationally unbounded parties: Two players (commonly referred to as Alice and Bob), and a referee. Alice and Bob receive inputs x, y ∈ {0, 1} k . Each of the players simultaneously (and independently) sends a message to the referee, based on its input and (private) randomness. The referee is then required to successfully compute f (x, y) with probability at least 2/3, using its private randomness and the messages received from Alice and Bob. The communication complexity of an SMP protocol is the total number of bits sent by Alice and Bob. The private-coin SMP complexity of f , denoted SMP( f ), is the minimum communication complexity of all SMP protocols that solve f with probability at least 2/3.
Generally, to reduce an SMP problem f to ε-testing a distribution property Π, Alice and Bob can send messages m A (x, r A , ε) and m B (y, r B , ε) (respectively) to the referee, where r A and r B are the private random strings of Alice and Bob. Subsequently, the referee uses the messages m A (x, r A , ε) and m A (y, r B , ε), as well as its own private randomness, to feed the property tester samples from a distribution p that satisfies the following conditions: We shall focus on a special type of the foregoing reductions, which is particularly convenient to work with and suffices for all of the our lower bounds. Loosely speaking, in these reductions Alice and Bob both send the prover samples from sub-distributions that can be combined by the referee to obtain samples from a distribution that satisfies the completeness and soundness conditions. The following lemma gives a framework for proving lower bounds based on such reductions.
and α, β can each be encoded with O (log n) bits.
Then, every ε-tester for Π needs Ω( Proof. Supose there exists an ε-tester for Π with sample complexity s. Let x, y ∈ {0, 1} k be the inputs of Alice and Bob (respectively) for the SMP problem. Alice computes the distribution p A (x ) and the "decomposability parameter" α = α (x ) and sends s independent samples from p A (x ), as well as the parameter α. Analogously, Bob computes p B (y) and its parameter β = β (y), and sends s independent samples from p B (y) as well as the parameter β. Subsequently, the referee generates a sequence of s independent samples from p(x, y), where each sample is drawn as follows: with probability α α +β use a (fresh) sample from Alice's samples, and with probability 1 − α α +β use a (fresh) sample from Bob's samples. Finally, the referee feeds the generated samples to the ε-tester for Π.
The above procedure indeed allows the referee to retrieve, with probability one, s independent samples from the distribution α α +β · p A (x ) + β α +β · p B (y), which equals to p(x, y), by the decomposability condition. If (x, y) = f −1 (1), then by the completeness condition p(x, y) ∈ Π, and so the ε-tester for Π is successful with probability at least 2/3. Similarly, if (x, y) = f −1 (0), then by the soundness condition p(x, y) is ε-far from Π, and so the ε-tester for Π is successful with probability at least 2/3. Finally, note that since each one of the samples provided by Alice and Bob requires sending log n bits, the total communication complexity of the protocol is 2s log n + O (log n) (the last term from the cost of sending α, β), hence s = Ω( SMP(f ) log(n) ). We conclude this section by stating a well-known SMP lower bound on the equality problem. Let Eq k : {0, 1} k × {0, 1} k → {0, 1} be the equality predicate, i.e., Eq k (x, y) = 1 if and only if x = y. In this work, we shall frequently use the following (tight) lower bound on the Eq k predicate: Theorem 4.2 (Newman and Szegedy [42]). For every k ∈ N it holds that SMP(Eq k ) = Ω( √ k ).
Proof. Assume there exists a q-query ε-tester for the uniform distribution, with error probability 1/6. For a sufficiently large k ∈ N, let C : {0, 1} k → {0, 1} n be a balanced code as promised by Proposition 3.3 with distance ε. Namely, there exists an absolute constant ρ > 0 such that Completeness. If (x, y) ∈ Eq −1 k (1), then C (x ) = C (y) and A = [n] \ B. This implies that p(x, y) is indeed the uniform distribution on [n], as desired. Soundness. If (x, y) ∈ Eq −1 k (0), then dist(C (x ), C (y)) > ε, and therefore |A B | > εn by construction. Since p(x, y) assigns mass 2/n to each element in A ∩ B = A \B, and mass 0 to any element inĀ ∩B =B \ A, we have p(x, y) − u 1 = 1 n · |A B | > ε; that is, p(x, y) is ε-far from uniform.
The desired Ω( √ n log n ) lower bound then immediately follows from Lemma 4.1 and Theorem 4.2.

Obtaining ε-Dependency
In this section, we explain how to generalize the reduction from the previous section to obtain some dependence (albeit non optimal) on the distance parameter ε in the lower bound. This generalization will rely on an extension of the methodology of Lemma 4.1: instead of having the referee define the distribution p(x, y) as a mixture of p A (x ) and p B (y) (namely, p(x, y) = α (x )+β (y ) p B (y)), he will instead use a (random) combination function F ε , function of ε and its private coins only. Given this function, which maps a larger domain of size m = Θ(n/ε 2 ) to [n], p(x, y) will be defined as the mixture More simply, this allows Alice and Bob to send to the referee samples from their respective distributions on a much larger domain m n; the referee, who has on its side chosen how to randomly partition this large domain into only n different "buckets," converts these draws from Alice and Bob into samples from the induced distributions on the n buckets, and takes a mixture of these two distributions instead. By choosing each bucket to have size roughly 1/ε 2 , we expect this random "coarsening" of Alice and Bob's distributions to yield a distribution at distance only Ω(ε) from uniformity (instead of constant distance) in the no-case; but now letting us get a lower bound on the original support size m, i.e.,Ω( n/ε 2 ), instead ofΩ( √ n) as before.
Proof of Theorem 5.2. We will reduce from Eq k , where k ∈ N is again assumed big enough (in particular, with regard to 1/ε 2 ). Alice and Bob act as in Section 5, separately creating (a, b) = This is where we deviate from the proof of Theorem 5.1: indeed, setting n def = cε 2 m (where c > 0 is an absolute constant determined later), the referee will combine the samples from p A (x ) and p B (y) in a different way to emulate a distribution p(x, y) ∈ Δ([n])-that is, with a much smaller support than that of p A (x ), p B (y) (instead of setting p(x, y) to be, as before, a mixture of the two).
Note, furthermore, that each sample sent by Alice and Bob (who have no knowledge of the randomly chosen F ε ) can be encoded with O (log m) = O (log n ε ) bits. We then turn to establish the analogue in this generalized reduction of the last two conditions of Lemma 4.1, i.e., the completeness and soundness. The former, formally stated below, will be an easy consequence of the previous section.
Proof. As in the proof of Theorem 5.1, in this case the distribution p (x, y) = 1 2 (p A (x ) + p B (y)) ∈ Δ([m]) is uniform; since each "bucket" B j = F −1 ε (j) has the same size, this implies that p(x, y)(j) = p (x, y)(B j ) = 1 n for all j ∈ [n]. Establishing the soundness, however, is not as straightforward: Claim 2. If x y, then with probability at least 1/100 (over the choice of the equipartition (B 1 , . . . , B n )), p(x, y) is ε-far from uniform.
Proof. Before delving into the proof, we provide a high-level idea of why this holds. Since the partition was chosen uniformly at random, on expectation each element j ∈ [n] will have probability E[p(x, y)(j)] = E[p (x, y)(B j )] = 1 n . However, since a constant fraction of elements i ∈ [m] (before the random partition) has probability mass either 0 or 2/m (as in the proof of Theorem 5.1), and each bucket B j contains r = 1/(cε 2 ) many elements chosen uniformly at random, we expect the fluctuations of p (x, y)(B j ) around its expectation to be of the order of Ω( √ r /m) = Ω(ε/n) with constant probability, and summing over all j's this will give us the distance Ω(ε) we want.
To make this argument precise, we assume x y, so that A B > δm; and define H For any j ∈ [n], we then let the random variables H (j ) , L (j ) be the number of "high" and "low" elements of [m] in the bucket B j , respectively: From the definition, we get that p = p(x, y) satisfies p . Furthermore, it is easy to see that E[p(j)] = r m = 1 n for all j ∈ [n], where the expectation is over the choice of the equipartition by the referee.
As previously discussed, we will analyze the deviation from this expectation; more precisely, we want to show that with good probability, a constant fraction of the j's will be such that p(j) deviates from 1/n by at least an additive Ω( √ r /m) = ε/n. This anticoncentration guarantee will be a consequence of the Paley-Zygmund inequality (Theorem 3.1) to Z (j ) def = (H (j ) − L (j ) ) 2 ≥ 0; in view of applying it, we need to analyze the first two moments of this random variable. Proof. Fix any j ∈ [n]. We write for convenience X and Y for, respectively, H (j ) and L (j ) . The distribution of (X , Y , r − (X − Y )) is then a multivariate hypergeometric distribution (cf. Reference [38]) with three classes: , m, r ).
Denote the hypergeometric distribution, in which we perform n draws from a set of N elements where K of them are considered as success, by Hypergeom(n, K, N ). Conditioning on 1 2 δm, δm). Moreover, U itself is hypergeometrically distributed, with U ∼ Hypergeom(r , δm, m). We can thus write By straightforward, yet tedious, calculations involving the computation of E[(2X − U ) 2 | U ] and E[(2X − U ) 4 | U ] (after expanding and using the known moments of the hypergeometric distribution), 12 we obtain the last equality as δ = 1/3.
We can now apply the Paley-Zygmund inequality to Z (j ) . Doing so, we obtain that for r ≤ m 4 (with some slack), and any θ ∈ [0, 1], By the lemma above, the right-hand side converges to (1−θ 2 ) 2 3 when m → ∞, and therefore is at for m big enough. We set θ def = 1/ √ 2 to obtain the following: There exists M ≥ 0, such that for every m ≥ M.
Equation (4) implies that the number K of good indices j ∈ [n] satisfying |H (j ) − L (j ) | ≥ δ r 4 is on expectation at least n 16 , and by an averaging argument 13 Whenever this happens, the distance from p to uniform is at least and choosing c ≥ 4800 so that √ c 40 √ 3 ≥ 1 yields the claim. From this lemma, we can complete the reduction: Given a tester T for uniformity with query complexity q, we first convert it by standard amplification into a tester T with failure probability 1/200 and sample complexity O (q). The referee can provide samples from the distribution p(x, t ), and on input ε: • If x = y, then T will return reject with probability at most 1/200; • If x y, then T will return reject with probability at least 199/200 · 1/100 > 1/200; so repeating independently the protocol a constant (fixed in advance) number of times and taking a majority vote enables the referee to solve Eq k with probability at least 2/3. Since 12 One can also use a formal computation system, e.g., Mathematica: 13  on the sample complexity of T and therefore of T .

THE K-FUNCTIONAL: AN UNEXPECTED JOURNEY
A quantity that will play a major role in our results is the K-functional between 1 and 2 , a specific case of the key operator in interpolation theory introduced by Peetre [45]. We start by recalling below the definition and some of its properties, before establishing (for our particular setting) results that will be crucial to us. (For more on the K-functional and its use in functional analysis, the reader is referred to References [12] and [6].)
In other terms, as t varies the quantity κ a (t ) interpolates between the 1 and 2 norms of the sequence a (and accordingly, for any fixed t it defines a norm on 1 + 2 ). In particular, note that for large values of t the function κ a (t ) is close to x 1 , whereas for small values of t the function κ a (t ) is close to t x 2 (see Corollary 6.5). We henceforth focus on the case of K 1 , 2 , although some of the results mentioned hold for the general setting of arbitrary Banach X 0 , X 1 . Proposition 6.2 ([12, Proposition 1.2]). For any a ∈ 1 + 2 , κ a is continuous, increasing, and concave. Moreover, the function t ∈ (0, 1) → κ a t is decreasing. Although no closed-form expression is known for κ a , it will be necessary for us to understand its behavior and therefore seek good upper and lower bounds on its value. We start with the following inequality, due to Holmstedt [35], which, loosely speaking, shows that the infimum in the definition of κ a (t ) is roughly obtained by partitioning a = (a 1 , a 2 ) such that a 1 consists of heaviest t 2 coordinates of a, and a 2 consists of the rest.
where a * is a non-increasing permutation of the sequence (|a i |) i ∈N .
(We remark that for our purposes, this constant factor gap between left-hand and right-hand side is not innocuous, as we will later need to study the behavior of the inverse of the function κ a .) Incomparable bounds on κ a were obtained [40], relating it to a different quantity, the "Q-norm," which we discuss and generalize next.

Approximating the K-Functional by the Q-norm
Loosely speaking, the Q-norm of a vector a (for a given parameter T ) is a mixed 1 / 2 norm: It is the maximum one can reach by partitioning the components of a into T sets and taking the sum of the 2 norms of these T subvectors. Although not straightforward to interpret, this intuitively captures the notion of sparsity of a: Indeed, if a is supported on k elements, then its Q-norm becomes equal to the 1 norm for parameter T ≥ k. Proposition 6.4 ([6, Lemma 2.2], after [40,Lemma 2]). For arbitrary a ∈ 2 and t ∈ N, define the norm Then, for any a ∈ 2 , and t > 0 such that t 2 ∈ N, we have As we shall see shortly, one can generalize this result further, obtaining a tradeoff in the upper bound. Before turning to this extension in Lemmas 6.6 and 6.7, we first state several other properties of the K-functional implied by the above: Moreover, for a supported on finitely many elements, it is the case that lim t →∞ κ a (t ) = a 1 .
Proof. The first two points follow by definition; turning to item (iii), we first note the upper bound is a direct consequence of the definition of κ a as an infimum (as, for all t > 0, κ a (t ) ≤ a 1 ). (This itself ensures the limit as t → ∞ exists by monotone convergence, as κ a is a nondecreasing bounded function.) The lower bound follows from that of Proposition 6.3, which guarantees that for all t > 0 κ a (t ) ≥ 1 Finally, the last point can be obtained immediately from, e.g., the lower bound side of Proposition 6.4 and the upper bound given on item (iii) above. Lemma 6.6. For any a ∈ 2 and t such that t 2 ∈ N, we have Proof of Lemma 6.6. We follow and adapt the proof of Reference [6, Lemma 2.2] (itself similar to that of Reference [40,Lemma 2]). The first inequality is immediate: Indeed, for any sequence c ∈ 2 , by the definition of a Q (t 2 ) and the monotonicity of the p-norms, we have c Q (t 2 ) ≤ c 1 ; and by Cauchy-Schwarz, for any partition (A j ) 1≤j ≤t 2 of N, and thus c Q (t 2 ) ≤ t c 2 . This yields the lower bound, as by the triangle inequality.
We turn to the upper bound. As 2 (R) is a symmetric space and κ a = κ |a | , without loss of generality, we can assume that (a k ) k ∈N is non-negative and monotone non-increasing, i.e., a 1 ≥ a 2 ≥ · · · ≥ a k ≥ . . . . We will rely on the characterization of κ a as (see, e.g., Reference [6, Lemma 2.2] for a proof). The first step is to establish the existence of a "nice" sequence b ∈ 2 arbitrarily close to this supremum: Proof. By the above characterization, there exists a sequence b We now claim that we can further take b to be non-negative and monotone non-increasing as well. The first part is immediate, as replacing negative terms by their absolute values can only increase the sum (since a is itself non-negative). For the second part, we will invoke the Hardy-Littlewood rearrangement inequality (Theorem 3.2), which states that for any two non-negative functions f , д vanishing at infinity, the integral R f д is maximized when f and д are non-increasing. We now apply this inequality to a, b, letting a * , b * be the non-increasing rearrangements of a, b (in particular, we have a = a * ) and introducing the functions f a , f b : where 1 (a,b] is the indicator function of the interval (a, b]. The functions f a , f b satisfy the hypotheses of Theorem 3.2. Thus, we get Moreover, it is immediate to check that max( b * ∞ , t −1 b * 2 ) ≤ 1. The next step is to relate the inner product ∞ k=1 a k b * k to the Q-norm of a: Claim 4. Fix t > 0 such that t 2 ∈ N, and let b * ∈ 2 be any non-increasing, non-negative sequence Proof. We proceed constructively, by exhibiting a partition of N into 2t 2 sets A 1 , . . . , 2 ) 1/2 . This will prove the claim, by definition of a Q (2t 2 ) as the supremum over all such partitions.
Specifically, we inductively choose n 0 , n 1 , . . . , n T ∈ {0, . . . , ∞} as follows, where T def = t 2 c for some c > 0 to be chosen later (satisfying T ∈ N). If 0 = n 0 < n 1 < · · · < n m are already set, then From b * 2 ≤ t, it follows that n T = ∞. Let m * be the first index such that n m * +1 > n m * + 1. Note that this implies (by monotonicity of Since b * ∞ ≤ 1 and n m−1 + 1 = n m for all m ≤ m * , the first term can be bounded as Turning to the second term, we recall that b * i 2 ≤ c for all i ≥ n m * + 1, so that n m i=n m−1 +1 b * i 2 ≤ 2c for all m ≥ m * + 1. This allows us to bound the second term as Therefore, by combining the two, we get that the last equality by choosing c def = 1 2 . We now fix an arbitrary δ > 0, and we let b * be as promised by Claim 3. As this sequence satisfies the assumptions of Claim 4, putting the two results together leads to Since this holds for all δ > 0, taking the limit as δ 0 gives the (upper bound of the) lemma.
We observe that, with similar techniques, one can also establish the following generalization of Proposition 6.4: Lemma 6.7 (Generalization of Proposition 6.4). For any a ∈ 2 , t, and α ∈ [1, ∞) such that Proof of Lemma 6.7 (Sketch). We again follow the proof of [6, Lemma 2.2], up to the inductive definition of n 1 , . . . , n j , which we change as it follows that n α t 2 = ∞. Therefore, for any δ > 0, Since this holds for all δ > 0, taking the limit gives the (upper bound of the) lemma. We note that further inequalities relating κ a to other functionals of a were obtained in Reference [34].

Concentration Inequalities for Weighted Rademacher Sums
The connection between the K-functional and tail bounds on weighted sums of Rademacher random variables was first made by Montgomery-Smith [40], to which the following result is due (we here state a version with slightly improved constants): Theorem 6.8. Let (X i ) i ∈N be a sequence of independent Rademacher random variables, i.e., uniform on {−1, 1}. Then, for any a ∈ 2 and t > 0, and, for any fixed c > 0 and all t ≥ 1, In particular, One can interpret the above theorem as stating that the (inverse of the) K-functional κ a is the "right" parameter to consider in these tail bounds; while standard Chernoff or Hoeffding bounds will depend instead on the quantity a 2 . Before giving the proof of this theorem, we remark that similar statements or improvements can be found in References [6,34]; below, we closely follow the argument of the latter.
Proof of Theorem 6.8. The upper bound can be found in, e.g., Reference [40], or Reference [6, Theorem 2.2]. For the lower bound, we mimic the proof due to Astashkin, improving the parameters of some of the lemmas it relies on. Lemma 6.9 (Small improvement of (2.14) in Reference [6, Lemma 2.3]). If a = (a k ) k ≥1 ∈ 2 , then, for any λ ∈ (0, 1), Proof of Lemma 6.9. The proof is exactly the same, but when invoking (1.10) for p = 4, we use the actual tight version proven there for p = 2m (instead of the more general version that also applies to odd values of p): since m = 2, we get (2m)!
(Lemma 6.9) By taking the limit as δ → 0 + , we then obtain This takes care of the case where t 2 c is an integer. If this is not the case, then we consider s def = c ( t 2 c + 1), so that t 2 ≤ s 2 ≤ t 2 + c. The monotonicity of κ a then ensures that which concludes the proof.

Some Examples
To gain intuition about the behavior of κ a , we now compute tight asymptotic expressions for it in several instructive cases, specifically for some natural examples of probability distributions in Δ([n]). From the lower bound of Proposition 6.4 and the fact that κ p ≤ p 1 for any p ∈ 1 , it is clear that as soon as t ≥ √ n, κ p (t ) = 1 for any p ∈ Δ([n]). It suffices then to consider the case 0 ≤ t ≤ √ n.
The Uniform Distribution. We let p be the uniform distribution on [n]: p k = 1 n for all k ∈ [n]. By considering a partition of [n] into t 2 sets of size n t 2 , the lower bound of Proposition 6.4 yields κ p (t ) ≥ p Q (t 2 ) ≥ t √ n . However, by definition κ p (t ) = inf p +p =p p 1 + t p 2 ≤ t p 2 = t √ n , and thus We remark that in this case, the upper bound of Holmstedt from Proposition 6.3 only results in where f : It is instructive to note this shows that this could not possibly have been the right upper bound (and therefore that Proposition 6.3 cannot be tight in general), as f is neither concave nor non-decreasing, and not even bounded by 1: From the above, we can now compare the behavior of κ −1 p (1 − 2ε) to the "2/3-norm functional" introduced by Valiant and Valiant [55]: For ε ∈ (0, 1/2), The Harmonic Distribution. We now consider the case of the (truncated) Harmonic distribution, letting p ∈ Δ([n]) be defined as p k = 1 kH n for all k ∈ [n] (H n being the nth Harmonic number). By considering a partition of [n] into t 2 − 1 sets of size 1 and one of size n − t 2 , the lower bound of Proposition 6.4 yields while Holmstedt's upper bound gives For t = O (1), this implies that κ p (t ) = o (1); however, for t = ω (1) (but still less than √ n), an asymptotic development of both upper and lower bounds shows that Using this expression, we can again compare the behavior of κ −1 p (1 − 2ε) to the 2/3-norm functional of Reference [55]: for ε ∈ (0, 1/2), 6:24 E. Blais et al.

IDENTITY TESTING, REVISITED
For any x ∈ (0, 1/2) and sequence a ∈ 1 , we let t x where κ a is the K-functional of a as previously defined. Armed with the results and characterizations from the previous section, we will first in Section 7.1 describe an elegant reduction from communication complexity leading to a lower bound on instance-optimal identity testing parameterized by the quantity t ε . Guided by this lower bound, we then will in Section 7.2 consider this result from the upper bound viewpoint, and in Theorem 7.6 we will establish that indeed this parameter captures the sample complexity of this problem. Finally, Section 7.3 is concerned with tightening our lower bound by using different arguments: specifically, showing that the bound that appeared naturally as a consequence of our communication complexity approach can, in hindsight, be established and slightly strengthened with standard distribution testing arguments.
We will follow the argument outlined in Section 2.2: namely, applying the same overall idea as in the reduction for uniformity testing, but with an error-correcting code specifically designed for the distribution p instead of a standard Hamming one. To prove Theorem 7.1, we thus first need to define and obtain codes with properties that are tailored for our reduction, which we do next.

Balanced p-weighted Codes.
Recall that in our reductions so far, the first step is for Alice and Bob to apply a code to their inputs; typically, we chose that code to be a balanced code with constant rate, and linear distance with respect to the uniform distribution (i.e., with good Hamming distance). To obtain better bounds on a case-by-case basis, it will be useful to consider a generalization of these codes, under a different distribution: Definition 7.2 (p-distance). For any n ∈ N, given a probability distribution p ∈ Δ([n]), we define the p-distance on {0, 1} n , denoted dist p , as the weighted Hamming distance, A p-weighted code is a code whose distance guarantee is with respect to the p-distance.
Recall that the "vanilla" reduction in Section 5 relies on balanced codes. We generalize the balance property to the p-distance and allow the following relaxation.
Hence, we have k ≤ n − log Vol F n 2 ,dist p (γ /2). Proof of Proposition 7.5. Note that The probability that a randomly chosen code C : Hence, for sufficiently small k = Ω(n − log Vol F n 2 ,dist p (ε)), the probability that a random code is a p-weighted code with relative distance γ is at least 2/3; fix such k. Similarly, the probability that a random code C : Thus, the probability that a random code is τ -balanced (under the p-distance) is at least 2/3, and so, with probability at least 1 3 , a random code satisfies the proposition's hypothesis. We now establish a connection between the rate of p-weighted codes and the K-functional of p, as introduced in Section 6: For correctness, note that if x = y, then A =B, which implies q = p. However, if x y, by the (p-weighted) distance of C, we have dist p (C (x ), C (y)) > γ , and so p(A ∩ B) + p(A ∪ B) > γ . Note that every i ∈ A ∩ B satisfies q i = 2p i and every i ∈ A ∪ B is not supported in q. Therefore, we have p − q 1 > ε. The referee can therefore invoke the identity testing algorithm to distinguish between p and q with probability 1 − ( 1 6 + 1 6 ) = 2 3 . This implies that the number of samples q used by any such tester must satisfy s log n = Ω( √ k ). Finally, by Claim 5, we have and therefore, we obtain a lower bound of s = Ω(t ε / log(n)).

The Upper Bound
Inspired by the results of the previous section, it is natural to wonder whether the dependence on t ε of the lower bound is the "right" one. Our next theorem shows that this is the case: The parameter t ε does, in fact, capture the sample complexity of the problem.
Theorem 7.6. There exists an absolute constant c > 0 such that the following holds. Given any fixed distribution p ∈ Δ([n]) and parameter ε ∈ (0, 1], and granted sample access to an unknown distribution q ∈ Δ([n]), one can test p = q vs. p − q 1 > ε with O (max( t cε ε 2 , 1 ε )) samples from q. (Moreover, one can take c = 1 18 .) 7.2.1 High-level Idea. As discussed in Section 2.4, the starting point of the proof is the connection between the K-functional and the "Q-norm" obtained in Lemma 6.6: indeed, this result ensures that for T = 2t 2 O (ε ) , there exists a partition of the domain into sets A 1 , . . . , A T , such that where p A j is the restriction of the sequence p to the indices in A j . But by the monotonicity of p norms, we know that T j=1 p A j 2 ≤ T j=1 p A j 1 = T j=1 i ∈A j p i = p 1 = 1. Therefore, what we obtain is, in fact, that Now, if the right-hand side were exactly 0, then this would imply p A j 1 = p A j 2 for all j, and thus that p has (at most) one non-zero element in each A j . Therefore, testing identity to p would boil down to testing identity on a distribution with support size T , which can be done This is not actually the case, of course: the right-hand-side is only small and not exactly zero. Yet, one can show that a robust version of the above holds, making this intuition precise: in Lemma 7.7, we show that on average, most of the probability mass of p is concentrated on a single point from each A j . This sparsity implies that testing identity to p on this set of T points is indeed enough-leading to the theorem. Theorem 7.6. Let p ∈ Δ([n]) be a fixed, known distribution, assumed monotone non-increasing without loss of generality: p 1 ≥ p 2 ≥ · · · ≥ p n . Given ε ∈ (0, 1/2), we let t ε be as above, namely, such that κ p (t ε ) ≥ 1 − 2ε.

Proof of
From this, it follows by Lemma 6.6 that where we set T def = 2t 2 ε . Choose A 1 , . . . , A T to be a partition of [n] achieving the maximum (since we are in the finite, discrete case) defining p Q (T ) ; and letp be the subdistribution on T elements defined as follows. For each j ∈ [T ], choose i j def = arg max i ∈A j p i , and setp(j) def = p(i j ).
We let s > 1 be a (non-integer) parameter to be chosen later. Suppose, first, that α ≤ s s+1 p(A), or equivalently α ≤ sp(a * ). In that case, we have where we relied on the inequality x for x ∈ [0, s]. However, if α > sp(a * ), then we have using the fact that p(a * ) is the maximum probability value of any element, so that the total α has to be spread among at least s + 1 elements (recall that s will be chosen not to be an integer).
Optimizing these two bounds leads to the choice of s Putting it together, we obtain, summing over all j ∈ [T ], that Remark 1. We observe that, although efficiently computing κ p (·) (and a fortiori κ −1 p (·)) or p Q (·) is not immediate, the above algorithm is efficient, and can be implemented to run in time O (n + T log n + √ T /ε 2 ). The reason is that knowing beforehand the value of T is not necessary: given p (e.g., as an unsorted sequence of n values) and ε, it is enough to retrieve the biggest values of p until they sum to 1 − O (ε): the number of elements retrieved will, by our proof, be at most T (and this can be done in time O (n + T log n) by using e.g., a max-heap). It only remains to apply the above testing algorithm to the set of (at most) T elements thus obtained.

Tightening the Lower Bound
As a last step, one may want to strengthen the lower bound obtained by the communication complexity reduction of Theorem 7.1. We here describe how this can be achieved using more standard arguments from distribution testing. However, we stress that these arguments in some sense are applicable "after the fact," that is after Section 7.1 revealed the connection to the K-functional, and the bound we should aim for. Specifically, we prove the following: Theorem 7.9. For any p ∈ Δ([n]), and any ε ∈ (0, 1/2) any algorithm testing identity to p must have sample complexity Ω( t ε ε ). Proof. Fix p ∈ Δ([n]) and ε ∈ (0, 1/2) as above, and consider the corresponding value t ε ; we assume that t ε ≥ 2, as otherwise there is nothing to prove. 15 Without loss of generality-as we could always consider a sufficiently small approximation, and take the limit in the end, we further assume the infimum defining κ p is attained: let h, ∈ [0, 1] n be such that p = h + and κ p ( (note that the right inequality is strict, because ε > 0: since if 2 = 0, then 1 = 0 and h = p; but then κ t ε = p 1 = 1.) In particular, this implies 1 − 2ε > 0. With this in hand, we will apply the following theorem, due to Valiant and Valiant: Theorem 7.10 ([55, Theorem 4.2]). Given a distribution p ∈ Δ([n]), and associated values (ε i ) i ∈[n] such that ε i ∈ [0, p i ] for each i, define the distribution over distributions Q by the process: independently for each domain element i, set uniformly at random q i = p i ± ε i , and then normalize q to be a distribution. Then there exists a constant c > 0 such that is takes at least c ( n i=1 ε 4 i /p 2 i ) −1/2 samples to distinguish p from Q with success probability 2/3. Further, with probability at least 1/2 the 1 distance between p and a uniformly random distribution from Q is at least We want to invoke the above theorem with being, roughly speaking, the "random perturbation" to p. Indeed, since has small 2 norm of order O (1/t ε ) by Equation (16) (which gives a good lower bound) and has 1 sum Ω (ε) (which gives distance), this seems to be a natural choice.
In view of this, set α def = 2ε 1 ∈ (0, 1) and, for i ∈ [n], ε i def = α i ≤ i ∈ [0, p i ]. Theorem 7.9 will then be a direct consequence of the next two claims: 15 Indeed, an immediate lower bound of Ω (1/ε ) on this problem holds.
distributions such as being binomially distributed, Poisson binomially distributed, and having a log-concave probability mass function. Throughout this section, we fix ε to be a small constant and refer to testing with respect to proximity Θ (ε).
Monotonicity on the Integer Line and the Boolean Hypercube. We start with the problem of testing monotonicity on the integer line, that is, testing whether a distribution p ∈ Δ([n]) has a monotone probability mass function. Consider the "vanilla" reduction, presented in Section 5. Note that for yes-instances, we obtain the uniform distribution, which is monotone. For no-instances, however, we obtain a distribution p that has mass 1/n on a (1 − ε)-fraction of the domain, is unsupported on a (ε/2)-fraction of the domain, and has mass 2/n on the remaining (ε/2)-fraction. Typically, p is Ω (1)-far from being monotone; however, it could be the case that the first (respectively, last) εn/2 elements are of 0 mass, and the last (respectively, first) εn/2 elements are of mass 2/n, in which case p is perfectly monotone. To remedy this, all we have to do is let the referee emulate a distribution p ∈ Δ([3n]) such that p i = { The idea above can be extended to monotonicity over the hypercube as follows. We start with the uniformity reduction, this time over the domain {0, 1} n . As before, yes-instances will be mapped to the uniform distribution over the hypercube, which is monotone, and no-instances will be mapped to a distribution that has mass 1/2 n on a (1 − ε)-fraction of the domain, is unsupported on a (ε/2)fraction of the domain, and has mass 1/2 n−1 on the remaining (ε/2)-fraction-but could potentially be monotonously strictly increasing (or decreasing). This time, however, the "boundary" is larger than the "edges" of the integer line, and we cannot afford to pad it with elements of weight 1/2 n . Instead, the referee, who receives for the players samples drawn from a distribution p ∈ Δ({0, 1} n ), emulates a distribution p ∈ Δ({0, 1} n+1 ) over a larger hypercube whose additional coordinate determines between a negated or regular copy of p; that is, p (z) = { p (z 1 , . . . , z n ) z n+1 = 0 1 2 n − p (z 1 , . . . , z n ) z n+1 = 1 (where the referee chooses z n+1 ∈ {0, 1} independently and uniformly at random for each new sample). Hence, even if p is monotonously increasing (or decreasing), the emulated distribution p is Ω(ε)far from monotone. By the above, we obtainΩ( √ n) andΩ(2 n/2 ) lower bounds on the sample complexity of testing monotonicity on the line and on the hypercube, respectively. k-modality. Recall that a distribution p ∈ Δ([n]) is said to be k-modal if its probability mass function has at most k "peaks" and "valleys." Such distributions are natural generalizations of monotone (for k = 0) and unimodal (for k = 1) distributions. Fix a sublinear k, and consider the uniformity reduction presented in Section 5, with the additional step of letting the prover apply a random permutation to the domain [n] (similarly to the reduction shown in Section 5.1). Note that yes-instances are still mapped to the uniform distribution (which is clearly k-modal), and no-instances are mapped to distributions with mass 1/n, 2/n, and 0 on a (1 − ε), (ε/2), and (ε/2) (respectively) fractions of the domain. Intuitively, applying a random permutation of the domain to such a distribution "spreads" the elements with masses 0 and 2/n nearly uniformly, causing many level changes (i.e., high modality); indeed, it is straightforward to verify that with high probability over the choice of a random permutation of the domain, such a distribution will indeed be Ω(ε)-far from k-modal. This yields anΩ( √ n) lower bound on the sample complexity of testing k-modality, nearly matching the best known lower bound of Ω(max( √ n, k/ log k )) following from Reference [19], for k/ log(k ) = O ( √ n).
Symmetric Sparse Support. Consider the property of distributions p ∈ Δ([n]) such that when projected to its support, p is mirrored around the middle of the domain. That is, p is said to have a symmetric sparse support if there exists S = {i 0 < i 2 < · · · < i 2 } ⊆ [n] with i = n 2 such that: (1) p(i) = 0 for all i ∈ [n] \ S, and (2) p(i +1−j ) = p(i +j ) for all 0 ≤ j ≤ . We sketch a proof of anΩ( √ n) lower bound on the sample complexity of testing this property. Once again, we shall begin with the uniformity reduction presented in Section 5, obtaining samples from a distribution p ∈ Δ([n/2]). Then the referee emulates samples from the distribution p ∈ Δ([n]) that is distributed as p on its left half, and uniformly distributed on its right half; that is, p i = { p i /2 i ∈ [n/2] 1/n otherwise . Note that yes-instances are mapped to the uniform distribution, which has symmetric sparse support, and no-instances are mapped to distributions in which the right half is uniformly distributed and the left half contains εn/4 elements of mass 2/n, and hence it is Ω (ε)-far from having symmetric sparse support.
Other Properties. As aforementioned, similar techniques as in the reductions above (as well as in the identity testing reduction of Section 7, invoked on a specific p, e.g., the Bin(n, 1/2) distribution) can be applied to obtain nearly tight lower bounds ofΩ( √ n) (respectively,Ω(n 1/4 )) for the properties of being log-concave and monotone hazard rate (respectively, Binomially and Poisson Binomially distributed). See, e.g., Reference [20] for the formal definitions of these properties.

TESTING WITH CONDITIONAL SAMPLES
In this section, we show that reductions from communication complexity protocols can be used to obtain lower bounds on the sample complexity of distribution testers that are augmented with conditional samples. These testing algorithms, first introduced in References [21,22], aim to address scenarios that arise both in theory and practice yet are not fully captured by the standard distribution testing model.
In more detail, algorithms for testing with conditional samples are distribution testers that, in addition to sample access to a distribution p ∈ Δ(Ω), can ask for samples from p conditioned on the sample belonging to a subset S ⊆ Ω. It turns out that testers with conditional samples are much stronger than standard distribution testers, leading in many cases to exponential savings (or even more) in the sample complexity. In fact, these testing algorithms can often maintain their power even if they only have the ability to query subsets of a particular structure.
One of the most commonly studied restricted conditional samples models is the PAIRCOND model [21]. In this model, the testers can either obtain standard samples from p, or specify two distinct indices i, j ∈ Ω and get a sample from p conditioned on membership in S = {i, j}. As shown in References [18,21], even under this restriction one can obtain constant-or poly log(n)-query testers for many properties, such as uniformity, identity, closeness, and monotonicity (all of which require Ω( √ n) or more samples in the standard sampling setting). This, along with the inherent difficulty of proving hardness results against adaptive algorithms, makes proving lower bounds in this setting a challenging task; and indeed, the PAIRCOND lower bounds established in the aforementioned works are quite complex and intricate.
We will prove, via a reduction from communication complexity, a strong lower bound on the sample complexity of any PAIRCOND algorithm for testing junta distributions, a class of distributions introduced in Reference [5] (see definition below).
Since PAIRCOND algorithms are stronger than standard distribution testers (in particular, they can make adaptive queries), we shall reduce from the general randomized communication complexity model (rather than from the SMP model, as we did for standard distribution testers). In this model, Alice and Bob are given inputs x and y as well as a common random string, and the parties aim to compute a function f (x, y) using the minimum amount of communication.
We say that a distribution p ∈ Δ({0, 1} n ) is a k-junta distribution (with respect to the uniform distribution) if its probability mass function is only influenced by k of its variables. We outline below a proof of the following lower bound.