Abstract 1 Introduction 2 An 𝛀(𝒎𝒄) lower bound for testing 𝒎-grainedness 3 𝛀(𝒎𝐥𝐨𝐠𝒎) lower bound for 𝒎-grainedness (Proof of Theorem 1.1) 4 Uniformity Testing in the DoHo Model 5 Conclusion and open problems References

Settling the Complexity of Testing Grainedness of Distributions, and Application to Uniformity Testing in the Huge Object Model

Clément L. Canonne ORCID School of Computer Science, University of Sydney, Australia Sayantan Sen ORCID Centre for Quantum Technologies, National University of Singapore, Singapore Joy Qiping Yang ORCID School of Computer Science, University of Sydney, Australia
Abstract

In this work, we study the problem of testing m-grainedness of probability distributions over an n-element universe 𝒰, or, equivalently, of whether a probability distribution is induced by a multiset S𝒰 of size |S|=m. Recently, Goldreich and Ron (Computational Complexity, 2023) proved that Ω(nc) samples are necessary for testing this property, for any c<1 and m=Θ(n). They also conjectured that Ω(mlogm) samples are necessary for testing this property when m=Θ(n). In this work, we positively settle this conjecture.

Using a known connection to the Distribution over Huge objects (DoHo) model introduced by Goldreich and Ron (TheoretiCS, 2023), we leverage our results to provide improved bounds for uniformity testing in the DoHo model.

Keywords and phrases:
Distribution testing, Uniformity testing, Huge Object Model, Lower bounds
Funding:
Clément L. Canonne: Supported by an ARC DECRA (DE230101329).
Sayantan Sen: Supported by the National Research Foundation, Singapore and ASTAR under its Quantum Engineering Programme NRF2021-QEP2-02-P05.
Joy Qiping Yang: Supported by a JD Technology Research Scholarship in Artificial intelligence.
Copyright and License:
[Uncaptioned image] © Clément L. Canonne, Sayantan Sen, and Joy Qiping Yang; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Streaming, sublinear and near linear time algorithms
Related Version:
Full Version: https://eccc.weizmann.ac.il/report/2024/196 [12]
Acknowledgements:
We would like to thank the anonymous reviewers of ITCS 2025 for their suggestions which improved the presentation of the paper. SS would like to thank Clément Canonne and the Theory CS group at the University of Sydney for the warm hospitality during his academic visit, where this work was initiated.
Editors:
Raghu Meka

1 Introduction

The field of distribution testing [7, 6] is concerned with providing statistically accurate information about large datasets or their underlying probability distributions, given very scarce data (sample size). Drawing insights from property testing [20, 27], distribution testing lies at the intersection of theoretical computer science, statistics, and learning theory; and has received significant attention over the past two decades, with many algorithms, insights, and new theoretical access models to the data being proposed and analyzed. We refer the reader to recent surveys [8, 4, 9] and textbook [19, Chapter 11] for more on distribution testing and property testing.

In the most standard access model, the algorithm accesses the underlying unknown probability distribution D (usually assumed to be over a known discrete domain of size n) by obtaining independent, identically distributed (i.i.d.) samples from it. The goal of the algorithm is then, given a fixed property ΠΔn (a subset of the n-dimensional probability simplex Δn) and a distance parameter ε(0,1], as follows:

  • If DΠ, the algorithm must output accept with probability at least 2/3;

  • If 𝖽𝖳𝖵(D,Q)>ε for all QΠ, the algorithm must output reject with probability at least 2/3;

where 𝖽𝖳𝖵 denotes the total variation distance between distributions, and the probability is taken over the choice of the samples and the internal randomness of the algorithm. This is defined as the ε-testing of the property Π. The minimum number of samples s=s(n,ε) necessary to achieve this task in the worst case (over all possible distributions D) is the sample complexity of testing Π. That is, the testing task requires the algorithm, given as few samples as possible, to distinguish with high probability between distributions which have the property, and those which are “far” from having it.

Many other access models have been introduced, providing additional types of queries to the algorithm, or changing the distance metric, or both (see [8] for an overview): among them is the Distribution over Huge Objects (DoHO) model, recently introduced by Goldreich and Ron [23] to capture settings where full access to the data sampled is itself costly or impractical, due to the size of these objects. In the simplest version of this model, the distribution D is defined over the n-dimensional hypercube {0,1}n, where n is assumed to be very large; given a sample xD, the algorithm must then choose which bits of x to observe, paying a unit cost for every such query made. The distance metric to quantify “farness” between distributions also differs from that of the standard formulation, and instead is taken to be the Earthmover distance (EMD) between probability distributions, with underlying metric chosen to be the (relative) Hamming distance between n-bit strings. (For the formal definition of this model, and the relation to the standard sampling model, see Section 1.3.)

Many different properties of distributions have been studied, some of them quite extensively: among those, uniformity (Π={UΩ}, consisting of the single uniform distribution over the whole domain Ω), generalized uniformity [5] (ΠU={US:SΩ}, consisting of all distributions uniform over some subset of the domain), and parameterized uniformity (Πm={US:SΩ,|S|=m}, consisting of all distributions uniform over some size-m subset of the domain) [23], and m-grainedness (denoted Πm, which we will define shortly) [18, 22] are the most relevant to this work.

In this paper, we will focus on two distribution testing tasks, intimately related:

  • Testing m-grainedness of distributions in the standard sampling model (property Πm); and

  • Testing parameterized uniformity in the DoHO model.

Grainedness of distributions

A probability distribution DΔn over a discrete set Ω of n elements is said to be m-grained, for a given parameter m, if all its probabilities are integer multiples of 1/m. That is,

mD(x)0,xΩ.

Such distributions naturally arise due to quantization (e.g., binning of continuous or discrete distributions), or when sampling from datasets: that is, if SΩ is a multiset of size m, uniformly sampling from S with replacement yields an m-grained distribution DS over Ω.

Throughout this work, we will assume ε(0,1) is a constant. Thus, m<n/ε=O(n). Previous works fixed m=Θ(n) but these results can also be extended for m=O(n), and therefore our results are stated in terms of m, instead of n.

Recent work of Goldreich and Ron [22] showed that Ω(mc) samples are necessary for testing this property, for any fixed c<1. A sample complexity upper bound of O(mlogm) also follows from previous work of Valiant and Valiant [29], which led [22] to conjecture a lower bound of Ω(mlogm) samples. Our main contribution is to resolve this conjecture, showing that, indeed, Θ(mlogm) samples are necessary and sufficient for m-grainedness testing:

Theorem 1.1.

Let n be a sufficiently large integer, m=O(n), and let ε(0,1) be a sufficiently small constant. Then, ε-testing m-grainedness of distributions over [n] requires Ω(mlogm) samples.

We note that the restriction m=O(n) is necessary, as if m=n/α for some α<1, every distribution is α-close to being m-grained. Recall that we assume ε is a small constant. Moreover, the problem of testing m-grainedness becomes trivial when mn/ε as in that case every distribution is ε-close to being m-grained. Along the way, we also present as a warmup a new proof of the lower bound of Ω(nc) samples for m-grainedness testing:

Theorem 1.2.

Let n be a sufficiently large integer, m=O(n), and let ε(0,1) be a sufficiently small constant. Then, for any fixed constant c(0,1), ε-testing m-grainedness of distributions over [n] requires Ω(mc) samples.

While this is a strictly weaker result than our main lower bound, our proof strategy differs significantly from that of [22], and may be of independent interest.

Parameterized uniformity testing in the DoHO model

In a separate work [23], Goldreich and Ron studied the problem of parameterized uniformity testing in the Huge Object model. Among other results, they established a connection between m-grainedness testing in the standard model and testing Πm in the DoHO model:

Theorem 1.3 ([23, Theorem 2.13]).

Assume that, for constant ε(0,1), ε-testing m-grainedness in the standard model has sample complexity Ω(mlogm). Then, for every 1mn and constant ε(0,ε2), ε-testing Πm for distribution over {0,1}n in the DoHO model has query complexity Ω(mlogm).

Our lower bound for m-grainedness immediately implies that this result holds unconditionally. In the same paper, the authors provided an algorithm for testing Πm using O~(m) samples and queries (for constant ε=Ω(1)), which together with the above – now unconditional – lower bound settles the complexity of testing Πm in the DoHO model, up to logarithmic factors, for mn. One may be tempted to assume the same query complexity lower bound holds in all parameter regimes: perhaps surprisingly, our next result is a new and simple algorithm for testing Πm in the DoHO model which takes O(m2/3) samples and performs O(m2/3n) queries:

Theorem 1.4.

There is an algorithm which, given an integer m and constant ε(0,1), ε-tests the property Πm of distributions over {0,1}n in the DoHO model, taking 𝒪(m2/3) samples and performing 𝒪(m2/3n) queries. Moreover, any ε-tester for Πm in the DoHO model must take Ω(m2/3) samples.

Notably, this improves the upper bound of Goldreich and Ron whenever n3m2n, and rules out any Ω~(m) query complexity lower bound for the whole range of m.

1.1 Overview of our techniques

Lower bound for 𝒎-grainedness testing

Our lower bound approach follows Le Cam’s two-point method: we will design two distributions of distributions D𝗒𝖾𝗌 and D𝗇𝗈 such that the distributions in D𝗒𝖾𝗌 are m-grained and the distributions in D𝗇𝗈 are “far” from being m-grained in variation distance. The name of the game is then to prove that it is information-theoretically impossible to distinguish between D𝗒𝖾𝗌 and D𝗇𝗈.

To achieve this, we will rely on the moment-matching technique, particularly suited to properties like m-grained (which are symmetric, i.e., invariant to relabelling of the domain elements). Broadly, if two probability distributions D𝗒𝖾𝗌 and D𝗇𝗈 have sufficiently similar moments D𝗒𝖾𝗌ttD𝗇𝗈tt, for say 1tK, then it is hard to distinguish (a permutation of) D𝗒𝖾𝗌 from (a permutation of) D𝗇𝗈 by using o(n11/K/K) samples. That is, the more moments one can match, the more samples one needs to tell two distributions apart. Note that setting K=Θ(logn) would then lead to an Ω(nlogn) lower bound (and that, as remarked later, in our case we can assume without loss of generality, for the sake of the lower bound argument, that m=Θ(n).) This approach has been used in the literature extensively [26, 29, 28, 30, 31, 32, 10], and has proven to be very successful in obtaining tight lower bounds for a range of symmetric properties. We will rely on the formulation of the technique described by [32], which maps the problem of matching moments of two full probability distributions (an n-dimensional object) to that of matching moments of two single-dimensional random variables U and V: these two random variables will then be used to generate the probability distributions: intuitively (and as described below), “n independent draws from U will give D𝗒𝖾𝗌(1),,D𝗒𝖾𝗌(n)” (and similarly for D𝗇𝗈 and V).

Thus, a crucial ingredient in our lower bound is the construction of a pair of discrete random variables U and V whose first logarithmically many moments are identical. Using n independent copies of U (resp. V), we will then define a random measure over [n], which corresponds to a “yes”-instance D𝗒𝖾𝗌 (resp., a “no”-instance D𝗇𝗈). Thus, another requirement on our construction of U and V is that the first one should yield an m-grained distribution, while the second should correspond to a distribution far from being m-grained. We will formalize this in Lemma 2.2.

To build these two random variables U and V, we will, as previously done in the literature, rely on the properties of the Chebyshev polynomial Td of degree d=O(logn), defining U and V to be supported on disjoint subsets of the roots of a suitable polynomial P(x), where the probability mass assigned to a given root r is proportional to 1/|P(r)|. The idea is that some of the roots of P, those corresponding to U, will be multiples of 1/m (leading to m-grained probabilities) while the others, corresponding to V, will be odd multiples of 1/(2m) (leading to “far-from-m-grained” probabilities): for instance, we would take

P(x)=x(x12m)(11m)Td(cx)

for some scaling constant c>0. The remaining roots will be that of the Chebyshev polynomial Td, which are there to ensure that we can match sufficiently many moments of U and V.

However, this approach, which underlies most of previous work, cannot be directly used here: indeed, while the resulting two random variables U,V could be made to have many matching moments, and as such be hard to distinguish, doing so will put a lot of probability mass on the roots of the Chebyshev polynomial Td both for U and V; and these roots are not multiples of 1/m. This would have the undesirable effect of making both our distributions far from being m-grained. Trying to fix this the obvious requires to ensure very little mass is put by U (and so V) on the roots of Td, which in turns limits the number of moments that can be matched. (Note that fixing this by choosing not to use the Chebyshev polynomial Td at all but instead choosing P(x) of the form P(x)=x(x12m)(x1m)(xLm), for some large constant integer L, does get part of the way there and provides a non-trivial result, leading to our (weaker) Ω(mc) lower bound for every c<1).

To remedy this, we take a different route, using an idea we believe to be of independent interest and applicable to other lower bounds: instead of using Chebyshev polynomials directly, we will be using a modified version of Chebyshev polynomials, T~d, defined from Td by first rounding its roots to multiples of 1/m. By carefully analyzing the resulting polynomial, we can show that it behaves, for our lower bound construction purposes, similarly to the Chebyshev polynomial, and so we can use it to define both U and V. This allows us to match more moments, leading to our final Ω(m/logm) lower bound.

Upper bound for parameterized uniformity testing

In contrast, the idea underlying our upper bound for testing Πm in the DoHO model is relatively straightforward: namely, by making n queries per sample, we can simulate any testing algorithm in the standard sampling model, as we now have observed the full samples. This, along with the relation between TV distance and EMD distance, allows us to lift any s-sample tester for Πm in the standard sampling model to an s-sample, sn-query tester for Πm in the DoHO model.

The only issue with this plan is that there is no known (nontrivial) tester for Πm in the standard sampling model. There is, however, a known testing algorithm for the related property of generalized uniformity, ΠU. Our key contribution here is then to use this known tester A to derive a tester B for Πm in the standard sampling model. Notably, this is not as immediate as it seems, and our algorithm needs to use A in a whitebox way, and combine this with an additional test to estimate the 2 norm of the unknown distribution D. The subtlety here (and the reason for which we cannot use any A in a blackbox fashion, but need to use a specific algorithm due to [5]) is that being close to some uniform distribution over some subset does not immediately allow to conclude about being close to some uniform distribution over some m-size subset – even if we are guaranteed the 2 norm of D is close to 1/m.

1.2 Related work

The field of distribution testing has its roots in theoretical computer science with the work of Goldreich and Ron [21], who designed a uniformity testing algorithm as a tool to check whether a graph is an expander; and formally defined and introduced in [7]. [6] studied the problem of identity testing, which generalizes the problem of uniformity testing. Over the last two decades, this field has seen significant growth, and a host of new tools and techniques have emerged, see [25, 1, 18, 16, 14, 15, 11] to name a few. See the surveys [8, 4, 9] and the book of [19, Chapter 11] for more details.

The study of grained distributions was done implicitly in the work of [26], and later [18] studied it in more detail.

As mentioned earlier, the Huge Object model was introduced by [23]. Since then, there have been several works focusing on this model. [13] defined the notion of index-invariant properties, a set of properties that are invariant under the permutation of the indices (these properties are different than label-invariant properties). They showed that index-invariant properties whose VC dimension of the support set is bounded can be learned using a constant number of queries, depending only on the VC dimension. They also gave tight separation between adaptive and non-adaptive testers for index and non-index-invariant properties. Later, [2] studied various different notions of adaptivity, and showed several separation results. Very recently, [3] studied the problem of support size testing in this model.

1.3 Preliminaries

For a positive integer n, let [n] denote the set {1,,n}. We will use the standard asymptotic notation 𝒪(),Ω(),Θ(), and, in some cases, the (somewhat less) standard notation O~(), which omits poly-logarithmic dependencies in the parameters for readability.

We will use several notions of distance between two distributions.

Definition 1.5.

Let D1 and D2 be two probability distributions over a domain Ω. The 1 distance between D1 and D2 is defined as

D1D21=xΩ|D1(x)D2(x)|.

The total variation distance between D1 and D2 is defined as:

𝖽𝖳𝖵(D1,D2)=12D1D21=supSΩ(D1(S)D2(S)).
Definition 1.6 (EMD with respect to Hamming distance).

Let D1,D2 be two distributions defined over the n-dimensional Hamming cube {0,1}n and dH be the (relative) Hamming distance over {0,1}n. Then the Earth Mover distance (EMD) between D1 and D2 is defined as follows:

𝖽𝖤𝖬(D1,D2):=infT𝒯(D1,D2)𝔼(x,y)T[dH(x,y)]

where 𝒯(D1,D2) denotes the set of all couplings between μ and τ.

Since we are working with the (relative) Hamming distance, dH(x,y)1. Then directly from the definitions of 𝖽𝖤𝖬(D1,D2) and dH(x,y), we get the following simple yet useful observation connecting total variation and EMD distances between two distributions.

Observation 1.7 (Relation between EMD and TV distances).

Let D1 and D2 be two distributions over the n-dimensional Hamming cube {0,1}n. Then,

𝖽𝖤𝖬(D1,D2)𝖽𝖳𝖵(D1,D2).
Definition 1.8 (Huge Object Model).

Consider a discrete distribution D supported over the n-dimensional Hamming cube {0,1}n. D is accessed by obtaining iid samples, where each sample obtained is an n-bit Boolean string. In addition to the sampling oracle, there is also a query oracle associated with D, where the tester can query any index i[n] for any samples (say j-th sample sj) obtained, and the query oracle will return the i-th bit of sj. The goal is then to minimize both the total sample and query complexities required to decide a property.

Note that this Huge Object model is particularly suited for studying high-dimensional distributions, where the dimension of the underlying domain is so large that even reading a single sample in its entirety might not be feasible.

Finally, we state here some technical result which will be used in our lower bounds proofs:

Lemma 1.9 ([L]emma 4).

wu2019chebyshev] Let W1,W2 be discrete random variables taking values in [0,Λ]. If 𝔼[W1t]=𝔼[W2t] for 1tT, then

𝖽𝖳𝖵(𝔼W1[𝖯𝗈𝗂(W1)],𝔼W2[𝖯𝗈𝗂(W2)])(eΛ2T)T.
Fact 1.10 ([10, Fact 7], [24]).

Let p be a degree-d polynomial with distinct roots r1,,rd. Then, for every 0kd2, we have that i=1drikp(ri)=0.

We will be using the following two well-known trigonometric inequalities.

Fact 1.11.
  1. (i)

    For x[0,π], the following holds:

    sinx2πmin(x,πx) (1)
  2. (ii)

    For x[π/2,π/2], the following holds:

    |4πx||sinx||2πx|, (2)

Organization

The rest of the paper is organized as follows. In Section 2, we present a new proof of the lower bound of Ω(mc) samples. Then in Section 3, we present the Ω(m/logm) lower bound for testing m-grainedness. In Section 4, we present our result of uniformity testing in the Huge Object Model. We conclude in Section 5 with some open questions. Due to the shortage of space, the formal proofs of some claims are deferred to the full version of the paper [12].

2 An 𝛀(𝒎𝒄) lower bound for testing 𝒎-grainedness

In this section, we prove Theorem 1.2, thereby presenting a new proof of the lower bound of [22].

See 1.2 Note that it suffices to consider the case m=Θ(n), as if mn one can then embed the hard instances in a larger domain, lifting the lower bound. As a result, we hereafter assume without loss of generality that m=Θ(n). We start by stating a relatively standard theorem which will be crucial in the proof of our sample complexity lower bound, and whose proof is deferred to the full version [12].

Theorem 2.1.

Fix positive integers m,n,s, where n>4.3×108 and mC0(n1) for some absolute constant C0. Suppose there exist random variables U and V, where U is supported on {0,1m,,mm}; Pr[V{12m,32m,,2m12m}]23; and they satisfy the following three conditions,

max(U,V)20log2(n1)(n1), (3)
𝔼[U]=𝔼[V]12(n1). (4)
𝖽𝖳𝖵(𝔼[Poi(sU)],𝔼[Poi(sV)])120(n1), (5)

Then any tester taking less than s/2 number of samples from unknown distribution p cannot distinguish between the cases that p is m-grained and p is at least 18C0-far from any m-grained distributions in TV distance with probability 3/5.

In order to prove Theorem 1.2, we will be using the following lemma.

Lemma 2.2.

There exist constants L, K, such that for any n and m=L(n1), there exist two discrete random variables U and V such that the following Equations (6), (7), (8) and (9) hold.

𝖲𝗎𝗉𝗉𝗈𝗋𝗍(U)m and 𝖲𝗎𝗉𝗉𝗈𝗋𝗍(V)2+12m (6)
U,VLm (7)
𝔼[U]=𝔼[V]12(n1). (8)
𝔼[Ut]=𝔼[Vt] for t[K]. (9)

In order to prove the above lemma, we will use the following polynomial to construct our U and V (fix any positive integer L):

P(x)=x(x12m)(x22m)(x32m)(x2L2m) (10)

In the context of Lemma 2.2, the random variable U will be supported on the roots 0,1m,2m,,Lm. Similarly, the random variable V will be supported on the roots 12m,32m,,,2L12m.

Proof of Lemma 2.2.

We will first describe the construction of random variables U and V, associated with the polynomial P. Namely, let Pr[U=r]1|P(r)| for r in the support of U (every derivative is positive when r is in the support of U, i.e., r{0,1m,2m,,Lm}). Similarly, we have Pr[V=r]1|P(r)| for r in the support of V (likewise, all derivative will be negative). The normalization term is simply

ZP:=i=0L1|P(im)|=i=1L1|P(2i12m)|.

And we can show that

ZP=k=0L(2L2k)(2m)2L(2L)!=(2m)2L(2L)!k=0L(2L2k)=(2m)2L(2L)!22L1.

Applying 1.10 with k=0, we can notice that the two summations are equal (U has all positive roots and V all negative).

Proof of Equation (7).

The largest value of U and V is the largest root of the polynomial P: max{U,V}L/m.

Proof of Equation (8).

Let us compute the derivative of the polynomial P in Equation 10. For any k[2L]{0}, we have the following:

P(k2m) = =0k1k2m=k+12Lk2m(1)2k
= (1)k(2m)2Lk!(2Lk)!
= (1)k(2L)!(2Lk)(2m)2L

Now let us compute 𝔼[U] as follows:

𝔼[U] = Pr[U=0]0+:even,[2m]Pr[U=m]m
= 1ZPk=0L2k2m(2L2k)(2m)2L(2L)!
= 1mk=0Lk(2L2k)22L1
= L2m

We finally upper bound the expectation:

12(n1)L2m=𝔼[U]mL(n1). (11)
Proof of Equation (9).

Their first K=2L1 moments are matched by recalling 1.10 (applied for k=t) and how U and V are constructed.

Proof of Theorem 1.2.

Using Lemma 1.9 on random variables constructed in Lemma 2.2, by setting m=2L(n1)=Θ(n), we know Λ=sLm, we get that

𝖽𝖳𝖵(𝔼[Poi(sU),Poi(sV)])(esLm2K)K12(n1),

which gives,

s2(n1)(2Ke)(12(n1))1/K.

Further simplification yields,

s(4L2e)(2(n1))112L1.

Applying Theorem 2.1, gives the lower bound s=Ω(n112L1). For any constant c<1, one can choose L large enough such that c112L1, and thus concludes our proof.

3 𝛀(𝒎𝐥𝐨𝐠𝒎) lower bound for 𝒎-grainedness (Proof of Theorem 1.1)

In this section, we will be proving the following theorem:

See 1.1

Similar to the proof of Theorem 1.2, we will make use of Theorem 2.1 to prove our Ω(mlogm) lower bound. Namely, we will construct two random variables U and V such that they can be used to construct approximate probability vector for m-grained and mostly 2m-grained and their first log(m) moments match. In particular, we will prove the following lemma:

Lemma 3.1.

For m=70(n1) and n>20, there exist discrete random variables U and V such that the following hold:

Pr[V=12m]2/3. (12)
𝖲𝗎𝗉𝗉𝗈𝗋𝗍(U)m (13)
0U,V20log2(n1)n1. (14)
𝔼[U]=𝔼[V]120(n1). (15)
𝔼[Ut]=𝔼[Vt] for t[3log(n1)] (16)

3.1 Construction of 𝑼 and 𝑽

The main focus of this subsection is to establish Lemma 3.1.

1.10 allows us to define two random variables U and V with matching moments up to the degree of some polynomial – similar to the analysis in Section 2, by simply defining U and V through its roots and partitioning them by the signs of the derivatives at those roots. However, the challenge here is, we need to match even higher degree (in Section 2, we can match them up to a constant degree C1, here we need to match them up to Θ(logm)=Θ(logn) degree), all while keeping every other conditions unchanged: U is m-grained and V on the most part is (2m)-grained; maximum value of the support size does not blow up; and more importantly, the expectation could be bounded by O(1/n). Notably, if we try to match higher degree: say instead of constant, we make L=O(logn) in Equation 11, with m=Θ(n), using the construction in Section 2. It will fail miserably – the expectation 𝔼[U] will exceed O(1/n), and so when we take n copies of U or V, it would not be a distribution in an approximate sense.

To work around this and match degree (by implication, moments) as high as possible, we take the (shifted) Chebyshev polynomial of degree d and we will later use properties of this polynomial to construct our prior variables U and V. Recall that the shifted Chebyshev polynomial of the first kind is defined as follows:

pTd(x)=Td(1Δx)=2d1Δdi=1d(xti), (17)

where Td(θ)=cos(darccosθ), for θ[1,1]; ti denotes the roots of the degree-d (shifted) Chebyshev polynomial for every i[d]; Δ, a parameter to be chosen in the analysis later.

However, as mentioned in the introduction, instead of using the shifted Chebyshev polynomial directly, we will be using a variant of it for our proof. For this purpose, we will define the polynomial p~Td, a slightly modified Chebyshev polynomial of degree d, by “rounding up to the nearest multiple of 1/m” its d roots:

p~Td(x)=T~d(1Δx)=2d1Δdi=1d(xt~i) (18)

where 1m+tit~i=1mmtiti.

Putting these ideas together, we will focus on two polynomials based on the Chebyshev polynomial and modified Chebyshev polynomial respectively (obtained by appending Chebyshev polynomial to x(x12m)(x1m)):

p(x)=x(x12m)(x1m)pTd(x) (19)
p~(x)=x(x12m)(x1m)p~Td(x) (20)

Through polynomial p~ in Equation 20, we can construct two random variables U and V identified by their probability mass function. Namely, we define U as follows:

Pr[U=r]1|p~(r)|if p~(r)>0 and r is a root of p~ (21)

Similarly, we define the random variable V as follows:

Pr[V=r]1|p~(r)|if p~(r)<0 and r is a root of p~ (22)

For both U and V, we assign probability 0 to the roots not specified. Thus, by setting k=0 in 1.10 and noting that the negative and positive terms sum to 0, the normalization factor are equal for both U and V, and can be expressed as follows:

Observation 3.2.
Zp~=Z(Δ,m,d)=1p~(0)+1p~(1m)+j=1d𝟙{p~(t~j)>0}p~(t~j)=1|p~(12m)|+j=1d𝟙{p~(t~j)<0}|p~(t~j)|. (23)

The reasons we are taking the shifted Chebyshev polynomial and “round the roots up” are two-fold: first, this kind of construction (before rounding) was used to prove lower bounds with Ω(n/logn) complexity before, by constructing a polynomial of degree Θ(logn) with certain constraints [10]. Second, we need to make sure that the distributions derived from U are m-grained with very high probability. Yet, using the shifted Chebyshev directly, some of the roots of U (the positive roots) resulting from Td will not be m-grained with non-trivial probability. Intuitively, we want to leverage the known properties of the shifted Chebyshev pTd and argue that rounding up and creating p~Td will only “mildly” change those properties.

Since we are deciding the support of U,V based on the evaluation of derivative at each root of p~Td and argue that it is somewhat close to pTd, we need to first make sure that the signs of derivative at the main three roots remain the same as pTd’s (the rest of the roots’ derivative’s signs, the ones induced by p~Td, will not affect the argument). Note that the sign of the polynomial p~Td when evaluated at 0,1/2m,1/m is not changed (same as pTd), as (0tj) and (0t~j) have the same sign; we also have 1/2m,1/mtjt~j, via a constraint on setting Δ and therefore we can conclude that

pTd(0)p~Td(0)>0;pTd(1/2m)p~Td(1/2m)>0;pTd(1/m)p~Td(1/m)>0.
Table 1: Properties of the polynomial p and its distinct d+3 roots. |p(r1)|,|p(r2)|,|p(r3)|’s upper and lower bounds are proven in Claim A.7 of the full version [12].
rootsvalue of p(r)bound on |p(r)|r1=012m2Td(1)Θ(1m2)r2=12m14m2Td(1Δ2m)Θ(1m2)r3=1m12m2Td(1Δm)Θ(1m2)r+3=t=2Δsin2(π2(212d))p(t)=t(t12m)(t1m)Td(t)Θ(5Δ2d4)

Next, we compute their corresponding derivatives of the two polynomials evaluated at the roots t and t~.

p(t)=t(t12m)(t1m)2d1Δdj(tjt) (24)
p~(t~)=t~(t~12m)(t~1m)2d1Δdj(t~jt~) (25)

Here, note that by construction, all the roots of p~Td will be m-grained (see Equation 18). Moreover, 0 and 1m are m-grained roots of p~(x) (see Equation 20). On the other hand, the root 12m of the polynomial p~(x) is not m-grained by definition.

In the context of Lemma 3.1, the random variable U will be supported on the roots 0, 1m, and a subset of the roots t~ of the modified Chebyshev polynomial p~Td from Equation 18, for [d]. On the other hand, the random variable V will be supported over 12m and a disjoint (from support of U constructed by modified Chebeshev roots of p~Td) subset of the modified Chebyshev polynomial roots. U and V’s support form a partition for all roots of p~Td.

To prove Lemma 3.1, we need a few claims, that we state below and whose proof can be found in the full version [12].

Claim 3.3 (Relation between roots at p~ and p).
|p~(t~)||p(t)|exp(12Δm/d2) for every [d]. (26)
1|p~(0)||p(0)|,1|p~(1/m)||p(1/m)|6,1|p~(1/2m)||p(1/2m)|6. (27)

The proofs of Equation 26 and Equation 27 can be found in Appendix A.1 of the full version (Claims A.2 to A.5). From now onwards, we will assume that 3.3 holds.

For the lower bound, we will be choosing Δ=m20d2 and d=10log(n1).

Observation 3.4.

Let tj and t~j be the roots of the Cheybyshev and modified Chebyshev polynomials, for any j[d]. When Δm2d2, we have that

t~jtj1m.
Proof.

Using Equation 2, we can lower bound tj for any j,

tj=1Δ(1cos(2j12dπ))=2Δsin2(π2(2j12d))2Δ(2j12d)21m, (28)

where the last inequality follows from 2j11 (which holds for all j1), and our assumption on Δ. Moreover, we also have:

t~j=1mmtjtj1m

where the last inequality follows from Equation 28. This completes the proof.

We need the following two claims (3.5 and 3.6) whose proofs are deferred to Appendix A.2 of the full version [12], to show that Pr[V=12m]Ω(1) in 3.7 as required by Equation 12.

Claim 3.5.

Suppose 1|p~(12m)||p(12m)|6 holds and Δ2m9d2. Then we have,

Pr[V=12m]1|p~(12m)|23m2.
Claim 3.6.

Suppose that Δ2m9d2. For any j[d], let t~j denote the j-th root of the modified Chebyshev polynomial p~. Then we have:

Pr[V12m]j=1d𝟙{p~(t~j)<0}|p~(t~j)|exp(12Δm/d2)66πΔ2d4.
Claim 3.7.

If Δm20d2, then

Pr[V=12m]=1|p~(12m)|Z56,

where Z is the normalizing constant defined in Equation 23.

Proof.

Following the definition of the random variable V (Equation 22) and the normalization term Zp~ (Equation 23), we can say that:

Pr[V=12m]=1|p~(12m)|Z=11+(j=1d𝟙{p~(t~j)<0}|p~(t~j)|)|p~(12m)|.

Combining 3.5 and 3.6, we see that

(j=1d𝟙{p~(t~j)<0}|p~(t~j)|)|p~(12m)| 3exp(12Δm/d2)2m266πΔ2d4
= 99πexp(12Δm/d2)Δ2d4m2
99πexp(12m20d2m/d2)(m20d2)2d4m2[Δm20d2]
99πexp(1220)140015.

Thus, we can say that

Pr[V=12m]=11+(j=1d𝟙{p~(t~j)<0}|p~(t~j)|)|p~(12m)|56.

This completes the proof.

Now we are ready to prove Lemma 3.1.

Proof of Lemma 3.1.

We now prove that the random variables U and V defined in Section 3.1 satisfies Equation 12-Equation 16. We begin by setting Δ=m20d2.

Proof of Equation (12).

Using 3.7, we know that

Pr[V=12m]5/6.
Proof of Equation (14).

The largest value of U and V are the largest root of the modified Chebyshev polynomial. Now let us first upper bound the largest value of the roots of the Chebyshev polynomial. The largest value of the roots of Chebyshev polynomial is:

max[d]t=max[d]1Δ(1cos(212dπ))=2Δsin2(π2(2d12d))2Δ,

where the last inequality is obtained via Equation 2. So, the largest value of the roots of the modified Chebyshev polynomial is:

max[d]t~=1mmtj2Δ+1m=40d2+1m,

The last equality follows from the fact that Δ=m20d2. Plugging the values of m=70(n1) and d=10log(n1), we have the result.

Proof of Equation (15).

Recall that the random variable U is supported on the roots 0, 1/m and a subset of the roots of the modified Chebyshev polynomial (Equation 18). Let us start by computing an upper bound on 𝔼[U].

𝔼[U] Pr[U=0]0+Pr[U=1m]1m+j=1dPr[U=t~j]t~j
=Pr[U=1m]1m+1Zp~(j=1dt~j1|p~(t~j)|)[Zp~ is the normalization term, see Equation 23]
1m+1Zp~(j=1d(tj+1m)exp(12Δm/d2)|p(tj)|)[t~j=1mmtj and Equation 26]
1m+1Zp~(j=1d(1+12)tjexp(12Δm/d2)Δ4tj3dsin2j12d)
[using the proof of Claim 3.6 (Appendix A.2, full version [12])]
=1m+1Zp~(j=1d6exp(12Δm/d2)Δ(4Δ2sin4(π22j12d))dsin2j12d)[tj=2Δsin2(π2(2j12d))]
1m+1Zp~(j=1d6exp(12Δm/d2)(4Δ(2j12d)4)d4π(2j1)2d)[Via Equation 2]
=1m+exp(12Δm/d2)Δd2Zp~(j=1d6π(j1/2)3)
1m+17exp(12Δm/d2)Δd2Zp~[i=16π(i1/2)317]
=1m+17×32exp(12Δm/d2)Δd2m2[Zp~1|p~(12m)|23m2 via Equation 23 & 3.5]
72m=120(n1)[Δm20d2&m=70(n1)]

Since Equation 16 holds, as we will prove below, we know that 𝔼[U]=𝔼[V]. This implies that 𝔼[V]120(n1) holds as well.

Proof of Equation (16).

We have 𝔼[Ut]=𝔼[Vt] for 1td+1 from 1.10 and 3.2.

3.2 Putting it together: proof of the 𝛀(𝒎/𝐥𝐨𝐠𝒎) lower bound

We are now ready to prove the lower bound of m-grainedness testing (Theorem 1.1): we will combine the results from Lemma 3.1, Lemma 1.9 and Theorem 2.1 to show our main Ω(m/logm) lower bound in Theorem 1.1.

Proof of Theorem 1.1.

Let us fix n>20,m=70(n1), and s1 be a parameter (we will later on set it to s=Θ(n/logn)). From Lemma 3.1, we know that there exist two discrete random variables U and V with the matching moments properties. Now we will use Lemma 1.9 for sU,sV. Following Lemma 3.1 and Lemma 1.9, we have Λs20log2(n1)n1 and T=3log(n1)

𝖽𝖳𝖵(𝔼[Poi(sU),𝔼[Poi(sV)]])(es20log2(n1)n13log(n1))3log(n1).

And we want to satisfy Equation 5, as all others have been matched to use Theorem 2.1.

(es20log2(n1)n13log(n1))3log(n1)120(n1)es20log(n1)3(n1)(120(n1))13log(n1)

For n21, we have

es20log(n1)3(n1)14(120(n1))13log(n1),

and thus for s=3(n1)80elog(n1), we can invoke, and get a sample complexity lower bound in Theorem 1.1, noting that m=70(n1)=Θ(n). Thus

s2=3(n1)160elog(n1)=Ω(mlogm).

This concludes our proof.

4 Uniformity Testing in the DoHo Model

In this section, we prove the following result on parameterized uniformity testing in the DoHO model, improving on the known O~(m)-query upper bound in the regime mn3.

Theorem 4.1.

For any m and constant ε>0, the property Πm of distributions over {0,1}n defined as

Πm={US:S{0,1}n,|S|=m}

can be ε-tested in the DoHO model using s=𝒪(m2/3/ε6) samples and sn queries. Moreover, Ω(m2/3) samples are necessary.

Proof.

The algorithm itself is quite simple, and follows from combining existing algorithms for generalized uniformity testing in the standard sampling model [5, 17] with an additional “check” of the 2 norm of the distribution: the tester is described in Algorithm 1. The query complexity follows trivially from the stated sample complexity, as n queries suffice to read the full sample.

Specifically, the algorithm is as follows:

Algorithm 1 Algorithm to test Πm.
1:Set δ1/6, and α=Θ(ε).
2:Run the adaptive 2 norm estimator of [5, Lemma 3.1] to obtain an (1±ε/10)estimate ρ^ of D22 (with error probability δ) using O(m/ε2) samples. If thealgorithm does not terminate with this number of samples, or if the estimate isnot in (1±ε/10)/m, output reject.
3:Run the generalized uniformity tester of [5] on s (fully queried) samples from Dwith (TV) distance parameter α and error probability δ. If it rejects, outputreject.
4:Output accept.

We now argue correctness. By a union bound, both subroutines behave as they should with overall probability 2/3: we hereafter assume their output is correct.

  • Completeness: Assume DΠm. Then D22=1m, so ρ^[1ε/10m,1+ε/10m] and the first step does not reject. Similarly, DΠU=k=1Πk, so the generalized uniformity tester (which is by definition a tester for ΠU) accepts. Overall, Algorithm 1 returns accept.

  • Soundness: By contrapositive, suppose Algorithm 1 returns accept. Since the second subroutine did not reject, it must be the case that D is ε2-close (in total variation distance) to a distribution UT, uniform on some subset T of a given size k. Moreover, the algorithm of [5] provides the extra guarantee that (1O(ε))D221k(1+O(ε))D22 (see [5, Lemma 3.4]).111This is why we chose to use this specific algorithm, instead of that of [17], which has better sample complexity (with respect to ε) but does not provide this extra guarantee.

    Now, let USΠm be a closest distribution to D, over all subsets S of size m:

    𝖽𝖳𝖵(D,US)𝖽𝖳𝖵(D,UT)+𝖽𝖳𝖵(UT,US)ε2+1min(m,k)max(m,k)

    where the second inequality follows the minimum TV distance between two uniform distributions on supports k and m, which occurs when the supports of the distributions overlap as much as possible, and is equal to 1min(m,k)max(m,k). In particular, we have 1min(m,k)max(m,k)ε2 if

    11ε2mk(1ε2)m

    Now, to see why this holds, observe that by our first check, D22 is within a (1±ε10) factor of 1m, and, by the additional guarantee of the [5] tester (adjusting the constant in the setting of α), within a (1±ε10) factor of 1k. This implies that m and k are within a 1±ε4 factor of each other, and thus that 1min(m,k)max(m,k)ε2. Overall, this establishes that 𝖽𝖳𝖵(D,US)ε.

Finally, the claimed lower bound follows directly from the lower bound on generalized uniformity testing of [17], which holds even when the target size of the support is given.

5 Conclusion and open problems

In this work, we studied the problem of testing m-grainedness of distributions. We established a new lower bound of Ω(m/logm) samples, improving on the previous lower bound of Ω(mc) due to [22], thereby settling a conjecture by [22]. Along the way, we also obtained an alternative, simpler proof of the Ω(mc) lower bound. By leveraging a reduction between the testing models due to [23], our result implies an optimal lower bound for uniformity testing in the DoHO (Distributions over Huge Objects) model, settling another conjecture of [23]. Finally, we provided a simple tester for uniformity testing in this DoHO model, with improved sample complexity for a large range of the parameters.

Our work leaves open several new avenues of research in this context; we list two of them below:

  1. (i)

    Our lower bound for m-grainedness testing establishes the optimal dependence on the parameter m, for constant proximity parameter ε=Ω(1). It would be interesting to fully characterize the complexity of the question, including the dependence on ε.

  2. (ii)

    From [23], it is known that testing uniformity over an m-element subset of {0,1}n requires Ω(m/logm) queries, for mn; and that O~(m) queries are sufficient for all m. Our own tester (Theorem 4.1) shows an upper bound of O(m2/3n) queries, better in the regime mn3. This leaves understanding the landscape of parameterized uniformity testing, especially in the intermediate regime nmn3, as an open and intriguing question.

References

  • [1] Jayadev Acharya, Constantinos Daskalakis, and Gautam Kamath. Optimal testing for properties of distributions. Advances in Neural Information Processing Systems (NeurIPS), 28, 2015.
  • [2] Tomer Adar and Eldar Fischer. Refining the adaptivity notion in the huge object model. In International Conference on Randomization and Computation (RANDOM), pages 45:1–45:16, 2024. doi:10.4230/LIPICS.APPROX/RANDOM.2024.45.
  • [3] Tomer Adar, Eldar Fischer, and Amit Levi. Support testing in the huge object model. In International Conference on Randomization and Computation (RANDOM), pages 46:1–46:16, 2024. doi:10.4230/LIPICS.APPROX/RANDOM.2024.46.
  • [4] Sivaraman Balakrishnan and Larry Wasserman. Hypothesis testing for high-dimensional multinomials: A selective review, 2018.
  • [5] Tugkan Batu and Clément L Canonne. Generalized uniformity testing. In Symposium on Foundations of Computer Science (FOCS), pages 880–889, 2017. doi:10.1109/FOCS.2017.86.
  • [6] Tugkan Batu, Eldar Fischer, Lance Fortnow, Ravi Kumar, Ronitt Rubinfeld, and Patrick White. Testing random variables for independence and identity. In Symposium on Foundations of Computer Science (FOCS), pages 442–451, 2001. doi:10.1109/SFCS.2001.959920.
  • [7] Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D Smith, and Patrick White. Testing that distributions are close. In Symposium on Foundations of Computer Science (FOCS), pages 259–269, 2000. doi:10.1109/SFCS.2000.892113.
  • [8] Clément L Canonne. A survey on distribution testing: Your data is big. but is it blue? Theory of Computing, pages 1–100, 2020.
  • [9] Clément L Canonne. Topics and techniques in distribution testing: A biased but representative sample. Foundations and Trends® in Communications and Information Theory, pages 1032–1198, 2022. doi:10.1561/0100000114.
  • [10] Clément L Canonne, Ilias Diakonikolas, Daniel Kane, and Sihan Liu. Nearly-tight bounds for testing histogram distributions. Advances in Neural Information Processing Systems (NeurIPS), 35:31599–31611, 2022.
  • [11] Clément L Canonne, Ayush Jain, Gautam Kamath, and Jerry Li. The price of tolerance in distribution testing. In Conference on Learning Theory (COLT), pages 573–624, 2022. URL: https://proceedings.mlr.press/v178/canonne22a.html.
  • [12] Clément L. Canonne, Sayantan Sen, and Joy Qiping Yang. Settling the complexity of testing grainedness of distributions, and application to uniformity testing in the huge object model. ECCC preprint, 2024. URL: https://eccc.weizmann.ac.il/report/2024/196.
  • [13] Sourav Chakraborty, Eldar Fischer, Arijit Ghosh, Gopinath Mishra, and Sayantan Sen. Testing of index-invariant properties in the huge object model. In Conference on Learning Theory (COLT), pages 3065–3136, 2023. URL: https://proceedings.mlr.press/v195/chakraborty23a.html.
  • [14] Ilias Diakonikolas, Themis Gouleakis, John Peebles, and Eric Price. Sample-optimal identity testing with high probability. In International Colloquium on Automata, Languages, and Programming (ICALP), 2018.
  • [15] Ilias Diakonikolas, Themis Gouleakis, John Peebles, and Eric Price. Collision-based testers are optimal for uniformity and closeness. Chic. J. Theor. Comput. Sci, 25:1–21, 2019. URL: http://cjtcs.cs.uchicago.edu/articles/2019/1/contents.html.
  • [16] Ilias Diakonikolas and Daniel M Kane. A new approach for testing properties of discrete distributions. In Symposium on Foundations of Computer Science (FOCS), pages 685–694, 2016. doi:10.1109/FOCS.2016.78.
  • [17] Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Sharp bounds for generalized uniformity testing. Advances in Neural Information Processing Systems (NeurIPS), 2018.
  • [18] Oded Goldreich. The uniform distribution is complete with respect to testing identity to a fixed distribution. In Electronic Colloquium on Computational Complexity (ECCC), page 1, 2016.
  • [19] Oded Goldreich. Introduction to property testing. Cambridge University Press, 2017. doi:10.1017/9781108135252.
  • [20] Oded Goldreich, Shafi Goldwasser, and Dana Ron. Property testing and its connection to learning and approximation. Journal of the ACM (JACM), pages 653–750, 1998. doi:10.1145/285055.285060.
  • [21] Oded Goldreich and Dana Ron. On testing expansion in bounded-degree graphs. Electron. Colloquium Comput. Complex., TR00-020, 2000. URL: https://eccc.weizmann.ac.il/eccc-reports/2000/TR00-020/index.html.
  • [22] Oded Goldreich and Dana Ron. A lower bound on the complexity of testing grained distributions. Computational Complexity (CC), page 11, 2023. doi:10.1007/S00037-023-00245-W.
  • [23] Oded Goldreich and Dana Ron. Testing distributions of huge objects. In TheoretiCS, page 78, 2023.
  • [24] metamorphy. Proving a formula for n-th degree polynomial with n distinct real roots, 2021. URL: https://math.stackexchange.com/questions/4074098/proving-a-formula-for-sum-j-1n-fracx-jkfx-j-for-f-an-n-th-degr.
  • [25] Liam Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Transactions on Information Theory, pages 4750–4755, 2008. doi:10.1109/TIT.2008.928987.
  • [26] Sofya Raskhodnikova, Dana Ron, Amir Shpilka, and Adam Smith. Strong lower bounds for approximating distribution support size and the distinct elements problem. SIAM Journal on Computing (SICOMP), pages 813–842, 2009. doi:10.1137/070701649.
  • [27] Ronitt Rubinfeld and Madhu Sudan. Robust characterizations of polynomials with applications to program testing. SIAM Journal on Computing (SICOMP), pages 252–271, 1996. doi:10.1137/S0097539793255151.
  • [28] Gregory Valiant and Paul Valiant. Estimating the unseen: improved estimators for entropy and other properties. Journal of the ACM (JACM), pages 1–41, 2017. doi:10.1145/3125643.
  • [29] Paul Valiant. Testing symmetric properties of distributions. SIAM Journal on Computing (SICOMP), pages 1927–1968, 2011. doi:10.1137/080734066.
  • [30] Yihong Wu and Pengkun Yang. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory, pages 3702–3720, 2016. doi:10.1109/TIT.2016.2548468.
  • [31] Yihong Wu and Pengkun Yang. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. The Annals of Statistics, pages 857–883, 2019.
  • [32] Yihong Wu, Pengkun Yang, et al. Polynomial methods in statistical inference: theory and practice. Foundations and Trends® in Communications and Information Theory, pages 402–586, 2020.