Settling the Complexity of Testing Grainedness of Distributions, and Application to Uniformity Testing in the Huge Object Model
Abstract
In this work, we study the problem of testing -grainedness of probability distributions over an -element universe , or, equivalently, of whether a probability distribution is induced by a multiset of size . Recently, Goldreich and Ron (Computational Complexity, 2023) proved that samples are necessary for testing this property, for any and . They also conjectured that samples are necessary for testing this property when . In this work, we positively settle this conjecture.
Using a known connection to the Distribution over Huge objects (DoHo) model introduced by Goldreich and Ron (TheoretiCS, 2023), we leverage our results to provide improved bounds for uniformity testing in the DoHo model.
Keywords and phrases:
Distribution testing, Uniformity testing, Huge Object Model, Lower boundsFunding:
Clément L. Canonne: Supported by an ARC DECRA (DE230101329).Copyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Theory of computation Streaming, sublinear and near linear time algorithmsAcknowledgements:
We would like to thank the anonymous reviewers of ITCS 2025 for their suggestions which improved the presentation of the paper. SS would like to thank Clément Canonne and the Theory CS group at the University of Sydney for the warm hospitality during his academic visit, where this work was initiated.Editors:
Raghu MekaSeries and Publisher:

1 Introduction
The field of distribution testing [7, 6] is concerned with providing statistically accurate information about large datasets or their underlying probability distributions, given very scarce data (sample size). Drawing insights from property testing [20, 27], distribution testing lies at the intersection of theoretical computer science, statistics, and learning theory; and has received significant attention over the past two decades, with many algorithms, insights, and new theoretical access models to the data being proposed and analyzed. We refer the reader to recent surveys [8, 4, 9] and textbook [19, Chapter 11] for more on distribution testing and property testing.
In the most standard access model, the algorithm accesses the underlying unknown probability distribution (usually assumed to be over a known discrete domain of size ) by obtaining independent, identically distributed (i.i.d.) samples from it. The goal of the algorithm is then, given a fixed property (a subset of the -dimensional probability simplex ) and a distance parameter , as follows:
-
If , the algorithm must output accept with probability at least ;
-
If for all , the algorithm must output reject with probability at least ;
where denotes the total variation distance between distributions, and the probability is taken over the choice of the samples and the internal randomness of the algorithm. This is defined as the -testing of the property . The minimum number of samples necessary to achieve this task in the worst case (over all possible distributions ) is the sample complexity of testing . That is, the testing task requires the algorithm, given as few samples as possible, to distinguish with high probability between distributions which have the property, and those which are “far” from having it.
Many other access models have been introduced, providing additional types of queries to the algorithm, or changing the distance metric, or both (see [8] for an overview): among them is the Distribution over Huge Objects (DoHO) model, recently introduced by Goldreich and Ron [23] to capture settings where full access to the data sampled is itself costly or impractical, due to the size of these objects. In the simplest version of this model, the distribution is defined over the -dimensional hypercube , where is assumed to be very large; given a sample , the algorithm must then choose which bits of to observe, paying a unit cost for every such query made. The distance metric to quantify “farness” between distributions also differs from that of the standard formulation, and instead is taken to be the Earthmover distance (EMD) between probability distributions, with underlying metric chosen to be the (relative) Hamming distance between -bit strings. (For the formal definition of this model, and the relation to the standard sampling model, see Section 1.3.)
Many different properties of distributions have been studied, some of them quite extensively: among those, uniformity (, consisting of the single uniform distribution over the whole domain ), generalized uniformity [5] (, consisting of all distributions uniform over some subset of the domain), and parameterized uniformity (, consisting of all distributions uniform over some size- subset of the domain) [23], and -grainedness (denoted , which we will define shortly) [18, 22] are the most relevant to this work.
In this paper, we will focus on two distribution testing tasks, intimately related:
-
Testing -grainedness of distributions in the standard sampling model (property ); and
-
Testing parameterized uniformity in the DoHO model.
Grainedness of distributions
A probability distribution over a discrete set of elements is said to be -grained, for a given parameter , if all its probabilities are integer multiples of . That is,
Such distributions naturally arise due to quantization (e.g., binning of continuous or discrete distributions), or when sampling from datasets: that is, if is a multiset of size , uniformly sampling from with replacement yields an -grained distribution over .
Throughout this work, we will assume is a constant. Thus, . Previous works fixed but these results can also be extended for , and therefore our results are stated in terms of , instead of .
Recent work of Goldreich and Ron [22] showed that samples are necessary for testing this property, for any fixed . A sample complexity upper bound of also follows from previous work of Valiant and Valiant [29], which led [22] to conjecture a lower bound of samples. Our main contribution is to resolve this conjecture, showing that, indeed, samples are necessary and sufficient for -grainedness testing:
Theorem 1.1.
Let be a sufficiently large integer, , and let be a sufficiently small constant. Then, -testing -grainedness of distributions over requires samples.
We note that the restriction is necessary, as if for some , every distribution is -close to being -grained. Recall that we assume is a small constant. Moreover, the problem of testing -grainedness becomes trivial when as in that case every distribution is -close to being -grained. Along the way, we also present as a warmup a new proof of the lower bound of samples for -grainedness testing:
Theorem 1.2.
Let be a sufficiently large integer, , and let be a sufficiently small constant. Then, for any fixed constant , -testing -grainedness of distributions over requires samples.
While this is a strictly weaker result than our main lower bound, our proof strategy differs significantly from that of [22], and may be of independent interest.
Parameterized uniformity testing in the DoHO model
In a separate work [23], Goldreich and Ron studied the problem of parameterized uniformity testing in the Huge Object model. Among other results, they established a connection between -grainedness testing in the standard model and testing in the DoHO model:
Theorem 1.3 ([23, Theorem 2.13]).
Assume that, for constant , -testing -grainedness in the standard model has sample complexity . Then, for every and constant , -testing for distribution over in the DoHO model has query complexity .
Our lower bound for -grainedness immediately implies that this result holds unconditionally. In the same paper, the authors provided an algorithm for testing using samples and queries (for constant ), which together with the above – now unconditional – lower bound settles the complexity of testing in the DoHO model, up to logarithmic factors, for . One may be tempted to assume the same query complexity lower bound holds in all parameter regimes: perhaps surprisingly, our next result is a new and simple algorithm for testing in the DoHO model which takes samples and performs queries:
Theorem 1.4.
There is an algorithm which, given an integer and constant , -tests the property of distributions over in the DoHO model, taking samples and performing queries. Moreover, any -tester for in the DoHO model must take samples.
Notably, this improves the upper bound of Goldreich and Ron whenever , and rules out any query complexity lower bound for the whole range of .
1.1 Overview of our techniques
Lower bound for -grainedness testing
Our lower bound approach follows Le Cam’s two-point method: we will design two distributions of distributions and such that the distributions in are -grained and the distributions in are “far” from being -grained in variation distance. The name of the game is then to prove that it is information-theoretically impossible to distinguish between and .
To achieve this, we will rely on the moment-matching technique, particularly suited to properties like -grained (which are symmetric, i.e., invariant to relabelling of the domain elements). Broadly, if two probability distributions and have sufficiently similar moments , for say , then it is hard to distinguish (a permutation of) from (a permutation of) by using samples. That is, the more moments one can match, the more samples one needs to tell two distributions apart. Note that setting would then lead to an lower bound (and that, as remarked later, in our case we can assume without loss of generality, for the sake of the lower bound argument, that .) This approach has been used in the literature extensively [26, 29, 28, 30, 31, 32, 10], and has proven to be very successful in obtaining tight lower bounds for a range of symmetric properties. We will rely on the formulation of the technique described by [32], which maps the problem of matching moments of two full probability distributions (an -dimensional object) to that of matching moments of two single-dimensional random variables and : these two random variables will then be used to generate the probability distributions: intuitively (and as described below), “ independent draws from will give ” (and similarly for and ).
Thus, a crucial ingredient in our lower bound is the construction of a pair of discrete random variables and whose first logarithmically many moments are identical. Using independent copies of (resp. ), we will then define a random measure over , which corresponds to a “yes”-instance (resp., a “no”-instance ). Thus, another requirement on our construction of and is that the first one should yield an -grained distribution, while the second should correspond to a distribution far from being -grained. We will formalize this in Lemma 2.2.
To build these two random variables and , we will, as previously done in the literature, rely on the properties of the Chebyshev polynomial of degree , defining and to be supported on disjoint subsets of the roots of a suitable polynomial , where the probability mass assigned to a given root is proportional to . The idea is that some of the roots of , those corresponding to , will be multiples of (leading to -grained probabilities) while the others, corresponding to , will be odd multiples of (leading to “far-from--grained” probabilities): for instance, we would take
for some scaling constant . The remaining roots will be that of the Chebyshev polynomial , which are there to ensure that we can match sufficiently many moments of and .
However, this approach, which underlies most of previous work, cannot be directly used here: indeed, while the resulting two random variables could be made to have many matching moments, and as such be hard to distinguish, doing so will put a lot of probability mass on the roots of the Chebyshev polynomial both for and ; and these roots are not multiples of . This would have the undesirable effect of making both our distributions far from being -grained. Trying to fix this the obvious requires to ensure very little mass is put by (and so ) on the roots of , which in turns limits the number of moments that can be matched. (Note that fixing this by choosing not to use the Chebyshev polynomial at all but instead choosing of the form , for some large constant integer , does get part of the way there and provides a non-trivial result, leading to our (weaker) lower bound for every ).
To remedy this, we take a different route, using an idea we believe to be of independent interest and applicable to other lower bounds: instead of using Chebyshev polynomials directly, we will be using a modified version of Chebyshev polynomials, , defined from by first rounding its roots to multiples of . By carefully analyzing the resulting polynomial, we can show that it behaves, for our lower bound construction purposes, similarly to the Chebyshev polynomial, and so we can use it to define both and . This allows us to match more moments, leading to our final lower bound.
Upper bound for parameterized uniformity testing
In contrast, the idea underlying our upper bound for testing in the DoHO model is relatively straightforward: namely, by making queries per sample, we can simulate any testing algorithm in the standard sampling model, as we now have observed the full samples. This, along with the relation between TV distance and EMD distance, allows us to lift any -sample tester for in the standard sampling model to an -sample, -query tester for in the DoHO model.
The only issue with this plan is that there is no known (nontrivial) tester for in the standard sampling model. There is, however, a known testing algorithm for the related property of generalized uniformity, . Our key contribution here is then to use this known tester to derive a tester for in the standard sampling model. Notably, this is not as immediate as it seems, and our algorithm needs to use in a whitebox way, and combine this with an additional test to estimate the norm of the unknown distribution . The subtlety here (and the reason for which we cannot use any in a blackbox fashion, but need to use a specific algorithm due to [5]) is that being close to some uniform distribution over some subset does not immediately allow to conclude about being close to some uniform distribution over some -size subset – even if we are guaranteed the norm of is close to .
1.2 Related work
The field of distribution testing has its roots in theoretical computer science with the work of Goldreich and Ron [21], who designed a uniformity testing algorithm as a tool to check whether a graph is an expander; and formally defined and introduced in [7]. [6] studied the problem of identity testing, which generalizes the problem of uniformity testing. Over the last two decades, this field has seen significant growth, and a host of new tools and techniques have emerged, see [25, 1, 18, 16, 14, 15, 11] to name a few. See the surveys [8, 4, 9] and the book of [19, Chapter 11] for more details.
The study of grained distributions was done implicitly in the work of [26], and later [18] studied it in more detail.
As mentioned earlier, the Huge Object model was introduced by [23]. Since then, there have been several works focusing on this model. [13] defined the notion of index-invariant properties, a set of properties that are invariant under the permutation of the indices (these properties are different than label-invariant properties). They showed that index-invariant properties whose VC dimension of the support set is bounded can be learned using a constant number of queries, depending only on the VC dimension. They also gave tight separation between adaptive and non-adaptive testers for index and non-index-invariant properties. Later, [2] studied various different notions of adaptivity, and showed several separation results. Very recently, [3] studied the problem of support size testing in this model.
1.3 Preliminaries
For a positive integer , let denote the set . We will use the standard asymptotic notation , and, in some cases, the (somewhat less) standard notation , which omits poly-logarithmic dependencies in the parameters for readability.
We will use several notions of distance between two distributions.
Definition 1.5.
Let and be two probability distributions over a domain . The distance between and is defined as
The total variation distance between and is defined as:
Definition 1.6 (EMD with respect to Hamming distance).
Let be two distributions defined over the -dimensional Hamming cube and be the (relative) Hamming distance over . Then the Earth Mover distance (EMD) between and is defined as follows:
where denotes the set of all couplings between and .
Since we are working with the (relative) Hamming distance, . Then directly from the definitions of and , we get the following simple yet useful observation connecting total variation and EMD distances between two distributions.
Observation 1.7 (Relation between EMD and TV distances).
Let and be two distributions over the -dimensional Hamming cube . Then,
Definition 1.8 (Huge Object Model).
Consider a discrete distribution supported over the -dimensional Hamming cube . is accessed by obtaining iid samples, where each sample obtained is an -bit Boolean string. In addition to the sampling oracle, there is also a query oracle associated with , where the tester can query any index for any samples (say -th sample ) obtained, and the query oracle will return the -th bit of . The goal is then to minimize both the total sample and query complexities required to decide a property.
Note that this Huge Object model is particularly suited for studying high-dimensional distributions, where the dimension of the underlying domain is so large that even reading a single sample in its entirety might not be feasible.
Finally, we state here some technical result which will be used in our lower bounds proofs:
Lemma 1.9 ([L]emma 4).
wu2019chebyshev] Let be discrete random variables taking values in . If for , then
Fact 1.10 ([10, Fact 7], [24]).
Let be a degree- polynomial with distinct roots . Then, for every , we have that .
We will be using the following two well-known trigonometric inequalities.
Fact 1.11.
-
(i)
For , the following holds:
(1) -
(ii)
For , the following holds:
(2)
Organization
The rest of the paper is organized as follows. In Section 2, we present a new proof of the lower bound of samples. Then in Section 3, we present the lower bound for testing -grainedness. In Section 4, we present our result of uniformity testing in the Huge Object Model. We conclude in Section 5 with some open questions. Due to the shortage of space, the formal proofs of some claims are deferred to the full version of the paper [12].
2 An lower bound for testing -grainedness
In this section, we prove Theorem 1.2, thereby presenting a new proof of the lower bound of [22].
See 1.2 Note that it suffices to consider the case , as if one can then embed the hard instances in a larger domain, lifting the lower bound. As a result, we hereafter assume without loss of generality that . We start by stating a relatively standard theorem which will be crucial in the proof of our sample complexity lower bound, and whose proof is deferred to the full version [12].
Theorem 2.1.
Fix positive integers , where and for some absolute constant . Suppose there exist random variables and , where is supported on ; ; and they satisfy the following three conditions,
(3) |
(4) |
(5) |
Then any tester taking less than number of samples from unknown distribution cannot distinguish between the cases that is -grained and is at least -far from any -grained distributions in TV distance with probability .
In order to prove Theorem 1.2, we will be using the following lemma.
Lemma 2.2.
In order to prove the above lemma, we will use the following polynomial to construct our and (fix any positive integer ):
(10) |
In the context of Lemma 2.2, the random variable will be supported on the roots . Similarly, the random variable will be supported on the roots .
Proof of Lemma 2.2.
We will first describe the construction of random variables and , associated with the polynomial . Namely, let for in the support of (every derivative is positive when is in the support of , i.e., ). Similarly, we have for in the support of (likewise, all derivative will be negative). The normalization term is simply
And we can show that
Applying 1.10 with , we can notice that the two summations are equal ( has all positive roots and all negative).
Proof of Equation (7).
The largest value of and is the largest root of the polynomial : .
Proof of Equation (8).
Let us compute the derivative of the polynomial in Equation 10. For any , we have the following:
Now let us compute as follows:
We finally upper bound the expectation:
(11) |
Proof of Equation (9).
Their first moments are matched by recalling 1.10 (applied for ) and how and are constructed.
Proof of Theorem 1.2.
Using Lemma 1.9 on random variables constructed in Lemma 2.2, by setting , we know , we get that
which gives,
Further simplification yields,
Applying Theorem 2.1, gives the lower bound . For any constant , one can choose large enough such that , and thus concludes our proof.
3 lower bound for -grainedness (Proof of Theorem 1.1)
In this section, we will be proving the following theorem:
See 1.1
Similar to the proof of Theorem 1.2, we will make use of Theorem 2.1 to prove our lower bound. Namely, we will construct two random variables and such that they can be used to construct approximate probability vector for -grained and mostly -grained and their first moments match. In particular, we will prove the following lemma:
Lemma 3.1.
For and , there exist discrete random variables and such that the following hold:
(12) |
(13) |
(14) |
(15) |
(16) |
3.1 Construction of and
The main focus of this subsection is to establish Lemma 3.1.
1.10 allows us to define two random variables and with matching moments up to the degree of some polynomial – similar to the analysis in Section 2, by simply defining and through its roots and partitioning them by the signs of the derivatives at those roots. However, the challenge here is, we need to match even higher degree (in Section 2, we can match them up to a constant degree , here we need to match them up to degree), all while keeping every other conditions unchanged: is -grained and on the most part is -grained; maximum value of the support size does not blow up; and more importantly, the expectation could be bounded by . Notably, if we try to match higher degree: say instead of constant, we make in Equation 11, with , using the construction in Section 2. It will fail miserably – the expectation will exceed , and so when we take copies of or , it would not be a distribution in an approximate sense.
To work around this and match degree (by implication, moments) as high as possible, we take the (shifted) Chebyshev polynomial of degree and we will later use properties of this polynomial to construct our prior variables . Recall that the shifted Chebyshev polynomial of the first kind is defined as follows:
(17) |
where , for ; denotes the roots of the degree- (shifted) Chebyshev polynomial for every ; , a parameter to be chosen in the analysis later.
However, as mentioned in the introduction, instead of using the shifted Chebyshev polynomial directly, we will be using a variant of it for our proof. For this purpose, we will define the polynomial , a slightly modified Chebyshev polynomial of degree , by “rounding up to the nearest multiple of ” its roots:
(18) |
where .
Putting these ideas together, we will focus on two polynomials based on the Chebyshev polynomial and modified Chebyshev polynomial respectively (obtained by appending Chebyshev polynomial to ):
(19) |
(20) |
Through polynomial in Equation 20, we can construct two random variables and identified by their probability mass function. Namely, we define as follows:
(21) |
Similarly, we define the random variable as follows:
(22) |
For both and , we assign probability to the roots not specified. Thus, by setting in 1.10 and noting that the negative and positive terms sum to 0, the normalization factor are equal for both and , and can be expressed as follows:
Observation 3.2.
(23) |
The reasons we are taking the shifted Chebyshev polynomial and “round the roots up” are two-fold: first, this kind of construction (before rounding) was used to prove lower bounds with complexity before, by constructing a polynomial of degree with certain constraints [10]. Second, we need to make sure that the distributions derived from are -grained with very high probability. Yet, using the shifted Chebyshev directly, some of the roots of (the positive roots) resulting from will not be -grained with non-trivial probability. Intuitively, we want to leverage the known properties of the shifted Chebyshev and argue that rounding up and creating will only “mildly” change those properties.
Since we are deciding the support of based on the evaluation of derivative at each root of and argue that it is somewhat close to , we need to first make sure that the signs of derivative at the main three roots remain the same as ’s (the rest of the roots’ derivative’s signs, the ones induced by , will not affect the argument). Note that the sign of the polynomial when evaluated at is not changed (same as ), as and have the same sign; we also have , via a constraint on setting and therefore we can conclude that
Next, we compute their corresponding derivatives of the two polynomials evaluated at the roots and .
(24) |
(25) |
Here, note that by construction, all the roots of will be -grained (see Equation 18). Moreover, and are -grained roots of (see Equation 20). On the other hand, the root of the polynomial is not -grained by definition.
In the context of Lemma 3.1, the random variable will be supported on the roots , , and a subset of the roots of the modified Chebyshev polynomial from Equation 18, for . On the other hand, the random variable will be supported over and a disjoint (from support of constructed by modified Chebeshev roots of ) subset of the modified Chebyshev polynomial roots. and ’s support form a partition for all roots of .
To prove Lemma 3.1, we need a few claims, that we state below and whose proof can be found in the full version [12].
Claim 3.3 (Relation between roots at and ).
(26) |
(27) |
The proofs of Equation 26 and Equation 27 can be found in Appendix A.1 of the full version (Claims A.2 to A.5). From now onwards, we will assume that 3.3 holds.
For the lower bound, we will be choosing and .
Observation 3.4.
Let and be the roots of the Cheybyshev and modified Chebyshev polynomials, for any . When , we have that
Proof.
Using Equation 2, we can lower bound for any ,
(28) |
where the last inequality follows from (which holds for all ), and our assumption on . Moreover, we also have:
where the last inequality follows from Equation 28. This completes the proof.
We need the following two claims (3.5 and 3.6) whose proofs are deferred to Appendix A.2 of the full version [12], to show that in 3.7 as required by Equation 12.
Claim 3.5.
Suppose holds and . Then we have,
Claim 3.6.
Suppose that . For any , let denote the -th root of the modified Chebyshev polynomial . Then we have:
Claim 3.7.
Proof.
Following the definition of the random variable (Equation 22) and the normalization term (Equation 23), we can say that:
Thus, we can say that
This completes the proof.
Now we are ready to prove Lemma 3.1.
Proof of Lemma 3.1.
We now prove that the random variables and defined in Section 3.1 satisfies Equation 12-Equation 16. We begin by setting .
Proof of Equation (12).
Using 3.7, we know that
Proof of Equation (14).
The largest value of and are the largest root of the modified Chebyshev polynomial. Now let us first upper bound the largest value of the roots of the Chebyshev polynomial. The largest value of the roots of Chebyshev polynomial is:
where the last inequality is obtained via Equation 2. So, the largest value of the roots of the modified Chebyshev polynomial is:
The last equality follows from the fact that . Plugging the values of and , we have the result.
Proof of Equation (15).
Recall that the random variable is supported on the roots , and a subset of the roots of the modified Chebyshev polynomial (Equation 18). Let us start by computing an upper bound on .
Since Equation 16 holds, as we will prove below, we know that . This implies that holds as well.
Proof of Equation (16).
3.2 Putting it together: proof of the lower bound
We are now ready to prove the lower bound of -grainedness testing (Theorem 1.1): we will combine the results from Lemma 3.1, Lemma 1.9 and Theorem 2.1 to show our main lower bound in Theorem 1.1.
Proof of Theorem 1.1.
Let us fix , and be a parameter (we will later on set it to ). From Lemma 3.1, we know that there exist two discrete random variables and with the matching moments properties. Now we will use Lemma 1.9 for . Following Lemma 3.1 and Lemma 1.9, we have and
And we want to satisfy Equation 5, as all others have been matched to use Theorem 2.1.
For , we have
and thus for , we can invoke, and get a sample complexity lower bound in Theorem 1.1, noting that . Thus
This concludes our proof.
4 Uniformity Testing in the DoHo Model
In this section, we prove the following result on parameterized uniformity testing in the DoHO model, improving on the known -query upper bound in the regime .
Theorem 4.1.
For any and constant , the property of distributions over defined as
can be -tested in the DoHO model using samples and queries. Moreover, samples are necessary.
Proof.
The algorithm itself is quite simple, and follows from combining existing algorithms for generalized uniformity testing in the standard sampling model [5, 17] with an additional “check” of the norm of the distribution: the tester is described in Algorithm 1. The query complexity follows trivially from the stated sample complexity, as queries suffice to read the full sample.
Specifically, the algorithm is as follows:
We now argue correctness. By a union bound, both subroutines behave as they should with overall probability : we hereafter assume their output is correct.
-
Completeness: Assume . Then , so and the first step does not reject. Similarly, , so the generalized uniformity tester (which is by definition a tester for ) accepts. Overall, Algorithm 1 returns accept.
-
Soundness: By contrapositive, suppose Algorithm 1 returns accept. Since the second subroutine did not reject, it must be the case that is -close (in total variation distance) to a distribution , uniform on some subset of a given size . Moreover, the algorithm of [5] provides the extra guarantee that (see [5, Lemma 3.4]).111This is why we chose to use this specific algorithm, instead of that of [17], which has better sample complexity (with respect to ) but does not provide this extra guarantee.
Now, let be a closest distribution to , over all subsets of size :
where the second inequality follows the minimum TV distance between two uniform distributions on supports and , which occurs when the supports of the distributions overlap as much as possible, and is equal to . In particular, we have if
Now, to see why this holds, observe that by our first check, is within a factor of , and, by the additional guarantee of the [5] tester (adjusting the constant in the setting of ), within a factor of . This implies that and are within a factor of each other, and thus that . Overall, this establishes that .
Finally, the claimed lower bound follows directly from the lower bound on generalized uniformity testing of [17], which holds even when the target size of the support is given.
5 Conclusion and open problems
In this work, we studied the problem of testing -grainedness of distributions. We established a new lower bound of samples, improving on the previous lower bound of due to [22], thereby settling a conjecture by [22]. Along the way, we also obtained an alternative, simpler proof of the lower bound. By leveraging a reduction between the testing models due to [23], our result implies an optimal lower bound for uniformity testing in the DoHO (Distributions over Huge Objects) model, settling another conjecture of [23]. Finally, we provided a simple tester for uniformity testing in this DoHO model, with improved sample complexity for a large range of the parameters.
Our work leaves open several new avenues of research in this context; we list two of them below:
-
(i)
Our lower bound for -grainedness testing establishes the optimal dependence on the parameter , for constant proximity parameter . It would be interesting to fully characterize the complexity of the question, including the dependence on .
-
(ii)
From [23], it is known that testing uniformity over an -element subset of requires queries, for ; and that queries are sufficient for all . Our own tester (Theorem 4.1) shows an upper bound of queries, better in the regime . This leaves understanding the landscape of parameterized uniformity testing, especially in the intermediate regime , as an open and intriguing question.
References
- [1] Jayadev Acharya, Constantinos Daskalakis, and Gautam Kamath. Optimal testing for properties of distributions. Advances in Neural Information Processing Systems (NeurIPS), 28, 2015.
- [2] Tomer Adar and Eldar Fischer. Refining the adaptivity notion in the huge object model. In International Conference on Randomization and Computation (RANDOM), pages 45:1–45:16, 2024. doi:10.4230/LIPICS.APPROX/RANDOM.2024.45.
- [3] Tomer Adar, Eldar Fischer, and Amit Levi. Support testing in the huge object model. In International Conference on Randomization and Computation (RANDOM), pages 46:1–46:16, 2024. doi:10.4230/LIPICS.APPROX/RANDOM.2024.46.
- [4] Sivaraman Balakrishnan and Larry Wasserman. Hypothesis testing for high-dimensional multinomials: A selective review, 2018.
- [5] Tugkan Batu and Clément L Canonne. Generalized uniformity testing. In Symposium on Foundations of Computer Science (FOCS), pages 880–889, 2017. doi:10.1109/FOCS.2017.86.
- [6] Tugkan Batu, Eldar Fischer, Lance Fortnow, Ravi Kumar, Ronitt Rubinfeld, and Patrick White. Testing random variables for independence and identity. In Symposium on Foundations of Computer Science (FOCS), pages 442–451, 2001. doi:10.1109/SFCS.2001.959920.
- [7] Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D Smith, and Patrick White. Testing that distributions are close. In Symposium on Foundations of Computer Science (FOCS), pages 259–269, 2000. doi:10.1109/SFCS.2000.892113.
- [8] Clément L Canonne. A survey on distribution testing: Your data is big. but is it blue? Theory of Computing, pages 1–100, 2020.
- [9] Clément L Canonne. Topics and techniques in distribution testing: A biased but representative sample. Foundations and Trends® in Communications and Information Theory, pages 1032–1198, 2022. doi:10.1561/0100000114.
- [10] Clément L Canonne, Ilias Diakonikolas, Daniel Kane, and Sihan Liu. Nearly-tight bounds for testing histogram distributions. Advances in Neural Information Processing Systems (NeurIPS), 35:31599–31611, 2022.
- [11] Clément L Canonne, Ayush Jain, Gautam Kamath, and Jerry Li. The price of tolerance in distribution testing. In Conference on Learning Theory (COLT), pages 573–624, 2022. URL: https://proceedings.mlr.press/v178/canonne22a.html.
- [12] Clément L. Canonne, Sayantan Sen, and Joy Qiping Yang. Settling the complexity of testing grainedness of distributions, and application to uniformity testing in the huge object model. ECCC preprint, 2024. URL: https://eccc.weizmann.ac.il/report/2024/196.
- [13] Sourav Chakraborty, Eldar Fischer, Arijit Ghosh, Gopinath Mishra, and Sayantan Sen. Testing of index-invariant properties in the huge object model. In Conference on Learning Theory (COLT), pages 3065–3136, 2023. URL: https://proceedings.mlr.press/v195/chakraborty23a.html.
- [14] Ilias Diakonikolas, Themis Gouleakis, John Peebles, and Eric Price. Sample-optimal identity testing with high probability. In International Colloquium on Automata, Languages, and Programming (ICALP), 2018.
- [15] Ilias Diakonikolas, Themis Gouleakis, John Peebles, and Eric Price. Collision-based testers are optimal for uniformity and closeness. Chic. J. Theor. Comput. Sci, 25:1–21, 2019. URL: http://cjtcs.cs.uchicago.edu/articles/2019/1/contents.html.
- [16] Ilias Diakonikolas and Daniel M Kane. A new approach for testing properties of discrete distributions. In Symposium on Foundations of Computer Science (FOCS), pages 685–694, 2016. doi:10.1109/FOCS.2016.78.
- [17] Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Sharp bounds for generalized uniformity testing. Advances in Neural Information Processing Systems (NeurIPS), 2018.
- [18] Oded Goldreich. The uniform distribution is complete with respect to testing identity to a fixed distribution. In Electronic Colloquium on Computational Complexity (ECCC), page 1, 2016.
- [19] Oded Goldreich. Introduction to property testing. Cambridge University Press, 2017. doi:10.1017/9781108135252.
- [20] Oded Goldreich, Shafi Goldwasser, and Dana Ron. Property testing and its connection to learning and approximation. Journal of the ACM (JACM), pages 653–750, 1998. doi:10.1145/285055.285060.
- [21] Oded Goldreich and Dana Ron. On testing expansion in bounded-degree graphs. Electron. Colloquium Comput. Complex., TR00-020, 2000. URL: https://eccc.weizmann.ac.il/eccc-reports/2000/TR00-020/index.html.
- [22] Oded Goldreich and Dana Ron. A lower bound on the complexity of testing grained distributions. Computational Complexity (CC), page 11, 2023. doi:10.1007/S00037-023-00245-W.
- [23] Oded Goldreich and Dana Ron. Testing distributions of huge objects. In TheoretiCS, page 78, 2023.
- [24] metamorphy. Proving a formula for n-th degree polynomial with n distinct real roots, 2021. URL: https://math.stackexchange.com/questions/4074098/proving-a-formula-for-sum-j-1n-fracx-jkfx-j-for-f-an-n-th-degr.
- [25] Liam Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Transactions on Information Theory, pages 4750–4755, 2008. doi:10.1109/TIT.2008.928987.
- [26] Sofya Raskhodnikova, Dana Ron, Amir Shpilka, and Adam Smith. Strong lower bounds for approximating distribution support size and the distinct elements problem. SIAM Journal on Computing (SICOMP), pages 813–842, 2009. doi:10.1137/070701649.
- [27] Ronitt Rubinfeld and Madhu Sudan. Robust characterizations of polynomials with applications to program testing. SIAM Journal on Computing (SICOMP), pages 252–271, 1996. doi:10.1137/S0097539793255151.
- [28] Gregory Valiant and Paul Valiant. Estimating the unseen: improved estimators for entropy and other properties. Journal of the ACM (JACM), pages 1–41, 2017. doi:10.1145/3125643.
- [29] Paul Valiant. Testing symmetric properties of distributions. SIAM Journal on Computing (SICOMP), pages 1927–1968, 2011. doi:10.1137/080734066.
- [30] Yihong Wu and Pengkun Yang. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory, pages 3702–3720, 2016. doi:10.1109/TIT.2016.2548468.
- [31] Yihong Wu and Pengkun Yang. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. The Annals of Statistics, pages 857–883, 2019.
- [32] Yihong Wu, Pengkun Yang, et al. Polynomial methods in statistical inference: theory and practice. Foundations and Trends® in Communications and Information Theory, pages 402–586, 2020.