Improving and extending the testing of distributions for shape-restricted properties

Distribution testing deals with what information can be deduced about an unknown distribution over $\{1,\ldots,n\}$, where the algorithm is only allowed to obtain a relatively small number of independent samples from the distribution. In the extended conditional sampling model, the algorithm is also allowed to obtain samples from the restriction of the original distribution on subsets of $\{1,\ldots,n\}$. In 2015, Canonne, Diakonikolas, Gouleakis and Rubinfeld unified several previous results, and showed that for any property of distributions satisfying a"decomposability"criterion, there exists an algorithm (in the basic model) that can distinguish with high probability distributions satisfying the property from distributions that are far from it in the variation distance. We present here a more efficient yet simpler algorithm for the basic model, as well as very efficient algorithms for the conditional model, which until now was not investigated under the umbrella of decomposable properties. Additionally, we provide an algorithm for the conditional model that handles a much larger class of properties. Our core mechanism is a way of efficiently producing an interval-partition of $\{1,\ldots,n\}$ that satisfies a"fine-grain"quality. We show that with such a partition at hand we can directly move forward with testing individual intervals, instead of first searching for the"correct"partition of $\{1,\ldots,n\}$.


Historical background
In most computational problems that arise from modeling real-world situations, we are required to analyze large amounts of data to decide if it has a fixed property. The amount of data involved is usually too large for reading it in its entirety, both with respect to time and storage. In such situations, it is natural to ask for algorithms that can sample points from the data and obtain a significant estimate for the property of interest. The area of property testing addresses this issue by studying algorithms that look at a small part of the data and then decide if the object that generated the data has the property or is far (according to some metric) from having the property.
There has been a long line of research, especially in statistics, where the underlying object from which we obtain the data is modeled as a probability distribution. Here the algorithm is only allowed to ask for independent samples from the distribution, and has to base its decision on them. If the support of the underlying probability distribution is large, it is not practical to approximate the entire distribution. Thus, it is natural to study this problem in the context of property testing.
The specific sub-area of property testing that is dedicated to the study of distributions is called distribution testing. There, the input is a probability distribution (in this paper the domain is the set {1, 2, . . . , n}) and the objective is to distinguish whether the distribution has a certain property, such as uniformity or monotonicity, or is far in ℓ 1 distance from it. See [Can15] for a survey about the realm of distribution testing.
Testing properties of distributions was studied by Batu et al in [BFR + 00], where they gave a sublinear query algorithm for testing closeness of distributions supported over the set {1, 2, . . . , n}. They extended the idea of collision counting, which was implicitly used for uniformity testing in the work of Goldreich and Ron ( [GR00]). Consequently, various properties of probability distributions were studied, like testing identity with a known distribution ([BFF + 01, VV14, ADK15, DK16]), testing independence of a distribution over a product space ([BFF + 01, ADK15]), and testing k-wise independence ([AAK + 07]).
In recent years, distribution testing has been extended beyond the classical model. A new model called the conditional sampling model was introduced. It first appeared independently in [CRS15] and [CFGM13]. In the conditional sampling model, the algorithm queries the input distribution µ with a set S ⊆ {1, 2, . . . , n}, and gets an index sampled according to µ conditioned on the set S. Notice that if S = {1, 2, . . . , n}, then this is exactly like in the standard model. The conditional sampling model allows adaptive querying of µ, since we can choose the set S based on the indexes sampled until now. Chakraborty et al ( [CFGM13]) and Canonne et al ( [CRS15]) showed that testing uniformity can be done with a number of queries not depending on n (the latter presenting an optimal test), and investigated the testing of other properties of distributions. In [CFGM13], it is also shown that uniformity can be tested with poly(log n) conditional samples by a non-adaptive algorithm. In this work, we study the distribution testing problem in the standard (unconditional) sampling model, as well as in the conditional model.
A line of work which is central to our paper, is the testing of distributions for structure. The objective is to test whether a given distribution has some structural properties like being monotone ( [BKR04]), being a k-histogram ( [ILR12,DK16]), or being log-concave ( [ADK15]). Canonne et al ( [CDGR15]) unified these results to show that if a property of distributions has certain structural characteristics, then membership in the property can be tested efficiently using samples from the distribution. More precisely, they introduced the notion of L-decomposable distributions as a way to unify various algorithms for testing distributions for structure. Informally, an L-decomposable distribution µ supported over {1, 2, . . . , n} is one that has an interval partition I of {1, 2, . . . , n} of size bounded by L, such that for every interval I, either the weight of µ on it is small or the distribution over the interval is close to uniform. A property C of distributions is L-decomposable if every distribution µ ∈ C is L-decomposable (L is allowed to depend on n). This generalizes various properties of distributions like being monotone, unimodal, log-concave etc. In this setting, their result for a set of distributions C supported over {1, 2, . . . , n} translates to the following: if every distribution µ from C is L-decomposable, then there is an efficient algorithm for testing whether a given distribution belongs to the property C.
To achieve their results, Canonne et al ( [CDGR15]) show that if a distribution µ supported over [n] is L-decomposable, then it is O(L log n)-decomposable where the intervals are of the form [j2 i + 1, (j + 1)2 i ]. This presents a natural approach of computing the interval partition in a recursive manner, by bisecting an interval if it has a large probability weight and is not close to uniform. Once they get an interval partition, they learn the "flattening" of the distribution over this partition, and check if this distribution is close to the property C. The term "flattening" refers to the distribution resulting from making µ conditioned on any interval of the partition to be uniform. When applied to a partition corresponding to a decomposition of the distribution, the learned flattening is also close to the original distribution. Because of this, in the case where there is a promise that µ is L-decomposable, the above can be viewed as a learning algorithm, where they obtain an explicit distribution that is close to µ. Without the promise it can be viewed as an agnostic learning algorithm. For further elaboration of this connection see [Dia16].

Results and techniques
In this paper, we extend the body of knowledge about testing L-decomposable properties. We improve upon the previously known bound on the sample complexity, and give much better bounds when conditional samples are allowed. Additionally, for the conditional model, we provide a test for a broader family of properties, that we call atlas-characterizable properties.
Our approach differs from that of [CDGR15] in the manner in which we compute the interval partition. We show that a partition where most intervals that are not singletons have small probability weight is sufficient to learn the distribution µ, even though it is not the original Ldecomposition of µ. We show that if a distribution µ is L-decomposable, then the "flattening" of µ with respect to a this partition is close to µ. It turns out that such a partition can be obtained in "one shot" without resorting to a recursive search procedure.
We obtain a partition as above using a method of partition pulling that we develop here. Informally, a pulled partition is obtained by sampling indexes from µ, and taking the partition induced by the samples in the following way: each sampled index is a singleton interval, and the rest of the partition is composed of the maximal intervals between sampled indexes. Apart from the obvious simplicity of this procedure, it also has the advantage of providing a partition with a significantly smaller number of intervals, linear in L for a fixed ǫ, and with no dependency on n unless L itself depends on it. This makes our algorithm more efficient in query complexity than the one of [CDGR15] in the unconditional sampling model, and leads to a dramatically small sampling complexity in the (adaptive) conditional model.
Another feature of the partition pulling method is that it provides a partition with small weight intervals also when the distribution is not L-decomposable. This allows us to use the partition in a different manner later on, in the algorithm for testing atlas characterizable properties using conditional samples.
The main common ground between our approach for L-decomposable properties and that of [CDGR15] is the method of testing by implicit learning, as defined formally in [DLM + 07] (see [Ser10]). In particular, the results also provide a means to learn a distribution close to µ if µ satisfies the tested property. We also provide a test under the conditional query model for the extended class of atlas characterizable properties that we define below, which generalizes both decomposable properties and symmetric properties. A learning algorithm for this class is not provided; only an "atlas" of the input distribution rather than the distribution itself is learned.
Our result for unconditional testing (Theorem 7.4) gives a √ nL/poly(ǫ) query algorithm in the standard (unconditional) sampling model for testing an L-decomposable property of distributions. Our method of finding a good partition for µ using pulled partitions, that we explained above, avoids the log n factor present in Theorem 3.3 of [CDGR15]. We also avoid the additional O(L 2 ) additive term present there. The same method enables us to extend our results to the conditional query model, which we present for both adaptive and non-adaptive algorithms. Table 1 summarizes our results and compares them with known lower bounds 1 .
We study the problem of testing properties of probability distributions supported over [n], when given samples from the distribution. For two distributions µ and χ, we say that µ is ǫ-far from χ if they are far in the ℓ 1 norm, that is, d(µ, χ) = i∈[n] |µ(i) − χ(i)| > ǫ. For a property of distributions C, we say that µ is ǫ-far from C if for all χ ∈ C, d(µ, χ) > ǫ.
Definition 2.1. For a distribution µ over a domain I, we define the bias of µ to be bias(µ) = max i∈I µ(i)/ min i∈I µ(i) − 1.
The following observation is easy and will be used implicitly throughout.
Observation 2.2. For any two distributions µ and χ over a domain I of size m, d(µ, χ) ≤ m µ − χ ∞ . Also, µ − U I ∞ ≤ 1 m bias(µ), where U I denotes the uniform distribution over I.
Proof. Follows from the definitions.
We study the problem, both in the standard model, where the algorithm is given indexes sampled from the distribution, as well as in the model of conditional samples. The conditional model was first studied in the independent works of Chakraborty et al ( [CFGM13]) and Canonne et al ( [CRS15]). We first give the definition of a conditional oracle for a distribution µ.

Definition 2.3 (Conditional oracle). A conditional oracle to a distribution µ supported over [n] is a black-box that takes as input a set A ⊆ [n]
, samples a point i ∈ A with probability µ(i)/ j∈A µ(j), and returns i. If µ(j) = 0 for all j ∈ A, then it chooses i ∈ A uniformly at random.
Remark. The behaviour of the conditional oracle on sets A with µ(A) = 0 is as per the model of Chakraborty et al [CFGM13]. However, upper bounds in this model also hold in the model of Canonne et al [CRS15], and most lower bounds can be easily converted to it. Now we define conditional distribution testing algorithms. We will define and analyze both adaptive and non-adaptive conditional testing algorithms.
Definition 2.4. An adaptive conditional distribution testing algorithm for a property of distributions C, with parameters ǫ, δ > 0, and n ∈ N, with query complexity q(ǫ, δ, n), is a randomized algorithm with access to a conditional oracle of a distribution µ with the following properties: • For each i ∈ [q], at the i th phase, the algorithm generates a set A i ⊆ [n], based on j 1 , j 2 , · · · , j i−1 and its internal coin tosses, and calls the conditional oracle with A i to receive an element j i , drawn independently of j 1 , j 2 , · · · , j i−1 .
• Based on the received elements j 1 , j 2 , · · · , j q and its internal coin tosses, the algorithm accepts or rejects the distribution µ.
If µ ∈ C, then the algorithm accepts with probability at least 1 − δ, and if µ is ǫ-far from C, then the algorithm rejects with probability at least 1 − δ.
Definition 2.5. A non-adaptive conditional distribution testing algorithm for a property of distributions C, with parameters ǫ, δ > 0, and n ∈ N, with query complexity q(ǫ, δ, n), is a randomized algorithm with access to a conditional oracle of a distribution µ with the following properties: • The algorithm chooses sets A 1 , . . . , A q (not necessarily distinct) based on its internal coin tosses, and then queries the conditional oracle to respectively obtain j 1 , . . . , j q .
• Based on the received elements j 1 , . . . , j q and its internal coin tosses, the algorithm accepts or rejects the distribution µ.
If µ ∈ C, then the algorithm accepts with probability at least 1 − δ, and if µ is ǫ-far from C, then the algorithm rejects with probability at least 1 − δ.

Large deviation bounds
The following large deviation bounds will be used in the analysis of our algorithms through the rest of the paper.

Basic distribution procedures
The following is a folklore result about learning any distribution supported over [n], that we prove here for completeness.
Proof. Take m = n+log(2/δ) 2ǫ 2 samples from µ, and for each i ∈ [n], let m i be the number of times i was sampled. Define µ ′ (i) = m i /m. Now, we show that max S⊆[n] |µ(S) − µ ′ (S)| ≤ ǫ/2. The lemma follows from this since the ℓ 1 distance is equal to twice this amount.
Proof. Take m = log(2n/δ) 2ǫ 2 samples from µ, and for each i ∈ [n], let m i be the number of times i was sampled. For each i ∈ [n], define µ ′ (i) = m i /m.

Fine partitions and how to pull them
We define the notion of η-fine partitions of a distribution µ supported over [n], which are central to all our algorithms.
Definition 3.1 (η-fine interval partition). Given a distribution µ over [n], an η-fine interval partition of µ is an interval-partition I = (I 1 , I 2 , . . . , I r ) of [n] such that for all j ∈ [r], µ(I j ) ≤ η, excepting the case |I j | = 1. The length |I| of an interval partition I is the number of intervals in it.
Algorithm 1: Pulling an η-fine partition Input: Distribution µ supported over [n], parameters η > 0 (fineness) and δ > 0 (error probability) 1 Take m = 3 η log 3 ηδ unconditional samples from µ 2 Arrange the indices sampled in increasing order i 1 < i 2 < · · · < i r without repetition and set Add the singleton interval {i j } to I 6 if i r < n then add the interval {i r + 1, . . . , n} to I 7 return I The following Algorithm 1 is the pulling mechanism. The idea is to take independent unconditional samples from µ, make them into singleton intervals in our interval-partition I, and then take the intervals between these samples as the remaining intervals in I.
Lemma 3.2. Let µ be a distribution that is supported over [n], and η, δ > 0, and suppose that these are fed to Algorithm 1. Then, with probability at least 1 − δ, the set of intervals I returned by Algorithm 1 is an η-fine interval partition of µ of length O 1 η log 1 ηδ .
Proof. Let I the set of intervals returned by Algorithm 1. The guarantee on the length of I follows from the number of samples taken in Step 1, noting that |I| ≤ 2r − 1 = O(m). Let J be a maximal set of pairwise disjoint minimal intervals I in [n], such that µ(I) ≥ η/3 for every interval I ∈ J . Note that every i for which µ(i) ≥ η/3 necessarily appears as a singleton interval {i} ∈ J . Also clearly |J | ≤ 3/η.
We shall first show that if an interval I ′ is such that µ(I ′ ) ≥ η, then it fully contains some interval I ∈ J . Then, we shall show that, with probability at least 1 − δ, the samples taken in Step 1 include an index from every interval I ∈ J . By Steps 2 to 6 of the algorithm and the above, this implies the statement of the lemma.
Let I ′ be an interval such that µ(I ′ ) ≥ η, and assume on the contrary that it contains no interval from J . Clearly it may intersect without containing at most two intervals I l , I r ∈ J . Also, µ(I ′ ∩ I l ) < η/3 because otherwise we could have replaced I l with I ′ ∩ I l in J , and the same holds for µ(I ′ ∩ I r ). But this means that µ(I \ (I l ∪ I r )) > η/3, and so we could have added I \ (I l ∪ I r ) to J , again a contradiction.
Let I ∈ J . The probability that an index from I is not sampled is at most (1−η/3) 3 log(3/ηδ)/η ≤ δη/3. By a union bound over all I ∈ J , with probability at least 1 − δ the samples taken in Step 1 include an index from every interval in J .
The following is a definition of a variation of a fine partition, where we allow some intervals of small total weight to violate the original requirements. In our applications, γ will be larger than η by a factor of L, which would allow us through the following Algorithm 2 to avoid having additional log L factors in our expressions for the unconditional and the adaptive tests.
Algorithm 2: Pulling an (η, γ)-fine partition Input: Distribution µ supported over [n], parameters η, γ > 0 (fineness) and δ > 0 (error probability) 1 Take m = 3 η log 5 γδ unconditional samples from µ 2 Perform Step 2 through Step 6 of Algorithm 1. 3 return I Lemma 3.4. Let µ be a distribution that is supported over [n], and γ, η, δ > 0, and suppose that these are fed to Algorithm 2. Then, with probability at least 1 − δ, the set of intervals I returned by Algorithm 2 is an Proof. Let I the set of intervals returned by Algorithm 2. The guarantee on the length of I follows from the number of samples taken in Step 1.
As in the proof of Lemma 3.2, Let J be a maximal set of pairwise disjoint minimal intervals I in [n], such that µ(I) ≥ η/3 for every interval I ∈ J . Here, also define the set J ′ to be the set of maximal intervals in [n] \ I∈J I. Note now that J ∪ J ′ is an interval partition of [n]. Note also that between every two consecutive intervals of J ′ lies an interval of J . Finally, since J is maximal, all intervals in J ′ are of weights less than η/3.
Recalling the definition H = {I ∈ I : µ(I) > η, |I| > 1}, as in the proof of Lemma 1 every I ′ ∈ H must fully contain an interval from J from which no point was sampled. Moreover, I ′ may not fully contain intervals from J from which any points were sampled.
Note furthermore that the weight of such an I ′ is not more than 5 times the total weight of the intervals in J that it fully contains. To see this, recall that the at most two intervals from J that intersect I ′ without containing it have intersections of weight not more than η/3. Also, there may be the intervals of J ′ intersecting I ′ , each of weight at most η/3. However, because there is an interval J between any two consecutive intervals of J ′ , the number of intervals from J ′ intersecting I ′ is at most 1 more than the number of intervals of J fully contained in I ′ . Thus the number of intersecting intervals from J ∪ J ′ is not more than 5 times the number of fully contained intervals from J , and together with their weight bounds we get the bound on µ(I ′ ).
Let I ∈ J . The probability that an index from I is not sampled is at most (1−η/3) 3 log(5/γδ)/η ≤ δγ/5. By applying the Markov bound over all I ∈ J (along with their weights), with probability at least 1 − δ the samples taken in Step 1 include an index from every interval in J but at most a subset of them of total weight at most γ/5. By the above this means that I∈H µ(I) ≤ γ.

Handling decomposable distributions
The notion of L-decomposable distributions was defined and studied in [CDGR15]. They showed that a large class of properties, such as monotonicity and log-concavity, are L-decomposable. We now formally define L-decomposable distributions and properties, as given in [CDGR15].
The second condition in the definition of a (γ, L)-decomposable distribution is identical to saying that bias(µ ↾ I j ) ≤ γ. An L-decomposable property is now defined in terms of all its members being decomposable distributions.
Recall that part of the algorithm for learning such distributions is finding (through pulling) what we referred to as a fine partition. Such a partition may still have intervals where the conditional distribution over them is far from uniform. However, we shall show that for L-decomposable distributions, the total weight of such "bad" intervals is not very high.
The next lemma shows that every fine partition of an (γ, L)-decomposable distribution has only a small weight concentrated on "non-uniform" intervals and thus it will be sufficient to deal with the "uniform" intervals.
Proof. Let I = (I 1 , I 2 , . . . , I ℓ ) be the L-decomposition of µ, where ℓ ≤ L. Let I ′ = (I ′ 1 , I ′ 2 , . . . , I ′ r ) be an interval partition of [n] such that for all j ∈ [r], µ(I ′ j ) ≤ γ/L or |I ′ j | = 1. Any interval I ′ j for which bias(µ ↾ I ′ j ) > γ, is either completely inside an interval I k such that µ(I k ) ≤ γ/L, or intersects more than one interval (and in particular |I ′ j | > 1). There are at most L − 1 intervals in I ′ that intersect more than one interval in I. The sum of the weights of all such intervals is at most γ.
For any interval I k of I such that µ(I k ) ≤ γ/L, the sum of the weights of intervals from I ′ that lie completely inside I k is at most γ/L. Thus, the total weight of all such intervals is bounded by γ. Therefore, the sum of the weights of intervals I ′ j such that bias(µ ↾ I ′ j ) > γ is at most 2γ. In order to get better bounds, we will use the counterpart of this lemma for the more general (two-parameter) notion of a fine partition.
Proof. Let I = (I 1 , I 2 , . . . , I ℓ ) be the L-decomposition of µ, where ℓ ≤ L. Let I ′ = (I ′ 1 , I ′ 2 , . . . , I ′ r ) be an interval partition of [n] such that for a set H I of total weight at most γ, for all I ′ j ∈ I \ H I , µ(I ′ j ) ≤ γ/L or |I ′ j | = 1. Exactly as in the proof of Lemma 4.3, the total weight of intervals I ′ j ∈ I \ H I for which bias(µ ↾ I ′ j ) > γ is at most 2γ. In the worst case, all intervals in H I are also such that bias(µ ↾ I ′ j ) > γ, adding at most γ to the total weight of such intervals.
As previously mentioned, we are not learning the actual distribution but a "flattening" thereof. We next formally define the flattening of a distribution µ with respect to an interval partition I. Afterwards we shall describe its advantages and how it can be learned.
Definition 4.5. Given a distribution µ supported over [n] and a partition I = (I 1 , I 2 , . . . , I ℓ ), of [n] to intervals, the flattening of µ with respect to I is a distribution µ I , supported over [n], such that for i ∈ I j , µ I (i) = µ(I j )/|I j |.
The following lemma shows that the flattening of any distribution µ, with respect to any interval partition that has only small weight on intervals far from uniform, is close to µ.
Lemma 4.6. Let µ be a distribution supported on [n], and let I = (I 1 , I 2 , . . . , I r ) be an interval Proof. We split the sum d(µ, µ I ) into parts, one for I j such that d(µ ↾ I j , U I j ) ≤ γ, and one for the remaining intervals. For For We know that the sum of µ(I j ) over all I j such that d(µ ↾ I j , U I j ) ≥ γ is at most η. Using Equations 2 and 1, and summing up over all the sets I j ∈ I, the lemma follows.
The good thing about a flattening (for an interval partition of small length) is that it can be efficiently learned. For this we first make a technical definition and note some trivial observations: Definition 4.7 (coarsening). Given µ and I, where |I| = ℓ, we define the coarsening of µ according This is a distribution, and for any two distributionsμ I andχ I we have d(µ I , χ I ) = d(μ I ,χ I ). Moreover, ifμ I is a coarsening of a distribution µ over [n], then µ I is the respective flattening of µ.
Proof. All of this follows immediately from the definitions.
The following lemma shows how learning can be achieved. We will ultimately use this in conjunction with Lemma 4.6 as a means to learn a whole distribution through its flattening.
Proof. First, note that an unconditional sample fromμ I can be simulated using one unconditional sample from µ. To obtain it, take the index i sampled from µ, and set j to be the index for which i ∈ I j . Using Lemma 2.8, we can now obtain a distributionμ ′ I , supported over [ℓ], such that with probability at least 1 − δ, d(μ I ,μ ′ I ) ≤ ǫ. To finish, we construct and output µ ′ I as per Observation 4.8.

Weakly tolerant interval uniformity tests
To unify our treatment of learning and testing with respect to L-decomposable properties to all three models (unconditional, adaptive-condition and non-adaptive-conditional), we first define what it means to test a distribution µ for uniformity over an interval I ⊆ [n]. The following definition is technical in nature, but it is what we need to be used as a building block for our learning and testing algorithms.
Definition 5.1 (weakly tolerant interval tester). A weakly tolerant interval tester is an algorithm T that takes as input a distribution µ over [n], an interval I ⊆ [n], a maximum size parameter m, a minimum weight parameter γ, an approximation parameter ǫ and an error parameter δ, and satisfies the following.
In all other cases, the algorithm may accept or reject with arbitrary probability.
For our purposes we will use three weakly tolerant interval testers, one for each model. First, a tester for uniformity which uses unconditional samples, a version of which has already appeared implicitly in [GR00]. We state below the tester with the best dependence on n and ǫ. We first state it in its original form, where I is the whole of [n], implying that m = n and γ = 1, and δ = 1/3.
The needed adaptation to our purpose is straightforward. Proof. To adapt the tester of Lemma 5.2 to the general m and γ, we just take samples according to µ and keep from them those samples lie in I. This simulates samples from µ ↾ I , over which we employ the original tester. This gives a tester using O( √ m/γǫ 2 ) unconditional samples and providing an error parameter of, say, δ = 2/5 (the extra error is due to the probability of not getting enough samples from I even when µ(I) ≥ γ). To move to a general δ, we repeat this O(1/δ) times and take the majority vote.
Next, a tester that uses adaptive conditional samples. For this we use the following tester from [CRS15] (see also [CFGM13]). Its original statement does not have the weakly tolerance (acceptance for small bias) guarantee, but it is easy to see that the proof there works for the stronger assertion. This time we skip the question of how to adapt the original algorithm from I = [n] and δ = 2/3 to the general parameters here. This is since γ does not matter (due to using adaptive conditional samples), the query complexity is independent of the domain size to begin with, and the move to a general δ > 0 is by standard amplification.
Finally, a tester that uses non-adaptive conditional samples. For this to work, it is also very important that the queries do not depend on I as well (but only on n and γ). We just state here the lemma, the algorithm itself is presented and analyzed in Section 8.

Assessing an interval partition
Through either Lemma 3.2 or Lemma 3.4 we know how to construct a fine partition, and then through either Lemma 4.3 or Lemma 4.4 respectively we know that if µ is decomposable, then most of the weight is concentrated on intervals with a small bias. However, eventually we would like a test that works for decomposable and non-decomposable distributions alike. For this we need a way to asses an interval partition as to whether it is indeed suitable for learning a distribution. This is done through a weighted sampling of intervals, for which we employ a weakly tolerant tester, The following is the formal description, given as Algorithm 3.

Algorithm 3: Assessing a partition
Input: A distribution µ supported over [n], parameters c, r, an interval partition I satisfying |I| ≤ r, parameters ǫ, δ > 0, a weakly tolerant interval uniformity tester T taking input values (µ, I, m, γ, ǫ, δ). 1 for s = 20 log(1/δ)/ǫ times do 2 Take an unconditional sample from µ and let I ∈ I be the interval that contains it 3 Use the tester T with input values (µ, I, n/c, ǫ/r, ǫ, δ/2s) 4 if test rejects then add I to B 5 if * then |B| > 4ǫsreject else *accept To analyze it, first, for a fine interval partition, we bound the total weight of intervals where the weakly tolerant tester is not guaranteed a small error probability; recall that T as used in Step 3 guarantees a correct output only for an interval I satisfying µ(I) ≥ ǫ/r and |I| ≤ n/r. Observation 6.1. Define N I = {I ∈ I : |I| > n/r or µ(I) < ǫ/r}. If I is (η, γ)-fine, where cη + γ ≤ ǫ, then µ( I∈N I I) ≤ 2ǫ.
Proof. Intervals in N I must fall into at least one of the following categories.
• Intervals in H I , whose total weight is bounded by γ by the definition of a fine partition.
• Intervals whose weight is less than ǫ/r. Since there are at most r such intervals (since |I| ≤ r) their total weight is bounded by ǫ.
• Intervals whose size is more than n/c and are not in H I . Every such interval is of weight bounded by η (by the definition of a fine partition) and clearly there are no more than c of those, giving a total weight of cη.
Summing these up concludes the proof.
The following "completeness" lemma states that the typical case for a fine partition of a decomposable distribution, i.e. the case where most intervals exhibit a small bias, is correctly detected.
then Algorithm 3 accepts with probability at least 1 − δ.
Proof. Note by Observation 6.1 that the total weight of G I \ N I is at least 1 − 3ǫ. By the Chernoff bound of Lemma 2.6, with probability at least 1 − δ/2 all but at most 4ǫs of the intervals drawn in Step 2 fall into this set.
Finally, note that if I as drawn in Step 2 belongs to this set, then with probability at least 1 − δ/2s the invocation of T in Step 3 will accept it, so by a union bound with probability at least 1 − δ/2 all sampled intervals from this set will be accepted. All events occur together and make the algorithm accept with probability at least 1 − δ, concluding the proof.
The following "soundness" lemma states that if too much weight is concentrated on intervals where µ is far from uniform in the ℓ 1 distance, then the algorithm rejects. Later we will show that this is the only situation where µ cannot be easily learned through its flattening according to I.
Proof. Note by Observation 6.1 that the total weight of F I \ N I is at least 5ǫ. By the Chernoff bound of Lemma 2.6, with probability at least 1 − δ/2 at least 4ǫs of the intervals drawn in Step 2 fall into this set.
Finally, note that if I as drawn in Step 2 belongs to this set, then with with probability at least 1 − δ/2s the invocation of T in Step 3 will reject it, so by a union bound with probability at least 1 − δ/2 all sampled intervals from this set will be rejected. All events occur together and make the algorithm reject with probability at least 1 − δ, concluding the proof.
Finally, we present the query complexity of the algorithm. It is presented as generally quadratic in log(1/δ), but this can be made linear easily by first using the algorithm with δ = 1/3, and then repeating it O(1/δ) times and taking the majority vote. When we use this lemma later on, both r and c will be linear in the decomposability parameter L for a fixed ǫ, and δ will be a fixed constant.

Proof. A single (unconditional) sample is taken each time
Step 2 is reached, and all other samples are taken by the invocation of T in Step 3. This makes the total number of samples to be s(q + 1) = O(q log(1/δ)/ǫ).
The bound for each individual sampling model follows by plugging in Lemma 5.3, Lemma 5.4 and Lemma 5.5 respectively. For the last one it is important that the tester makes its queries completely independently of I, as otherwise the algorithm would not have been non-adaptive.

Learning and testing decomposable distributions and properties
Here we finally put things together to produce a learning algorithm for L-decomposable distribution. This algorithm is not only guaranteed to learn with high probability a distribution that is decomposable, but is also guaranteed with high probability to not produce a wrong output for any distribution (though it may plainly reject a distribution that is not decomposable). This is presented in Algorithm 4. We present it with a fixed error probability 2/3 because this is what we use later on, but it is not hard to move to a general δ.
In this situation, in particular by Lemma 4.6 we have that d(µ I , µ) ≤ 15ǫ/20 (in fact this can be bounded much smaller here), and with probability at least 8/9 (by Lemma 4.9) Step 4 provides a distribution that is ǫ/10-close to µ I and hence ǫ-close to µ.
Next we show soundness, that the algorithm will with high probability not mislead about the distribution, whether it is decomposable or not.
Proof. Consider the interval partition I. By Lemma 3.4, with probability at least 8/9 it is (ǫ/2000L, ǫ/2000)-fine. When this happens, if I is such that j:d(µ↾ I j ,U I j ) µ(I j ) > 7ǫ/20, then by Lemma 6.3 with probability at least 8/9 the algorithm will reject in Step 3, and we are done (recall that here a rejection is an allowable outcome).
On the other hand, if I is such that j:d(µ↾ I j ,U I j ) µ(I j ) ≤ 7ǫ/20, then by Lemma 4.6 we have that d(µ I , µ) ≤ 15ǫ/20, and with probability at least 8/9 (by Lemma 4.9) Step 4 provides a distribution that is ǫ/10-close to µ I and hence ǫ-close to µ, which is also an allowable outcome.
And finally, we plug in the sample complexity bounds.

ǫ) is a bound on the number of samples that each invocation of T inside Algorithm 3 requires. In particular, Algorithm 4 can be implemented either as an unconditional sampling algorithm taking
√ nL/poly(ǫ) many samples, an adaptive conditional sampling algorithm taking L/poly(ǫ) many samples, or a non-adaptive conditional sampling algorithm taking Lpoly(log n, 1/ǫ) many samples.
Proof. The three summands in the general expression follow respectively from the sample complexity calculations of Lemma 3.4 for Step 1, Lemma 6.4 for Step 2, and Lemma 4.9 for Step 4 respectively. Also note that all samples outside Step 2 are unconditional.
The bound for each individual sampling model follows from the respective bound stated in Lemma 6.4.
Let us now summarize the above as a theorem.
Theorem 7.4. Algorithm 4 is capable of learning an (ǫ/2000, L)-decomposable distribution, giving with probability at least 2/3 a distribution that is epsilon-close to it, such that for no distribution will it give as output a distribution ǫ-far from it with probability more than 1/3.
It can be implemented either as an unconditional sampling algorithm taking √ nL/poly(ǫ) many samples, an adaptive conditional sampling algorithm taking L/poly(ǫ) many samples, or a nonadaptive conditional sampling algorithm taking Lpoly(log n, 1/ǫ) many samples.
Let us now move to the immediate application of the above for testing decomposable properties. The algorithm achieving this is summarized as Algorithm 5 Algorithm 5: Testing L-decomposable properties.
Proof. The number and the nature of the samples are determined fully by the application of Algorithm 4 in Step 1, and are thus the same as in Theorem 7.4. Also by this theorem, for a distribution µ ∈ C, with probability at least 2/3 an ǫ/2-close distribution µ ′ will be produced, and so it will be accepted in Step 2.
Finally, if µ is ǫ-far from C, then with probability at least 2/3 Step 1 will either produce a rejection, or again produce µ ′ that is ǫ/2-close to µ. In the latter case, µ ′ will be ǫ/2-far from C by the triangle inequality, and so Step 2 will reject in either case.

A weakly tolerant tester for the non-adaptive conditional model
Given a distribution µ, supported over [n], and an interval I ⊆ [n] such that µ(I) ≥ γ, we give a tester that uses non-adaptive conditional queries to µ to distinguish between the cases bias(µ ↾ I ) ≤ ǫ/100 and d(µ ↾ I , U I ) > ǫ, using ideas from [CFGM13]. A formal description of the test is given as Algorithm 6. It is formulated here with error probability δ = 1/3. Lemma 5.5 is obtained from this the usual way, by repeating the algorithm O(1/δ) times and taking the majority vote.
We first make the observation that makes Algorithm 6 suitable for a non-adaptive setting.
Observation 8.1. Algorithm 6 can be implemented using only non-adaptive conditional queries to the distribution µ, that are chosen independently of I.
Proof. First, note that the algorithm samples elements from µ at three places. Initially, it samples unconditionally from µ in Step 1, and then it performs conditional samples from the sets U k in Steps 10 and 13. In Steps 10 and 13, the samples are conditioned on sets U k , where k depends on I. However, observe that we can sample from all sets U k , for all 0 ≤ k ≤ log n, at the beginning, and then use just the samples from the appropriate U k at Steps 10 and 13. This only increases the bound on the number of samples by a factor of log n. Thus we have only non-adaptive queries, all of which are made at the start of the algorithm, independently of I.
The following lemma is used in Step 6 of our algorithm.

11
if the same element from I ∩ U k has been sampled twice then reject.
If we did not obtain sufficiently many samples (either because µ(I) < γ or due to a low probability event) then we just output an arbitrary distribution supported on I. Lemma 8.3 (Completeness). If µ(I) ≥ γ and bias(µ ↾ I ) ≤ ǫ/100, then Algorithm 6 accepts with probability at least 2/3.
Proof. First note that if |I| ≤ log 10 n, then we use Lemma 8.2 to test the distance of µ ↾ I to uniform with probability at least 9/10 in Step 7. For the remaining part of the proof, we will assume that |I| > log 10 n.
For a set U k chosen by the algorithm, and any i ∈ I ∩ U k , the probability that it is sampled twice in Step 11 is at most log 3 n 2 µ(i) µ(U k ) 2 . Since µ(U k ) ≥ µ(I ∩ U k ), the probability of sampling twice in Step 11 is at most log 3 n 2 µ(i) µ(I∩U k ) 2 . By Observation 2.2 bias(µ ↾ I ) ≤ ǫ/100 implies . (3) From Equation 3 we get the following for all U k . .
(4) Therefore, the probability that the algorithm samples the same element in I ∩ U k at Step 11 twice is bounded as follows.
Since |I ∩ U k | ≥ log 8 n for the k chosen in Step 9, we can bound the sum as follows.
Therefore, with probability at least 1 − o(1), the algorithm does not reject at Step 11.
To show that the algorithm accepts with probability at least 2/3 in Step 18, we proceed as follows. Combining Equations 3 and 4, we get the following.
We now argue that in this case, the test does not reject at Step 14, for the k chosen in Step 12. Observe that E[µ(I ∩ U k )] ≥ p k γ. Also, the expected size of the set I ∩ U k is p k |I|. Since the k chosen in Step 12 is such that |I|p k ≥ 2 3 log 8 n, with probability at least 1 − exp(−O(log 8 n)), p k |I|/2 ≤ |I ∩ U k | ≤ 2p k |I| (and in particular Step 14 does not reject). Therefore from Equation 4, we get that, with probability at least 1 − exp(−O(log 8 n)), µ(I ∩ U k ) ≥ p k γ/3. Since E[µ(U k )] = p k , we can conclude using Markov's inequality that, with probability at least 9/10, µ(U k ) ≤ 10p k . The expected number of samples from I∩U k among the m k samples used in Step 17 is m k µ(I∩U k )/µ(U k ). Therefore, with probability at least 9/10, the expected number of samples from I ∩ U k among the m k samples is at least m k γ/30. Therefore, with probability, at least 9/10 − o(1), at least m k γ/40 elements of I ∩ U k are sampled, and the tester does not reject at Step 14. The indexes that are sampled in Step 13 that lie in I ∩ U k are distributed according to µ ↾ I∩U k and we know that |I ∩ U k | ≤ 2|I|p k ≤ 8 3 log 8 n. Therefore, with probability at least 9/10, we get a distribution µ ′ such that Step 17. Therefore, the test correctly accepts in Step 18 for the k chosen in Step 12. Now we prove the soundness of the tester mentioned above. First we state a lemma from Chakraborty et al [CFGM13].

Lemma 8.4 ([CFGM13], adapted for intervals). Let µ be a distribution, and I ⊆ [n]
be an interval such that d(µ ↾ I , U I ) ≥ ǫ. Then the following two conditions hold.

There exists a set
2. There exists an index j ∈ 3, . . . , log |I| log(1+ǫ/3) , and a set B j of cardinality at least Now we analyze the case where d(µ ↾ I , U I ) > ǫ.

Proof.
Observe that when |I| ≤ log 10 n, the algorithm rejects with probability at least 9/10 in Step 7. For the remainder of the proof, we will assume that |I| > log 10 n. We analyze two cases according to the value of j given by Lemma 8.4.
1. Suppose that j > 2 is such that |B j | ≥ ǫ 2 |I| 96(1+ǫ/3) j log |I| , and (1 + ǫ/3) j ≤ log 6 n. The expected number of elements from this set that is chosen in U k is at least ǫ 2 |I|p k 96(1+ǫ/3) j log |I| . For the choice of k made in Step 12, we have |I|p k ≥ 2 3 log 8 n. The probability that no index from B j is chosen in U k is (1 − p k ) |B j | which is at most (1 − 2 log 8 n 3|I| ) ǫ 2 |I|/(1+ǫ) j log |I| . Since (1 + ǫ/3) j ≤ log 6 n, this is at most exp − ǫ 2 log n 144 . Therefore, with probability 1 − o(1), at least one element i is chosen from B j .
Since |B 1 | ≥ ǫ|I|/2, the probability that no element from B 1 is chosen in U k is at most (1 − p k ) ǫ|I|/2 . Substituting for p k , we can conclude that, with probability 1 − o(1), at least one element i ′ is chosen from the set B 1 .
The algorithm will reject with high probability in Step 18, unless it has already rejected in Step 14.
Furthermore, the probability that B j ∩ U k is empty is (1 − p k ) |B j | . Substituting the values of |B j | and p k , we get that Pr[B j ∩ U k = ∅] ≤ exp(−ǫ 2 log n/48). Therefore, with probability at least 1 − exp(−ǫ 2 log n/48), U k contains an element of B j .
From this bound, we get that µ ↾ U k (i) ≥ (1+ǫ) j−1 µ(I) |I|µ(U k ) . The expected value of µ(U k ) is p k . By Markov's inequality, with probability at least 9/10, The probability that i is sampled at most once in this case is at most log 3 (n) 1 − γ 40(1+ǫ/3) log 2 n log 3 (n)−1 . Therefore, with probability at least 1 − o(1), i is sampled at least twice and the tester rejects at Step 11.
Proof of Lemma 5.5. Given the input values (µ, I, m, γ, ǫ, δ), we iterate Algorithm 6 O(1/δ) independent times with input values (µ, I, γ, ǫ) (we may ignore m here), and take the majority vote. The sample complexity is evident from the description of the algorithm. If indeed µ(I) ≥ γ then Lemma 8.3 and Lemma 8.5 provide that every round gives the correct answer with probability at least 2/3, making the majority vote correct with probability at least 1 − δ.

Introducing properties characterized by atlases
In this section, we give a testing algorithm for properties characterized by atlases, which we formally define next. We will show in the next subsection that distributions that are L-decomposable are, in particular, characterized by atlases. First we start with the definition of an inventory. That is, we keep count of the function values over the interval including repetitions, but ignore their order. In particular, for a distribution µ = (p 1 , . . . , p n ) over [n], the inventory of µ over [a, b] is the multiset M corresponding to (p a , . . . , p b ).
Definition 9.2 (atlas). Given a distribution µ over [n], and an interval partition I = (I 1 , . . . , I k ) of [n], the atlas A of µ over I is the ordered pair (I, M), where M is the sequence of multisets (M 1 , . . . , M k ) so that M j is the inventory of µ over I j for every j ∈ [k]. In this setting, we also say that µ conforms to A.
We note that there can be many distributions over [n] whose atlas is the same. We will also denote by an atlas A any ordered pair (I, M) where I is an interval partition of [n] and M is a sequence of multisets of the same length, so that the total sum of all members of all multisets is 1. It is a simple observation that for every such A there exists at least one distribution that conforms to it. The length of an atlas |A| is defined as the shared length of its interval partition and sequence of multisets.
We now define what it means for a property to be characterized by atlases, and state our main theorem concerning such properties.
Definition 9.3. For a function k : R + × N → N, we say that a property of distributions C is k-characterized by atlases if for every n ∈ N and every ǫ > 0 we have a set A of atlases of lengths bounded by k(ǫ, n), so that every distribution µ over [n] satisfying C conforms to some A ∈ A, while on the other hand no distribution µ that conforms to any A ∈ A is ǫ-far from satisfying C.
Theorem 9.4. If C is a property of distributions that is k-characterized by atlases, then for any ǫ > 0 there is an adaptive conditional testing algorithm for C with query complexity k(ǫ/5, n) · poly(log n, 1/ǫ) (and error probability bound 1/3).

Applications and examples
We first show that L-decomposable properties are in particular characterized by atlases.
Lemma 9.5. If C is a property of distributions that is L-decomposable, then C is k-characterized by atlases, where k(ǫ, n) = L(ǫ/3, n).
Proof. Every distribution µ ∈ C that is supported over [n] defines an atlas in conjunction with the interval partition of the L-decomposition of µ for L = L(γ, n). Let A be the set of all such atlases. We will show that C is L(3γ, n)-characterized by A.
Let µ ∈ C. Since µ is L-decomposable, µ conforms to the atlas given by the L-decomposition and it is in A as defined above. Now suppose that µ conforms to an atlas A ∈ A, where I = (I 1 , . . . , I ℓ ) is the sequence of intervals. By the construction of A, there exists a distribution χ ∈ C that conforms with A. Now, for each j ∈ [ℓ] such that µ(I j ) ≤ γ/L, we have (noting that χ(I j ) = µ(I j )) i∈I j Noting that µ and χ have the same maximum and minimum over I j (as they have the same inventory), for each j ∈ [ℓ] and i ∈ I j , we know that |µ(i) − χ(i)| ≤ max i∈I j µ(i) − min i∈I j µ(i). Therefore, for all j ∈ [ℓ] such that max i∈I j µ(i) ≤ (1 + γ) min i∈I j µ(i), |µ(i) − χ(i)| ≤ γ min i∈I j µ(i). Therefore, Finally, recall that since A came from an L-decomposition of χ, all intervals are covered by the above cases. Summing up Equations 5 and 6 for all j ∈ {1, 2, . . . , ℓ}, we obtain d(µ, χ) ≤ 3γ.
Note that atlases characterize also properties that do not have shape restriction. The following is a simple observation.
Observation 9.6. If C is a property of distributions that is symmetric over [n], then C is 1characterized by atlases.
It was shown in Chakraborty et al [CFGM13] that such properties are efficiently testable with conditional queries, so Theorem 9.4 in particular generalizes this result. Also, the notion of characterization by atlases provides a natural model for tolerant testing, as we will see in the next subsection.

Atlas characterizations and tolerant Testing
We now show that for all properties of distributions that are characterized by atlases, there are efficient tolerant testers as well. In [CDGR15], it was shown that for a large property of distribution properties that have "semi-agnostic" learners, there are efficient tolerant testers. In this subsection, we show that when the algorithm is given conditional query access, there are efficient tolerant testers for the larger class of properties that are characterized by atlases, including decomposable properties that otherwise do not lend themselves to tolerant testing.
The mechanism presented here will also be used in the proof of Theorem 9.4 itself. First, we give a definition of tolerant testing. We note that the definition extends naturally to algorithms that make conditional queries to a distribution.
Definition 10.1. Let C be any property of probability distributions. An (η, ǫ)-tolerant tester for C with query complexity q and error probability δ, is an algorithm that samples q elements x 1 , . . . , x q from a distribution µ, accepts with probability at least 1−δ if d(µ, C) ≤ η, and rejects with probability at least 1 − δ if d(µ, C) ≥ η + ǫ.
In [CDGR15], they show that for every α > 0, there is an ǫ > 0 that depends on α, such that there is an (ǫ, α − ǫ)-tolerant tester for certain shape-restricted properties. On the other hand, tolerant testing using unconditional queries for other properties, such as the (1-decomposable) property of being uniform, require Ω (n/ log n) many samples ( [VV10]). We prove that, in the presence of conditional query access, there is an (η, ǫ)-tolerant tester for every η, ǫ > 0 such that η + ǫ < 1 for all properties of probability distributions that are characterized by atlases.
We first present a definition and prove an easy lemma that will be useful later on. Proof. It is an easy observation that there exists an I-preserving permutation σ that moves χ to χ ′ . Now let µ be the distribution that conforms to A ′ such that d(µ, χ) ≤ ǫ, and let µ ′ be the distribution that results from having σ operate on µ. It is not hard to see that µ ′ conforms to A ′ (which has the same interval partition as A), and that it is ǫ-close to χ ′ .
For a property C of distributions that is k-characterized by atlases, let C η be the property of distributions of being η-close to C. The following lemma states that C η is also k-characterized by atlases. This lemma will also be important for us outside the context of tolerant testing per-se.
Lemma 10.4. Let C be a property of probability distributions that is k(ǫ, n)-characterized by atlases. For any η > 0, let C η be the set of all probability distributions µ such that d(µ, C) ≤ η. Then, there is a set A η of atlases, of length at most k, such that every µ ∈ C η conforms to at least one atlas in A η , and every distribution that conforms to an atlas in A η is η + ǫ-close to C, and is ǫ-close to C η .
Proof. Since C is k-characterized by atlases, there is a set of atlases A of length at most k(ǫ, n) such that for each µ ∈ C, there is an atlas A ∈ A to which it conforms, and any χ that conforms to an atlas A ∈ A is ǫ-close to C. Now, let A η be obtained by taking each atlas A ∈ A, and adding all atlases, with the same interval partition, corresponding to distributions that are η-close to conforming to A.
First, note that the new atlases that are added have the same interval partitions as atlases in A, and hence have the same length bound k(ǫ, n). To complete the proof of the lemma, we need to prove that every µ ∈ C η conforms to some atlas in A η , and that no distribution that conforms to A η is η + ǫ-far from C.
Take any µ ∈ C η . There exists some distribution µ ′ ∈ C such that d(µ, µ ′ ) ≤ η. Since C is k-characterized by atlases, there is some atlas A ∈ A such that µ ′ conforms with A. Also, observe that µ is η-close to conforming to A through µ ′ . Therefore, there is an atlas A ′ with the same interval partition as A that was added in A η , which is the atlas corresponding to the distribution µ. Hence, there is an atlas in A η to which µ conforms.
Conversely, let χ be a distribution that conforms with an atlas A ′ ∈ A η . From the construction of A η , we know that there is an atlas A ∈ A with the same interval partition as A ′ , and there is a distribution χ ′ that conforms to A ′ and is η-close to conforming to A. Therefore, by Lemma 10.3 χ is also η-close to conforming to A. Let µ ′ be the distribution conforming to A such that d(χ, µ ′ ) ≤ η. Since µ ′ conforms to an atlas A ∈ A, d(µ ′ , C) ≤ ǫ. Therefore, by the triangle inequality, d(χ, C) ≤ η + ǫ.
Using Lemma 10.4 we get the following corollary of Theorem 9.4 about tolerant testing of distributions characterized by atlases.

Some useful lemmas about atlases and characterizations
We start with a definition and a lemma, providing an alternative equivalent definition of properties k-characterizable by atlases Definition 11.1 (permutation-resistant distributions). For a function k : R + × N → N, a property C of probability distributions is k-piecewise permutation resistant if for every n ∈ N, every ǫ > 0, and every distribution µ over [n] in C, there exists a partition I of [n] into up to k(ǫ, n) intervals, so that every I-preserving permutation of [n] transforms µ into a distribution that is ǫ-close to a distribution in C.
Lemma 11.2. For k : R + × N → N, a property C of probability distributions over [n] is k-piecewise permutation resistant if and only if it is k-characterized by atlases.
Proof. If C is k-piecewise permutation resistant, then for each distribution µ ∈ C, there exists an interval partition I of [n] such that every I-preserving permutation of [n] transforms µ into a distribution that is ǫ-close to C. Each distribution µ thus gives an atlas over I, and the collection of these atlases for all µ ∈ C characterizes the property C. Therefore, C is k-characterized by atlases.
Conversely, let C be a property of distributions that are k-characterized by atlases and let A be the set of atlases. For each µ ∈ C, let A µ be the atlas in A that characterizes µ and let I µ be the interval partition corresponding to this atlas. Now, every I µ -preserving permutation σ of µ gives a distribution µ σ that has the same atlas A µ . Since C is k-characterized by atlases, µ σ is ǫ-close to C. Therefore, C is k-piecewise permutation resistant as well.
We now prove the following lemma about ǫ/k-fine partitions of distributions characterized by atlases, having a similar flavor as Lemma 4.3 for L-decomposable properties. Since we cannot avert a poly(log n) dependency anyway, for simplicity we use the 1-parameter variant of fine partitions.
Lemma 11.3. Let C be a property of distributions that is k(ǫ, n)-characterized by atlases through A. For any µ ∈ C, any ǫ/k-fine interval partition I ′ of µ, and the corresponding atlas A ′ = (I ′ , M ′ ) for µ (not necessarily in A), any distribution µ ′ that conforms to A ′ is 3ǫ-close to C.
Proof. Let A = (I, M) be the atlas from A to which µ conforms, and let I ′ = (I ′ 1 , I ′ 2 , . . . , I ′ r ) be an ǫ/k-fine interval partition of µ. Let P ⊆ I ′ be the set of intervals that intersect more than one interval in I. Since I ′ is ǫ/k-fine, and the length of A is at most k, µ( I ′ j ∈P I ′ j ) ≤ ǫ (note that P cannot contain singletons). Also, since µ ′ conforms to A ′ , we have µ ′ ( I ′ j ∈P I ′ j ) ≤ ǫ. Letμ be a distribution supported over [n] obtained as follows: For each interval I ′ j ∈ P, µ(i) = µ(i) for every i ∈ I ′ j . For each interval I ′ j ∈ I ′ \ P,μ(i) = µ ′ (i) for every i ∈ I ′ j . Note that that the inventories ofμ and µ are identical over any I ′ j in I ′ \ P. From this it follows thatμ also conforms to A, and in particularμ is a distribution. To see this, for any I j in I partition it to its intersection with the members of I ′ \ P contained in it, and all the rest. For the former we use that µ and µ ′ have the same inventories, and for the latter we specified thatμ has the same values as µ.
The main idea of our test for a property of distributions k-characterized by atlases, starts with a γ/k-fine partition I obtained by through Algorithm 1. We then show how to compute an atlas A with this interval partition such that there is a distribution µ I that conforms to A that is close to µ. We use the ǫ-trimming sampler from [CFGM13] to obtain such an atlas corresponding to I. To test if µ is in C, we show that it is sufficient to check if there is some distribution conforming to A that is close to a distribution in C.

An adaptive test for properties characterized by atlases
Our main technical lemma, which we state here and prove in Section 13, is the following possibility of "learning" an atlas of an unknown distribution for an interval partition I, under the conditional sampling model.
First, we show how this implies Theorem 9.4. To prove it, we give as Algorithm 7 a formal description of the test.
Algorithm 7: Adaptive conditional tester for properties k-characterized by atlases Input: A distribution µ supported over [n], a function k : (0, 1] × N → N, accuracy parameter ǫ > 0, a property C of distributions that is k-characterized by the set of atlases A 1 Use Algorithm 1 with input values (µ, ǫ/5k(ǫ/5, n), 1/6) to obtain a partition I with |I| ≤ 20k(ǫ/5, n) log(n) log(1/ǫ)/ǫ 2 Use Lemma 12.1 with accuracy parameter ǫ/5 and error parameter 1/6 to obtain an atlas A I ′ corresponding to I ′ 3 if * then there exists χ ∈ C that is ǫ/5-close to conforming to A I ′ accept else reject Lemma 12.2 (completeness). Let C be a property of distributions that is k-characterized by atlases, and let µ be any distribution supported over [n]. If µ ∈ C, then with probability at least 2/3 Algorithm 7 accepts.
Proof. In Step 2, with probability at least 5/6 > 2/3, we get an atlas A I ′ such that there is a distribution µ I ′ that is ǫ/5-close to µ.
Step 3 then accepts on account of χ = µ.
Lemma 12.3 (soundness). Let C be a property of distributions that is k-characterized by atlases, and let µ be any distribution supported over [n]. If d(µ, C) > ǫ, then with probability at least 2/3 Algorithm 7 rejects.
Proof. With probability at least 2/3, for k = k(ǫ/5, n) we get an ǫ/5k-fine partition I ′ in Step 1, as well as an atlas A I ′ in Step 2 such that there is a distribution µ I ′ conforming to it that is ǫ/5-close to µ. Suppose that the algorithm accepted in Step 3 on account of χ ∈ C. Then there is a χ ′ that is ǫ/5-close to χ and conforms to A I ′ . By Lemma 10.4, the property of being ǫ/5-close to C is itself k-characterized by atlases. Let A ǫ/5 be the collection of atlases characterizing it. Using Lemma 11.3 with χ ′ and A ǫ/5 , we know that χ ′ is 3ǫ/5-close to some χ, which is in C ǫ/5 and thus ǫ/5-close to C. Since χ ′ is also ǫ/5-close to µ, we obtain that µ is ǫ-close to C by the triangle inequality, contradicting d(µ, C) > ǫ.
Proof of Theorem 9.4. Given a distribution µ, supported on [n], and a property C of distributions that is k-characterized by atlases, we use Algorithm 7. The correctness follows from Lemmas 12.2 and 12.3. The number of samples made in Step 1 is clearly dominated by the number of samples in Step 2, which is k(ǫ/5, n) · poly(log n, 1/ǫ).

Constructing an atlas for a distribution
Before we prove Lemma 12.1, we will define the notion of value-distances and prove lemmas that will be useful for the proof of the theorem.
Definition 13.1 (value-distance). Given two multisets A,B of real numbers, both of the same size (e.g. two inventories over an interval [a, b]), the value-distance between them is the minimum ℓ 1 distance between a vector that conforms to A and a vector that conforms to B.
The following observation gives a simple method to calculate the value-distances between two multisets A and B.
Observation 13.2. The value-distance between A and B is equal to the ℓ 1 distance between the two vectors conforming to the respective sorting of the two multisets.
Proof. Given two vectors v and w corresponding to A and B achieving the value-distance, first assume that the smallest value of A is not larger than the smallest value of B. Assume without loss of generality (by permuting both v and w) that v 1 is of the smallest value among those of A. It is not hard to see that, in this case, one can make w 1 to be of the smallest value among those of B without increasing the distance (by swapping the two values), and from here one can proceed by induction over |A| = |B|.
We now prove two lemmas that will be useful for the proof of Lemma 12.1.
Lemma 13.3. Let A and B be two multisets of the same size, both with members whose values range in {0, α 1 , . . . , α r }. Let m j be the number of appearances of α j in A, and n j the corresponding number in B. If m j ≤ n j for every 1 ≤ j ≤ r, then the value-distance between A and B is bounded by r j=1 (n j − m j )α j .
Definition 13.5 (ǫ-trimming sampler). An ǫ-trimming sampler with s samples for a distribution µ supported over [n], is an algorithm that has conditional query access to the distribution µ and returns s pairs of values (r, µ(r)) (for r = 0, µ(r) is not output) from a distribution µ supported on {0}∪[n], such that i∈[n] |µ(i)−µ(i)| ≤ 4ǫ, and each r is independently drawn from µ. Furthermore, there is a set P of poly(log n, 1/ǫ) real numbers such that for all i either µ(i) = 0 or µ(i) ∈ P .
The existence of an ǫ-trimming sampler with a small set of values was proved in [CFGM13]. Let us formally state this lemma.

Proving the main lemma
The proof of Lemma 12.1 depends on the following technical lemma.
Lemma 13.7. Let µ be a distribution supported over [n], and let P = (P 0 , P 1 , P 2 , . . . , P r ) be a partition of [n] into r + 1 subsets with the following properties. Now, k∈[r] p k (|P k | − m k ) ≤ k∈[r] (p k |P | − α ′ k ). Under the above assumption on the value of α k , for every k such that p k |P k | < 1/r, we have α ′ k ≥ p k |P k | − 2ǫ/r. Hence, this difference is at most 2ǫ/r. For every k such that p k |P k | ≥ 1/r, α ′ k ≥ (1 − 2ǫ)p k |P k |. For any such k, the difference p k |P k | − α ′ k is at most 2ǫp k |P k |. Therefore, k∈[r] p k (|P k | − m k ) is at most 4ǫ.
Proof of Lemma 12.1. Given a distribution µ and an interval partition I = (I 1 , I 2 , . . . , I r ), let µ be the distribution presented by the ǫ/8-trimming sampler in Lemma 13.6. Let I j,k ⊆ [n] be the set of indexes i such that i ∈ I j and µ(i) = (1+ǫ/8) k−1 ǫ 8n . Thus, each interval I j in I is now split into subsets I j,0 , I j,1 , I j,2 , . . . , I j,ℓ , where ℓ ≤ log n log(8/ǫ)/ log 2 (1 + ǫ/8) and I j,0 is the set of indexes in I j such that µ(i) = 0.
Using Lemma 13.6, with s = r · poly(log n, 1/ǫ, log(1/δ)) samples from the distribution µ, we can estimate, with probability at least 1 − δ, the values m j,k such that m j,k ≤ |I j,k | for all k > 0, and the following holds. For every j, let M I j be the inventory provided by m j,1 , . . . , m j,ℓ and m j,0 = |I j |− k∈[ℓ] m j,k . Thus, we have a inventory sequenceM I = (M I 1 , M I 2 , . . . , M Ir ) that is ǫ/10-close in value-distance to the corresponding atlas of µ (where we need to add the interval {0} to the partition to cover its entire support). Corresponding toM I , there is a vectorμ that is ǫ/10-close to µ. Using Lemma 13.4, we have a distributionμ that is ǫ/2-close to µ. Since the [n] portion of µ is ǫ/2-close to µ, by the triangle inequality,μ is ǫ-close to µ. Thus A = (I, M I ), where M I is obtained by multiplying all members ofM I by the same factor used to produceμ fromμ, is an atlas for a distribution that is ǫ-close to µ. We need s · poly(log n, log s, 1/ǫ, log(1/δ) conditional samples from µ to get s samples from µ. Therefore, we require r ·poly(log n, 1/ǫ, log(1/δ)) conditional samples to construct the atlas A I .