Abstract 1 Introduction 2 Preliminaries 3 Refutation, Learning, and Testable Learning with Queries 4 MQ-SQ Lower Bounds References

Limitations of Membership Queries in Testable Learning

Jane Lange ORCID Massachusetts Institute of Technology, Cambridge, MA, USA Mingda Qiao ORCID University of Massachusetts Amherst, MA, USA
Abstract

Membership queries (MQ) often yield speedups for learning tasks, particularly in the distribution-specific setting. We show that in the testable learning model of Rubinfeld and Vasilyan [21], membership queries cannot decrease the time complexity of testable learning algorithms beyond the complexity of sample-only distribution-specific learning. In the testable learning model, the learner must output a hypothesis whenever the data distribution satisfies a desired property, and if it outputs a hypothesis, the hypothesis must be near-optimal.

We give a general reduction from sample-based refutation of boolean concept classes, as presented in [23, 17], to testable learning with queries (TL-Q). This yields lower bounds for TL-Q via the reduction from learning to refutation given in [17]. The result is that, relative to a concept class and a distribution family, no m-sample TL-Q algorithm can be super-polynomially more time-efficient than the best m-sample PAC learner.

Finally, we define a class of “statistical” MQ algorithms that encompasses many known distribution-specific MQ learners, such as those based on influence estimation or subcube-conditional statistical queries. We show that TL-Q algorithms in this class imply efficient statistical-query refutation and learning algorithms. Thus, combined with known SQ dimension lower bounds, our results imply that these efficient membership query learners cannot be made testable.

Keywords and phrases:
Testable learning, PAC learning
Funding:
Jane Lange: Supported in part by NSF Awards CCF-2006664, DMS-2022448, CCF-2310818, and Big George Fellowship
Copyright and License:
[Uncaptioned image] © Jane Lange and Mingda Qiao; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Query learning
Related Version:
Full Version: https://arxiv.org/abs/2512.02279
Editor:
Shubhangi Saraf

1 Introduction

In distribution-specific PAC learning, a learning algorithm is only required to output a competitive hypothesis when the data distribution satisfies some property. Distribution-specific PAC often allows for much more efficient learning than distribution-free PAC, but with the following shortcoming: if the distribution does not satisfy the property, then the behavior of the learner is completely undefined.

A testable agnostic learning algorithm [21] alleviates this shortcoming by combining a distribution-specific learner with a tester for the desired property. It may output a hypothesis or it may reject the distribution and output . The testable learner is run on i.i.d. samples from an unknown distribution 𝒟 over 𝒳×{0,1}, and has the following behavior:

  • Soundness: If the learner outputs a hypothesis h, then with high probability

    Pr(x,y)𝒟[h(x)y]𝗈𝗉𝗍+ε,

    where 𝗈𝗉𝗍 is the error of the best concept in the concept class. A semi-agnostic variant of this condition, with 𝗈𝗉𝗍+ε replaced by O(𝗈𝗉𝗍)+ε, has also been considered.

  • Completeness: If the distribution 𝒟 has the desired property, then the learner outputs a hypothesis (instead of ) with high probability.

There is a significant body of work studying the sample and time complexities of testable learning for various concept classes and distribution properties in both the agnostic (𝗈𝗉𝗍+ε) and semi-agnostic (O(𝗈𝗉𝗍)+ε) settings [21, 13, 8, 12, 22]. There are efficient algorithms for testable learning in cases where distribution-free learning cannot be done efficiently.

1.1 The Power of Membership Queries in Agnostic Learning

One might hope to testably learn more efficiently by strengthening the learner’s access to the data distribution. In the membership query (MQ) model, we think of the data as being drawn from a distribution 𝒟x over 𝒳 and labeled by some unknown function f:𝒳{0,1}. The learner gets i.i.d. samples from 𝒟x and may also query any point x𝒳 and receive its label f(x).

The work of [10] shows that under the standard cryptographic assumption of one-way functions, membership queries can speed up agnostic learning in the distribution-specific setting, but not in the distribution-free setting. In the distribution-free setting, every concept class that can be agnostically learned with MQs can be agnostically learned with just random examples. In contrast, in the uniform-distribution-specific setting, there exists a concept class that can be learned strictly more efficiently with MQs.

Separations exist for more “natural” concept classes under the stronger assumption that learning sparse parities with noise (LSPN) is hard. For example, over the uniform distribution on {0,1}n, k-juntas can be learned in poly(n)2O(k) time with membership queries [3, 20]. On the other hand, there is a statistical query lower bound of nΩ(k) [4], and if LSPN is hard then one cannot hope to do better than this bound with random examples. Similarly, polynomial-size decision trees can be learned improperly in polynomial time [18, 14] and properly in nO(loglogn) time [1] with membership queries, while there is an SQ lower bound of nΩ(logn) [2], and LSPN implies one cannot do better than this bound either.

1.2 Limitations of Membership Queries in Testable Learning

If membership queries do help in the distribution-specific setting but do not help in the distribution-free setting, one may then naturally wonder whether they ought to help in the testable setting as well. A testable learner with queries (TL-Q) has both sample access to the unknown data distribution and membership query access to the unknown function, and must satisfy the soundness and completeness guarantees of ordinary testable learning.

Question 1.

How much can membership queries speed up the task of testable learning?

Our results show that membership queries are quite weak in the TL-Q setting. Particularly, whenever agnostic learning with random examples is hard – as is believed to be the case for juntas and decision trees – testable learning is hard as well, even with queries.

Theorem 2 (Corollary 17, informal).

If a concept class 𝒞 is agnostically testably learnable with queries in time t over a distribution 𝒟, then it is agnostically learnable with random examples in time poly(t) over 𝒟 as well.

Corollary 3.

If LSPN is hard, then no concept class containing k-parities as a subset can be agnostically testably learned in no(k) time over the uniform distribution, even with membership queries.

Furthermore, we show that SQ lower bounds rule out a large class of natural query-based learning algorithms. We define a class of “statistical” membership query (MQ-SQ) algorithms – those that use membership queries only to sample from particular distributions over the input domain. For example, algorithms that use MQs only to estimate influences or to make SQs over large subsets of {0,1}n are MQ-SQ algorithms (this includes the aforementioned uniform-distribution algorithms of [18, 14, 1]). We show that such algorithms cannot be “made testable” with respect to the uniform distribution without introducing non-statistical use of membership queries, due to the SQ lower bounds.

Theorem 4 (Theorem 29, informal).

If a concept class 𝒞 is testably learnable in time t over a distribution 𝒟 by an MQ-SQ algorithm, then the SQ dimension of 𝒞 with respect to 𝒟 is at most poly(t).

1.3 Technical Overview

Our reductions are through the intermediate task of refutation. Refutation, as presented in [23, 17], is the problem of distinguishing examples correlated with some function in the concept class from examples labeled uniformly at random. The work of [23] shows that in the distribution-free setting, refutation and realizable learning are polynomially equivalent, and the work of [17] shows an analogous statement in the distribution-specific agnostic setting. By giving an efficient reduction from distribution-specific refutation (without queries) to testable learning (with queries), we show that distribution-specific agnostic learning reduces to TL-Q as well.

As a warm-up, consider a special case of refutation where the labels are promised to either be completely random or exactly match some function in the class 𝒞. Let the distribution be uniform over {0,1}n. Assume for simplicity that all functions in 𝒞 are balanced, i.e., 𝔼x{0,1}n[f(x)]=1/2.

Definition 5 (Exact refutation over the uniform distribution, informal).

An exact refutation algorithm for a concept class 𝒞 takes an m-tuple {(x1,y1),,(xm,ym)} of examples where the x’s are drawn uniformly at random from {0,1}n. It outputs either 𝗇𝗈𝗂𝗌𝖾 or 𝗌𝗍𝗋𝗎𝖼𝗍𝗎𝗋𝖾 with the following guarantees:

  • Completeness: If the examples are consistent with some g𝒞, then

    Pr[𝒜 outputs 𝗌𝗍𝗋𝗎𝖼𝗍𝗎𝗋𝖾]2/3.
  • Soundness: If the yi’s are drawn i.i.d. from 𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(1/2), then

    Pr[𝒜 outputs 𝗇𝗈𝗂𝗌𝖾]2/3.

Suppose we want to implement exact refutation using a TL-Q algorithm. If we had any agnostic learning algorithm that did not require queries, the task would be trivially easy: split {(x1,y1),,(xm,ym)} into training and test sets, run the learner on the training set, estimate the error of the returned hypothesis on the test set, and output 𝗌𝗍𝗋𝗎𝖼𝗍𝗎𝗋𝖾 if the test error is, say, 1/10. If we are in the 𝗌𝗍𝗋𝗎𝖼𝗍𝗎𝗋𝖾 case, the error will be ε, and if we are in the 𝗇𝗈𝗂𝗌𝖾 case, with high probability the error will be close to 1/2.

Instead we have to answer queries, so we will answer them randomly. Specifically, we will draw a random function f:{0,1}n{0,1}, and whenever the TL-Q algorithm wants to make a query, we will answer according to the f we chose. We will filter both the training and test sets to just those points where y=f(x). This means that we essentially sample from a domain that is uniform over the portion of {0,1}n where y(x) agrees with f(x).

As before, if the TL-Q algorithm produces a hypothesis, we will output 𝗌𝗍𝗋𝗎𝖼𝗍𝗎𝗋𝖾 if the error is 1/10 and 𝗇𝗈𝗂𝗌𝖾 if the error is greater. But since TL-Q can also reject the instance and output , if it does so, we will output 𝗌𝗍𝗋𝗎𝖼𝗍𝗎𝗋𝖾.

Notice that in the noise case, each xi is filtered out independently with probability 1/2; therefore the distribution of samples is uniform. By completeness of the TL-Q algorithm, it must then output a hypothesis, and with high probability the error will be close to 1/2. In the structure case, however, the distribution of xi’s may be far from uniform, in which case the TL-Q algorithm may output . It may also output a hypothesis – but by soundness, the hypothesis must have error ε, since the samples come from a distribution such that y=f(x) for every x in its support.

Our reduction from refutation to TL-Q is basically a generalization of this strategy, adapted to handle unbalanced functions and accept functions that are close to, but not exactly, in 𝒞.

1.3.1 An SQ-Preserving Reduction

We observe that some membership query algorithms, such as those of [18, 14, 1], use membership queries only to estimate statistical properties of the unknown function. We roughly categorize MQ-SQ queries as follows (formalized in Definition 20):

  • Standard SQs: queries of the form 𝔼x𝒟[ϕ(x)] or 𝔼x𝒟[ϕ(x)f(x)] for a test function ϕ. These don’t require membership queries to implement.

  • Pair SQs: queries of the form 𝔼x𝒟[ϕ(x)f(x)f(π(x))], where π is a permutation of the domain without fixed points. This generalizes influence estimation.

  • Customized distribution SQs: Any of the above queries, where the expectation is taken over a specific (and sufficiently spread-out) distribution 𝒟 instead of the unknown distribution 𝒟. This generalizes making SQs over restrictions of {0,1}n.

Our goal is to simulate an MQ-SQ testable learner by making only SQs to the unknown distribution 𝒟refut. (over 𝒳×{0,1}) in the refutation instance. As in the non-SQ setting, we choose a random function to be the target function and answer queries according to that random function.

For example, the customized-distribution query 𝔼x𝒟[f(x)ϕ(x)] is easy to simulate with just one SQ to 𝒟refut.: simply estimate the mean p𝔼(x,y)𝒟refut.[y], and answer the MQ-SQ with p𝔼x𝒟[ϕ(x)] (sampling from 𝒟 requires neither samples nor membership queries, since it’s the customized distribution). The value of this MQ-SQ concentrates around this estimate as each f(x) is an independent random variable with mean p.

Pair SQs are handled similarly, though in this case the random variables f(x)f(π(x)) are not independent. However, since the dependence graph of these variables decomposes into cycles, we can partition the graph into large independent sets and prove concentration of the variables within the independent sets. Thus, for example, the MQ-SQ 𝔼x𝒟[ϕ(x)f(x)f(π(x))] can be answered with p𝔼(x,y)𝒟refut.[ϕ(x)y], which is an SQ to 𝒟refut..

1.4 Related Work and Discussion

The power of membership queries

It is well known that in the realizable setting, PAC learners with membership queries are strictly stronger than PAC learners without them under standard cryptographic assumptions [9, 11]. The work of [10] establishes an equivalence between PAC with random examples and PAC with membership queries in the distribution-free agnostic setting, and a separation in the distribution-specific agnostic setting.

Testable learning and friends

There are many papers that address the computational and sample complexities of testably learning various natural concept classes; these works are in the standard agnostic testable learning model. Some examples, but certainly not all, are the works of [21, 13, 8, 12, 22]. Some works addressing related problems include [16, 15, 19].

The work of [13] characterizes the sample complexity of testable learning by the Rademacher complexity, which is especially relevant to this work in light of the result of [17], which establishes refutation complexity as an analogue of Rademacher complexity for the computationally bounded setting.

Learning and refutation

A connection between learning and refutation was first introduced in [6] as a means of proving computational lower bounds for learning based on the assumption that refuting random CSPs is hard, and other works including those of [7, 5] use this method to give conditional lower bounds for various learning problems. Of particular relevance to this work are [23, 17], which give polynomial equivalences between PAC learning and refutation.

1.4.1 Directions for Future Work

While our work reduces ordinary sample-based PAC learning to query-based testable learning, we do not resolve the strongest, most natural question on the power of membership queries: whether sample-based testable learning reduces to query-based testable learning. This would be a strictly stronger lower bound for TL-Q than anything obtainable through refutation, as there are function classes for which refutation is known to be easier than testable learning with samples – for example, the class of monotone functions. It is proven in [21] that testably learning monotone functions on the uniform distribution requires 2Ω(n) samples. On the other hand, agnostic learning (and therefore refutation) can be done in 2O~(n) time and samples.

We leave this possible stronger lower bound as an open question for future work.

Conjecture 6.

If a concept class 𝒞 is agnostically testably learnable with queries in time t over a distribution 𝒟, then it is agnostically testably learnable with samples in time poly(t) over 𝒟 as well.

We also remark that in the semi-agnostic setting, our method relates semi-agnostic TL-Q to weak agnostic learning. It is an open question for future work to resolve the connection between semi-agnostic TL-Q and semi-agnostic learning as well.

2 Preliminaries

2.1 Distances and Errors

Definition 7 (Distance of functions and distance to a concept class).

Relative to a distribution 𝒟 over 𝒳×{0,1} with 𝒳-marginal 𝒟x, we denote

dist𝒟x(f,g)=Prx𝒟x[f(x)g(x)]
err𝒟(f)=Pr(x,y)𝒟[f(x)y].

We also use the following notation to denote the classification error of the most accurate concept in a class, which we often refer to as 𝗈𝗉𝗍:

dist𝒟x(f,𝒞)=infg𝒞dist𝒟x(f,g).
err𝒟(𝒞)=infg𝒞err𝒟(g).

2.2 Refutation and Learning

We state a definition of refutation similar to the definition presented in [17]. We have modified it to use classification error rather than correlation, for ease of use in our {0,1}-labeled setting ([17] uses {1,1} labels).

Definition 8 (η-refutation).

Let 𝒞{f:𝒳{0,1}} be a concept class over a finite input domain 𝒳, and let be a family of distributions on 𝒳. An η-refutation algorithm 𝒜 for 𝒞 on with m samples is an algorithm that takes an m-tuple of labeled examples {(x1,y1),,(xm,ym)} and outputs either 𝗇𝗈𝗂𝗌𝖾 or 𝗌𝗍𝗋𝗎𝖼𝗍𝗎𝗋𝖾. If the examples are i.i.d. from a distribution 𝒟 over 𝒳×{0,1} such that the marginal on 𝒳 is some 𝒟x, then the following guarantees hold:

  • Completeness: If there exists g𝒞 such that err𝒟(g)η, then

    Pr{(xi,yi)}𝒟internal randomness of 𝒜[𝒜 outputs 𝗌𝗍𝗋𝗎𝖼𝗍𝗎𝗋𝖾]2/3.
  • Soundness: If the yi’s are drawn i.i.d. from 𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(1/2), then

    Pr{(xi,yi)}𝒟internal randomness of 𝒜[𝒜 outputs 𝗇𝗈𝗂𝗌𝖾]2/3.

We also define a similar but stronger task:

Definition 9 (Biased (α,η)-refutation).

A biased-(α,η)-refutation algorithm is as above except the soundness condition is the following:

  • Soundness: For all p[α,1α], if the yi’s are drawn i.i.d. from 𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(p), then

    Pr{(xi,yi)}𝒟internal randomness of 𝒜[𝒜 outputs 𝗇𝗈𝗂𝗌𝖾]2/3.

Here we state definitions and facts used in [17]’s reduction from refutation to agnostic learning. Again we modify these statements to use classification error rather than correlation.

Definition 10 (Weak agnostic learning).

A (γ,α)-weak agnostic learner for the concept class 𝒞 over the distribution 𝒟 outputs a hypothesis h satisfying the following:

err𝒟(h)1+αγ2+γerr𝒟(𝒞).
Lemma 11 (Learning by refutation: Lemma 6 of [17]).

Suppose there is an η-refutation algorithm for the class 𝒞 over distribution 𝒟 running in T(n) time with m samples. Then there is an algorithm that runs in T(n)m2ε2 and uses O(m3ε2) samples to agnostically learn 𝒞 on 𝒟 with excess error 12η+ε.

2.3 Testable Learning with Queries

We now define semi-agnostic testable learning with queries.

Definition 12 (Testable learning with queries, or TL-Q).

A concept class 𝒞 over the input domain 𝒳 is (c,ε,δ)-PAC-testably-learnable with q queries, m samples, and t time, on a set of distributions over 𝒳, if there is a t-time algorithm 𝒜 that takes m samples from an unknown distribution 𝒟 and q membership queries to an unknown function f, and outputs h𝒞{} such that:

  • Soundness: If h, then

    Pr[dist𝒟(h,f)c𝗈𝗉𝗍+ε]δ.
  • Completeness: If 𝒟, then

    Pr[h=]δ.

3 Refutation, Learning, and Testable Learning with Queries

3.1 A General Reduction from Refutation to TL-Q

In this section, we show that if a class is efficiently testably-learnable, then it is efficiently refutable as well, with polynomial dependence on the sample, time, and query complexity of the testable learner.

In fact, our Algorithm 1 solves the harder problem of biased refutation (Definition 9), though the distinction between the two will not be relevant until Section 3.3. A biased refutation algorithm can always be used to solve the standard unbiased refutation problem, as the soundness guarantee for biased refutation must always hold when p=1/2. To avoid confusion, we let 𝒟refut. denote the distribution of labeled pairs (x,y)𝒳×{0,1} in the refutation instance, while 𝒟 denotes the unknown marginal distribution in a TL-Q instance.

Algorithm 1 BiasedRefutation(𝗌𝖺𝗆𝗉𝗅𝖾𝗌,η,ε,m,q,c).

The main result of this section is the following:

Theorem 13.

Let 𝒞 be (c,ε,110) PAC-testably-learnable with m samples, q queries, and t time, on a distribution family satisfying

m+q/ε21sup𝒟x(𝒟x2).

Then for any ε satisfying ε2cksup𝒟x(𝒟x2) for sufficiently large constant k, and any η<1/24εc, 𝒞 is (cη+4ε,η)-refutable over all members of with m samples and t time, where

m=O(m+1/ε2ε+q)
t=O(m+t).

Algorithm 1 essentially draws a random function f of bias p, where p is the mean of the labels in the refutation distribution 𝒟refut., and filters the samples to just pairs where y=f(x). The drawing of f is “lazy:” to draw f and answer queries to it, it suffices to draw each value of f(x) from 𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(p) the first time we need to know f(x), then store (x,f(x)) in a table for consistency. We implement the random coin by reading a label, as the labels are distributed as 𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(p).

3.1.1 Properties of the Filtered Sample Distribution

Before proving the theorem, it will be useful to have the following supporting claims about the distribution that the sets S𝗍𝗋𝖺𝗂𝗇 and S𝗍𝖾𝗌𝗍 are drawn from. This distribution 𝒟 – which is the unknown distribution that the TL-Q instance is running on – depends on the random function f. Formally, the PMF of 𝒟 is the following, and one may observe that the construction of S in lines 7-11 of Algorithm 1 produces samples from this distribution:

Definition 14 (Filtered sample distribution).

Let 𝒟refut. be a distribution over 𝒳×{0,1} with 𝒳-marginal 𝒟x and let f:𝒳{0,1} be a p-biased random function. Let the function y(x) be defined as 𝔼(x,y)𝒟refut.[yx=x]. Then the filtered sample distribution 𝒟 is defined by the following PMF:

𝒟(x)=𝒟x(x)Z(p(1y(x))(1f(x))+(1p)y(x)f(x)),

where Z is the normalization factor

Z=𝔼x𝒟x[(1p)y(x)f(x)+p(1y(x))(1f(x))].

We state the properties below; the proofs are deferred to the full version of this paper.

Lemma 15.

For every δ0, it holds with probability at least 12exp(Ω(δ2p2(1p)2𝒟x22)) over the randomness of f that

|Zp(1p)|δp(1p).
Claim 16.

Let g:𝒳{0,1} be an arbitrary function. For any δ, with probability at least 1exp(Ω((δp)2(1p)2𝒟x22)) over the randomness of f, we have

Prx𝒟[g(x)f(x)]Pr(x,y)𝒟refut.[g(x)y]+δ.

3.1.2 Proof of Theorem 13

With the above properties in hand, we will now prove the main theorem of this section.

Proof of Theorem 13.

Let 𝒜 be the (c,ε,1/10)-testable learner for 𝒞. We will show that Algorithm 1 is a (cη+4ε,η) refutation algorithm over any 𝒟refut. with 𝒳-marginal 𝒟x such that 𝒟x. Since the samples come from a refutation instance, one of the following must hold:

  • Structure: Pr(x,y)𝒟refut.[g(x)y]η for some g𝒞, or

  • Noise: y𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(p) for some p[cη+4ε,1cη4ε].

We will refer to 𝔼(x,y)𝒟refut.[y] as p regardless of whether we are in the structure case or the noise case. We consider the following possibilities:

  • Structure: In this case, there is some function g𝒞 such that Pr(x,y)𝒟refut.[g(x)y]η. By setting the constant C1 large enough, we have by Hoeffding’s inequality that Pr[|p^p|>ε]1/100. With the remaining probability, if p<ε or p>1ε we would output 𝗌𝗍𝗋𝗎𝖼𝗍𝗎𝗋𝖾 after line 4, so we will assume from here that εp1ε.

    We will denote by 𝒟 the distribution over 𝒳 from which our sample set S is drawn, as discussed in Section 3.1.1. Since εp1ε, by our assumption that ε2cksup𝒟x(𝒟x2) we have

    (ε/c)2p2(1p)2𝒟x22Ω(ε4c2𝒟x22)Ω(k2).

    By Claim 16, setting the constant k to be large enough, we then have

    dist𝒟(f,𝒞)Prx𝒟[f(x)g(x)]+ε/cη+ε/c

    with probability at least 1exp(Ω(k2))99/100 over the randomness of f. When this happens, 𝒜 has two possible sound behaviors: either output , or output h satisfying

    dist𝒟(h,f)c(η+ε/c)+εcη+2ε.

    By the TL-Q guarantee, it produces one of these sound behaviors with probability at least 9/10. By setting the constant C3 large enough, by Hoeffding’s inequality if 𝒜 outputs a hypothesis h, it satisfies 𝖾𝗋𝗋^(h)dist𝒟(h,f)+ε with probability at least

    1exp(2ε2|T|)99/100.

    When this happens, we have 𝖾𝗋𝗋^(h)cη+3ε, so we successfully output 𝗌𝗍𝗋𝗎𝖼𝗍𝗎𝗋𝖾.

  • Noise: In this case, the labels are drawn from 𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(p), and the elements are drawn from 𝒟. Since y(x)=p for all x, we have

    𝒟(x)=𝒟x(x)p(1p)ZandZ=z𝒳𝒟x(x)p(1p).

    Thus, 𝒟=𝒟x, so by completeness of 𝒜, it must output with probability at most 1/10. With the remaining probability, 𝒜 outputs a hypothesis h; we will argue that with high probability its error on S𝗍𝖾𝗌𝗍 is at least min(p,1p)ε>cη+3ε. We return an error if any element in the test set appears anywhere else, so assuming this does not happen, each label in the test set is a new independent draw from 𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(p). Thus we have:

    PrxT[h(x)f(x)] =1|T|xT:h(x)=0𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(p)+1|T|xT:h(x)=1𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(1p)
    1|T|𝖡𝗂𝗇𝗈𝗆𝗂𝖺𝗅(min(p,1p),|T|).

    By a Hoeffding bound it follows that 𝖾𝗋𝗋^(h)min(p,1p)ε with probability at least

    12exp(2ε2|T|) 99/100.

    Thus, we successfully output 𝗇𝗈𝗂𝗌𝖾 with high probability.

In both cases the refutation algorithm succeeds with probability 4/5 conditioned on not returning an error due to either insufficient samples, duplicate samples, or overlap between the query set and the test set.

We will set the sample complexity so that the probability of insufficient samples is small. Let B be the number of samples reserved for drawing from 𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(p) or estimating p^. Each of the remaining mB samples is included in S with probability p(1p). Thus we will set mB100(cη+4ε)(1cη4ε)(m+C3/ε2). Then we have by a Chernoff bound,

Pr[|S|<m+C3ε2]exp((99/100)22p(1p)(mB))1/100.

To set B, we need 2 draws for each of the samples and 1 draw for each membership query; by setting C2 large enough this requirement is satisfied. Thus the total sample complexity is

mO(m+1/ε2ε+q)

and the total time complexity is O(m+t).

Finally, we will bound the probability of duplicate samples and overlap with the query set. By the assumption that m+1/ε21/𝒟x2 and the fact that each pair of samples collides with probability 𝒟x22, it follows from a union bound over the pairs of samples in S that w.h.p. there are no duplicates in S𝗍𝗋𝖺𝗂𝗇S𝗍𝖾𝗌𝗍. Furthermore, by the assumption that q/ε21/𝒟x21/𝒟x, it follows that every set of size q has distributional mass ε2. Thus, with high probability none of the C3/ε2 elements in the test set appear in the query set.

Union bounding over all the failure probabilities in each case, the total success probability remains at least 2/3.

3.2 TL-Q Implies Sample-Based Learnability

A corollary of Theorem 13 is the fact that TL-Q implies efficient sample-based, distribution-specific agnostic learning.

Corollary 17 (Testable learning with queries implies learning with samples).

Let 𝒞 be (c,ε,110) PAC-testably-learnable with m samples, q queries, and t time, on a distribution family satisfying

m+q/ε21sup𝒟x(𝒟x2).

Then for any ε satisfying ε2cksup𝒟x(𝒟x2) for sufficiently large constant k, there is an agnostic learner for 𝒞 over with excess error 11/c+O(ε). In particular, when c=1, i.e. 𝒞 is fully agnostically learnable in TL-Q, 𝒞 is fully agnostically learnable with samples.

The sample complexity of the learner is

O((m)3/ε2))wherem=O(m+1/ε2ε+q)

and the time complexity is

O((m)2(m+t)ε2).
Proof.

By Theorem 13, there is a (cη+4ε,η)-refutation algorithm for any η<1/24εc. Observe that biased (α,η)-refutation is at least as strong as η-refutation: the (α,η)-refutation algorithm can be used to solve η-refutation, as p=1/2 is always in the range [α,1α]. By Lemma 11, this gives an agnostic learner with excess error 121/24εc+ε=11/c+O(ε). The time and sample bounds are obtained by combining the bounds in Lemma 11 with those in Theorem 13.

3.3 Realizably Learning Juntas via Exact Refutation

In the above subsections, we gave a general reduction from TL-Q to agnostic learning, citing as a black box the learning-by-refutation lemma of [17], Lemma 11. This lemma yields learners whose excess error depends on the refutation gap parameter η, which in our case must necessarily be smaller than 1/2c, as one cannot hope to distinguish a function that is (1/2c)-close to the concept class from a random function using a c-semi-agnostic learner. Thus the performance of this learner quickly degrades with c, becoming trivial when c=2.

For the class of sparse juntas over the uniform distribution on {0,1}n, we give a realizable learner from an algorithm that solves the easier task of exact refutation (η=0). Exact refutation reduces to c-semi-agnostic TL-Q for any value of c, via Theorem 13. Thus we show that for juntas, even for large values of c, semi-agnostic TL-Q is as hard as learning with samples. We state the result below; the proof is deferred to the full version of this paper.

Lemma 18.

Let ε2n/4 and m+q/ε22n/2. For any constant c, if the class of k-juntas is (c,ε,110)-PAC-testably-learnable with q queries, m samples, and t time over the uniform distribution on {0,1}n, then the class of k-juntas is agnostically learnable over the uniform distribution with excess error O(ε) and confidence 1δ, with sample complexity

m=(m+1/ε2ε+q)2O(k)nlog2(n/δ)

and time complexity

t=(m+1/ε2ε+q+t)2O(k)nlog(n/δ).

Since k-sparse parities are a subclass of k-juntas, we conclude that even semi-agnostic TL-Q cannot be done in no(k) time, under the assumption that LSPN is hard.

Corollary 19.

If LSPN requires nΩ(k) time, then for any constant c, c-semi-agnostic TL-Q for k-juntas over the uniform distribution requires nΩ(k) time.

4 MQ-SQ Lower Bounds

We introduce a class of “MQ-SQ” (membership-query-statistical-query) algorithms that capture many existing learning algorithms that use membership queries. Then, we prove an SQ-analogue of the reduction in Section 3.1: an MQ-SQ testable learner for a class 𝒞 implies an SQ algorithm for refutation (Definition 9). This result, together with a reduction from SQ weak learning to SQ refutation, allows us to prove lower bounds against MQ-SQ algorithms for testably learning several fundamental concept classes, including parity functions, k-juntas, and decision trees.

4.1 Five Types of MQ-SQs

Let 𝒳 denote the instance space and f:𝒳{0,1} denote the target function in the TL-Q instance. Let 𝒟Δ(𝒳) be the unknown marginal distribution over 𝒳. An MQ-SQ oracle for (f,𝒟) answers the following five types of queries up to a small error.

Definition 20 (MQ-SQ Oracle).

An MQ-SQ oracle with tolerance τ0 answers the following five types of queries within an additive error of τ, given any test function ϕ:𝒳[0,1], any distribution 𝒟Δ(𝒳), and any permutation π:𝒳𝒳 without fixed points:

  • Type I: 𝔼x𝒟[ϕ(x)f(x)].

  • Type II: 𝔼x𝒟[ϕ(x)f(x)f(π(x))].

  • Type III: 𝔼x𝒟[ϕ(x)].

  • Type IV: 𝔼x𝒟[ϕ(x)f(x)].

  • Type V: 𝔼x𝒟[ϕ(x)f(x)f(π(x))].

For concreteness, consider the case that 𝒳={0,1}n is the hypercube. By setting 𝒟 to the uniform distribution over {0,1}n (denoted by 𝒰) in Type I queries, we recover the usual statistical query model for learning over 𝒰. Queries of Types II and V allow us to estimate the correlation between f and a permuted version of f (weighted by ϕ) over both a generic customized distribution 𝒟 and the unknown marginal 𝒟, respectively. For instance, setting π:xxi and 𝒟=𝒰 allows us to estimate the influence of variable xi with respect to f.

When the distribution 𝒟 in a Type I MQ-SQ is degenerate (i.e., with all its probability mass on a single point x0𝒳), the MQ-SQ reduces to a usual membership query at x0. Our reduction for MQ-SQ algorithms only works when no such queries are made. More concretely, the reduction requires that the squared 2-norm of 𝒟, 𝒟22x𝒳[𝒟(x)]2, is sufficiently small in all queries that the algorithm makes. As we will see later, when many existing query-based learning algorithms are implemented using MQ-SQs, 𝒟 is usually uniform over a size-2Ω(n) subset of {0,1}n. This implies 𝒟222Ω(n), so our reduction applies to most of the interesting parameter regimes.

4.2 Implementing Query-Based Learning Algorithms Using MQ-SQs

We note that many existing MQ-based learning algorithms (or components thereof) for the uniform distribution 𝒰 over the hypercube 𝒳={0,1}n can be implemented using MQ-SQs. Examples of well-known MQ algorithms and their MQ-SQ implementations are given in the full version of this paper.

4.3 MQ-SQ Testable Learning Implies SQ Refutation

We will show that, if there is an efficient MQ-SQ algorithm that testably learns class 𝒞, the same class can be refuted by an efficient SQ algorithm. To this end, we first recall the definition of a (usual) SQ-based algorithm in the context of refutation (Definition 9). As in Section 3.1, we let 𝒟refut. denote the distribution of labeled pairs in the refutation instance and 𝒟 denote the unknown marginal distribution in a TL-Q instance.

Definition 21 (SQ oracle for refutation).

An SQ oracle for refutation with tolerance τ0 answers queries of form 𝔼(x,y)𝒟refut.[ϕ(x,y)] within an additive error of τ given a test function ϕ:𝒳×{0,1}[0,1].

Recap: Reduce refutation to TL-Q

We start by recalling the reduction from Section 3.1. Let pPr(x,y)𝒟refut.[y=1] be the fraction of positive labels in the refutation instance. We consider the TL-Q instance on the same concept class 𝒞, where the target function f:𝒳{0,1} is chosen as a random p-biased function, i.e., each function value f(x) is sampled from 𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(p) independently. The marginal distribution 𝒟Δ(𝒳) is the distribution of the output x𝒳 produced by the following procedure:

  • Sample (x,y)𝒟refut..

  • If y=f(x)=1, output x with probability 1p.

  • If y=f(x)=0, output x with probability p.

  • If no output is produced, return to the first step.

Formally, the probability mass function of 𝒟 is given by

𝒟(x)=1Z𝒟x(x)[(1p)y(x)f(x)+p(1y(x))(1f(x))], (1)

where 𝒟xΔ(𝒳) is the 𝒳-marginal of 𝒟refut.,

y(x)𝔼(x,y)𝒟refut.[yx=x]

is the conditional expectation of yx over 𝒟refut., and

Zx𝒳𝒟x(x)[(1p)y(x)f(x)+p(1y(x))(1f(x))] (2)

is a normalization factor.

Simulation of oracle queries

We state the following lemmas, whose proofs are deferred to the full version of this paper.

Lemma 22 (Types I and II).

The following holds for every ε0, ϕ:𝒳[0,1], 𝒟Δ(𝒳) and permutation π:𝒳𝒳 without fixed points:

  • With probability at least 12eΩ(ε2/𝒟22) over the randomness of f,

    |𝔼x𝒟[ϕ(x)f(x)]p𝔼x𝒟[ϕ(x)]|ε.
  • With probability at least 16eΩ(ε2/𝒟22) over the randomness of f,

    |𝔼x𝒟[ϕ(x)f(x)f(π(x))]p2𝔼x𝒟[ϕ(x)]|ε.
Lemma 23 (Types III, IV and V).

The following holds for every ε0, ϕ:𝒳[0,1] and permutation π:𝒳𝒳 without fixed points:

  • With probability at least 14eΩ(ε2p2(1p)2/𝒟x22) over the randomness of f,

    |𝔼x𝒟[ϕ(x)]𝔼(x,y)𝒟refut.[ϕ(x)]|ε.
  • With probability at least 14eΩ(ε2p2(1p)2/𝒟x22) over the randomness of f,

    |𝔼x𝒟[ϕ(x)f(x)]𝔼(x,y)𝒟refut.[ϕ(x)y]|ε.
  • With probability at least 18eΩ(ε2p2(1p)2/𝒟x22) over the randomness of f,

    |𝔼x𝒟[ϕ(x)f(x)f(π(x))]p𝔼(x,y)𝒟refut.[ϕ(x)y]|ε.

Now, we put everything together and prove our main result on MQ-SQ algorithms for testable learning.

Proposition 24.

Let 𝒞 be a concept class of boolean functions over instance space 𝒳. Suppose that there is a (c,ε,δ)-PAC MQ-SQ algorithm that testably learns 𝒞 over distribution family Δ(𝒳) using at most q queries to an MQ-SQ oracle with tolerance τ>0. Then, there is an algorithm that solves biased-(α,η)-refutation on 𝒞 over the same distribution family by making at most q=q+O(1) queries to an SQ oracle with tolerance τ=τ/4 and has a failure probability of at most δ=δ+O(q)eΩ(τ2B), assuming the following:

  1. 1.

    α>cη+ε+(c+4)τ+6τ;

  2. 2.

    Bp2(1p)2/𝒟x22 for every 𝒟x, where p=𝔼(x,y)𝒟refut.[y] is the average label in the refutation instance;

  3. 3.

    B1/𝒟22 holds for all MQ-SQs of Types I and II that 𝒜 makes.

Proof.

Let 𝒜 denote the hypothetical MQ-SQ algorithm that testably learns 𝒞. We construct a new algorithm, denoted by 𝒜, that refutes 𝒞 using an SQ oracle by simulating the execution of 𝒜.

Recall that we defined p=𝔼(x,y)𝒟refut.[y]. As the first step of the algorithm, 𝒜 queries the SQ oracle (Definition 21) with ϕ(x,y)=y to obtain an estimate p^[pτ,p+τ] for p.

Handling queries.

Whenever the simulated copy of 𝒜 makes an MQ-SQ, 𝒜 answers the query as follows:

  • When 𝒜 makes a Type I query on 𝔼x𝒟[ϕ(x)f(x)], 𝒜 returns p^𝔼x𝒟[ϕ(x)].

  • When 𝒜 makes a Type II query on 𝔼x𝒟[ϕ(x)f(x)f(π(x))], 𝒜 returns p^2𝔼x𝒟[ϕ(x)].

  • When 𝒜 makes a Type III query on 𝔼x𝒟[ϕ(x)], 𝒜 queries the SQ oracle on the value of μ=𝔼(x,y)𝒟refut.[ϕ(x)] and then forwards the answer μ^[μτ,μ+τ] to 𝒜.

  • When 𝒜 makes a Type IV query on 𝔼x𝒟[ϕ(x)f(x)], 𝒜 queries the SQ oracle on the value of μ=𝔼(x,y)𝒟refut.[ϕ(x)y] and then forwards the answer μ^[μτ,μ+τ] to 𝒜.

  • When 𝒜 makes a Type V query on 𝔼x𝒟[ϕ(x)f(x)f(π(x))], 𝒜 queries the SQ oracle on the value of μ=𝔼(x,y)𝒟refut.[ϕ(x)y] and then forwards the answer μ^[μτ,μ+τ] multiplied by p^ to 𝒜.

In the first two cases, 𝒜 can exactly compute the expectations since it has full knowledge of ϕ:𝒳[0,1] and 𝒟Δ(𝒳).

Decision rule.

When 𝒜 terminates, 𝒜 decides on the refutation instance as follows:

  • If 𝒜 rejects the TL-Q instance, 𝒜 returns 𝗌𝗍𝗋𝗎𝖼𝗍𝗎𝗋𝖾, indicating that some f𝒞 has an error η on 𝒟refut..

  • If 𝒜 accepts and returns a function f^:𝒳{0,1}, 𝒜 makes the following three additional MQ-SQs on behalf of 𝒜 and answer them as described above:

    𝔼x𝒟[f^(x)],𝔼x𝒟[f(x)],and𝔼x𝒟[f^(x)f(x)].

    Let μ^1,μ^2,μ^3 denote the answers to the three queries, and let

    μ^=μ^1+μ^22μ^3

    be a weighted sum of them. Note that μ^ is intended to be an estimate of

    𝔼x𝒟[f^(x)+f(x)2f^(x)f(x)]=Prx𝒟[f^(x)f(x)].

    Finally, 𝒜 outputs 𝗇𝗈𝗂𝗌𝖾 (indicating that the labels are random) if μ^min{p^,1p^}5τ, and outputs 𝗌𝗍𝗋𝗎𝖼𝗍𝗎𝗋𝖾 otherwise.

Overview of analysis.

We first upper bound the number of SQs that 𝒜 makes. By construction, 𝒜 queries the SQ oracle at most once for every MQ-SQ made by 𝒜. In addition, 𝒜 makes one query at the beginning and at most three queries at the end. Thus, 𝒜 makes at most q=q+O(1) queries in total.

To analyze the correctness of 𝒜, let f:𝒳{0,1} be a random p-biased function obtained by independently drawing the function value f(x) from 𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(p) for each x𝒳. Note that f is only for the analysis; it is never used in algorithm 𝒜. Also, let 𝒟 denote the distribution over 𝒳 induced by 𝒟refut. and f (see Equation 1). We will first argue that, with high probability, the simulated copy of 𝒜 effectively runs on an instance of testable learning with target function f and marginal distribution 𝒟. We will then show that the decision made by 𝒜 is correct due to the intended behavior of 𝒜 on such an instance.

A good event.

Let good be the “good event” that the three conditions below hold simultaneously:

  • The simulated execution of 𝒜 coincides with its execution on the testable learning instance (f,𝒟) using an MQ-SQ oracle with tolerance τ. In other words, every MQ-SQ made by 𝒜 is answered up to an additive error of τ.

  • If the first condition holds, the output of 𝒜 is valid with respect to the testable learning instance.

  • If there exists f𝒞 that satisfies Pr(x,y)𝒟refut.[f(x)y]η (i.e., we are in the 𝗌𝗍𝗋𝗎𝖼𝗍𝗎𝗋𝖾 case), it holds that Prx𝒟[f(x)f(x)]η+τ.

By Lemmas 22 and 23, for each MQ-SQ made by 𝒜, the first condition gets violated with probability at most 8eΩ(τ2B), where B is the minimum between the value of 1/𝒟22 (among all queries of Types I and II) and p2(1p)2/𝒟x22. Applying the union bound to the q+4 queries shows that the first condition gets violated with probability at most O(q)eΩ(τ2B). Since 𝒜 is assumed to be (c,ε,δ)-PAC, the second condition gets violated with probability at most δ. Applying Claim 16 with δ=τ shows that the third condition gets violated with probability at most eΩ(τ2p2(1p)2/𝒟x22)eΩ(τ2B). Applying the union bound again gives

Pr[good]1δO(q)eΩ(τ2B).

In the rest of the proof, we show that event good implies that 𝒜 decides correctly.

Proof of completeness.

Suppose that some f𝒞 satisfies Pr(x,y)𝒟refut.[f(x)y]η, where η is the parameter of the refutation instance (Definition 9). The third condition of the good event good implies that the same f has with an error 2η over 𝒟. Then, the testable learner 𝒜 may output either or a function f^:𝒳{0,1} that satisfies

Prx𝒟[f^(x)f(x)]cPrx𝒟[f(x)f(x)]+εc(η+τ)+ε.

In the former case, 𝒜 would correctly output 𝗌𝗍𝗋𝗎𝖼𝗍𝗎𝗋𝖾. For the latter case, applying the identity 𝟙[b1b2]=b1+b22b1b2 for b1,b2{0,1} gives

Prx𝒟[f^(x)f(x)]=𝔼x𝒟[f^(x)]+𝔼x𝒟[f(x)]2𝔼x𝒟[f^(x)f(x)].

Assuming event good, 𝒜 obtains an estimate of each of the three expectations on the right-hand side above up to an additive error of τ. It then follows that the value of μ^ computed at the end satisfies

μ^Prx𝒟[f^(x)f(x)]+4τcη+ε+(c+4)τ.

Since min{p,1p}α>cη+ε+(c+4)τ+6τ, we have

μ^<min{p,1p}6τmin{p^,1p^}5τ

in this case, and 𝒜 would correctly output 𝗌𝗍𝗋𝗎𝖼𝗍𝗎𝗋𝖾.

Proof of soundness.

Suppose that the distribution 𝒟refut. in the refutation instance is the product distribution of some 𝒟x and 𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(p). Then, the resulting marginal distribution 𝒟 in the testable learning instance is exactly 𝒟x. Thus, assuming that 𝒜 is correct, 𝒜 would accept and output a function f^:𝒳{0,1}. Then, at the end of 𝒜, we compute

μ^𝔼x𝒟[f^(x)]+𝔼x𝒟[f(x)]2𝔼x𝒟[f^(x)f(x)].

By the way in which 𝒜 handles the MQ-SQs, the three terms

𝔼x𝒟[f^(x)],𝔼x𝒟[f(x)],and𝔼x𝒟[f^(x)f(x)]

are approximated with

𝔼(x,y)𝒟refut.[f^(x)],𝔼(x,y)𝒟refut.[y],and𝔼(x,y)𝒟refut.[f^(x)y],

respectively. All the three values are obtained from querying the SQ oracle. Since the SQ oracle has a tolerance of τ, the value of μ^ is within an additive error of 4τ to

𝔼(x,y)𝒟refut.[f^(x)+y2f^(x)y]=Pr(x,y)𝒟refut.[f^(x)y],

which is exactly the error of f^ on distribution 𝒟refut.. Since 𝒟refut. is the product of 𝒟x and 𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(p), regardless of the choice of f^, Pr(x,y)𝒟refut.[f^(x)y] is at least min{p,1p}. Together with the fact that |pp^|τ, this further implies

μ^min{p,1p}4τmin{p^,1p^}5τ.

Thus, 𝒜 would correctly output 𝗇𝗈𝗂𝗌𝖾.

4.4 SQ Refutation Implies SQ Weak Learning

We show that an SQ algorithm for refutation implies an SQ algorithm for weakly learning the same concept class. Later, using the equivalence between SQ dimension and weak learning [2], we can lift SQ-dimension lower bounds to lower bounds against SQ-based refutation. By Proposition 24, this further leads to lower bounds against MQ-SQ algorithms for testable learning with queries.

We first recall the definition of SQ-based weak learning.

Definition 25 (SQ Weak learning).

Let 𝒞{f:𝒳{0,1}} be a concept class over a finite instance space 𝒳. Let 𝒟 be a given distribution over 𝒳 and f𝒞 be an unknown target function. An SQ oracle for (f,𝒟) with tolerance τ0 answers queries of form 𝔼x𝒟[ϕ(x)f(x)] up to an additive error of τ, where ϕ:𝒳[0,1] is any given test function. An SQ algorithm ε-weakly learns 𝒞 if it, by making queries to an SQ oracle, with probability 2/3 outputs a classifier f^:𝒳{0,1} that satisfies

Prx𝒟[f^(x)f(x)]12ε.

Intuitively, to solve refutation (Definition 9) using an SQ algorithm, in the “structure” case that some f𝒞 has a low error, the algorithm must query the SQ oracle using a test function that has a non-trivial correlation with f. Such a test function would then allow us to learn the unknown function up to an error that is better than random guessing.

Proposition 26.

Suppose that an SQ algorithm solves biased-(α,η)-refutation for concept class 𝒞 on distribution 𝒟 by making at most q queries with tolerance τ0. Then, assuming that α1/2Ω(τ), there is an SQ algorithm that Ω(τ)-weakly learns 𝒞 on 𝒟 by making at most q=O(q+1/τ) queries to an SQ oracle with tolerance τ=Ω(τ).

To prove the proposition, we will use the following simple fact: If a [0,1]-valued function has a non-trivial correlation with a sufficiently balanced boolean function, it can be rounded into a random binary classifier with a non-trivial accuracy in expectation. (If the boolean function is far from balanced, it can be easily learned by a constant function.)

Lemma 27.

The following holds for every δ0, distribution 𝒟 over 𝒳, and binary function f:𝒳{0,1} with mean p𝔼x𝒟[f(x)][1/2γ,1/2+γ]: Suppose that function ϕ:𝒳[0,1] satisfies 𝔼x𝒟[ϕ(x)f(x)]p𝔼x𝒟[ϕ(x)]δ. Then, for the random function ϕ~:𝒳{0,1} obtained from sampling each ϕ~(x) from 𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(ϕ(x)) independently, we have

𝔼ϕ~[Prx𝒟[ϕ~(x)f(x)]]12(2δ3γ).

Similarly, if 𝔼x𝒟[ϕ(x)f(x)]p𝔼x𝒟[ϕ(x)]δ, we have

𝔼ϕ~[Prx𝒟[1ϕ~(x)f(x)]]12(2δ3γ).
Proof.

It suffices to prove the first part; the second part follows by symmetry. By the identity 𝟙[b1b2]=b1+b22b1b2 for b1,b2{0,1}, it holds for every possible realization of ϕ~:𝒳{0,1} that

Prx𝒟[ϕ~(x)f(x)]=𝔼x𝒟[ϕ~(x)+f(x)2ϕ~(x)f(x)].

Taking an expectation over the randomness of ϕ~ shows that

𝔼ϕ~[Prx𝒟[ϕ~(x)f(x)]] =𝔼ϕ~[𝔼x𝒟[ϕ~(x)+f(x)2ϕ~(x)f(x)]]
=𝔼x𝒟[ϕ(x)+f(x)2ϕ(x)f(x)]
=𝔼x𝒟[ϕ(x)]+p2𝔼x𝒟[ϕ(x)f(x)]
p+𝔼x𝒟[ϕ(x)]2(p𝔼x𝒟[ϕ(x)]+δ)
=p+(12p)𝔼x𝒟[ϕ(x)]2δ,

where the fourth step applies the assumption that 𝔼x𝒟[ϕ(x)f(x)]p𝔼x𝒟[ϕ(x)]δ. Since 12p[2γ,2γ] and 𝔼x𝒟[ϕ(x)][0,1], the second term above is at most 2γ. It follows that the expected error of ϕ~ is at most p+2γ2δ(12+γ)+2γ2δ=12(2δ3γ).

Proof of Proposition 26.

Let 𝒜 denote the hypothetical SQ algorithm that solves biased-(α,η)-refutation for 𝒞. Let ε,τ=Θ(τ) be sufficiently small such that: (1) α1/2(ε+2τ); (2) τ4ε+22τ. We construct an SQ algorithm 𝒜 that weakly learns 𝒞 by simulating 𝒜 on the distribution of (x,f(x)) where x𝒟 and f𝒞 is the unknown target function in the weak learning instance:

  • Step 1: Query the SQ oracle (for the weak learning instance) to estimate the value of p𝔼x𝒟[f(x)] using the constant function ϕ(x)1. Let p^[pτ,p+τ] be the output of the oracle. If p^+τ1/2ε, output the constant function 0 and terminate. If p^τ1/2+ε, output the constant function 1 and terminate.

  • Step 2: Simulate the refutation algorithm 𝒜. Whenever 𝒜 tries to query the SQ oracle (for the refutation instance) with test function ϕ:𝒳×{0,1}[0,1], consider the function Δ:𝒳[1,1] defined as Δ(x)ϕ(x,1)ϕ(x,0) and Δ:𝒳[0,1] defined as Δ(x)Δ(x)+12.

  • Step 3: Query the SQ oracle (for weak learning) with test function Δ to estimate μ𝔼x𝒟[Δ(x)f(x)]. Let μ^[μτ,μ+τ] denote the output of the SQ oracle. Check whether it holds that

    |μ^p^𝔼x𝒟[Δ(x)]|τ22τ.

    If so, we compute 𝔼x𝒟[ϕ(x,0)+pΔ(x)], return the result to the refutation algorithm 𝒜, and continue the simulation by going back to Step 2. Otherwise, go to Step 4.

  • Step 4: We apply Lemma 27 to Δ and obtain a randomized boolean function ϕ~ from either Δ or 1Δ. We query the SQ oracle to obtain an estimate ε^ of

    Prx𝒟[ϕ~(x)f(x)]=𝔼x𝒟[ϕ~(x)]+𝔼x𝒟[f(x)]2𝔼x𝒟[ϕ~(x)f(x)].

    If ε^1/2ε3τ, we return the function ϕ~. Otherwise, repeat this step.

If 𝒜 outputs a constant classifier in the first step, the output clearly has an error 1/2ε. Thus, we may focus on the case that |p^1/2|ε+τ. Since |p^p|τ, we must have |p1/2|γε+2τ in this case. The rest of the proof proceeds in the following three steps:

  • If Δ has a low correlation with f (i.e., |μ^p^𝔼x𝒟[Δ(x)]|τ4τ holds in Step 3 of 𝒜), the answer 𝔼x𝒟[ϕ(x,0)+pΔ(x)] that we return to 𝒜 is a valid answer for an SQ oracle with tolerance τ.

  • If Δ has a high correlation with f (i.e., |μ^p^𝔼x𝒟[Δ(x)]|>τ4τ), we will find a good ϕ~ without repeating Step 4 too many times.

  • If Δ never has a high correlation with f, the execution of 𝒜 will be indistinguishable from that in the “noise” case of the refutation instance. Therefore, a high-correlation Δ must be found with a good probability.

Low correlation gives accurate answers.

Suppose that, for some test function ϕ:𝒳×{0,1}[0,1] chosen by 𝒜 and the corresponding Δ, it holds in Step 3 that

|μ^p^𝔼x𝒟[Δ(x)]|τ22τ.

Recall that μ^ is within an additive error of τ to μ=𝔼x𝒟[Δ(x)f(x)] and p^ is within error τ to p=𝔼x𝒟[f(x)]. We have

|𝔼x𝒟[Δ(x)f(x)]p𝔼x𝒟[Δ(x)]||μ^p^𝔼x𝒟[Δ(x)]|+2ττ2.

Then, the difference between the correct answer,

𝔼x𝒟[ϕ(x,f(x))]=𝔼x𝒟[ϕ(x,0)+(ϕ(x,1)ϕ(x,0))f(x)]=𝔼x𝒟[ϕ(x,0)+Δ(x)f(x)],

and the answer 𝔼x𝒟[ϕ(x,0)+pΔ(x)] returned by 𝒜 is exactly

|𝔼x𝒟[Δ(x)f(x)]p𝔼x𝒟[Δ(x)]|=2|𝔼x𝒟[Δ(x)f(x)]p𝔼x𝒟[Δ(x)]|τ.

In other words, 𝒜 simulates a valid SQ oracle with tolerance τ when the correlation is low.

High correlation gives good ϕ~.

Now, suppose that |μ^p^𝔼x𝒟[Δ(x)]|>τ22τ holds in Step 3. Again, since μ^μ=𝔼x𝒟[Δ(x)f(x)] and p^p=𝔼x𝒟[f(x)] hold up to error τ, we have

|𝔼x𝒟[Δ(x)f(x)]p𝔼x𝒟[Δ(x)]||μ^p^𝔼x𝒟[Δ(x)]|2τ>τ24τ.

By the assumption that τ4ε+22τ, the above implies |𝔼x𝒟[Δ(x)f(x)]p𝔼x𝒟[Δ(x)]|δ2ε+7τ. Recall that we assumed p[1/2γ,1/2+γ] for γ=ε+2τ. Applying Lemma 27 to Δ shows that Δ can be rounded to a random function ϕ~:𝒳{0,1} with an expected error of at most

12(2δ3γ)=12(ε+8τ).

By Markov’s inequality, the probability that ϕ~ has an error 1/2(ε+6τ) is at least

11/2(ε+8τ)1/2(ε+6τ)=2τ1/2(ε+6τ)=Ω(τ).

Note that in Step 4, ε^ is within an additive error of 3τ to the actual error of ϕ~. Then, if ϕ~ has an error 1/2(ε+6τ), we would have

ε^Prx𝒟[ϕ~(x)f(x)]+3τ12ε3τ,

and algorithm 𝒜 would terminate. Therefore, whenever Step 4 is entered, at most O(1/τ) repetitions are needed in expectation. This shows that 𝒜 makes at most O(q+1/τ) SQs in expectation.

The probability of making high-correlation queries.

Now, we argue that the hypothetical refutation algorithm 𝒜 must make the aforementioned high-correlation query. To this end, we couple the execution of 𝒜 simulated by our weak learner 𝒜 (the simulated copy) with a slight variant of it (the imaginary copy): In the imaginary copy, we never check the correlation or go to Step 4; we always return 𝔼x𝒟[ϕ(x,0)+pΔ(x)] for every query ϕ:𝒳×{0,1}[0,1] that 𝒜 makes.

Note that the imaginary copy of 𝒜 exactly runs on a refutation instance in which the distribution is the product distribution of 𝒟 and 𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(p), i.e., the label y is a p-biased coin flip regardless of x. Since we assumed that |p1/2|γ=ε+2τ1/2α, we have p[α,1α]. Then, by the soundness guarantee of 𝒜, the imaginary copy 𝒜 must output 𝗇𝗈𝗂𝗌𝖾 with probability at least 2/3.

In contrast, the simulated copy of 𝒜 runs on an instance in which the labels are consistent with f𝒞. Then, the completeness of 𝒜 ensures that the simulated copy outputs 𝗌𝗍𝗋𝗎𝖼𝗍𝗎𝗋𝖾 with probability 2/3. Therefore, in the coupling between the simulated and imaginary copies, they must diverge with probability at least 1/3. The only way for the two copies to disagree is that, during the execution of the simulated copy, the algorithm makes an SQ with a high-correlation test function. Therefore, algorithm 𝒜 outputs a classifier with error 1/2ε with probability at least 1/3. Since 𝒜 never returns an incorrect answer (i.e., a classifier with error >1/2ε), repeating 𝒜 a constant number of times would boost the success probability to 2/3, thereby giving an ε-weak learner for class 𝒞 on distribution 𝒟.

4.5 Put Everything Together

So far, we have established a reduction from SQ weak learning to SQ refutation (Proposition 26), and one from SQ refutation to MQ-SQ testable learning (Proposition 24). We now combine them and prove a lower bound for MQ-SQ testable learning in terms of the statistical query dimension (SQ dimension) of the concept class introduced by Blum, Furst, Jackson, Kearns, Mansour, and Rudich [2].

Definition 28 (SQ dimension).

The SQ dimension of a concept class 𝒞{f:𝒳{0,1}} on a distribution 𝒟 over 𝒳 is the maximum number d such that there exists f1,f2,,fd𝒞 that satisfy

Prx𝒟[fi(x)fj(x)][11/d32,1+1/d32]

for all ij[d].

Theorem 29.

The following holds for a sufficiently small constant ε0>0: Suppose that 𝒞 is a concept class with SQ dimension d on distribution 𝒟 and 𝒟22O(1/poly(d)). Let c1 and ε,δε0. Then, no MQ-SQ algorithm can (c,ε,δ)-testably learn 𝒞 on distribution 𝒟 by making qo(poly(d)) queries to an MQ-SQ oracle with tolerance τω(1/poly(d)) such that 𝒟22O(1/poly(d)) holds for all queries of types I and II.

For concreteness, consider the hypercube 𝒳={0,1}n and the uniform distribution over it. It is well-known that the family of parity functions has an SQ dimension of 2n. Furthermore, for every kn1Ω(1), both k-juntas and depth-k decision trees have SQ dimensions of nΩ(k), as they both contain all parity functions of k variables. Thus, Theorem 29 gives 2Ω(n) or nΩ(k) lower bounds against MQ-SQ testable learners for these classes. Regarding the constraint on 𝒟, if 𝒟 is the uniform distribution over a d-dimensional subcube, we need 2d=𝒟22O(1/poly(d)), so ensuring d=Ω(n) would suffice. (Recall from Section 4.2 that this condition holds when implementing many existing query-based learners as MQ-SQ algorithms.)

Proof.

Suppose towards a contradiction that such an MQ-SQ algorithm exists. Ignoring all the other parameters for now, Proposition 24 shows that there is an algorithm that refutes 𝒞 by making q=q+O(1) queries to an SQ oracle with tolerance τ=τ/4. Applying Proposition 26 then gives an algorithm that Ω(τ)-weakly learns parity functions using O(q+1/τ)=O(q+1/τ) queries to an SQ oracle with tolerance Θ(τ). By [2, Theorem 12], to Ω(1/d3)-weakly learn concept class 𝒞 using an SQ oracle with tolerance Ω(d1/3), at least Ω(d1/3) queries are needed. Since q,1/τ=o(poly(d)), we obtain a contradiction.

Now, we set the parameters in Propositions 24 and 26 carefully. We may assume that τε0/c without loss of generality; a smaller tolerance makes the SQ oracle (and thus the lower bound result) stronger. Since ε,δ,cτε0 are sufficiently small and τ=τ/4, we can choose α=0.1 and η=0 in Proposition 24 such that the first condition α>cη+ε+(c+4)τ+6τ is satisfied. Furthermore, the condition that α1/2Ω(τ) in Proposition 26 would also hold. Recall that the failure probability increases from δε0 to δ+O(q)eΩ(τ2B) in Proposition 24. Setting B=Θ((logq)/τ2)=o(poly(d)) suffices to control the new failure probability by 2ε0.

It remains to check the second and the third conditions of Proposition 24. For the third, we need the MQ-SQ testable learner to restrict the distribution 𝒟 in its queries such that 𝒟221/B. This is ensured by 𝒟22O(1/poly(d)) and Bo(poly(d)). Finally, to check the second condition that Bp2(1p)2/𝒟x22, we note that 𝒟x22=𝒟22O(1/poly(d)). Furthermore, by the way in which the reduction works in the proof of Proposition 26, whenever the refutation algorithm is called, the labels are nearly balanced, i.e., p2(1p)2=Ω(1). Therefore, the second condition is always satisfied by our choice of B=o(poly(d)). This completes the proof.

References

  • [1] Guy Blanc, Jane Lange, Mingda Qiao, and Li-Yang Tan. Properly learning decision trees in almost polynomial time. Journal of the ACM, 69(6):1–19, 2022. doi:10.1145/3561047.
  • [2] Avrim Blum, Merrick Furst, Jeffrey Jackson, Michael Kearns, Yishay Mansour, and Steven Rudich. Weakly learning dnf and characterizing statistical query learning using fourier analysis. In Proceedings of the twenty-sixth annual ACM symposium on Theory of computing, pages 253–262, 1994. doi:10.1145/195058.195147.
  • [3] Avrim L Blum and Pat Langley. Selection of relevant features and examples in machine learning. Artificial intelligence, 97(1-2):245–271, 1997. doi:10.1016/S0004-3702(97)00063-5.
  • [4] Nader H Bshouty and Vitaly Feldman. On using extended statistical queries to avoid membership queries. Journal of Machine Learning Research, 2(Feb):359–395, 2002. URL: https://jmlr.org/papers/v2/bshouty02a.html.
  • [5] Amit Daniely. Complexity theoretic limitations on learning halfspaces. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 105–117, 2016. doi:10.1145/2897518.2897520.
  • [6] Amit Daniely, Nati Linial, and Shai Shalev-Shwartz. From average case complexity to improper learning complexity. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 441–448, 2014. doi:10.1145/2591796.2591820.
  • [7] Amit Daniely and Shai Shalev-Shwartz. Complexity theoretic limitations on learning dnf’s. In Conference on Learning Theory, pages 815–830. PMLR, 2016. URL: http://proceedings.mlr.press/v49/daniely16.html.
  • [8] Ilias Diakonikolas, Daniel Kane, Vasilis Kontonis, Sihan Liu, and Nikos Zarifis. Efficient testable learning of halfspaces with adversarial label noise. Advances in Neural Information Processing Systems, 36:39470–39490, 2023.
  • [9] Ariel Elbaz, Homin K Lee, Rocco A Servedio, and Andrew Wan. Separating models of learning from correlated and uncorrelated data. The Journal of Machine Learning Research, 8:277–290, 2007. URL: https://jmlr.org/papers/v8/elbaz07a.html.
  • [10] Vitaly Feldman. On the power of membership queries in agnostic learning. The Journal of Machine Learning Research, 10:163–182, 2009. doi:10.5555/1577069.1577076.
  • [11] Vitaly Feldman and Shrenik Shah. Separating models of learning with faulty teachers. Theoretical computer science, 410(19):1903–1912, 2009. doi:10.1016/J.TCS.2009.01.017.
  • [12] Aravind Gollakota, Adam Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. Tester-learners for halfspaces: Universal algorithms. Advances in Neural Information Processing Systems, 36:10145–10169, 2023.
  • [13] Aravind Gollakota, Adam R. Klivans, and Pravesh K. Kothari. A moment-matching approach to testable learning and a new characterization of rademacher complexity. In Symposium on Theory of Computing (STOC), pages 1657–1670, 2023. doi:10.1145/3564246.3585206.
  • [14] Parikshit Gopalan, Adam Tauman Kalai, and Adam R Klivans. Agnostically learning decision trees. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 527–536, 2008. doi:10.1145/1374376.1374451.
  • [15] Adam Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. Learning intersections of halfspaces with distribution shift: Improved algorithms and sq lower bounds. In Shipra Agrawal and Aaron Roth, editors, Proceedings of Thirty Seventh Conference on Learning Theory, volume 247 of Proceedings of Machine Learning Research, pages 2944–2978. PMLR, 30 June–03 July 2024. URL: https://proceedings.mlr.press/v247/klivans24b.html.
  • [16] Adam Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. Testable learning with distribution shift. In The Thirty Seventh Annual Conference on Learning Theory, pages 2887–2943. PMLR, 2024. URL: https://proceedings.mlr.press/v247/klivans24a.html.
  • [17] Pravesh K. Kothari and Roi Livni. Improper Learning by Refuting. In Innovations in Theoretical Computer Science (ITCS), pages 55:1–55:10, 2018. doi:10.4230/LIPIcs.ITCS.2018.55.
  • [18] Eyal Kushilevitz and Yishay Mansour. Learning decision trees using the fourier spectrum. In Symposium on Theory of Computing (STOC), pages 455–464, 1991.
  • [19] Cassandra Marcussen, Ronitt Rubinfeld, and Madhu Sudan. Quality control in sublinear time: a case study via random graphs. arXiv preprint arXiv:2508.16531, 2025. doi:10.48550/arXiv.2508.16531.
  • [20] Elchanan Mossel, Ryan O’Donnell, and Rocco A Servedio. Learning functions of k relevant variables. Journal of Computer and System Sciences, 69(3):421–434, 2004. doi:10.1016/J.JCSS.2004.04.002.
  • [21] Ronitt Rubinfeld and Arsen Vasilyan. Testing distributional assumptions of learning algorithms. In Symposium on Theory of Computing (STOC), pages 1643–1656, 2023. doi:10.1145/3564246.3585117.
  • [22] Lucas Slot, Stefan Tiegel, and Manuel Wiedmer. Testably learning polynomial threshold functions. Advances in Neural Information Processing Systems, 37:3781–3831, 2024.
  • [23] Salil Vadhan. On learning vs. refutation. In Conference on Learning Theory (COLT), pages 1835–1848, 2017. URL: http://proceedings.mlr.press/v65/vadhan17a.html.