Limitations of Membership Queries in Testable Learning

Lange, Jane; Qiao, Mingda

doi:10.4230/LIPIcs.ITCS.2026.91

Limitations of Membership Queries in Testable Learning

Jane Lange

Massachusetts Institute of Technology, Cambridge, MA, USA Mingda Qiao

University of Massachusetts Amherst, MA, USA

Abstract

Membership queries (MQ) often yield speedups for learning tasks, particularly in the distribution-specific setting. We show that in the testable learning model of Rubinfeld and Vasilyan [21], membership queries cannot decrease the time complexity of testable learning algorithms beyond the complexity of sample-only distribution-specific learning. In the testable learning model, the learner must output a hypothesis whenever the data distribution satisfies a desired property, and if it outputs a hypothesis, the hypothesis must be near-optimal.

We give a general reduction from sample-based refutation of boolean concept classes, as presented in [23, 17], to testable learning with queries (TL-Q). This yields lower bounds for TL-Q via the reduction from learning to refutation given in [17]. The result is that, relative to a concept class and a distribution family, no $m$ -sample TL-Q algorithm can be super-polynomially more time-efficient than the best $m$ -sample PAC learner.

Finally, we define a class of “statistical” MQ algorithms that encompasses many known distribution-specific MQ learners, such as those based on influence estimation or subcube-conditional statistical queries. We show that TL-Q algorithms in this class imply efficient statistical-query refutation and learning algorithms. Thus, combined with known SQ dimension lower bounds, our results imply that these efficient membership query learners cannot be made testable.

Keywords and phrases:

Testable learning, PAC learning

Funding:

Jane Lange: Supported in part by NSF Awards CCF-2006664, DMS-2022448, CCF-2310818, and Big George Fellowship

Copyright and License:

2012 ACM Subject Classification:

Theory of computation

\rightarrow

Query learning

Editor:

Shubhangi Saraf

Series and Publisher:

Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

In distribution-specific PAC learning, a learning algorithm is only required to output a competitive hypothesis when the data distribution satisfies some property. Distribution-specific PAC often allows for much more efficient learning than distribution-free PAC, but with the following shortcoming: if the distribution does not satisfy the property, then the behavior of the learner is completely undefined.

A testable agnostic learning algorithm [21] alleviates this shortcoming by combining a distribution-specific learner with a tester for the desired property. It may output a hypothesis or it may reject the distribution and output $\bot$ . The testable learner is run on i.i.d. samples from an unknown distribution $\mathcal{D}$ over $\mathcal{X}\times\{0,1\}$ , and has the following behavior:

$\blacksquare$

Soundness: If the learner outputs a hypothesis $h$ , then with high probability

$\Pr_{(x,y)\sim\mathcal{D}}[h(x)\neq y]\leq\mathsf{opt}+\varepsilon,$

where $\mathsf{opt}$ is the error of the best concept in the concept class. A semi-agnostic variant of this condition, with $\mathsf{opt}+\varepsilon$ replaced by $O(\mathsf{opt})+\varepsilon$ , has also been considered.
$\blacksquare$

Completeness: If the distribution $\mathcal{D}$ has the desired property, then the learner outputs a hypothesis (instead of $\bot$ ) with high probability.

There is a significant body of work studying the sample and time complexities of testable learning for various concept classes and distribution properties in both the agnostic ( $\mathsf{opt}+\varepsilon$ ) and semi-agnostic ( $O(\mathsf{opt})+\varepsilon$ ) settings [21, 13, 8, 12, 22]. There are efficient algorithms for testable learning in cases where distribution-free learning cannot be done efficiently.

1.1 The Power of Membership Queries in Agnostic Learning

One might hope to testably learn more efficiently by strengthening the learner’s access to the data distribution. In the membership query (MQ) model, we think of the data as being drawn from a distribution $\mathcal{D}_{x}$ over $\mathcal{X}$ and labeled by some unknown function $f:\mathcal{X}\to\{0,1\}$ . The learner gets i.i.d. samples from $\mathcal{D}_{x}$ and may also query any point $x\in\mathcal{X}$ and receive its label $f(x)$ .

The work of [10] shows that under the standard cryptographic assumption of one-way functions, membership queries can speed up agnostic learning in the distribution-specific setting, but not in the distribution-free setting. In the distribution-free setting, every concept class that can be agnostically learned with MQs can be agnostically learned with just random examples. In contrast, in the uniform-distribution-specific setting, there exists a concept class that can be learned strictly more efficiently with MQs.

Separations exist for more “natural” concept classes under the stronger assumption that learning sparse parities with noise (LSPN) is hard. For example, over the uniform distribution on $\{0,1\}^{n}$ , $k$ -juntas can be learned in $\mathop{\mathrm{poly}}(n)\cdot 2^{O(k)}$ time with membership queries [3, 20]. On the other hand, there is a statistical query lower bound of $n^{\Omega(k)}$ [4], and if LSPN is hard then one cannot hope to do better than this bound with random examples. Similarly, polynomial-size decision trees can be learned improperly in polynomial time [18, 14] and properly in $n^{O(\log\log n)}$ time [1] with membership queries, while there is an SQ lower bound of $n^{\Omega(\log n)}$ [2], and LSPN implies one cannot do better than this bound either.

1.2 Limitations of Membership Queries in Testable Learning

If membership queries do help in the distribution-specific setting but do not help in the distribution-free setting, one may then naturally wonder whether they ought to help in the testable setting as well. A testable learner with queries (TL-Q) has both sample access to the unknown data distribution and membership query access to the unknown function, and must satisfy the soundness and completeness guarantees of ordinary testable learning.

Question 1.

How much can membership queries speed up the task of testable learning?

Our results show that membership queries are quite weak in the TL-Q setting. Particularly, whenever agnostic learning with random examples is hard – as is believed to be the case for juntas and decision trees – testable learning is hard as well, even with queries.

Theorem 2 (Corollary 17, informal).

If a concept class $\mathcal{C}$ is agnostically testably learnable with queries in time $t$ over a distribution $\mathcal{D}$ , then it is agnostically learnable with random examples in time $\mathop{\mathrm{poly}}(t)$ over $\mathcal{D}$ as well.

Corollary 3.

If LSPN is hard, then no concept class containing $k$ -parities as a subset can be agnostically testably learned in $n^{o(k)}$ time over the uniform distribution, even with membership queries.

Furthermore, we show that SQ lower bounds rule out a large class of natural query-based learning algorithms. We define a class of “statistical” membership query (MQ-SQ) algorithms – those that use membership queries only to sample from particular distributions over the input domain. For example, algorithms that use MQs only to estimate influences or to make SQs over large subsets of $\{0,1\}^{n}$ are MQ-SQ algorithms (this includes the aforementioned uniform-distribution algorithms of [18, 14, 1]). We show that such algorithms cannot be “made testable” with respect to the uniform distribution without introducing non-statistical use of membership queries, due to the SQ lower bounds.

Theorem 4 (Theorem 29, informal).

If a concept class $\mathcal{C}$ is testably learnable in time $t$ over a distribution $\mathcal{D}$ by an MQ-SQ algorithm, then the SQ dimension of $\mathcal{C}$ with respect to $\mathcal{D}$ is at most $\mathop{\mathrm{poly}}(t)$ .

1.3 Technical Overview

Our reductions are through the intermediate task of refutation. Refutation, as presented in [23, 17], is the problem of distinguishing examples correlated with some function in the concept class from examples labeled uniformly at random. The work of [23] shows that in the distribution-free setting, refutation and realizable learning are polynomially equivalent, and the work of [17] shows an analogous statement in the distribution-specific agnostic setting. By giving an efficient reduction from distribution-specific refutation (without queries) to testable learning (with queries), we show that distribution-specific agnostic learning reduces to TL-Q as well.

As a warm-up, consider a special case of refutation where the labels are promised to either be completely random or exactly match some function in the class $\mathcal{C}$ . Let the distribution be uniform over $\{0,1\}^{n}$ . Assume for simplicity that all functions in $\mathcal{C}$ are balanced, i.e., $\operatorname*{\mathbb{E}}_{x\sim\{0,1\}^{n}}\left[f(x)\right]=1/2$ .

Definition 5 (Exact refutation over the uniform distribution, informal).

An exact refutation algorithm for a concept class $\mathcal{C}$ takes an $m$ -tuple $\{(x_{1},y_{1}),\ldots,(x_{m},y_{m})\}$ of examples where the $x$ ’s are drawn uniformly at random from $\{0,1\}^{n}$ . It outputs either $\mathsf{noise}$ or $\mathsf{structure}$ with the following guarantees:

$\blacksquare$

Completeness: If the examples are consistent with some $g\in\mathcal{C}$ , then

$\mathop{{\Pr}\/}[\mathcal{A}\text{ outputs }\mathsf{structure}]\geq 2/3.$
$\blacksquare$

Soundness: If the $y_{i}$ ’s are drawn i.i.d. from $\mathsf{Bernoulli}(1/2)$ , then

$\mathop{{\Pr}\/}[\mathcal{A}\text{ outputs }\mathsf{noise}]\geq 2/3.$

Suppose we want to implement exact refutation using a TL-Q algorithm. If we had any agnostic learning algorithm that did not require queries, the task would be trivially easy: split $\{(x_{1},y_{1}),\ldots,(x_{m},y_{m})\}$ into training and test sets, run the learner on the training set, estimate the error of the returned hypothesis on the test set, and output $\mathsf{structure}$ if the test error is, say, $\leq 1/10$ . If we are in the $\mathsf{structure}$ case, the error will be $\leq\varepsilon$ , and if we are in the $\mathsf{noise}$ case, with high probability the error will be close to 1/2.

Instead we have to answer queries, so we will answer them randomly. Specifically, we will draw a random function $f:\{0,1\}^{n}\to\{0,1\}$ , and whenever the TL-Q algorithm wants to make a query, we will answer according to the $f$ we chose. We will filter both the training and test sets to just those points where $y=f(x)$ . This means that we essentially sample from a domain that is uniform over the portion of $\{0,1\}^{n}$ where $y(x)$ agrees with $f(x)$ .

As before, if the TL-Q algorithm produces a hypothesis, we will output $\mathsf{structure}$ if the error is $\leq 1/10$ and $\mathsf{noise}$ if the error is greater. But since TL-Q can also reject the instance and output $\bot$ , if it does so, we will output $\mathsf{structure}$ .

Notice that in the noise case, each $x_{i}$ is filtered out independently with probability 1/2; therefore the distribution of samples is uniform. By completeness of the TL-Q algorithm, it must then output a hypothesis, and with high probability the error will be close to 1/2. In the structure case, however, the distribution of $x_{i}$ ’s may be far from uniform, in which case the TL-Q algorithm may output $\bot$ . It may also output a hypothesis – but by soundness, the hypothesis must have error $\leq\varepsilon$ , since the samples come from a distribution such that $y=f(x)$ for every $x$ in its support.

Our reduction from refutation to TL-Q is basically a generalization of this strategy, adapted to handle unbalanced functions and accept functions that are close to, but not exactly, in $\mathcal{C}$ .

1.3.1 An SQ-Preserving Reduction

We observe that some membership query algorithms, such as those of [18, 14, 1], use membership queries only to estimate statistical properties of the unknown function. We roughly categorize MQ-SQ queries as follows (formalized in Definition 20):

$\blacksquare$

Standard SQs: queries of the form $\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x)\right]$ or $\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x)f(x)\right]$ for a test function $\phi$ . These don’t require membership queries to implement.
$\blacksquare$

Pair SQs: queries of the form $\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x)f(x)f(\pi(x))\right]$ , where $\pi$ is a permutation of the domain without fixed points. This generalizes influence estimation.
$\blacksquare$

Customized distribution SQs: Any of the above queries, where the expectation is taken over a specific (and sufficiently spread-out) distribution $\mathcal{D}^{\star}$ instead of the unknown distribution $\mathcal{D}$ . This generalizes making SQs over restrictions of $\{0,1\}^{n}$ .

Our goal is to simulate an MQ-SQ testable learner by making only SQs to the unknown distribution $\mathcal{D}^{\mathrm{refut.}}$ (over $\mathcal{X}\times\{0,1\}$ ) in the refutation instance. As in the non-SQ setting, we choose a random function to be the target function and answer queries according to that random function.

For example, the customized-distribution query $\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}^{\star}}\left[f(x)\phi(x)\right]$ is easy to simulate with just one SQ to $\mathcal{D}^{\mathrm{refut.}}$ : simply estimate the mean $p\coloneqq\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}^{\mathrm{refut.}}}% \left[y\right]$ , and answer the MQ-SQ with $p\cdot\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}^{\star}}\left[\phi(x)\right]$ (sampling from $\mathcal{D}^{\star}$ requires neither samples nor membership queries, since it’s the customized distribution). The value of this MQ-SQ concentrates around this estimate as each $f(x)$ is an independent random variable with mean $p$ .

Pair SQs are handled similarly, though in this case the random variables $f(x)f(\pi(x))$ are not independent. However, since the dependence graph of these variables decomposes into cycles, we can partition the graph into large independent sets and prove concentration of the variables within the independent sets. Thus, for example, the MQ-SQ $\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x)f(x)f(\pi(x))\right]$ can be answered with $p\cdot\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}^{\mathrm{refut.}}}\left% [\phi(x)y\right]$ , which is an SQ to $\mathcal{D}^{\mathrm{refut.}}$ .

1.4 Related Work and Discussion

The power of membership queries

It is well known that in the realizable setting, PAC learners with membership queries are strictly stronger than PAC learners without them under standard cryptographic assumptions [9, 11]. The work of [10] establishes an equivalence between PAC with random examples and PAC with membership queries in the distribution-free agnostic setting, and a separation in the distribution-specific agnostic setting.

Testable learning and friends

There are many papers that address the computational and sample complexities of testably learning various natural concept classes; these works are in the standard agnostic testable learning model. Some examples, but certainly not all, are the works of [21, 13, 8, 12, 22]. Some works addressing related problems include [16, 15, 19].

The work of [13] characterizes the sample complexity of testable learning by the Rademacher complexity, which is especially relevant to this work in light of the result of [17], which establishes refutation complexity as an analogue of Rademacher complexity for the computationally bounded setting.

Learning and refutation

A connection between learning and refutation was first introduced in [6] as a means of proving computational lower bounds for learning based on the assumption that refuting random CSPs is hard, and other works including those of [7, 5] use this method to give conditional lower bounds for various learning problems. Of particular relevance to this work are [23, 17], which give polynomial equivalences between PAC learning and refutation.

1.4.1 Directions for Future Work

While our work reduces ordinary sample-based PAC learning to query-based testable learning, we do not resolve the strongest, most natural question on the power of membership queries: whether sample-based testable learning reduces to query-based testable learning. This would be a strictly stronger lower bound for TL-Q than anything obtainable through refutation, as there are function classes for which refutation is known to be easier than testable learning with samples – for example, the class of monotone functions. It is proven in [21] that testably learning monotone functions on the uniform distribution requires $2^{\Omega(n)}$ samples. On the other hand, agnostic learning (and therefore refutation) can be done in $2^{\tilde{O}(\sqrt{n})}$ time and samples.

We leave this possible stronger lower bound as an open question for future work.

Conjecture 6.

If a concept class $\mathcal{C}$ is agnostically testably learnable with queries in time $t$ over a distribution $\mathcal{D}$ , then it is agnostically testably learnable with samples in time $\mathop{\mathrm{poly}}(t)$ over $\mathcal{D}$ as well.

We also remark that in the semi-agnostic setting, our method relates semi-agnostic TL-Q to weak agnostic learning. It is an open question for future work to resolve the connection between semi-agnostic TL-Q and semi-agnostic learning as well.

2 Preliminaries

2.1 Distances and Errors

Definition 7 (Distance of functions and distance to a concept class).

Relative to a distribution $\mathcal{D}$ over $\mathcal{X}\times\{0,1\}$ with $\mathcal{X}$ -marginal $\mathcal{D}_{x}$ , we denote

\mathrm{dist}_{\mathcal{D}_{x}}(f,g)=\mathop{{\Pr}\/}_{x\sim\mathcal{D}_{x}}[f% (x)\neq g(x)]

\mathrm{err}_{\mathcal{D}}(f)=\mathop{{\Pr}\/}_{(x,y)\sim\mathcal{D}}[f(x)\neq y].

We also use the following notation to denote the classification error of the most accurate concept in a class, which we often refer to as $\mathsf{opt}$ :

\mathrm{dist}_{\mathcal{D}_{x}}(f,\mathcal{C})=\inf_{g\in\mathcal{C}}\mathrm{% dist}_{\mathcal{D}_{x}}(f,g).

\mathrm{err}_{\mathcal{D}}(\mathcal{C})=\inf_{g\in\mathcal{C}}\mathrm{err}_{% \mathcal{D}}(g).

2.2 Refutation and Learning

We state a definition of refutation similar to the definition presented in [17]. We have modified it to use classification error rather than correlation, for ease of use in our $\{0,1\}$ -labeled setting ([17] uses $\{-1,1\}$ labels).

Definition 8 ( $\eta$ -refutation).

Let $\mathcal{C}\subseteq\{f:\mathcal{X}\to\{0,1\}\}$ be a concept class over a finite input domain $\mathcal{X}$ , and let $\mathcal{F}$ be a family of distributions on $\mathcal{X}$ . An $\eta$ -refutation algorithm $\mathcal{A}$ for $\mathcal{C}$ on $\mathcal{F}$ with $m$ samples is an algorithm that takes an $m$ -tuple of labeled examples $\{(x_{1},y_{1}),\ldots,(x_{m},y_{m})\}$ and outputs either $\mathsf{noise}$ or $\mathsf{structure}$ . If the examples are i.i.d. from a distribution $\mathcal{D}$ over $\mathcal{X}\times\{0,1\}$ such that the marginal on $\mathcal{X}$ is some $\mathcal{D}_{x}\in\mathcal{F}$ , then the following guarantees hold:

$\blacksquare$

Completeness: If there exists $g\in\mathcal{C}$ such that $\mathrm{err}_{\mathcal{D}}(g)\leq\eta$ , then

$\mathop{{\Pr}\/}_{\begin{subarray}{c}\{(x_{i},y_{i})\}\sim\mathcal{D}\\ \text{internal randomness of }\mathcal{A}\end{subarray}}[\mathcal{A}\text{ % outputs }\mathsf{structure}]\geq 2/3.$
$\blacksquare$

Soundness: If the $y_{i}$ ’s are drawn i.i.d. from $\mathsf{Bernoulli}(1/2)$ , then

$\mathop{{\Pr}\/}_{\begin{subarray}{c}\{(x_{i},y_{i})\}\sim\mathcal{D}\\ \text{internal randomness of }\mathcal{A}\end{subarray}}[\mathcal{A}\text{ % outputs }\mathsf{noise}]\geq 2/3.$

We also define a similar but stronger task:

Definition 9 (Biased $(\alpha,\eta)$ -refutation).

A biased- $(\alpha,\eta)$ -refutation algorithm is as above except the soundness condition is the following:

$\blacksquare$

Soundness: For all $p\in[\alpha,1-\alpha]$ , if the $y_{i}$ ’s are drawn i.i.d. from $\mathsf{Bernoulli}(p)$ , then

$\mathop{{\Pr}\/}_{\begin{subarray}{c}\{(x_{i},y_{i})\}\sim\mathcal{D}\\ \text{internal randomness of }\mathcal{A}\end{subarray}}[\mathcal{A}\text{ % outputs }\mathsf{noise}]\geq 2/3.$

Here we state definitions and facts used in [17]’s reduction from refutation to agnostic learning. Again we modify these statements to use classification error rather than correlation.

Definition 10 (Weak agnostic learning).

A $(\gamma,\alpha)$ -weak agnostic learner for the concept class $\mathcal{C}$ over the distribution $\mathcal{D}$ outputs a hypothesis $h$ satisfying the following:

\mathrm{err}_{\mathcal{D}}(h)\leq\frac{1+\alpha-\gamma}{2}+\gamma\mathrm{err}_% {\mathcal{D}}(\mathcal{C}).

Lemma 11 (Learning by refutation: Lemma 6 of [17]).

Suppose there is an $\eta$ -refutation algorithm for the class $\mathcal{C}$ over distribution $\mathcal{D}$ running in $T(n)$ time with $m$ samples. Then there is an algorithm that runs in $T(n)\cdot\frac{m^{2}}{\varepsilon^{2}}$ and uses $O(\frac{m^{3}}{\varepsilon^{2}})$ samples to agnostically learn $\mathcal{C}$ on $\mathcal{D}$ with excess error $1-2\eta+\varepsilon$ .

2.3 Testable Learning with Queries

We now define semi-agnostic testable learning with queries.

Definition 12 (Testable learning with queries, or TL-Q).

A concept class $\mathcal{C}$ over the input domain $\mathcal{X}$ is $(c,\varepsilon,\delta)$ -PAC-testably-learnable with $q$ queries, $m$ samples, and $t$ time, on a set $\mathcal{F}$ of distributions over $\mathcal{X}$ , if there is a $t$ -time algorithm $\mathcal{A}$ that takes $m$ samples from an unknown distribution $\mathcal{D}\in\mathcal{F}$ and $q$ membership queries to an unknown function $f^{\star}$ , and outputs $h\in\mathcal{C}\cup\{\bot\}$ such that:

$\blacksquare$

Soundness: If $h\neq\bot$ , then

$\Pr[\mathrm{dist}_{\mathcal{D}}(h,f^{\star})\geq c\cdot\mathsf{opt}+% \varepsilon]\leq\delta.$
$\blacksquare$

Completeness: If $\mathcal{D}\in\mathcal{F}$ , then

$\Pr[h=\bot]\leq\delta.$

3 Refutation, Learning, and Testable Learning with Queries

3.1 A General Reduction from Refutation to TL-Q

In this section, we show that if a class is efficiently testably-learnable, then it is efficiently refutable as well, with polynomial dependence on the sample, time, and query complexity of the testable learner.

In fact, our Algorithm 1 solves the harder problem of biased refutation (Definition 9), though the distinction between the two will not be relevant until Section 3.3. A biased refutation algorithm can always be used to solve the standard unbiased refutation problem, as the soundness guarantee for biased refutation must always hold when $p=1/2$ . To avoid confusion, we let $\mathcal{D}^{\mathrm{refut.}}$ denote the distribution of labeled pairs $(x,y)\in\mathcal{X}\times\{0,1\}$ in the refutation instance, while $\mathcal{D}$ denotes the unknown marginal distribution in a TL-Q instance.

Algorithm 1

\textsc{BiasedRefutation}(\mathsf{samples},\eta,\varepsilon,m,q,c)

.

The main result of this section is the following:

Theorem 13.

Let $\mathcal{C}$ be $(c,\varepsilon,\tfrac{1}{10})$ PAC-testably-learnable with $m$ samples, $q$ queries, and $t$ time, on a distribution family $\mathcal{F}$ satisfying

m+q/\varepsilon^{2}\ll\frac{1}{\sup_{\mathcal{D}_{x}\in\mathcal{F}}(\|\mathcal% {D}_{x}\|_{2})}.

Then for any $\varepsilon$ satisfying $\varepsilon^{2}\geq ck\cdot\sup_{\mathcal{D}_{x}\in\mathcal{F}}(\|\mathcal{D}_% {x}\|_{2})$ for sufficiently large constant $k$ , and any $\eta<\frac{1/2-4\varepsilon}{c}$ , $\mathcal{C}$ is $(c\eta+4\varepsilon,\eta)$ -refutable over all members of $\mathcal{F}$ with $m^{\prime}$ samples and $t^{\prime}$ time, where

m^{\prime}=O\left(\frac{m+1/\varepsilon^{2}}{\varepsilon}+q\right)

t^{\prime}=O(m^{\prime}+t).

Algorithm 1 essentially draws a random function $f$ of bias $p$ , where $p$ is the mean of the labels in the refutation distribution $\mathcal{D}^{\mathrm{refut.}}$ , and filters the samples to just pairs where $y=f(x)$ . The drawing of $f$ is “lazy:” to draw $f$ and answer queries to it, it suffices to draw each value of $f(x)$ from $\mathsf{Bernoulli}(p)$ the first time we need to know $f(x)$ , then store $(x,f(x))$ in a table for consistency. We implement the random coin by reading a label, as the labels are distributed as $\mathsf{Bernoulli}(p)$ .

3.1.1 Properties of the Filtered Sample Distribution

Before proving the theorem, it will be useful to have the following supporting claims about the distribution that the sets $S_{\mathsf{train}}$ and $S_{\mathsf{test}}$ are drawn from. This distribution $\mathcal{D}$ – which is the unknown distribution that the TL-Q instance is running on – depends on the random function $f$ . Formally, the PMF of $\mathcal{D}$ is the following, and one may observe that the construction of $S$ in lines 7-11 of Algorithm 1 produces samples from this distribution:

Definition 14 (Filtered sample distribution).

Let $\mathcal{D}^{\mathrm{refut.}}$ be a distribution over $\mathcal{X}\times\{0,1\}$ with $\mathcal{X}$ -marginal $\mathcal{D}_{x}$ and let $f:\mathcal{X}\to\{0,1\}$ be a $p$ -biased random function. Let the function $y(x)$ be defined as $\operatorname*{\mathbb{E}}_{(x^{\prime},y^{\prime})\sim\mathcal{D}^{\mathrm{% refut.}}}\left[y^{\prime}\mid x^{\prime}=x\right]$ . Then the filtered sample distribution $\mathcal{D}$ is defined by the following PMF:

\mathcal{D}(x)=\frac{\mathcal{D}_{x}(x)}{Z}\cdot(p(1-y(x))(1-f(x))+(1-p)y(x)f(% x)),

where $Z$ is the normalization factor

Z=\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}_{x}}\left[(1-p)y(x)f(x)+p(1-y(x% ))(1-f(x))\right].

We state the properties below; the proofs are deferred to the full version of this paper.

Lemma 15.

For every $\delta\geq 0$ , it holds with probability at least $1-2\exp\left(-\Omega\left(\frac{\delta^{2}\cdot p^{2}(1-p)^{2}}{\|\mathcal{D}_% {x}\|_{2}^{2}}\right)\right)$ over the randomness of $f$ that

\left|Z-p(1-p)\right|\leq\delta\cdot p(1-p).

Claim 16.

Let $g:\mathcal{X}\to\{0,1\}$ be an arbitrary function. For any $\delta$ , with probability at least $1-\exp\left(-\Omega\left(\frac{(\delta p)^{2}(1-p)^{2}}{\|\mathcal{D}_{x}\|_{2% }^{2}}\right)\right)$ over the randomness of $f$ , we have

\mathop{{\Pr}\/}_{x\sim\mathcal{D}}[g(x)\neq f(x)]\leq\mathop{{\Pr}\/}_{(x,y)% \sim\mathcal{D}^{\mathrm{refut.}}}[g(x)\neq y]+\delta.

3.1.2 Proof of Theorem 13

With the above properties in hand, we will now prove the main theorem of this section.

Proof of Theorem 13.

Let $\mathcal{A}$ be the $(c,\varepsilon,1/10)$ -testable learner for $\mathcal{C}$ . We will show that Algorithm 1 is a $(c\eta+4\varepsilon,\eta)$ refutation algorithm over any $\mathcal{D}^{\mathrm{refut.}}$ with $\mathcal{X}$ -marginal $\mathcal{D}_{x}$ such that $\mathcal{D}_{x}\in\mathcal{F}$ . Since the samples come from a refutation instance, one of the following must hold:

$\blacksquare$

Structure: $\mathop{{\Pr}\/}_{(x,y)\sim\mathcal{D}^{\mathrm{refut.}}}[g(x)\neq y]\leq\eta$ for some $g\in\mathcal{C}$ , or
$\blacksquare$

Noise: $y\sim\mathsf{Bernoulli}(p)$ for some $p\in[c\eta+4\varepsilon,1-c\eta-4\varepsilon]$ .

We will refer to $\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}^{\mathrm{refut.}}}\left[y\right]$ as $p$ regardless of whether we are in the structure case or the noise case. We consider the following possibilities:

$\blacksquare$

Structure: In this case, there is some function $g\in\mathcal{C}$ such that $\mathop{{\Pr}\/}_{(x,y)\sim\mathcal{D}^{\mathrm{refut.}}}[g(x)\neq y]\leq\eta$ . By setting the constant $C_{1}$ large enough, we have by Hoeffding’s inequality that $\mathop{{\Pr}\/}[|\hat{p}-p|>\varepsilon]\leq 1/100$ . With the remaining probability, if $p<\varepsilon$ or $p>1-\varepsilon$ we would output $\mathsf{structure}$ after line 4, so we will assume from here that $\varepsilon\leq p\leq 1-\varepsilon$ .

We will denote by $\mathcal{D}$ the distribution over $\mathcal{X}$ from which our sample set $S$ is drawn, as discussed in Section 3.1.1. Since $\varepsilon\leq p\leq 1-\varepsilon$ , by our assumption that $\varepsilon^{2}\geq ck\cdot\sup_{\mathcal{D}_{x}\in\mathcal{F}}(\|\mathcal{D}_% {x}\|_{2})$ we have

$\frac{(\varepsilon/c)^{2}\cdot p^{2}(1-p)^{2}}{\|\mathcal{D}_{x}\|_{2}^{2}}% \geq\Omega\left(\frac{\varepsilon^{4}}{c^{2}\|\mathcal{D}_{x}\|_{2}^{2}}\right% )\geq\Omega(k^{2}).$

By Claim 16, setting the constant $k$ to be large enough, we then have

$\mathrm{dist}_{\mathcal{D}}(f,\mathcal{C})\leq\mathop{{\Pr}\/}_{x\sim\mathcal{% D}}[f(x)\neq g(x)]+\varepsilon/c\leq\eta+\varepsilon/c$

with probability at least $1-\exp(-\Omega(k^{2}))\geq 99/100$ over the randomness of $f$ . When this happens, $\mathcal{A}$ has two possible sound behaviors: either output $\bot$ , or output $h$ satisfying

$\mathrm{dist}_{\mathcal{D}}(h,f)\leq c(\eta+\varepsilon/c)+\varepsilon\leq c% \eta+2\varepsilon.$

By the TL-Q guarantee, it produces one of these sound behaviors with probability at least $9/10$ . By setting the constant $C_{3}$ large enough, by Hoeffding’s inequality if $\mathcal{A}$ outputs a hypothesis $h\neq\bot$ , it satisfies $\widehat{\mathsf{err}}(h)\leq\mathrm{dist}_{\mathcal{D}}(h,f)+\varepsilon$ with probability at least

$1-\exp(-2\varepsilon^{2}|T|)\geq 99/100.$

When this happens, we have $\widehat{\mathsf{err}}(h)\leq c\eta+3\varepsilon$ , so we successfully output $\mathsf{structure}$ .

\blacksquare

Noise: In this case, the labels are drawn from $\mathsf{Bernoulli}(p)$ , and the elements are drawn from $\mathcal{D}$ . Since $y(x)=p$ for all $x$ , we have

\mathcal{D}(x)=\frac{\mathcal{D}_{x}(x)\cdot p(1-p)}{Z}\quad\text{and}\quad Z=% \sum_{z\in\mathcal{X}}\mathcal{D}_{x}(x)\cdot p(1-p).

Thus, $\mathcal{D}=\mathcal{D}_{x}$ , so by completeness of $\mathcal{A}$ , it must output $\bot$ with probability at most $1/10$ . With the remaining probability, $\mathcal{A}$ outputs a hypothesis $h$ ; we will argue that with high probability its error on $S_{\mathsf{test}}$ is at least $\min(p,1-p)-\varepsilon>c\eta+3\varepsilon$ . We return an error if any element in the test set appears anywhere else, so assuming this does not happen, each label in the test set is a new independent draw from $\mathsf{Bernoulli}(p)$ . Thus we have:

	$\displaystyle\Pr_{x\sim T}[h(x)\neq f(x)]$	$\displaystyle=\frac{1}{\|T\|}\sum_{x\in T:h(x)=0}\mathsf{Bernoulli}(p)+\frac{1}{% \|T\|}\sum_{x\in T:h(x)=1}\mathsf{Bernoulli}(1-p)$
		$\displaystyle\geq\frac{1}{\|T\|}\mathsf{Binomial}(\min(p,1-p),\|T\|).$

By a Hoeffding bound it follows that $\widehat{\mathsf{err}}(h)\geq\min(p,1-p)-\varepsilon$ with probability at least

\displaystyle 1-2\exp(-2\varepsilon^{2}|T|)

\displaystyle\geq 99/100.

Thus, we successfully output $\mathsf{noise}$ with high probability.

In both cases the refutation algorithm succeeds with probability $\geq 4/5$ conditioned on not returning an error due to either insufficient samples, duplicate samples, or overlap between the query set and the test set.

We will set the sample complexity so that the probability of insufficient samples is small. Let $B$ be the number of samples reserved for drawing from $\mathsf{Bernoulli}(p)$ or estimating $\hat{p}$ . Each of the remaining $m^{\prime}-B$ samples is included in $S$ with probability $p(1-p)$ . Thus we will set $m^{\prime}-B\geq\frac{100}{(c\eta+4\varepsilon)(1-c\eta-4\varepsilon)}\cdot(m+% C_{3}/\varepsilon^{2})$ . Then we have by a Chernoff bound,

\Pr\left[|S|<m+\frac{C_{3}}{\varepsilon^{2}}\right]\leq\exp\left(-\frac{(99/10% 0)^{2}}{2}\cdot p(1-p)(m^{\prime}-B)\right)\leq 1/100.

To set $B$ , we need 2 draws for each of the samples and 1 draw for each membership query; by setting $C_{2}$ large enough this requirement is satisfied. Thus the total sample complexity is

m^{\prime}\coloneqq O\left(\frac{m+1/\varepsilon^{2}}{\varepsilon}+q\right)

and the total time complexity is $O(m^{\prime}+t)$ .

Finally, we will bound the probability of duplicate samples and overlap with the query set. By the assumption that $m+1/\varepsilon^{2}\ll 1/\|\mathcal{D}_{x}\|_{2}$ and the fact that each pair of samples collides with probability $\|\mathcal{D}_{x}\|_{2}^{2}$ , it follows from a union bound over the pairs of samples in $S$ that w.h.p. there are no duplicates in $S_{\mathsf{train}}\cup S_{\mathsf{test}}$ . Furthermore, by the assumption that $q/\varepsilon^{2}\ll 1/\|\mathcal{D}_{x}\|_{2}\leq 1/\|\mathcal{D}_{x}\|_{\infty}$ , it follows that every set of size $q$ has distributional mass $\ll\varepsilon^{2}$ . Thus, with high probability none of the $C_{3}/\varepsilon^{2}$ elements in the test set appear in the query set.

Union bounding over all the failure probabilities in each case, the total success probability remains at least 2/3. $\hfill\blacktriangleleft$

3.2 TL-Q Implies Sample-Based Learnability

A corollary of Theorem 13 is the fact that TL-Q implies efficient sample-based, distribution-specific agnostic learning.

Corollary 17 (Testable learning with queries implies learning with samples).

Let $\mathcal{C}$ be $(c,\varepsilon,\tfrac{1}{10})$ PAC-testably-learnable with $m$ samples, $q$ queries, and $t$ time, on a distribution family $\mathcal{F}$ satisfying

m+q/\varepsilon^{2}\ll\frac{1}{\sup_{\mathcal{D}_{x}\in\mathcal{F}}(\|\mathcal% {D}_{x}\|_{2})}.

Then for any $\varepsilon$ satisfying $\varepsilon^{2}\geq ck\cdot\sup_{\mathcal{D}_{x}\in\mathcal{F}}(\|\mathcal{D}_% {x}\|_{2})$ for sufficiently large constant $k$ , there is an agnostic learner for $\mathcal{C}$ over $\mathcal{F}$ with excess error $1-1/c+O(\varepsilon)$ . In particular, when $c=1$ , i.e. $\mathcal{C}$ is fully agnostically learnable in TL-Q, $\mathcal{C}$ is fully agnostically learnable with samples.

The sample complexity of the learner is

O((m^{\prime})^{3}/\varepsilon^{2}))\quad\text{where}\quad m^{\prime}=O\left(% \frac{m+1/\varepsilon^{2}}{\varepsilon}+q\right)

and the time complexity is

O\left(\frac{(m^{\prime})^{2}(m^{\prime}+t)}{\varepsilon^{2}}\right).

Proof.

By Theorem 13, there is a $(c\eta+4\varepsilon,\eta)$ -refutation algorithm for any $\eta<\frac{1/2-4\varepsilon}{c}$ . Observe that biased $(\alpha,\eta)$ -refutation is at least as strong as $\eta$ -refutation: the $(\alpha,\eta)$ -refutation algorithm can be used to solve $\eta$ -refutation, as $p=1/2$ is always in the range $[\alpha,1-\alpha]$ . By Lemma 11, this gives an agnostic learner with excess error $1-2\cdot\frac{1/2-4\varepsilon}{c}+\varepsilon=1-1/c+O(\varepsilon)$ . The time and sample bounds are obtained by combining the bounds in Lemma 11 with those in Theorem 13. $\hfill\blacktriangleleft$

3.3 Realizably Learning Juntas via Exact Refutation

In the above subsections, we gave a general reduction from TL-Q to agnostic learning, citing as a black box the learning-by-refutation lemma of [17], Lemma 11. This lemma yields learners whose excess error depends on the refutation gap parameter $\eta$ , which in our case must necessarily be smaller than $1/2c$ , as one cannot hope to distinguish a function that is $(1/2c)$ -close to the concept class from a random function using a $c$ -semi-agnostic learner. Thus the performance of this learner quickly degrades with $c$ , becoming trivial when $c=2$ .

For the class of sparse juntas over the uniform distribution on $\{0,1\}^{n}$ , we give a realizable learner from an algorithm that solves the easier task of exact refutation ( $\eta=0$ ). Exact refutation reduces to $c$ -semi-agnostic TL-Q for any value of $c$ , via Theorem 13. Thus we show that for juntas, even for large values of $c$ , semi-agnostic TL-Q is as hard as learning with samples. We state the result below; the proof is deferred to the full version of this paper.

Lemma 18.

Let $\varepsilon\gg 2^{-n/4}$ and $m+q/\varepsilon^{2}\ll 2^{n/2}$ . For any constant $c$ , if the class of $k$ -juntas is $(c,\varepsilon,\tfrac{1}{10})$ -PAC-testably-learnable with $q$ queries, $m$ samples, and $t$ time over the uniform distribution on $\{0,1\}^{n}$ , then the class of $k$ -juntas is agnostically learnable over the uniform distribution with excess error $O(\varepsilon)$ and confidence $1-\delta$ , with sample complexity

m^{\prime}=\left(\frac{m+1/\varepsilon^{2}}{\varepsilon}+q\right)\cdot 2^{O(k)% }\cdot n\log^{2}(n/\delta)

and time complexity

t^{\prime}=\left(\frac{m+1/\varepsilon^{2}}{\varepsilon}+q+t\right)\cdot 2^{O(% k)}\cdot n\log(n/\delta).

Since $k$ -sparse parities are a subclass of $k$ -juntas, we conclude that even semi-agnostic TL-Q cannot be done in $n^{o(k)}$ time, under the assumption that LSPN is hard.

Corollary 19.

If LSPN requires $n^{\Omega(k)}$ time, then for any constant $c$ , $c$ -semi-agnostic TL-Q for $k$ -juntas over the uniform distribution requires $n^{\Omega(k)}$ time.

4 MQ-SQ Lower Bounds

We introduce a class of “MQ-SQ” (membership-query-statistical-query) algorithms that capture many existing learning algorithms that use membership queries. Then, we prove an SQ-analogue of the reduction in Section 3.1: an MQ-SQ testable learner for a class $\mathcal{C}$ implies an SQ algorithm for refutation (Definition 9). This result, together with a reduction from SQ weak learning to SQ refutation, allows us to prove lower bounds against MQ-SQ algorithms for testably learning several fundamental concept classes, including parity functions, $k$ -juntas, and decision trees.

4.1 Five Types of MQ-SQs

Let $\mathcal{X}$ denote the instance space and $f:\mathcal{X}\to\{0,1\}$ denote the target function in the TL-Q instance. Let $\mathcal{D}\in\Delta(\mathcal{X})$ be the unknown marginal distribution over $\mathcal{X}$ . An MQ-SQ oracle for $(f,\mathcal{D})$ answers the following five types of queries up to a small error.

Definition 20 (MQ-SQ Oracle).

An MQ-SQ oracle with tolerance $\tau\geq 0$ answers the following five types of queries within an additive error of $\tau$ , given any test function $\phi:\mathcal{X}\to[0,1]$ , any distribution $\mathcal{D}^{\star}\in\Delta(\mathcal{X})$ , and any permutation $\pi:\mathcal{X}\to\mathcal{X}$ without fixed points:

$\blacksquare$

Type I: $\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}^{\star}}\left[\phi(x)f(x)\right]$ .
$\blacksquare$

Type II: $\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}^{\star}}\left[\phi(x)f(x)f(\pi(x)% )\right]$ .
$\blacksquare$

Type III: $\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x)\right]$ .
$\blacksquare$

Type IV: $\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x)f(x)\right]$ .
$\blacksquare$

Type V: $\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x)f(x)f(\pi(x))\right]$ .

For concreteness, consider the case that $\mathcal{X}=\{0,1\}^{n}$ is the hypercube. By setting $\mathcal{D}^{\star}$ to the uniform distribution over $\{0,1\}^{n}$ (denoted by $\mathcal{U}$ ) in Type I queries, we recover the usual statistical query model for learning over $\mathcal{U}$ . Queries of Types II and V allow us to estimate the correlation between $f$ and a permuted version of $f$ (weighted by $\phi$ ) over both a generic customized distribution $\mathcal{D}^{\star}$ and the unknown marginal $\mathcal{D}$ , respectively. For instance, setting $\pi:x\mapsto x^{\oplus i}$ and $\mathcal{D}^{\star}=\mathcal{U}$ allows us to estimate the influence of variable $x_{i}$ with respect to $f$ .

When the distribution $\mathcal{D}^{\star}$ in a Type I MQ-SQ is degenerate (i.e., with all its probability mass on a single point $x_{0}\in\mathcal{X}$ ), the MQ-SQ reduces to a usual membership query at $x_{0}$ . Our reduction for MQ-SQ algorithms only works when no such queries are made. More concretely, the reduction requires that the squared $2$ -norm of $\mathcal{D}^{\star}$ , $\|\mathcal{D}^{\star}\|_{2}^{2}\coloneqq\sum_{x\in\mathcal{X}}[\mathcal{D}^{% \star}(x)]^{2}$ , is sufficiently small in all queries that the algorithm makes. As we will see later, when many existing query-based learning algorithms are implemented using MQ-SQs, $\mathcal{D}^{\star}$ is usually uniform over a size- $2^{\Omega(n)}$ subset of $\{0,1\}^{n}$ . This implies $\|\mathcal{D}^{\star}\|_{2}^{2}\leq 2^{-\Omega(n)}$ , so our reduction applies to most of the interesting parameter regimes.

4.2 Implementing Query-Based Learning Algorithms Using MQ-SQs

We note that many existing MQ-based learning algorithms (or components thereof) for the uniform distribution $\mathcal{U}$ over the hypercube $\mathcal{X}=\{0,1\}^{n}$ can be implemented using MQ-SQs. Examples of well-known MQ algorithms and their MQ-SQ implementations are given in the full version of this paper.

4.3 MQ-SQ Testable Learning Implies SQ Refutation

We will show that, if there is an efficient MQ-SQ algorithm that testably learns class $\mathcal{C}$ , the same class can be refuted by an efficient SQ algorithm. To this end, we first recall the definition of a (usual) SQ-based algorithm in the context of refutation (Definition 9). As in Section 3.1, we let $\mathcal{D}^{\mathrm{refut.}}$ denote the distribution of labeled pairs in the refutation instance and $\mathcal{D}$ denote the unknown marginal distribution in a TL-Q instance.

Definition 21 (SQ oracle for refutation).

An SQ oracle for refutation with tolerance $\tau\geq 0$ answers queries of form $\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}^{\mathrm{refut.}}}\left[\phi(% x,y)\right]$ within an additive error of $\tau$ given a test function $\phi:\mathcal{X}\times\{0,1\}\to[0,1]$ .

Recap: Reduce refutation to TL-Q

We start by recalling the reduction from Section 3.1. Let $p\coloneqq\Pr_{(x,y)\sim\mathcal{D}^{\mathrm{refut.}}}\left[y=1\right]$ be the fraction of positive labels in the refutation instance. We consider the TL-Q instance on the same concept class $\mathcal{C}$ , where the target function $f:\mathcal{X}\to\{0,1\}$ is chosen as a random $p$ -biased function, i.e., each function value $f(x)$ is sampled from $\mathsf{Bernoulli}(p)$ independently. The marginal distribution $\mathcal{D}\in\Delta(\mathcal{X})$ is the distribution of the output $x\in\mathcal{X}$ produced by the following procedure:

$\blacksquare$

Sample $(x,y)\sim\mathcal{D}^{\mathrm{refut.}}$ .
$\blacksquare$

If $y=f(x)=1$ , output $x$ with probability $1-p$ .
$\blacksquare$

If $y=f(x)=0$ , output $x$ with probability $p$ .
$\blacksquare$

If no output is produced, return to the first step.

Formally, the probability mass function of $\mathcal{D}$ is given by

\mathcal{D}(x)=\frac{1}{Z}\cdot\mathcal{D}_{x}(x)\left[(1-p)\cdot y(x)\cdot f(% x)+p\cdot(1-y(x))\cdot(1-f(x))\right],

(1)

where $\mathcal{D}_{x}\in\Delta(\mathcal{X})$ is the $\mathcal{X}$ -marginal of $\mathcal{D}^{\mathrm{refut.}}$ ,

y(x)\coloneqq\operatorname*{\mathbb{E}}_{(x^{\prime},y^{\prime})\sim\mathcal{D% }^{\mathrm{refut.}}}\left[y^{\prime}\mid x^{\prime}=x\right]

is the conditional expectation of $y\mid x$ over $\mathcal{D}^{\mathrm{refut.}}$ , and

Z\coloneqq\sum_{x\in\mathcal{X}}\mathcal{D}_{x}(x)\left[(1-p)\cdot y(x)\cdot f% (x)+p\cdot(1-y(x))\cdot(1-f(x))\right]

(2)

is a normalization factor.

Simulation of oracle queries

We state the following lemmas, whose proofs are deferred to the full version of this paper.

Lemma 22 (Types I and II).

The following holds for every $\varepsilon\geq 0$ , $\phi:\mathcal{X}\to[0,1]$ , $\mathcal{D}^{\star}\in\Delta(\mathcal{X})$ and permutation $\pi:\mathcal{X}\to\mathcal{X}$ without fixed points:

$\blacksquare$

With probability at least $1-2e^{-\Omega(\varepsilon^{2}/\|\mathcal{D}^{\star}\|_{2}^{2})}$ over the randomness of $f$ ,

$\left|\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}^{\star}}\left[\phi(x)f(x)% \right]-p\cdot\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}^{\star}}\left[\phi(% x)\right]\right|\leq\varepsilon.$
$\blacksquare$

With probability at least $1-6e^{-\Omega(\varepsilon^{2}/\|\mathcal{D}^{\star}\|_{2}^{2})}$ over the randomness of $f$ ,

$\left|\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}^{\star}}\left[\phi(x)f(x)f(% \pi(x))\right]-p^{2}\cdot\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}^{\star}}% \left[\phi(x)\right]\right|\leq\varepsilon.$

Lemma 23 (Types III, IV and V).

The following holds for every $\varepsilon\geq 0$ , $\phi:\mathcal{X}\to[0,1]$ and permutation $\pi:\mathcal{X}\to\mathcal{X}$ without fixed points:

$\blacksquare$

With probability at least $1-4e^{-\Omega(\varepsilon^{2}\cdot p^{2}(1-p)^{2}/\|\mathcal{D}_{x}\|_{2}^{2})}$ over the randomness of $f$ ,

$\left|\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x)\right]-% \operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}^{\mathrm{refut.}}}\left[\phi(% x)\right]\right|\leq\varepsilon.$
$\blacksquare$

With probability at least $1-4e^{-\Omega(\varepsilon^{2}\cdot p^{2}(1-p)^{2}/\|\mathcal{D}_{x}\|_{2}^{2})}$ over the randomness of $f$ ,

$\left|\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x)f(x)\right]-% \operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}^{\mathrm{refut.}}}\left[\phi(% x)\cdot y\right]\right|\leq\varepsilon.$
$\blacksquare$

With probability at least $1-8e^{-\Omega(\varepsilon^{2}\cdot p^{2}(1-p)^{2}/\|\mathcal{D}_{x}\|_{2}^{2})}$ over the randomness of $f$ ,

$\left|\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x)f(x)f(\pi(x))% \right]-p\cdot\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}^{\mathrm{refut.% }}}\left[\phi(x)\cdot y\right]\right|\leq\varepsilon.$

Now, we put everything together and prove our main result on MQ-SQ algorithms for testable learning.

Proposition 24.

Let $\mathcal{C}$ be a concept class of boolean functions over instance space $\mathcal{X}$ . Suppose that there is a $(c,\varepsilon,\delta)$ -PAC MQ-SQ algorithm that testably learns $\mathcal{C}$ over distribution family $\mathcal{F}\subseteq\Delta(\mathcal{X})$ using at most $q$ queries to an MQ-SQ oracle with tolerance $\tau>0$ . Then, there is an algorithm that solves biased- $(\alpha,\eta)$ -refutation on $\mathcal{C}$ over the same distribution family $\mathcal{F}$ by making at most $q^{\prime}=q+O(1)$ queries to an SQ oracle with tolerance $\tau^{\prime}=\tau/4$ and has a failure probability of at most $\delta^{\prime}=\delta+O(q)\cdot e^{-\Omega(\tau^{2}B)}$ , assuming the following:

1.

$\alpha>c\eta+\varepsilon+(c+4)\tau+6\tau^{\prime}$ ;
2.

$B\leq p^{2}(1-p)^{2}/\|\mathcal{D}_{x}\|_{2}^{2}$ for every $\mathcal{D}_{x}\in\mathcal{F}$ , where $p=\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}^{\mathrm{refut.}}}\left[y\right]$ is the average label in the refutation instance;
3.

$B\leq 1/\|\mathcal{D}^{\star}\|_{2}^{2}$ holds for all MQ-SQs of Types I and II that $\mathcal{A}$ makes.

Proof.

Let $\mathcal{A}$ denote the hypothetical MQ-SQ algorithm that testably learns $\mathcal{C}$ . We construct a new algorithm, denoted by $\mathcal{A}^{\prime}$ , that refutes $\mathcal{C}$ using an SQ oracle by simulating the execution of $\mathcal{A}$ .

Recall that we defined $p=\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}^{\mathrm{refut.}}}\left[y\right]$ . As the first step of the algorithm, $\mathcal{A}^{\prime}$ queries the SQ oracle (Definition 21) with $\phi(x,y)=y$ to obtain an estimate $\widehat{p}\in[p-\tau^{\prime},p+\tau^{\prime}]$ for $p$ .

Handling queries.

Whenever the simulated copy of $\mathcal{A}$ makes an MQ-SQ, $\mathcal{A}^{\prime}$ answers the query as follows:

$\blacksquare$

When $\mathcal{A}$ makes a Type I query on $\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}^{\star}}\left[\phi(x)f(x)\right]$ , $\mathcal{A}^{\prime}$ returns $\widehat{p}\cdot\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}^{\star}}\left[% \phi(x)\right]$ .
$\blacksquare$

When $\mathcal{A}$ makes a Type II query on $\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}^{\star}}\left[\phi(x)f(x)f(\pi(x)% )\right]$ , $\mathcal{A}^{\prime}$ returns ${\widehat{p}}^{2}\cdot\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}^{\star}}% \left[\phi(x)\right]$ .
$\blacksquare$

When $\mathcal{A}$ makes a Type III query on $\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x)\right]$ , $\mathcal{A}^{\prime}$ queries the SQ oracle on the value of $\mu=\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}^{\mathrm{refut.}}}\left[% \phi(x)\right]$ and then forwards the answer $\widehat{\mu}\in[\mu-\tau^{\prime},\mu+\tau^{\prime}]$ to $\mathcal{A}$ .
$\blacksquare$

When $\mathcal{A}$ makes a Type IV query on $\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x)f(x)\right]$ , $\mathcal{A}^{\prime}$ queries the SQ oracle on the value of $\mu=\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}^{\mathrm{refut.}}}\left[% \phi(x)\cdot y\right]$ and then forwards the answer $\widehat{\mu}\in[\mu-\tau^{\prime},\mu+\tau^{\prime}]$ to $\mathcal{A}$ .
$\blacksquare$

When $\mathcal{A}$ makes a Type V query on $\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x)f(x)f(\pi(x))\right]$ , $\mathcal{A}^{\prime}$ queries the SQ oracle on the value of $\mu=\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}^{\mathrm{refut.}}}\left[% \phi(x)\cdot y\right]$ and then forwards the answer $\widehat{\mu}\in[\mu-\tau^{\prime},\mu+\tau^{\prime}]$ multiplied by $\widehat{p}$ to $\mathcal{A}$ .

In the first two cases, $\mathcal{A}^{\prime}$ can exactly compute the expectations since it has full knowledge of $\phi:\mathcal{X}\to[0,1]$ and $\mathcal{D}^{\star}\in\Delta(\mathcal{X})$ .

Decision rule.

When $\mathcal{A}$ terminates, $\mathcal{A}^{\prime}$ decides on the refutation instance as follows:

$\blacksquare$

If $\mathcal{A}$ rejects the TL-Q instance, $\mathcal{A}^{\prime}$ returns $\mathsf{structure}$ , indicating that some $f^{*}\in\mathcal{C}$ has an error $\leq\eta$ on $\mathcal{D}^{\mathrm{refut.}}$ .
$\blacksquare$

If $\mathcal{A}$ accepts and returns a function $\widehat{f}:\mathcal{X}\to\{0,1\}$ , $\mathcal{A}^{\prime}$ makes the following three additional MQ-SQs on behalf of $\mathcal{A}$ and answer them as described above:

$\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\widehat{f}(x)\right],\quad% \operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[f(x)\right],\quad\text{and}% \quad\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\widehat{f}(x)f(x)% \right].$

Let $\widehat{\mu}_{1},\widehat{\mu}_{2},\widehat{\mu}_{3}$ denote the answers to the three queries, and let

$\widehat{\mu}=\widehat{\mu}_{1}+\widehat{\mu}_{2}-2\widehat{\mu}_{3}$

be a weighted sum of them. Note that $\widehat{\mu}$ is intended to be an estimate of

$\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\widehat{f}(x)+f(x)-2% \widehat{f}(x)f(x)\right]=\Pr_{x\sim\mathcal{D}}\left[\widehat{f}(x)\neq f(x)% \right].$

Finally, $\mathcal{A}^{\prime}$ outputs $\mathsf{noise}$ (indicating that the labels are random) if $\widehat{\mu}\geq\min\{\widehat{p},1-\widehat{p}\}-5\tau^{\prime}$ , and outputs $\mathsf{structure}$ otherwise.

Overview of analysis.

We first upper bound the number of SQs that $\mathcal{A}^{\prime}$ makes. By construction, $\mathcal{A}^{\prime}$ queries the SQ oracle at most once for every MQ-SQ made by $\mathcal{A}$ . In addition, $\mathcal{A}^{\prime}$ makes one query at the beginning and at most three queries at the end. Thus, $\mathcal{A}^{\prime}$ makes at most $q^{\prime}=q+O(1)$ queries in total.

To analyze the correctness of $\mathcal{A}^{\prime}$ , let $f:\mathcal{X}\to\{0,1\}$ be a random $p$ -biased function obtained by independently drawing the function value $f(x)$ from $\mathsf{Bernoulli}(p)$ for each $x\in\mathcal{X}$ . Note that $f$ is only for the analysis; it is never used in algorithm $\mathcal{A}^{\prime}$ . Also, let $\mathcal{D}$ denote the distribution over $\mathcal{X}$ induced by $\mathcal{D}^{\mathrm{refut.}}$ and $f$ (see Equation 1). We will first argue that, with high probability, the simulated copy of $\mathcal{A}$ effectively runs on an instance of testable learning with target function $f$ and marginal distribution $\mathcal{D}$ . We will then show that the decision made by $\mathcal{A}^{\prime}$ is correct due to the intended behavior of $\mathcal{A}$ on such an instance.

A good event.

Let $\mathcal{E}^{\textsf{good}}$ be the “good event” that the three conditions below hold simultaneously:

$\blacksquare$

The simulated execution of $\mathcal{A}$ coincides with its execution on the testable learning instance $(f,\mathcal{D})$ using an MQ-SQ oracle with tolerance $\tau$ . In other words, every MQ-SQ made by $\mathcal{A}$ is answered up to an additive error of $\tau$ .
$\blacksquare$

If the first condition holds, the output of $\mathcal{A}$ is valid with respect to the testable learning instance.
$\blacksquare$

If there exists $f^{*}\in\mathcal{C}$ that satisfies $\Pr_{(x,y)\sim\mathcal{D}^{\mathrm{refut.}}}\left[f^{*}(x)\neq y\right]\leq\eta$ (i.e., we are in the $\mathsf{structure}$ case), it holds that $\Pr_{x\sim\mathcal{D}}\left[f^{*}(x)\neq f(x)\right]\leq\eta+\tau$ .

By Lemmas 22 and 23, for each MQ-SQ made by $\mathcal{A}$ , the first condition gets violated with probability at most $8e^{-\Omega(\tau^{2}B)}$ , where $B$ is the minimum between the value of $1/\|\mathcal{D}^{\star}\|_{2}^{2}$ (among all queries of Types I and II) and $p^{2}(1-p)^{2}/\|\mathcal{D}_{x}\|_{2}^{2}$ . Applying the union bound to the $\leq q+4$ queries shows that the first condition gets violated with probability at most $O(q)\cdot e^{-\Omega(\tau^{2}B)}$ . Since $\mathcal{A}$ is assumed to be $(c,\varepsilon,\delta)$ -PAC, the second condition gets violated with probability at most $\delta$ . Applying Claim 16 with $\delta=\tau$ shows that the third condition gets violated with probability at most $e^{-\Omega(\tau^{2}p^{2}(1-p)^{2}/\|\mathcal{D}_{x}\|_{2}^{2})}\leq e^{-\Omega% (\tau^{2}B)}$ . Applying the union bound again gives

\Pr\left[\mathcal{E}^{\textsf{good}}\right]\geq 1-\delta-O(q)\cdot e^{-\Omega(% \tau^{2}B)}.

In the rest of the proof, we show that event $\mathcal{E}^{\textsf{good}}$ implies that $\mathcal{A}^{\prime}$ decides correctly.

Proof of completeness.

Suppose that some $f^{*}\in\mathcal{C}$ satisfies $\Pr_{(x,y)\sim\mathcal{D}^{\mathrm{refut.}}}\left[f^{*}(x)\neq y\right]\leq\eta$ , where $\eta$ is the parameter of the refutation instance (Definition 9). The third condition of the good event $\mathcal{E}^{\textsf{good}}$ implies that the same $f^{*}$ has with an error $\leq 2\eta$ over $\mathcal{D}$ . Then, the testable learner $\mathcal{A}$ may output either $\bot$ or a function $\widehat{f}:\mathcal{X}\to\{0,1\}$ that satisfies

\Pr_{x\sim\mathcal{D}}\left[\widehat{f}(x)\neq f(x)\right]\leq c\cdot\Pr_{x% \sim\mathcal{D}}\left[f^{*}(x)\neq f(x)\right]+\varepsilon\leq c(\eta+\tau)+\varepsilon.

In the former case, $\mathcal{A}^{\prime}$ would correctly output $\mathsf{structure}$ . For the latter case, applying the identity $\mathbbm{1}\left[b_{1}\neq b_{2}\right]=b_{1}+b_{2}-2b_{1}b_{2}$ for $b_{1},b_{2}\in\{0,1\}$ gives

\Pr_{x\sim\mathcal{D}}\left[\widehat{f}(x)\neq f(x)\right]=\operatorname*{% \mathbb{E}}_{x\sim\mathcal{D}}\left[\widehat{f}(x)\right]+\operatorname*{% \mathbb{E}}_{x\sim\mathcal{D}}\left[f(x)\right]-2\operatorname*{\mathbb{E}}_{x% \sim\mathcal{D}}\left[\widehat{f}(x)f(x)\right].

Assuming event $\mathcal{E}^{\textsf{good}}$ , $\mathcal{A}^{\prime}$ obtains an estimate of each of the three expectations on the right-hand side above up to an additive error of $\tau$ . It then follows that the value of $\widehat{\mu}$ computed at the end satisfies

\widehat{\mu}\leq\Pr_{x\sim\mathcal{D}}\left[\widehat{f}(x)\neq f(x)\right]+4% \tau\leq c\eta+\varepsilon+(c+4)\tau.

Since $\min\{p,1-p\}\geq\alpha>c\eta+\varepsilon+(c+4)\tau+6\tau^{\prime}$ , we have

\widehat{\mu}<\min\{p,1-p\}-6\tau^{\prime}\leq\min\{\widehat{p},1-\widehat{p}% \}-5\tau^{\prime}

in this case, and $\mathcal{A}^{\prime}$ would correctly output $\mathsf{structure}$ .

Proof of soundness.

Suppose that the distribution $\mathcal{D}^{\mathrm{refut.}}$ in the refutation instance is the product distribution of some $\mathcal{D}_{x}\in\mathcal{F}$ and $\mathsf{Bernoulli}(p)$ . Then, the resulting marginal distribution $\mathcal{D}$ in the testable learning instance is exactly $\mathcal{D}_{x}\in\mathcal{F}$ . Thus, assuming that $\mathcal{A}$ is correct, $\mathcal{A}$ would accept and output a function $\widehat{f}:\mathcal{X}\to\{0,1\}$ . Then, at the end of $\mathcal{A}^{\prime}$ , we compute

\widehat{\mu}\approx\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[% \widehat{f}(x)\right]+\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[f(x)% \right]-2\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\widehat{f}(x)f(x)% \right].

By the way in which $\mathcal{A}^{\prime}$ handles the MQ-SQs, the three terms

\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\widehat{f}(x)\right],% \operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[f(x)\right],\text{and}% \operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\widehat{f}(x)f(x)\right]

are approximated with

\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}^{\mathrm{refut.}}}\left[% \widehat{f}(x)\right],\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}^{% \mathrm{refut.}}}\left[y\right],\text{and}\operatorname*{\mathbb{E}}_{(x,y)% \sim\mathcal{D}^{\mathrm{refut.}}}\left[\widehat{f}(x)\cdot y\right],

respectively. All the three values are obtained from querying the SQ oracle. Since the SQ oracle has a tolerance of $\tau^{\prime}$ , the value of $\widehat{\mu}$ is within an additive error of $4\tau^{\prime}$ to

\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}^{\mathrm{refut.}}}\left[% \widehat{f}(x)+y-2\widehat{f}(x)\cdot y\right]=\Pr_{(x,y)\sim\mathcal{D}^{% \mathrm{refut.}}}\left[\widehat{f}(x)\neq y\right],

which is exactly the error of $\widehat{f}$ on distribution $\mathcal{D}^{\mathrm{refut.}}$ . Since $\mathcal{D}^{\mathrm{refut.}}$ is the product of $\mathcal{D}_{x}$ and $\mathsf{Bernoulli}(p)$ , regardless of the choice of $\widehat{f}$ , $\Pr_{(x,y)\sim\mathcal{D}^{\mathrm{refut.}}}\left[\widehat{f}(x)\neq y\right]$ is at least $\min\{p,1-p\}$ . Together with the fact that $|p-\widehat{p}|\leq\tau^{\prime}$ , this further implies

\widehat{\mu}\geq\min\{p,1-p\}-4\tau^{\prime}\geq\min\{\widehat{p},1-\widehat{% p}\}-5\tau^{\prime}.

Thus, $\mathcal{A}^{\prime}$ would correctly output $\mathsf{noise}$ . $\hfill\blacktriangleleft$

4.4 SQ Refutation Implies SQ Weak Learning

We show that an SQ algorithm for refutation implies an SQ algorithm for weakly learning the same concept class. Later, using the equivalence between SQ dimension and weak learning [2], we can lift SQ-dimension lower bounds to lower bounds against SQ-based refutation. By Proposition 24, this further leads to lower bounds against MQ-SQ algorithms for testable learning with queries.

We first recall the definition of SQ-based weak learning.

Definition 25 (SQ Weak learning).

Let $\mathcal{C}\subseteq\{f:\mathcal{X}\to\{0,1\}\}$ be a concept class over a finite instance space $\mathcal{X}$ . Let $\mathcal{D}$ be a given distribution over $\mathcal{X}$ and $f^{*}\in\mathcal{C}$ be an unknown target function. An SQ oracle for $(f^{*},\mathcal{D})$ with tolerance $\tau\geq 0$ answers queries of form $\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x)\cdot f^{*}(x)\right]$ up to an additive error of $\tau$ , where $\phi:\mathcal{X}\to[0,1]$ is any given test function. An SQ algorithm $\varepsilon$ -weakly learns $\mathcal{C}$ if it, by making queries to an SQ oracle, with probability $\geq 2/3$ outputs a classifier $\widehat{f}:\mathcal{X}\to\{0,1\}$ that satisfies

\Pr_{x\sim\mathcal{D}}\left[\widehat{f}(x)\neq f^{*}(x)\right]\leq\frac{1}{2}-\varepsilon.

Intuitively, to solve refutation (Definition 9) using an SQ algorithm, in the “structure” case that some $f^{*}\in\mathcal{C}$ has a low error, the algorithm must query the SQ oracle using a test function that has a non-trivial correlation with $f^{*}$ . Such a test function would then allow us to learn the unknown function up to an error that is better than random guessing.

Proposition 26.

Suppose that an SQ algorithm solves biased- $(\alpha,\eta)$ -refutation for concept class $\mathcal{C}$ on distribution $\mathcal{D}$ by making at most $q$ queries with tolerance $\tau\geq 0$ . Then, assuming that $\alpha\leq 1/2-\Omega(\tau)$ , there is an SQ algorithm that $\Omega(\tau)$ -weakly learns $\mathcal{C}$ on $\mathcal{D}$ by making at most $q^{\prime}=O(q+1/\tau)$ queries to an SQ oracle with tolerance $\tau^{\prime}=\Omega(\tau)$ .

To prove the proposition, we will use the following simple fact: If a $[0,1]$ -valued function has a non-trivial correlation with a sufficiently balanced boolean function, it can be rounded into a random binary classifier with a non-trivial accuracy in expectation. (If the boolean function is far from balanced, it can be easily learned by a constant function.)

Lemma 27.

The following holds for every $\delta\geq 0$ , distribution $\mathcal{D}$ over $\mathcal{X}$ , and binary function $f^{*}:\mathcal{X}\to\{0,1\}$ with mean $p\coloneqq\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[f^{*}(x)\right]% \in[1/2-\gamma,1/2+\gamma]$ : Suppose that function $\phi:\mathcal{X}\to[0,1]$ satisfies $\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x)f^{*}(x)\right]-p% \cdot\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x)\right]\geq\delta$ . Then, for the random function $\widetilde{\phi}:\mathcal{X}\to\{0,1\}$ obtained from sampling each $\widetilde{\phi}(x)$ from $\mathsf{Bernoulli}(\phi(x))$ independently, we have

\operatorname*{\mathbb{E}}_{\widetilde{\phi}}\left[\Pr_{x\sim\mathcal{D}}\left% [\widetilde{\phi}(x)\neq f^{*}(x)\right]\right]\leq\frac{1}{2}-(2\delta-3% \gamma).

Similarly, if $\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x)f^{*}(x)\right]-p% \cdot\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x)\right]\leq-\delta$ , we have

\operatorname*{\mathbb{E}}_{\widetilde{\phi}}\left[\Pr_{x\sim\mathcal{D}}\left% [1-\widetilde{\phi}(x)\neq f^{*}(x)\right]\right]\leq\frac{1}{2}-(2\delta-3% \gamma).

Proof.

It suffices to prove the first part; the second part follows by symmetry. By the identity $\mathbbm{1}\left[b_{1}\neq b_{2}\right]=b_{1}+b_{2}-2b_{1}b_{2}$ for $b_{1},b_{2}\in\{0,1\}$ , it holds for every possible realization of $\widetilde{\phi}:\mathcal{X}\to\{0,1\}$ that

\Pr_{x\sim\mathcal{D}}\left[\widetilde{\phi}(x)\neq f^{*}(x)\right]=% \operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\widetilde{\phi}(x)+f^{*}(x% )-2\widetilde{\phi}(x)f^{*}(x)\right].

Taking an expectation over the randomness of $\widetilde{\phi}$ shows that

	$\displaystyle\operatorname{\mathbb{E}}_{\widetilde{\phi}}\left[\Pr_{x\sim% \mathcal{D}}\left[\widetilde{\phi}(x)\neq f^{}(x)\right]\right]$	$\displaystyle=\operatorname{\mathbb{E}}_{\widetilde{\phi}}\left[\operatorname% {\mathbb{E}}_{x\sim\mathcal{D}}\left[\widetilde{\phi}(x)+f^{}(x)-2\widetilde% {\phi}(x)f^{}(x)\right]\right]$
		$\displaystyle=\operatorname{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x)+f^{}% (x)-2\phi(x)f^{*}(x)\right]$
		$\displaystyle=\operatorname{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x)\right% ]+p-2\operatorname{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x)f^{*}(x)\right]$
		$\displaystyle\leq p+\operatorname{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x)% \right]-2\left(p\cdot\operatorname{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x% )\right]+\delta\right)$
		$\displaystyle=p+(1-2p)\cdot\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[% \phi(x)\right]-2\delta,$

where the fourth step applies the assumption that $\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x)f^{*}(x)\right]-p% \cdot\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x)\right]\geq\delta$ . Since $1-2p\in[-2\gamma,2\gamma]$ and $\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x)\right]\in[0,1]$ , the second term above is at most $2\gamma$ . It follows that the expected error of $\widetilde{\phi}$ is at most $p+2\gamma-2\delta\leq\left(\frac{1}{2}+\gamma\right)+2\gamma-2\delta=\frac{1}{% 2}-(2\delta-3\gamma)$ . $\hfill\blacktriangleleft$

Proof of Proposition 26.

Let $\mathcal{A}$ denote the hypothetical SQ algorithm that solves biased- $(\alpha,\eta)$ -refutation for $\mathcal{C}$ . Let $\varepsilon,\tau^{\prime}=\Theta(\tau)$ be sufficiently small such that: (1) $\alpha\leq 1/2-(\varepsilon+2\tau^{\prime})$ ; (2) $\tau\geq 4\varepsilon+22\tau^{\prime}$ . We construct an SQ algorithm $\mathcal{A}^{\prime}$ that weakly learns $\mathcal{C}$ by simulating $\mathcal{A}$ on the distribution of $(x,f^{*}(x))$ where $x\sim\mathcal{D}$ and $f^{*}\in\mathcal{C}$ is the unknown target function in the weak learning instance:

$\blacksquare$

Step 1: Query the SQ oracle (for the weak learning instance) to estimate the value of $p\coloneqq\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[f^{*}(x)\right]$ using the constant function $\phi(x)\equiv 1$ . Let $\widehat{p}\in[p-\tau^{\prime},p+\tau^{\prime}]$ be the output of the oracle. If $\widehat{p}+\tau^{\prime}\leq 1/2-\varepsilon$ , output the constant function $0$ and terminate. If $\widehat{p}-\tau^{\prime}\geq 1/2+\varepsilon$ , output the constant function $1$ and terminate.
$\blacksquare$

Step 2: Simulate the refutation algorithm $\mathcal{A}$ . Whenever $\mathcal{A}$ tries to query the SQ oracle (for the refutation instance) with test function $\phi:\mathcal{X}\times\{0,1\}\to[0,1]$ , consider the function $\Delta:\mathcal{X}\to[-1,1]$ defined as $\Delta(x)\coloneqq\phi(x,1)-\phi(x,0)$ and $\Delta^{\prime}:\mathcal{X}\to[0,1]$ defined as $\Delta^{\prime}(x)\coloneqq\frac{\Delta(x)+1}{2}$ .
$\blacksquare$

Step 3: Query the SQ oracle (for weak learning) with test function $\Delta^{\prime}$ to estimate $\mu\coloneqq\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\Delta^{\prime}% (x)\cdot f^{*}(x)\right]$ . Let $\widehat{\mu}\in[\mu-\tau^{\prime},\mu+\tau^{\prime}]$ denote the output of the SQ oracle. Check whether it holds that

$\left|\widehat{\mu}-\widehat{p}\cdot\operatorname*{\mathbb{E}}_{x\sim\mathcal{% D}}\left[\Delta^{\prime}(x)\right]\right|\leq\frac{\tau}{2}-2\tau^{\prime}.$

If so, we compute $\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x,0)+p\cdot\Delta(x)\right]$ , return the result to the refutation algorithm $\mathcal{A}$ , and continue the simulation by going back to Step 2. Otherwise, go to Step 4.
$\blacksquare$

Step 4: We apply Lemma 27 to $\Delta^{\prime}$ and obtain a randomized boolean function $\widetilde{\phi}$ from either $\Delta^{\prime}$ or $1-\Delta^{\prime}$ . We query the SQ oracle to obtain an estimate $\widehat{\varepsilon}$ of

$\Pr_{x\sim\mathcal{D}}\left[\widetilde{\phi}(x)\neq f^{*}(x)\right]=% \operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\widetilde{\phi}(x)\right]+% \operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[f^{*}(x)\right]-2% \operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\widetilde{\phi}(x)\cdot f^% {*}(x)\right].$

If $\widehat{\varepsilon}\leq 1/2-\varepsilon-3\tau^{\prime}$ , we return the function $\widetilde{\phi}$ . Otherwise, repeat this step.

If $\mathcal{A}^{\prime}$ outputs a constant classifier in the first step, the output clearly has an error $\leq 1/2-\varepsilon$ . Thus, we may focus on the case that $|\widehat{p}-1/2|\leq\varepsilon+\tau^{\prime}$ . Since $|\widehat{p}-p|\leq\tau^{\prime}$ , we must have $|p-1/2|\leq\gamma\coloneqq\varepsilon+2\tau^{\prime}$ in this case. The rest of the proof proceeds in the following three steps:

$\blacksquare$

If $\Delta^{\prime}$ has a low correlation with $f^{*}$ (i.e., $\left|\widehat{\mu}-\widehat{p}\cdot\operatorname*{\mathbb{E}}_{x\sim\mathcal{% D}}\left[\Delta^{\prime}(x)\right]\right|\leq\tau-4\tau^{\prime}$ holds in Step 3 of $\mathcal{A}^{\prime}$ ), the answer $\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x,0)+p\cdot\Delta(x)\right]$ that we return to $\mathcal{A}$ is a valid answer for an SQ oracle with tolerance $\tau$ .
$\blacksquare$

If $\Delta^{\prime}$ has a high correlation with $f^{*}$ (i.e., $\left|\widehat{\mu}-\widehat{p}\cdot\operatorname*{\mathbb{E}}_{x\sim\mathcal{% D}}\left[\Delta^{\prime}(x)\right]\right|>\tau-4\tau^{\prime}$ ), we will find a good $\widetilde{\phi}$ without repeating Step 4 too many times.
$\blacksquare$

If $\Delta^{\prime}$ never has a high correlation with $f^{*}$ , the execution of $\mathcal{A}$ will be indistinguishable from that in the “noise” case of the refutation instance. Therefore, a high-correlation $\Delta^{\prime}$ must be found with a good probability.

Low correlation gives accurate answers.

Suppose that, for some test function $\phi:\mathcal{X}\times\{0,1\}\to[0,1]$ chosen by $\mathcal{A}$ and the corresponding $\Delta^{\prime}$ , it holds in Step 3 that

\left|\widehat{\mu}-\widehat{p}\cdot\operatorname*{\mathbb{E}}_{x\sim\mathcal{% D}}\left[\Delta^{\prime}(x)\right]\right|\leq\frac{\tau}{2}-2\tau^{\prime}.

Recall that $\widehat{\mu}$ is within an additive error of $\tau^{\prime}$ to $\mu=\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\Delta^{\prime}(x)\cdot f% ^{*}(x)\right]$ and $\widehat{p}$ is within error $\tau^{\prime}$ to $p=\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[f^{*}(x)\right]$ . We have

\left|\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\Delta^{\prime}(x)% \cdot f^{*}(x)\right]-p\cdot\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left% [\Delta^{\prime}(x)\right]\right|\leq\left|\widehat{\mu}-\widehat{p}\cdot% \operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\Delta^{\prime}(x)\right]% \right|+2\tau^{\prime}\leq\frac{\tau}{2}.

Then, the difference between the correct answer,

\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x,f^{*}(x))\right]=% \operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x,0)+(\phi(x,1)-\phi(x% ,0))\cdot f^{*}(x)\right]=\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[% \phi(x,0)+\Delta(x)\cdot f^{*}(x)\right],

and the answer $\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x,0)+p\cdot\Delta(x)\right]$ returned by $\mathcal{A}^{\prime}$ is exactly

\left|\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\Delta(x)\cdot f^{*}(% x)\right]-p\cdot\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\Delta(x)% \right]\right|=2\cdot\left|\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[% \Delta^{\prime}(x)\cdot f^{*}(x)\right]-p\cdot\operatorname*{\mathbb{E}}_{x% \sim\mathcal{D}}\left[\Delta^{\prime}(x)\right]\right|\leq\tau.

In other words, $\mathcal{A}^{\prime}$ simulates a valid SQ oracle with tolerance $\tau$ when the correlation is low.

High correlation gives good $\widetilde{\phi}$ .

Now, suppose that $\left|\widehat{\mu}-\widehat{p}\cdot\operatorname*{\mathbb{E}}_{x\sim\mathcal{% D}}\left[\Delta^{\prime}(x)\right]\right|>\frac{\tau}{2}-2\tau^{\prime}$ holds in Step 3. Again, since $\widehat{\mu}\approx\mu=\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[% \Delta^{\prime}(x)\cdot f^{*}(x)\right]$ and $\widehat{p}\approx p=\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[f^{*}(% x)\right]$ hold up to error $\tau^{\prime}$ , we have

\left|\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\Delta^{\prime}(x)% \cdot f^{*}(x)\right]-p\cdot\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left% [\Delta^{\prime}(x)\right]\right|\geq\left|\widehat{\mu}-\widehat{p}\cdot% \operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\Delta^{\prime}(x)\right]% \right|-2\tau^{\prime}>\frac{\tau}{2}-4\tau^{\prime}.

By the assumption that $\tau\geq 4\varepsilon+22\tau^{\prime}$ , the above implies $\left|\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\Delta^{\prime}(x)% \cdot f^{*}(x)\right]-p\cdot\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left% [\Delta^{\prime}(x)\right]\right|\geq\delta\coloneqq 2\varepsilon+7\tau^{\prime}$ . Recall that we assumed $p\in[1/2-\gamma,1/2+\gamma]$ for $\gamma=\varepsilon+2\tau^{\prime}$ . Applying Lemma 27 to $\Delta^{\prime}$ shows that $\Delta^{\prime}$ can be rounded to a random function $\widetilde{\phi}:\mathcal{X}\to\{0,1\}$ with an expected error of at most

\frac{1}{2}-(2\delta-3\gamma)=\frac{1}{2}-(\varepsilon+8\tau^{\prime}).

By Markov’s inequality, the probability that $\widetilde{\phi}$ has an error $\leq 1/2-(\varepsilon+6\tau^{\prime})$ is at least

1-\frac{1/2-(\varepsilon+8\tau^{\prime})}{1/2-(\varepsilon+6\tau^{\prime})}=% \frac{2\tau^{\prime}}{1/2-(\varepsilon+6\tau^{\prime})}=\Omega(\tau).

Note that in Step 4, $\widehat{\varepsilon}$ is within an additive error of $3\tau^{\prime}$ to the actual error of $\widetilde{\phi}$ . Then, if $\widetilde{\phi}$ has an error $\leq 1/2-(\varepsilon+6\tau^{\prime})$ , we would have

\widehat{\varepsilon}\leq\Pr_{x\sim\mathcal{D}}\left[\widetilde{\phi}(x)\neq f% ^{*}(x)\right]+3\tau^{\prime}\leq\frac{1}{2}-\varepsilon-3\tau^{\prime},

and algorithm $\mathcal{A}^{\prime}$ would terminate. Therefore, whenever Step 4 is entered, at most $O(1/\tau)$ repetitions are needed in expectation. This shows that $\mathcal{A}^{\prime}$ makes at most $O(q+1/\tau)$ SQs in expectation.

The probability of making high-correlation queries.

Now, we argue that the hypothetical refutation algorithm $\mathcal{A}$ must make the aforementioned high-correlation query. To this end, we couple the execution of $\mathcal{A}$ simulated by our weak learner $\mathcal{A}^{\prime}$ (the simulated copy) with a slight variant of it (the imaginary copy): In the imaginary copy, we never check the correlation or go to Step 4; we always return $\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x,0)+p\cdot\Delta(x)\right]$ for every query $\phi:\mathcal{X}\times\{0,1\}\to[0,1]$ that $\mathcal{A}$ makes.

Note that the imaginary copy of $\mathcal{A}$ exactly runs on a refutation instance in which the distribution is the product distribution of $\mathcal{D}$ and $\mathsf{Bernoulli}(p)$ , i.e., the label $y$ is a $p$ -biased coin flip regardless of $x$ . Since we assumed that $|p-1/2|\leq\gamma=\varepsilon+2\tau^{\prime}\leq 1/2-\alpha$ , we have $p\in[\alpha,1-\alpha]$ . Then, by the soundness guarantee of $\mathcal{A}$ , the imaginary copy $\mathcal{A}$ must output $\mathsf{noise}$ with probability at least $2/3$ .

In contrast, the simulated copy of $\mathcal{A}$ runs on an instance in which the labels are consistent with $f^{*}\in\mathcal{C}$ . Then, the completeness of $\mathcal{A}$ ensures that the simulated copy outputs $\mathsf{structure}$ with probability $\geq 2/3$ . Therefore, in the coupling between the simulated and imaginary copies, they must diverge with probability at least $1/3$ . The only way for the two copies to disagree is that, during the execution of the simulated copy, the algorithm makes an SQ with a high-correlation test function. Therefore, algorithm $\mathcal{A}^{\prime}$ outputs a classifier with error $\leq 1/2-\varepsilon$ with probability at least $1/3$ . Since $\mathcal{A}^{\prime}$ never returns an incorrect answer (i.e., a classifier with error $>1/2-\varepsilon$ ), repeating $\mathcal{A}^{\prime}$ a constant number of times would boost the success probability to $2/3$ , thereby giving an $\varepsilon$ -weak learner for class $\mathcal{C}$ on distribution $\mathcal{D}$ . $\hfill\blacktriangleleft$

4.5 Put Everything Together

So far, we have established a reduction from SQ weak learning to SQ refutation (Proposition 26), and one from SQ refutation to MQ-SQ testable learning (Proposition 24). We now combine them and prove a lower bound for MQ-SQ testable learning in terms of the statistical query dimension (SQ dimension) of the concept class introduced by Blum, Furst, Jackson, Kearns, Mansour, and Rudich [2].

Definition 28 (SQ dimension).

The SQ dimension of a concept class $\mathcal{C}\subseteq\{f:\mathcal{X}\to\{0,1\}\}$ on a distribution $\mathcal{D}$ over $\mathcal{X}$ is the maximum number $d$ such that there exists $f_{1},f_{2},\ldots,f_{d}\in\mathcal{C}$ that satisfy

\Pr_{x\sim\mathcal{D}}\left[f_{i}(x)\neq f_{j}(x)\right]\in\left[\frac{1-1/d^{% 3}}{2},\frac{1+1/d^{3}}{2}\right]

for all $i\neq j\in[d]$ .

Theorem 29.

The following holds for a sufficiently small constant $\varepsilon_{0}>0$ : Suppose that $\mathcal{C}$ is a concept class with SQ dimension $d$ on distribution $\mathcal{D}$ and $\|\mathcal{D}\|_{2}^{2}\leq O(1/\mathop{\mathrm{poly}}(d))$ . Let $c\geq 1$ and $\varepsilon,\delta\leq\varepsilon_{0}$ . Then, no MQ-SQ algorithm can $(c,\varepsilon,\delta)$ -testably learn $\mathcal{C}$ on distribution $\mathcal{D}$ by making $q\leq o(\mathop{\mathrm{poly}}(d))$ queries to an MQ-SQ oracle with tolerance $\tau\geq\omega(1/\mathop{\mathrm{poly}}(d))$ such that $\|\mathcal{D}^{\star}\|_{2}^{2}\leq O(1/\mathop{\mathrm{poly}}(d))$ holds for all queries of types I and II.

For concreteness, consider the hypercube $\mathcal{X}=\{0,1\}^{n}$ and the uniform distribution over it. It is well-known that the family of parity functions has an SQ dimension of $2^{n}$ . Furthermore, for every $k\leq n^{1-\Omega(1)}$ , both $k$ -juntas and depth- $k$ decision trees have SQ dimensions of $n^{\Omega(k)}$ , as they both contain all parity functions of $\leq k$ variables. Thus, Theorem 29 gives $2^{\Omega(n)}$ or $n^{\Omega(k)}$ lower bounds against MQ-SQ testable learners for these classes. Regarding the constraint on $\mathcal{D}^{\star}$ , if $\mathcal{D}^{\star}$ is the uniform distribution over a $d^{\prime}$ -dimensional subcube, we need $2^{-d^{\prime}}=\|\mathcal{D}^{\star}\|_{2}^{2}\leq O(1/\mathop{\mathrm{poly}}% (d))$ , so ensuring $d^{\prime}=\Omega(n)$ would suffice. (Recall from Section 4.2 that this condition holds when implementing many existing query-based learners as MQ-SQ algorithms.)

Proof.

Suppose towards a contradiction that such an MQ-SQ algorithm exists. Ignoring all the other parameters for now, Proposition 24 shows that there is an algorithm that refutes $\mathcal{C}$ by making $q^{\prime}=q+O(1)$ queries to an SQ oracle with tolerance $\tau^{\prime}=\tau/4$ . Applying Proposition 26 then gives an algorithm that $\Omega(\tau)$ -weakly learns parity functions using $O(q^{\prime}+1/\tau^{\prime})=O(q+1/\tau)$ queries to an SQ oracle with tolerance $\Theta(\tau)$ . By [2, Theorem 12], to $\Omega(1/d^{3})$ -weakly learn concept class $\mathcal{C}$ using an SQ oracle with tolerance $\Omega(d^{-1/3})$ , at least $\Omega(d^{1/3})$ queries are needed. Since $q,1/\tau=o(\mathop{\mathrm{poly}}(d))$ , we obtain a contradiction.

Now, we set the parameters in Propositions 24 and 26 carefully. We may assume that $\tau\leq\varepsilon_{0}/c$ without loss of generality; a smaller tolerance makes the SQ oracle (and thus the lower bound result) stronger. Since $\varepsilon,\delta,c\tau\leq\varepsilon_{0}$ are sufficiently small and $\tau^{\prime}=\tau/4$ , we can choose $\alpha=0.1$ and $\eta=0$ in Proposition 24 such that the first condition $\alpha>c\eta+\varepsilon+(c+4)\tau+6\tau^{\prime}$ is satisfied. Furthermore, the condition that $\alpha\leq 1/2-\Omega(\tau)$ in Proposition 26 would also hold. Recall that the failure probability increases from $\delta\leq\varepsilon_{0}$ to $\delta+O(q)\cdot e^{-\Omega(\tau^{2}B)}$ in Proposition 24. Setting $B=\Theta((\log q)/\tau^{2})=o(\mathop{\mathrm{poly}}(d))$ suffices to control the new failure probability by $2\varepsilon_{0}$ .

It remains to check the second and the third conditions of Proposition 24. For the third, we need the MQ-SQ testable learner to restrict the distribution $\mathcal{D}^{\star}$ in its queries such that $\|\mathcal{D}^{\star}\|_{2}^{2}\leq 1/B$ . This is ensured by $\|\mathcal{D}^{\star}\|_{2}^{2}\leq O(1/\mathop{\mathrm{poly}}(d))$ and $B\leq o(\mathop{\mathrm{poly}}(d))$ . Finally, to check the second condition that $B\leq p^{2}(1-p)^{2}/\|\mathcal{D}_{x}\|_{2}^{2}$ , we note that $\|\mathcal{D}_{x}\|_{2}^{2}=\|\mathcal{D}\|_{2}^{2}\leq O(1/\mathop{\mathrm{% poly}}(d))$ . Furthermore, by the way in which the reduction works in the proof of Proposition 26, whenever the refutation algorithm is called, the labels are nearly balanced, i.e., $p^{2}(1-p)^{2}=\Omega(1)$ . Therefore, the second condition is always satisfied by our choice of $B=o(\mathop{\mathrm{poly}}(d))$ . This completes the proof. $\hfill\blacktriangleleft$

References

[1] Guy Blanc, Jane Lange, Mingda Qiao, and Li-Yang Tan. Properly learning decision trees in almost polynomial time. Journal of the ACM, 69(6):1–19, 2022. doi:10.1145/3561047.
[2] Avrim Blum, Merrick Furst, Jeffrey Jackson, Michael Kearns, Yishay Mansour, and Steven Rudich. Weakly learning dnf and characterizing statistical query learning using fourier analysis. In Proceedings of the twenty-sixth annual ACM symposium on Theory of computing, pages 253–262, 1994. doi:10.1145/195058.195147.
[3] Avrim L Blum and Pat Langley. Selection of relevant features and examples in machine learning. Artificial intelligence, 97(1-2):245–271, 1997. doi:10.1016/S0004-3702(97)00063-5.
[4] Nader H Bshouty and Vitaly Feldman. On using extended statistical queries to avoid membership queries. Journal of Machine Learning Research, 2(Feb):359–395, 2002. URL: https://jmlr.org/papers/v2/bshouty02a.html.
[5] Amit Daniely. Complexity theoretic limitations on learning halfspaces. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 105–117, 2016. doi:10.1145/2897518.2897520.
[6] Amit Daniely, Nati Linial, and Shai Shalev-Shwartz. From average case complexity to improper learning complexity. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 441–448, 2014. doi:10.1145/2591796.2591820.
[7] Amit Daniely and Shai Shalev-Shwartz. Complexity theoretic limitations on learning dnf’s. In Conference on Learning Theory, pages 815–830. PMLR, 2016. URL: http://proceedings.mlr.press/v49/daniely16.html.
[8] Ilias Diakonikolas, Daniel Kane, Vasilis Kontonis, Sihan Liu, and Nikos Zarifis. Efficient testable learning of halfspaces with adversarial label noise. Advances in Neural Information Processing Systems, 36:39470–39490, 2023.
[9] Ariel Elbaz, Homin K Lee, Rocco A Servedio, and Andrew Wan. Separating models of learning from correlated and uncorrelated data. The Journal of Machine Learning Research, 8:277–290, 2007. URL: https://jmlr.org/papers/v8/elbaz07a.html.
[10] Vitaly Feldman. On the power of membership queries in agnostic learning. The Journal of Machine Learning Research, 10:163–182, 2009. doi:10.5555/1577069.1577076.
[11] Vitaly Feldman and Shrenik Shah. Separating models of learning with faulty teachers. Theoretical computer science, 410(19):1903–1912, 2009. doi:10.1016/J.TCS.2009.01.017.
[12] Aravind Gollakota, Adam Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. Tester-learners for halfspaces: Universal algorithms. Advances in Neural Information Processing Systems, 36:10145–10169, 2023.
[13] Aravind Gollakota, Adam R. Klivans, and Pravesh K. Kothari. A moment-matching approach to testable learning and a new characterization of rademacher complexity. In Symposium on Theory of Computing (STOC), pages 1657–1670, 2023. doi:10.1145/3564246.3585206.
[14] Parikshit Gopalan, Adam Tauman Kalai, and Adam R Klivans. Agnostically learning decision trees. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 527–536, 2008. doi:10.1145/1374376.1374451.
[15] Adam Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. Learning intersections of halfspaces with distribution shift: Improved algorithms and sq lower bounds. In Shipra Agrawal and Aaron Roth, editors, Proceedings of Thirty Seventh Conference on Learning Theory, volume 247 of Proceedings of Machine Learning Research, pages 2944–2978. PMLR, 30 June–03 July 2024. URL: https://proceedings.mlr.press/v247/klivans24b.html.
[16] Adam Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. Testable learning with distribution shift. In The Thirty Seventh Annual Conference on Learning Theory, pages 2887–2943. PMLR, 2024. URL: https://proceedings.mlr.press/v247/klivans24a.html.
[17] Pravesh K. Kothari and Roi Livni. Improper Learning by Refuting. In Innovations in Theoretical Computer Science (ITCS), pages 55:1–55:10, 2018. doi:10.4230/LIPIcs.ITCS.2018.55.
[18] Eyal Kushilevitz and Yishay Mansour. Learning decision trees using the fourier spectrum. In Symposium on Theory of Computing (STOC), pages 455–464, 1991.
[19] Cassandra Marcussen, Ronitt Rubinfeld, and Madhu Sudan. Quality control in sublinear time: a case study via random graphs. arXiv preprint arXiv:2508.16531, 2025. doi:10.48550/arXiv.2508.16531.
[20] Elchanan Mossel, Ryan O’Donnell, and Rocco A Servedio. Learning functions of k relevant variables. Journal of Computer and System Sciences, 69(3):421–434, 2004. doi:10.1016/J.JCSS.2004.04.002.
[21] Ronitt Rubinfeld and Arsen Vasilyan. Testing distributional assumptions of learning algorithms. In Symposium on Theory of Computing (STOC), pages 1643–1656, 2023. doi:10.1145/3564246.3585117.
[22] Lucas Slot, Stefan Tiegel, and Manuel Wiedmer. Testably learning polynomial threshold functions. Advances in Neural Information Processing Systems, 37:3781–3831, 2024.
[23] Salil Vadhan. On learning vs. refutation. In Conference on Learning Theory (COLT), pages 1835–1848, 2017. URL: http://proceedings.mlr.press/v65/vadhan17a.html.

[bib.bib1] [1] Guy Blanc, Jane Lange, Mingda Qiao, and Li-Yang Tan. Properly learning decision trees in almost polynomial time. Journal of the ACM, 69(6):1–19, 2022. doi:10.1145/3561047.

[bib.bib2] [2] Avrim Blum, Merrick Furst, Jeffrey Jackson, Michael Kearns, Yishay Mansour, and Steven Rudich. Weakly learning dnf and characterizing statistical query learning using fourier analysis. In Proceedings of the twenty-sixth annual ACM symposium on Theory of computing, pages 253–262, 1994. doi:10.1145/195058.195147.

[bib.bib3] [3] Avrim L Blum and Pat Langley. Selection of relevant features and examples in machine learning. Artificial intelligence, 97(1-2):245–271, 1997. doi:10.1016/S0004-3702(97)00063-5.

[bib.bib4] [4] Nader H Bshouty and Vitaly Feldman. On using extended statistical queries to avoid membership queries. Journal of Machine Learning Research, 2(Feb):359–395, 2002. URL: https://jmlr.org/papers/v2/bshouty02a.html.

[bib.bib5] [5] Amit Daniely. Complexity theoretic limitations on learning halfspaces. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 105–117, 2016. doi:10.1145/2897518.2897520.

[bib.bib6] [6] Amit Daniely, Nati Linial, and Shai Shalev-Shwartz. From average case complexity to improper learning complexity. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 441–448, 2014. doi:10.1145/2591796.2591820.

[bib.bib7] [7] Amit Daniely and Shai Shalev-Shwartz. Complexity theoretic limitations on learning dnf’s. In Conference on Learning Theory, pages 815–830. PMLR, 2016. URL: http://proceedings.mlr.press/v49/daniely16.html.

[bib.bib8] [8] Ilias Diakonikolas, Daniel Kane, Vasilis Kontonis, Sihan Liu, and Nikos Zarifis. Efficient testable learning of halfspaces with adversarial label noise. Advances in Neural Information Processing Systems, 36:39470–39490, 2023.

[bib.bib9] [9] Ariel Elbaz, Homin K Lee, Rocco A Servedio, and Andrew Wan. Separating models of learning from correlated and uncorrelated data. The Journal of Machine Learning Research, 8:277–290, 2007. URL: https://jmlr.org/papers/v8/elbaz07a.html.

[bib.bib10] [10] Vitaly Feldman. On the power of membership queries in agnostic learning. The Journal of Machine Learning Research, 10:163–182, 2009. doi:10.5555/1577069.1577076.

[bib.bib11] [11] Vitaly Feldman and Shrenik Shah. Separating models of learning with faulty teachers. Theoretical computer science, 410(19):1903–1912, 2009. doi:10.1016/J.TCS.2009.01.017.

[bib.bib12] [12] Aravind Gollakota, Adam Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. Tester-learners for halfspaces: Universal algorithms. Advances in Neural Information Processing Systems, 36:10145–10169, 2023.

[bib.bib13] [13] Aravind Gollakota, Adam R. Klivans, and Pravesh K. Kothari. A moment-matching approach to testable learning and a new characterization of rademacher complexity. In Symposium on Theory of Computing (STOC), pages 1657–1670, 2023. doi:10.1145/3564246.3585206.

[bib.bib14] [14] Parikshit Gopalan, Adam Tauman Kalai, and Adam R Klivans. Agnostically learning decision trees. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 527–536, 2008. doi:10.1145/1374376.1374451.

[bib.bib15] [15] Adam Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. Learning intersections of halfspaces with distribution shift: Improved algorithms and sq lower bounds. In Shipra Agrawal and Aaron Roth, editors, Proceedings of Thirty Seventh Conference on Learning Theory, volume 247 of Proceedings of Machine Learning Research, pages 2944–2978. PMLR, 30 June–03 July 2024. URL: https://proceedings.mlr.press/v247/klivans24b.html.

[bib.bib16] [16] Adam Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. Testable learning with distribution shift. In The Thirty Seventh Annual Conference on Learning Theory, pages 2887–2943. PMLR, 2024. URL: https://proceedings.mlr.press/v247/klivans24a.html.

[bib.bib17] [17] Pravesh K. Kothari and Roi Livni. Improper Learning by Refuting. In Innovations in Theoretical Computer Science (ITCS), pages 55:1–55:10, 2018. doi:10.4230/LIPIcs.ITCS.2018.55.

[bib.bib18] [18] Eyal Kushilevitz and Yishay Mansour. Learning decision trees using the fourier spectrum. In Symposium on Theory of Computing (STOC), pages 455–464, 1991.

[bib.bib19] [19] Cassandra Marcussen, Ronitt Rubinfeld, and Madhu Sudan. Quality control in sublinear time: a case study via random graphs. arXiv preprint arXiv:2508.16531, 2025. doi:10.48550/arXiv.2508.16531.

[bib.bib20] [20] Elchanan Mossel, Ryan O’Donnell, and Rocco A Servedio. Learning functions of k relevant variables. Journal of Computer and System Sciences, 69(3):421–434, 2004. doi:10.1016/J.JCSS.2004.04.002.

[bib.bib21] [21] Ronitt Rubinfeld and Arsen Vasilyan. Testing distributional assumptions of learning algorithms. In Symposium on Theory of Computing (STOC), pages 1643–1656, 2023. doi:10.1145/3564246.3585117.

[bib.bib22] [22] Lucas Slot, Stefan Tiegel, and Manuel Wiedmer. Testably learning polynomial threshold functions. Advances in Neural Information Processing Systems, 37:3781–3831, 2024.

[bib.bib23] [23] Salil Vadhan. On learning vs. refutation. In Conference on Learning Theory (COLT), pages 1835–1848, 2017. URL: http://proceedings.mlr.press/v65/vadhan17a.html.

	$\displaystyle\operatorname{\mathbb{E}}_{\widetilde{\phi}}\left[\Pr_{x\sim% \mathcal{D}}\left[\widetilde{\phi}(x)\neq f^{}(x)\right]\right]$	$\displaystyle=\operatorname{\mathbb{E}}_{\widetilde{\phi}}\left[\operatorname% {\mathbb{E}}_{x\sim\mathcal{D}}\left[\widetilde{\phi}(x)+f^{}(x)-2\widetilde% {\phi}(x)f^{}(x)\right]\right]$
		$\displaystyle=\operatorname{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x)+f^{}% (x)-2\phi(x)f^{*}(x)\right]$
		$\displaystyle=\operatorname{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x)\right% ]+p-2\operatorname{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x)f^{*}(x)\right]$
		$\displaystyle\leq p+\operatorname{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x)% \right]-2\left(p\cdot\operatorname{\mathbb{E}}_{x\sim\mathcal{D}}\left[\phi(x% )\right]+\delta\right)$
		$\displaystyle=p+(1-2p)\cdot\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[% \phi(x)\right]-2\delta,$

Limitations of Membership Queries in Testable Learning

Abstract

Keywords and phrases:

Funding:

Copyright and License:

2012 ACM Subject Classification:

Related Version:

DOI:

Event:

Editor:

Series and Publisher:

1 Introduction

1.1 The Power of Membership Queries in Agnostic Learning

1.2 Limitations of Membership Queries in Testable Learning

Question 1.

Theorem 2 (Corollary 17, informal).

Corollary 3.

Theorem 4 (Theorem 29, informal).

1.3 Technical Overview

Definition 5 (Exact refutation over the uniform distribution, informal).

1.3.1 An SQ-Preserving Reduction

1.4 Related Work and Discussion

The power of membership queries

Testable learning and friends

Learning and refutation

1.4.1 Directions for Future Work

Conjecture 6.

2 Preliminaries

2.1 Distances and Errors

Definition 7 (Distance of functions and distance to a concept class).

2.2 Refutation and Learning

Definition 8 (η-refutation).

Definition 9 (Biased (α,η)-refutation).

Definition 10 (Weak agnostic learning).

Lemma 11 (Learning by refutation: Lemma 6 of [17]).

2.3 Testable Learning with Queries

Definition 12 (Testable learning with queries, or TL-Q).

3 Refutation, Learning, and Testable Learning with Queries

3.1 A General Reduction from Refutation to TL-Q

Theorem 13.

3.1.1 Properties of the Filtered Sample Distribution

Definition 14 (Filtered sample distribution).

Lemma 15.

Claim 16.

3.1.2 Proof of Theorem 13

Proof of Theorem 13.

3.2 TL-Q Implies Sample-Based Learnability

Corollary 17 (Testable learning with queries implies learning with samples).

Proof.

3.3 Realizably Learning Juntas via Exact Refutation

Lemma 18.

Corollary 19.

4 MQ-SQ Lower Bounds

4.1 Five Types of MQ-SQs

Definition 20 (MQ-SQ Oracle).

4.2 Implementing Query-Based Learning Algorithms Using MQ-SQs

4.3 MQ-SQ Testable Learning Implies SQ Refutation

Definition 21 (SQ oracle for refutation).

Recap: Reduce refutation to TL-Q

Simulation of oracle queries

Lemma 22 (Types I and II).

Lemma 23 (Types III, IV and V).

Proposition 24.

Proof.

Handling queries.

Decision rule.

Overview of analysis.

A good event.

Proof of completeness.

Proof of soundness.

4.4 SQ Refutation Implies SQ Weak Learning

Definition 25 (SQ Weak learning).

Proposition 26.

Lemma 27.

Proof.

Proof of Proposition 26.

Low correlation gives accurate answers.

High correlation gives good ϕ~.

The probability of making high-correlation queries.

4.5 Put Everything Together

Definition 8 ( $\eta$ -refutation).

Definition 9 (Biased $(\alpha,\eta)$ -refutation).

High correlation gives good $\widetilde{\phi}$ .