Abstract 1 Introduction 2 Fourier analysis over the Boolean cube 3 Estimation from chosen queries 4 Learning 5 Estimation from uniform samples References

Estimating Euclidean Distance to Linearity

Andrej Bogdanov ORCID University of Ottawa, Canada Lorenzo Taschin ORCID EPFL, Lausanne, Switzerland
Abstract

Given oracle access to a real-valued function on the n-dimensional Boolean cube, how many queries does it take to estimate the squared Euclidean distance to its closest linear function within ϵ? Our main result is that O(log3(1/ϵ)1/ϵ2) queries suffice. Not only is the query complexity independent of n but it is optimal up to the polylogarithmic factor.

Our estimator evaluates f on pairs correlated by noise rates chosen to cancel out the low-degree contributions to f while leaving the linear part intact. The query complexity is optimized when the noise rates are multiples of Chebyshev nodes.

In contrast, we show that the dependence on n is unavoidable in two closely related settings. For estimation from random samples, Θ(n/ϵ+1/ϵ2) samples are necessary and sufficient. For agnostically learning a linear approximation with ϵ mean-square regret under the uniform distribution, Ω(n/ϵ) nonadaptively chosen queries are necessary, while O(n/ϵ) random samples are known to be sufficient (Linial, Mansour, and Nisan).

Our upper bounds apply to functions with bounded 4-norm. Our lower bounds apply even to ±1-valued functions.

Keywords and phrases:
sublinear-time algorithms, statistical estimation, analysis of boolean functions, property testing, regression
Copyright and License:
[Uncaptioned image] © Andrej Bogdanov and Lorenzo Taschin; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Streaming, sublinear and near linear time algorithms
; Theory of computation Randomness, geometry and discrete structures ; Theory of computation Probabilistic computation
Acknowledgements:
We thank Gautam Prakriya for insightful discussions and the anonymous ITCS reviewers for corrections and helpful comments.
Funding:
This work was supported by an NSERC discovery grant.
Editors:
Raghu Meka

1 Introduction

Finding linear approximations is perhaps the most important problem in statistics and data science. Linear regression [11, Section 9.2] seeks to learn the “best” linear predictor for f from labeled data (x,f(x)). Owing to its wide applicability and computational advantages, approximation error is often measured as the minimum squared loss 𝐄[(f(x)(x))2] under some distribution on examples.

Our main question of interest is whether estimating the minimum squared loss can be performed more efficiently than learning the predictor. A highly efficient estimator can be potentially used to decide whether learning a linear model is sensible at all.

We aim to highlight the conceptual distinctions between estimation and learning. We demonstrate that a complexity gap between the two problems is already exhibited by arguably the simplest distribution on examples: The uniform distribution over the Boolean cube {±1}n.

Under the uniform distribution, the minimum squared loss is the distance between the function of interest viewed as a vector in 2n-dimensional Euclidean space and the n-dimensional subspace spanned by the linear functions. In this setting estimating the squared distance is essentially equivalent to estimating the squared length of the projection.

The power of linear regression in machine learning applications is derived to a great extent from the distribution-free nature of the training algorithms. In contrast our results concern linear regression for a known distribution. We believe that distribution-specific regression is interesting for several reasons.

First, restrictive assumptions on the data distribution may improve performance. For example, distributions in which some features linearly approximate other features can be problematic for popular algorithms like gradient descent. The uniform distribution, as a model of pairwise independence, avoids such obstacles.

Second, distribution-specific algorithms sometimes serve as a stepping stone to distribution-free ones. In the context of property testing this strategy was successful in obtaining distribution-free algorithms for combinatorial properties like monotonicity [6] as well as algebraic ones like linearity and low-degree representability [6, 5].

Third, negative results about learning and estimation are stronger in the distribution-specific model. Distribution-free learning must succeed for all concepts in the given class and all distributions on examples. To understand the limitations it is natural to decouple the complexities of the concept class and of the distribution. Do learning and estimation remain hard even when the distribution is simple?

In this work we focus on information-theoretic measures of complexity. Our principal measure of performance is the query complexity m. The time complexity of all our algorithms is linear in mn, namely in the size of the labeled dataset. We study both algorithms that choose their queries and ones that are provided with uniformly random examples.

Our results

Our results are summarized in Table 1. Our main contribution is Theorem 1: The squared linear projection of f

𝐖1[f]=𝐄[f2]minlinear 𝐄[(f(x)(x))2] (1)

can be estimated within ϵ with O~(1/ϵ2) chosen queries.111The query complexities of the squared projection and the squared distance to linearity are the same up to constant factor since 𝐄[f2] can be estimated from O(1/ϵ2) samples assuming f4 is bounded. For similar reasons so are the squared projection and distance to the space of affine functions. In contrast, learning the requisite approximation and obtaining the same estimate from random samples both entail queries that grow at least as fast as some root of n.

Table 1: Query complexity of ϵ-estimating 𝐖1[f] and ϵ-learning the linear part of f. All positive results assume f41. All negative results hold even for ±1-valued f.
chosen queries random samples
estimation O(log3(1/ϵ)1/ϵ2) (Theorem 1)
Ω(1/ϵ2) (Theorem 8)
O(n/ϵ+1/ϵ2) (Theorem 18)
Ω(n/ϵ+1/ϵ2) (Theorem 19 + 8)
learning O(n/ϵ) [8]
Ω(n/ϵ) non-adaptive (Theorem 11)
O(n/ϵ) [8]
Ω(n/ϵ) (Theorem 11)

We found this result surprising from a Fourier-analytic perspective. Our quantity of interest is the sum of squares of all n first-level Fourier coefficients of f. Up to the polylogarithmic factor, estimating the sum has the same query complexity as estimating any of its individual terms, or estimating the squared mean of f.

In the decision version of estimation also known as tolerant testing [10], the objective is to distinguish functions that are θ-close to linear from those that are θ+ϵ-far. In particular, the algorithm of Theorem 1 is a tolerant tester for any value of θ. In the intolerant limit θ=0, algorithms with O(1/ϵ) query complexity are known [1, 3]. For functions over n the query complexity is also O(1/ϵ) under Gaussian measure [7], and O(log(1/ϵ)1/ϵ) under arbitrary measure [5]. With the exception of [7], these testers appear not to be robust against small perturbations as their soundness is analyzed with respect to the Hamming distance. The tester of Khot and Moshkovitz [7] can be viewed as an estimator, albeit one with large constant bias.

Our positive results apply to functions with bounded 4-norm. Boundedness of the 2-norm is clearly insufficient: The point function f(x)=2n𝟙(x=a) for a random point a has unit 2-norm and unit squared distance to linearity, yet is indistinguishable from the all-zero function with o(2n) queries. We did not investigate whether p-norm boundedness for 2<p<4 is sufficient.

In contrast, in Theorem 11 we show that proper learning of a linear hypothesis for f that is within ϵ squared distance of optimal has non-adaptive query complexity linear in n, specifically Ω(n/ϵ). Our technique also yields a Ω(logn/ϵ) general (adaptive) lower bound. On the positive side, it is known that O(n/ϵ) random samples are sufficient for learning via empirical risk mininization [8]. The ϵ gap between the lower and upper bounds is reminiscent of analogous gaps in distribution-free agnostic learning of deterministic concept classes [2].

In Theorems 18 and 19 we show that the sample complexity of estimating 𝐖1[f] is n/ϵ+1/ϵ2 up to constant factor. As a consequence of Theorems 18 and 11, estimation requires fewer random samples than learning in the regime ϵ=ω(n2/3).

Techniques

Estimation from chosen queries

Algorithm 1 estimates the linear projection of f by a suitably scaled product f(x)f(y) evaluated on a pair (x,y) of correlated inputs. The query complexity can be decoupled from the input length n already when the pairs (xi,yi) are independent, marginally uniform, and of sufficiently low correlation ρ=𝐄xiyi. For simplicity assume f is unbiased and ±1-valued. By orthogonal decomposition (see (2)),

𝐄f(x)f(y)=ρf=12+ρ2f=22++ρnf=n2,

where f=d is the degree-d part of f (see precise definition in Section 2). By Parseval’s identity, ρ1f(x)f(y) estimates 𝐖1[f]=f=12 with bias at most k>1|ρ|k1f=k2|ρ|. By choosing the bias-variance tradeoff ρϵ/2, empirically averaging this estimator yields an ϵ-approximation of 𝐖1[f] with O(1/ϵ4) queries.

To improve the query complexity we work with a more general class of distributions on pairs (x,y). Barring additional information on f, it is sensible that the distribution should place equal probability on all pairs (x,y) whose difference has fixed Hamming weight. A special class of such distibutions is mixtures of ρ-correlated pairs over some distribution d(ρ) on noise rates.

The uniform mixture of ρ and ρ-correlated pairs already improves the query complexity to O(1/ϵ3). The reason is that the related Fourier decomposition

𝐄(sign of correlation)f(x)f(y)=ρf=12+ρ3f=32+ρ5f=52+

cancels out all even levels, thus reducing the bias of the estimator from ρ to ρ2.

To achieve our query complexity of O(log3(1/ϵ)1/ϵ2) we construct a mixture of noise rates α1ρ,,αρ that annihilates levels 2 up to 1 while minimizing the contribution of levels and higher. A close to optimal error is achieved using Chebyshev interpolation. We give evidence that a Ω(log2(1/ϵ)1/ϵ2) lower bound is inherent in this approach (6), but we do not know if it can be bypassed in general.

In Remark 7 we explain why the low complexity of Algorithm 1 owes not only to the choice of interpolation scheme, but to some serendipity in the constraints (6) on the noise rates.

The proof of Theorem 8 is based on a reduction from the problem of estimating the bias of a coin.

Learning lower bound

Theorem 11 is proved in two steps. Proposition 12 addresses the regime of constant error ϵ. It is proved by exhibiting a family of 2Ω(n) functions whose linear projections are Ω(1)-far apart. The existence of this family is established using the probabilistic method in Lemma 13. An algorithm of sublinear query complexity is unlikely to disambiguate among members of the family and cannot be accurate.

In Proposition 15 we give a self-reduction for learning linear approximations from query complexity q and constant error to query complexity O(q/ϵ) and error ϵ. The reduction applies to any query model.

Estimation from samples

The random variable f(x)f(y)x,y with random x,y is an unbiased estimator of 𝐖1[f]. To improve its variance at optimal query cost we average it over all (m2) pairs of examples (Algorithm 2 and Theorem 18). For constant ϵ, the lower bound in Theorem 19 is exhibited by a sign-rounded linear function with independent Gaussian coefficients. Such a function is weakly pseudorandom against o(n) examples yet has constant correlation with . We achieve optimal dependence on ϵ by adding Gaussian noise to .

Extensions and future work

We believe that a result analogous to Theorem 1 can be obtained for functions over n under standard Gaussian measure using related techniques. It would be interesting to investigate the general setting of product distributions under Efron-Stein decomposition. The biased Boolean cube could be a good test case.

Theorem 1 can be extended to approximate the degree-d part of f with logO(d)(1/ϵ)1/ϵ2 queries. Both extensions might be sensible in the context of regression, where a mix of categorical (Boolean) and numerical (Gaussian) data and higher-degree models (capturing e.g. decision trees) are more realistic.

An interesting open question is whether the polylogarithmic factor can be eliminated. We speculate that this may be possible under the stronger assumption that f is bounded in infinity-norm.

Organization

Section 2 has a brief background on Fourier analysis of Boolean functions. In Sections 3, 4, and 5 we present our results on estimation from chosen queries, learning, and estimation from samples, respectively.

2 Fourier analysis over the Boolean cube

We use standard notation from Boolean function analysis [9]. All functions are real-valued over the Boolean cube {±1}n under uniform measure. Such functions live in Hilbert space under inner product 𝐄fg. The p-norm of a function is fp=𝐄[|f|p]1/p, with p=2 usually omitted. The monomials {iSxi:S[n]} form an orthonormal basis. f admits orthogonal decompositions

f=f=0+f=1++f=n=𝐄f+f=1+f2

with f=j (resp., fj) being the projection onto the subspace spanned by monomials of degree exactly (resp., at least) j. Our main object of interest is the linear part f=1, especially its weight

𝐖1[f]=f=12=𝐄ff=1.

The orthogonal decomposition can be further refined into the complete Fourier decomposition

f(x)=S[n]f^(S)iSxi,wheref^(S)=𝐄f(x)iSxi.

By orthogonality, 𝐄fg=𝐄f=ig=i=f^(S)g^(S) (Plancherel’s formula). In particular, when f=g

f2=f=j2=f^(S)2.(Parseval’s identity)

By orthogonality, the distance to linearity 𝐄[(f(x)(x))2] is minimized when =f=1. The minimizer equals

minlinear 𝐄[(f(x)(x))2]=f2f=12=𝐄[f2]𝐖1[f]

as in (1).

A ρ-biased string in {±1}n is one in which the coordinates are independent and ρ-biased. We use xy for the pointwise product of strings x and y in {±1}n. Multiplying by a ρ-biased string and averaging the outcome has the effect of dampening the higher-degree terms:

𝐄ρ-biased e[f(xe)]=f=0(x)+ρf=1(x)+ρ2f=2(x)+ (2)

3 Estimation from chosen queries

Theorem 1.

There is an algorithm that makes O(log(f42/ϵ)3f44/ϵ2) queries to f and outputs a value within ϵ of 𝐖1[f] with probability at least 2/3.

The algorithm takes the empirical average of m instantiations of Algorithm 1 with parameters m=963/ϵ2, =log1/ϵ+1, ρ=ϵ(1)1/6, and ϵ=ϵ/f42. (We may assume ϵ is an inverse power of two as this doesn’t change the asymptotics.)

Algorithm 1 Estimator W~1[f] from chosen queries.
Parameters : 0<ρ<1 and
Setup: Set α1,,α[1,1] as in (5). Calculate d1,,d as in (6).
Sample i from {1,,} with probability |di|/d1.
Sample random x and αiρ-biased ei.
Output W~1[f]=ρ1d1signdif(x)f(xei)

The conditional bias of this estimator given x is f(x)g(x), where

g=ρ1diNαiρf(x).

Here Nρ is the noise operator Nρf(x)=𝐄[f(xe)] for a ρ-biased e. The constants di,αi are chosen so that the first Fourier level of f survives but all other levels among the first 1 are annihilated:

g=j={0,if j=0 or 2j<f=j,if j=1 (3)

The theorem is proved by balancing the bias and the variance of the estimator W~1. The next two claims bound the bias and the variance respectively. Their proofs are in Section 3.1.

Claim 2.

|𝐄W~1[f]𝐖1[f]|d1ρ1f2.

Claim 3.

W~1[f] has variance at most ρ2d12f44.

The quantity d1 governs both. We analyze it next. Expanding g in the Fourier basis and applying (2), the constraints (3) translate to

ρ1i=1di(αiρ)j={0,if j=0 or 2j<1,if j=1.

We seek a short solution to the linear system

idiαij={0,if j=0 or 2j<1,if j=1. (4)

in unknowns d1,,d. The existence of a short solution strongly depends on the choice of evaluation points α1,,α. The Chebyshev nodes

αi=cos(π2i12) (5)

are close to best possible for this purpose, as will be argued in Corollary 5 and 6.

Claim 4.

di=d(αi), where

d(t)=2odd j=11(1)(j1)/2jTj(t), (6)

and Tj is the j-th Chebyshev polynomial of the first kind.

Proof.

The first Chebyshev polynomials T0,,T1 are orthogonal with respect to inner product over the Chebyshev nodes:

iTj(αi)Tj(αi)={0,if jj,/2,if j=j0,,if j=j=0..

Using the facts T0(t)=1,T1(t)=t, and that Tj(0) is zero for even j and (1)(j1)/2j for odd j we derive the unique Chebyshev expansion

d(t)=d^(j)Tj(t). (7)

By orthogonality, d^(j) equals (2/)d(αi)Tj(αi), when j0 and half that when j=0. In the latter case, the first equation in (4) says that

d^(0)=1id(αi)T0(αi)=1di=0.

Otherwise, the polynomial Tj(t)Tj(0)t has no linear term, so

id(αi)(Tj(αi)Tj(0)αi)

expands as a linear combination of idiαik for kj,k1. By (4) such linear combinations vanish. Therefore if j1,

d^(j)=2idiTj(0)αi=2Tj(0)={(2/)(1)(j1)/2j,if j is odd0,if j is even.

Plugging into (7) produces the desired formula (6).

Corollary 5.

d.

Proof.

By orthogonality we have the explicit formula

d2=d^(0)2+2d^(1)2++2d^(1)2=2odd j=11j2,

which is at most twice the sum of odd integers between 1 and . Those can be matched into /2 pairs that add up to giving the 2 upper bound.

Therefore d13/2. This bound is within a factor of optimal for any choice of roots:

Claim 6.

For any choice of 1α1,,α1, any d that solves (4) has length d11.

Matching this bound (up to constant factor) would result in a logf42/ϵ factor improvement in query complexity. We do not know if d as in (6) achieves it already.

Proof.

Set j equal to or 1, whichever is odd. Take a linear combination of equations (4) weighted by the coefficients t of Tj. As Tj is bounded on [1,1],

d1|idiTj(αi)|=|i,kdiαiktk|=|t1|=j.

 Remark 7.

By the same argument, had the right-hand side of (6) been replaced by 𝟙(j=), there would have been no solution d of 1-norm less than 21. Thus the low complexity of Algorithm 1 owes not only to the choice of noise parameters αi but to the fortunate form of the right-hand side in the linear system (6).

3.1 Proof of Theorem 1

Proof of Theorem 1.

By Chebyshev’s inequality, the empirical average of m samples of an estimator with variance v is within 2v/m of its mean except with probability 1/4. By 2 and 3, with probability at least 3/4,

|W~1[f]𝐖1[f]|d1ρ1f2+2ρ2d12mf42.

As ff4, it is sufficient that the choices of ρ,,m satisfy

d1ρ1ϵ2and2ρ2d12mϵ24,

for ϵ=ϵ/f42. By Claim 5, d1d23/2. The choice ρ=ϵ(1)1/6 ensures that for 2,

d1ρ13/261ϵ12ϵ.

As =log1/ϵ+1,

2ρ2d12m243m,

which is at most ϵ2/4 by our choice m=963/ϵ2.

Proof of Claim 2.

The bias of the estimator is 𝐄[fg]𝐖1[f]=𝐄[fg]f=12. The first levels of g are given by (3), so 𝐄[f<g<]=f=12. As for j, by (2),

g=j=ρ1di(Nαiρf)=j=di(αiρ)jf=j

from where

|𝐄W~1[f]𝐖1[f]| =|𝐄[fg]|
jρj1i|diαij|f=j2 (by triangle inequality)
ρ1d1jf=j2 (|αi|1)
=ρ1d1f2.

Proof of Claim 3.

Operator Nρ is contractive: Nρgg for all ρ and g.

𝐕𝐚𝐫W~1[f] 𝐄[W~1[f]2]
=ρ2d12𝐄𝐄[f2Nαiρ(f2)|i]
ρ2d12𝐄𝐄[f4|i]𝐄[Nαiρ(f2)2|i] by Cauchy-Schwarz
ρ2d12f2f2 by contractivity
=ρ2d12f44.

3.2 Lower bound

Theorem 8.

For every o(1/ϵ2)-query algorithm A there exists a Boolean-valued f such that |A(f)𝐖1[f]|>ϵ with probability at least 1/4, as long as ϵ=Ω(n3/22n/2).

The advantage of a distinguisher D with respect to random variable f and g is the absolute difference in the probabilities that D accepts f and g. The maximum possible distinguishing advantage is known to equal the minimum possible probability of the event fg under all couplings of f and g, i.e., joint distributions over (f,g) that are consistent with their marginals.

Fact 9.

Assuming 0β<α<0.9, distinguishing between independent α and β-biased coin flips with advantage at least 1/4 requires Ω(1/(αβ)2) samples.

Proof.

There are several proofs for the case β=0, see for example [13]. The general case reduces to this special case. Consider the reduction that takes the outcome of a coin flip, outputs it with probability 1β, and outputs 1 otherwise. This reduction maps unbiased coins into β-biased ones and (αβ)/(1β)-biased coins into α-biased ones. Distinguishing the latter with advantage 1/4 therefore requires Ω((1β)2/(αβ)2) samples.

Proof of Theorem 8.

Choose any 0.5<β<α<0.9 with αβ=5ϵ. Let f (resp., g) be a probabilistic Boolean function in which the values f(x) are independent αx1-biased, (resp., βx1-biased). By 10 𝐖1[f] is at least α2ϵ, and 𝐖1[g] is at most β2+ϵ except with probability at most 1/4. Therefore

𝐖1[f]𝐖1[g]α2β22ϵ=(αβ)(α+β)2ϵ3ϵ

by our choice of α and β. Had A been ϵ-accurate with probability 3/4, by a union bound,

Pr[|A(f)A(g)|ϵ] Pr[|A(f)𝐖1[f]|ϵ]+Pr[|A(g)𝐖1[g]|ϵ]
+Pr[|𝐖1[f]𝐖1[g]|3ϵ]
3/4

so Pr[fg]Pr[|A(f)A(g)|>ϵ]>1/4. As this holds for an arbitrary coupling of f and g, the two can be 1/4-distinguished.

We now show this is impossible, namely f and g cannot be 1/4-distinguished with q=o(1/ϵ2) queries. For if they could be by some algorithm A, then the following algorithm can distinguish α and β-biased coin flips with q queries, contradicting 9: Whenever A makes a new query x, flip a coin, multiply the outcome by x1 and provide this answer. The view of A when the reduction provides α and β-biased coins is identical to interactions with f and g, respectively.

Claim 10.

Let h be a random Boolean function whose values are independent ±1 bits. Then |𝐖1[h]𝐖1[𝐄h]|ϵ except with probability at most O(n32n/ϵ2).

Proof.

By independence, 𝐕𝐚𝐫[h^(i)]2n for every i so by Chebyshev’s inequality, |h^(i)𝐄h^(i)|ϵ/2n except with probability O(n22n/ϵ2). Therefore |h^(i)2𝐄[h^(i)]2|=|h^(i)+𝐄h^(i)||h^(i)𝐄h^(i)|ϵ/n. The claim follows from a union bound and the triangle inequality.

For this argument to result in a Ω(1/ϵ2) lower bound it is essential that α and β be bounded away both from 0 and from 1: Had they been too close to 1 the coins would have been distinguishable from O(1/ϵ) samples. Had they been too close to 0, the typical distance between W1[f] and W1[g] would have been O(ϵ2).

4 Learning

In the setting of distribution-free (agnostic) learning of a linear approximation from independent random samples, Ω(n/ϵ) samples are necessary for properly learning a bounded n-dimensional function with respect to expected square loss error [12]. This bound can be matched by an improper learning algorithm but not by empirical risk minimization [14].

Shamir [12] points out that in general, loss minimization (output a linear hypothesis such that 𝐄[(f(x)(x))2] is ϵ-close to best possible) and model approximation (find such that 𝐄[((x)(x))2]ϵ for the that minimizes 𝐄[(f(x)(x))2]) are not equivalent problems.

In the context of learning under the uniform distribution, however, the two problems are equivalent to approximating the projection f1 onto the space of linear functions. Linial, Mansour, and Nisan [8] showed that the approximation can be calculated from O(n/ϵ) samples via empirical risk minimization.

We show a query complexity lower bound of Ω(n/ϵ) for proper agnostic learning, even when the learner is given non-adaptive query access to f.

Theorem 11.

For every non-adaptive algorithm A that makes o(n/ϵ) queries and outputs an affine hypothesis, there is an f:{±1}n{±1} such that fA(f)2ff12+ϵ only with probability 2Ω(n) assuming ϵ2(1o(1))n/2.

Our proof technique also gives a Ω(logn/ϵ) query complexity lower bound for adaptive A.

Theorem 11 is proved in two steps. In Proposition 12 we establish it for sufficiently small but constant ϵ. In Proposition 15 we give a reduction from agnostically learning linear approximations with constant mean-square error and q queries to the same problem with mean-square error ϵ and O(q/ϵ) queries. The reduction is flexible with respect to the learning model, and even to the task. It can also be applied to estimation but gives a worse dependence on ϵ than what we have in Theorems 8 and 19.

4.1 Lower bound for constant error

Proposition 12.

For every κ<1 there exist ϵ,η such that for n sufficiently large and for every non-adaptive algorithm A that makes at most κn queries and outputs an affine hypothesis, there exists f:{±1}n{±1} such that fA(f)2ff12+ϵ with probability at most 2ηn.

A lognO(1) adaptive query complexity lower bound follows from the same assumptions as adaptive algorithms that make q Boolean-valued queries can be simulated with 2q non-adaptive queries.

The proof relies on the following lemma, which states that there is a large set of Boolean functions whose close-to-best linear approximations are mutually far apart. Any candidate learning algorithm with too few queries does not have enough information to disambiguate between these functions and is unlikely to output an approximately correct hypothesis.

Lemma 13.

For every ρ<1 there exists ϵ>0 such that for sufficiently large n, there is a set 𝒫 of 2ρn pairs of functions (f,) over domain {1,1}n, where f is boolean-valued, is linear and real-valued, and

  1. 1.

    For every (f,)𝒫, f1ϵ.

  2. 2.

    For all distinct (f,),(f,)𝒫, 4ϵ.

By the triangle inequality:

Corollary 14.

For every ρ<1 there exists ϵ>0 and a set of 2ρn Boolean functions such that for every fg, f1 and g1 are at least 2ϵ far apart.

By Yao’s Minimax principle, to prove Proposition 12 it suffices (and is necessary) to give a probability distribution over the possible inputs f and show that any non-adaptive algorithm that makes too few queries is likely to fail on this distribution. Our distribution is uniform over the collection from Corollary 14: As log|| is larger than the query complexity, the algorithm does not have enough information to disambiguate between candidate inputs in and is therefore unlikely to be correct.

Proof of Proposition 12.

Choose f at random from the set of functions in Corollary 14 instantiated with ρ satisfying 1ρ=(1κ)/2. Let fQ denote the restriction of f on the query set Q. Conditioned on the choice of Q, by the law of total expectation

𝐄1|{g:gQ=fQ}| =aSuppfQ1|{g:gQ=a}|Prf[fQ=a]
=aSuppfQ1||Prg[gQ=a]Prf[fQ=a]
=|SuppfQ|||
2q||,

where q is the query complexity. By our choice of parameters this is at most 2(1κ)n/2.

The left-hand side upper bounds the probability that the algorithm succeeds. For conditioned on the view of A and the choice of f, its input is equally likely to have been any function in the set Sf={g:gQ=fQ}. However, the output h of A(f) could be accurate for at most one function in this set: If

gh2 g1g2+ϵand
gh2 g1g2+ϵ

by Pythagoras’ theorem g1h2,g1h2ϵ. By the triangle inequality g1 and g1 are 2ϵ-close, so it must be that g=g. Therefore A(f) succeeds with probability at most 1/|Sf|.

4.2 Trading accuracy for queries

We describe a reduction R that, given access to a high-accuracy learner A that requires access to many queries, learns with low accuracy but fewer queries.

When A makes a query, the reduction flips a coin with success probability δ=2ϵ/ϵ0. If the coin flip succeeds, the reduction forwards the query and returns the answer to A. If it fails, the reduction answers the query randomly. When A outputs its answer h, the reduction returns h/δ.

The reduction can be implemented in any query model (random samples, non-adaptive, adaptive). We show that it is effective in the context of approximately learning linear approximations.

We show correctness for Boolean-valued functions as it matches our application, but the proposition should hold more generally under any p-norm bound.

Proposition 15.

If A makes q queries and outputs h such that hf12ϵ given Boolean-valued f as input, then RA(f) makes at most q=4ϵ/ϵ0q queries except with probability 2Ω(q), and outputs h such that hf12ϵ0 except with probability O(n/ϵ2n).

Proof.
Query complexity.

The number of queries is a Binomial(q,δ) random variable. By a Chernoff bound it exceeds 2δq with probability at most 2Ω(δq).

Correctness.

The input to A provided by R is indistinguishable from a function f obtained by corrupting f by noise of rate δ, namely f(x)=N(x)f(x), where N(x) are independent δ-biased bits. By the triangle inequality,

h/δf11δhf1+1δf1δf1.

By our choice of δ, the first term on the right is at most ϵ0/2, so it suffices to upper bound the second one by that much. In expectation, using Parseval’s identity,

𝐄f1δf12=(1δ2)(n+1)2n, (8)

By Markov’s inequality, δ1f1δf1ϵ0/2 except with probability O(n/ϵ2n).

As an aside, R can be implemented in a “complexity-preserving” manner in the sense that if A is only guaranteed to work on “low-complexity” inputs f (e.g. those of small circuit complexity) then RA works on all inputs of slightly lower complexity. The reason is that Proposition 15 (specifically (8)) only relies on the pairwise independence of the noise.

Proof of Theorem 11.

Fix κ=1/2 and let ϵ0 be the corresponding ϵ from Proposition 12. Suppose A succeeds with higher probability. By Proposition 15 RA makes o(n)<n/2 queries, succeeds with probability 2o(n), and outputs an ϵ0-approximation, violating Proposition 12.

A Ω(logn/ϵ) adaptive query lower bound follows from the same argument.

4.3 Proof of Lemma 13

Claim 16.

For every ϵ there exists a Boolean function f:{1,1}n{1,1} such that f12ϵ for sufficiently large n, where (x)=(γ/n)(x1++xn), assuming 46γ2exp(1/4γ2)ϵ.

The proof uses the following special case of Khintchine’s inequality:

Fact 17 (Khintchine’s inequality).

For every linear function :{1,1}n, 44324.

Proof of 16.

We show that a random f from a suitable distribution satisfies the desired property with nonzero probability. For each x, f(x) is a random {1,1} outcome with bias

(x)𝟙(|(x)|1)={(x),if |(x)|10,if not.
Figure 1: Illustration of f. When |(x)|1, f(x) is (x)-biased. Otherwise, it is unbiased.

The values of f are chosen independently conditioned on f(x)=f(x) for every x. (This folding simplifies the proof a little bit.) The construction is illustrated in Figure 1. By construction, f is balanced. By Parseval’s identity,

f12=i=1n(f^(i)γ/n)2. (9)

We analyze the concentration of f^(i). In expectation, and by linearity of expectation,

𝐄f^(i) =𝐄[xif(x)]
=𝐄𝐄[xif(x)|x]
=𝐄[xi(x)𝟙(|(x)|1)]
=𝐄[xi(x)]𝐄[xi(x)𝟙(|(x)|>1)]
=γng^(i),

where g is the function g(x)=(x)𝟙(|(x)|>1). As g is a symmetric function, its first-level Fourier coefficients are equal and must therefore have absolute value

|g^(i)| =g^(1)2++g^(n)2n
𝐄[g(x)2]n by Parseval’s identity
=1n𝐄[(x)2𝟙(|(x)|>1)]
1n𝐄[(x)4]Pr[|(x)|>1]4 by Cauchy-Schwarz
1n34Pr[|(x)|>1]4 by 17
64γnexp(18γ2) by Hoeffding’s bound
ϵ/4n by the assumption on γ.

As f^(i) is the average of 2n/2 independent outcomes, by Hoeffding’s bound

Pr[|f^(i)𝐄f^(i)|>ϵ/4n]2exp(2nϵ/8n).

Assuming n is sufficiently large so that 2exp(2nϵ/8n) is less than 1/n, by the triangle inequality and a union bound, |f^(i)γ/n|ϵ/n for all i with positive probability. By (9) there must exist a choice of f for which f12ϵ.

Proof of Lemma 13.

Let γ satisfy 46γ2exp(1/4γ2)=ϵ, 𝒞 be a code over {1,1}n of rate ρ and mininum relative Hamming distance 4ϵ/γ2. Such codes exist for sufficiently large n by the Gilbert-Varshamov bound. Let (f,) be the pair of functions from Claim 16. The collection 𝒫 consists of the pairs (fσ,σ):σ𝒞, where gσ(x) is the function g(σ1x1,,σnxn).

As gσ is a permutation of g, fσ1σ12=f2ϵ proving property 1. As for property 2, by Parseval’s identity, for σσ𝒞,

σσ2=γ2n(σiσi)2=4γ2n𝟙(σiσi)16ϵ

by our choice of parameters.

5 Estimation from uniform samples

The queries in Algorithm 1 come in pairwise correlated pairs (x,xei). They can be emulated from independent pairs (x,y) after reweighting by the likelihood ratio thereby preserving the bias of the estimator. However, the variance of the likelihood ratio grows exponentially in Θ(nρ2), leading to a comparable blowup in the estimator variance (which becomes 2Ω(nρ2)/ρ2). Such an algorithm is not query efficient as the seemingly optimal choice of ρ1/n still fails to attain sublinear complexity.

5.1 The algorithm

Instead of adapting Algorithm 1, our pair-based sample mean estimator directly calculates the empirical mean of the unbiased estimator A(x,y)=f(x)f(y)x,y averaged over all pairs of samples. The reuse of samples creates correlations, but their covariances are dominated by their individual variances.

Algorithm 2 pair-based sample mean estimator.
given samples x1,xmi.i.d{1,1}n
return W~1[f]=(m2)1i<jmA(xi,xj), where A(x,y)=f(x)f(y)x,y

The algorithm can be implemented in complexity linear in mn (the bit complexity of the samples) by evaluating the equivalent formula

W~1[f]=(m2)1(i(jf(xj)xij)2njf(xj)2).
Theorem 18.

For m=O(nf42/ϵ+f44/ϵ2) independent and uniform samples, |W~1[f]𝐖1[f]|ϵ with probability at least 2/3.

Proof.

When x and y are random, A and therefore W~1[f] is an unbiased estimator of 𝐖1[f]:

𝐄[A(x,y)] =𝐄[f(x)f(y)x,y]
=i=1n𝐄[f(x)xif(y)yi]
=i=1n𝐄[f(x)xi]2
=i=1nf^(i)2=𝐖1[f]

The variance of W~1[f] is the sum of the (co)variances of pairs A(xi,xj) and A(xi,xj) indexed by i<j and i<j scaled by (m2)2. Only intersecting pairs have nonzero contribution. There are (m2) and 2(m2)(m2) pairs of intersection size 2 and 1, respectively, resulting in

𝐕𝐚𝐫W~1[f]=2m(m1)v+4(m2)m(m1)c

where for independent x,y,z,

v =𝐕𝐚𝐫A(x,y)
𝐄A(x,y)2
=𝐄f(x)2f(y)2x,y2
𝐄f(x)4f(y)4𝐄x,y4 by Cauchy-Schwarz
=f443n by 17

and

c =𝐂𝐨𝐯[A(x,y),A(x,z)]
𝐄A(x,y)A(x,z)
=𝐄𝐄[A(x,y)|x]2
=𝐄[f(x)2𝐄[f(y)x,y|x]2]
f44𝐄𝐄[f(y)x,y|x]4 by Cauchy-Schwarz
=f42f=142 by Plancherel’s formula
=f423f=122 by 17
3f44.

Summing up, 𝐕𝐚𝐫W~1[f]=O(n/m2+1/m)f44. The conclusion follows by applying Chebyshev’s inequality to W~1[f].

5.2 Lower bound

We show that for sample-based algorithms (unlike query-based) for this problem the dependence on the dimension is inherent. More precisely, the following theorem say that Algorithm 2 is optimal in the sample model.

Theorem 19.

For every (possibly randomized) A there exists f:{±1}n{±1} so that when given m independent examples (x,f(x)) with uniform x, A outputs a value within ϵ of 𝐖1[f] with probability at most 1/2+O(mϵ/n)+O((n/ϵ)2n).

The proof in fact shows that the query complexity bound in Theorem 18 is tight in all parameters, even if f4 is weakened to f in the statement.

To prove Theorem 19, we construct a function f that is weakly pseudorandom given o(n/ϵ) examples but for which 𝐖1[f] is likely to be Ω(ϵ). In contrast, 𝐖1[f] for a random function is concentrated around 2n.

The function f is the sign of

gZ,N(x)=ϵ/nx,Z+1ϵN(x),

where Z is an n-dimensional standard normal and N is a standard normal function over {1,1}n (the 2n values N(x) are independent standard normals as x ranges over {1,1}n.)

Claim 20.

The statistical distance between (x1,O(x1)),,(xm,O(xm)) when O=f and O is a random function is at most O(mϵ/n).

Proof.

It is sufficient to show the claim for the real-valued functions g=gZ,N and N, as f and a random Boolean function are obtained by taking signs of g and N, respectively, and postprocessing cannot increase statistical distance.

We will assume without loss of generality that x1,,xm are all distinct as repetitions can only decrease statistical distance. Fixing x1,,xm, g(x1),,g(xm) are jointly centered Gaussian with covariance matrix

Σij={(ϵ/n)xi,xjif ij,1,if i=j

The covariance matrix of N(x1),,N(xm) is the identity Id. The conditional statistical distance given x1,,xm is at most [4]

O(ΣIdF)=O(ϵnijxi,xj2).

By Cauchy-Schwarz, the average, unconditional statistical distance is at most

O(ϵnij𝐄xi,xj2)=O(ϵnm2n)=O(mϵ/n).

Claim 21.

For every z such that n/64z24n, 𝐄[X,zsigngz,N(X)|N]=Ω(ϵn) except with probability O(ϵ2n).

Proof.

The expression inside the expectation is (up to a n factor) of the form tsign(ϵt+1ϵN) for t=X,z/n. For fixed t and standard normal N

𝐄[tsign(ϵt+1ϵN)]=2|t|Pr(|N|<ϵ/(1ϵ)|t|)2|t|Pr(|N|<ϵ|t|). (10)

As the right-hand side is always nonnegative,

𝐄[X,zsigngz,N(X)]𝐄[X,zsigngz,N(X)||X,z|z/3]Pr(|X,z|z/3)2(z/3)Pr(|N|<ϵ/24)Pr(|X,z|z/3)by (10)=Ω(nPr(|N|<ϵ/243))by Khinchine’s inequality=Ω(ϵn).

By independence of the values signgz,N(x) across x,

𝐕𝐚𝐫𝐄[X,zsigngz,N(X)|N] =𝐕𝐚𝐫12nxx,zsigngz,N(x)
=122nxx,z2𝐕𝐚𝐫signgz,N(x)
122nxx,z2
=z22n.

The claim follows from Chebyshev’s inequality.

Claim 22.

𝐖1[f]Ω(ϵ) except with probability Ω(2n).

Proof.

By the Cauchy-Schwarz inequality, for any linear function ,

f,2=f1,2f122=𝐖1[f]2.

The function (x)=x,Z has squared 2-norm Zi2, which is between n/64 and 4n except with probability 2n. Applying Claim 21,

𝐖1[f]f,22=𝐄[X,ZsigngZ,N(X)|N,Z]2𝐄[X,Z2|Z]=Ω(ϵ).

Claim 23.

For a random Boolean function r, 𝐖1[r]ϵ except with probability (n/ϵ)2n.

Proof.

By symmetry the expectation is n2n. The bound follows from Markov’s inequality.

Proof of Theorem 19.

Applying Claims 22 with ϵ in the definition of f scaled up by a suitable constant factor, 𝐖1[f]>3ϵ except with probability O(2n). By Claim 23, 𝐖1[r]ϵ except with probability (n/ϵ)2n.

Let D be the distinguisher that accepts if A’s output is greater than 2ϵ and rejects if not. If A is correct with probability at least (1+δ)/2 on every input, then D accepts f with probability at least 1/2+δ/2O(2n) while D accepts r with probability at most 1/2δ/2+(n/ϵ)2n. As D’s distinguishing advantage cannot exceed A’s, the statistical distance between A’s views on inputs f and r is at least δO((n/ϵ)2n). By Claim 20 it is at most O(mϵ/n), from where δO(mϵ/n)+O((n/ϵ)2n).

References

  • [1] Mitali Bafna, Srikanth Srinivasan, and Madhu Sudan. Local decoding and testing of polynomials over grids. Random Struct. Algorithms, 57(3):658–694, 2020. doi:10.1002/RSA.20933.
  • [2] Shai Ben-David and Ruth Urner. The sample complexity of agnostic learning under deterministic labels. In Proceedings of The 27th Conference on Learning Theory, volume 35, pages 527–542, 13–15 June 2014. URL: https://proceedings.mlr.press/v35/ben-david14.html.
  • [3] Andrej Bogdanov and Gautam Prakriya. Direct Sum and Partitionability Testing over General Groups. In 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021), volume 198, pages 33:1–33:19, 2021. doi:10.4230/LIPIcs.ICALP.2021.33.
  • [4] Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The total variation distance between high-dimensional gaussians with the same mean. arXiv preprint, 2018. arXiv:1810.08693.
  • [5] Noah Fleming and Yuichi Yoshida. Distribution-free testing of linear functions on n. arXiv preprint, 2019. arXiv:1909.03391.
  • [6] Shirley Halevy and Eyal Kushilevitz. Distribution-free property-testing. SIAM J. Comput., 37(4):1107–1138, 2007. doi:10.1137/050645804.
  • [7] Subhash Khot and Dana Moshkovitz. NP-hardness of approximately solving linear equations over reals. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 413–420, 2011. doi:10.1145/1993636.1993692.
  • [8] Nathan Linial, Yishay Mansour, and Noam Nisan. Constant depth circuits, fourier transform, and learnability. Journal of the ACM (JACM), 40(3):607–620, 1993. doi:10.1145/174130.174138.
  • [9] Ryan O’Donnell. Analysis of boolean functions. Cambridge University Press, 2014.
  • [10] Michal Parnas, Dana Ron, and Ronitt Rubinfeld. Tolerant property testing and distance approximation. J. Comput. Syst. Sci., 72(6):1012–1042, 2006. doi:10.1016/j.jcss.2006.03.002.
  • [11] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, USA, 2014.
  • [12] Ohad Shamir. The sample complexity of learning linear predictors with the squared loss. J. Mach. Learn. Res., 16(1):3475–3486, January 2015. doi:10.5555/2789272.2912110.
  • [13] Madhur Tulsiani. Lecture 5: Information and coding theory. Lecture Notes, Winter 2021. https://home.ttic.edu/~madhurt/courses/infotheory2021/l5.pdf.
  • [14] Tomas Vaškevičius and Nikita Zhivotovskiy. Suboptimality of constrained least squares and improvements via non-linear predictors. Bernoulli, 29, February 2023. doi:10.3150/22-BEJ1465.