Estimating Euclidean Distance to Linearity
Abstract
Given oracle access to a real-valued function on the -dimensional Boolean cube, how many queries does it take to estimate the squared Euclidean distance to its closest linear function within ? Our main result is that queries suffice. Not only is the query complexity independent of but it is optimal up to the polylogarithmic factor.
Our estimator evaluates on pairs correlated by noise rates chosen to cancel out the low-degree contributions to while leaving the linear part intact. The query complexity is optimized when the noise rates are multiples of Chebyshev nodes.
In contrast, we show that the dependence on is unavoidable in two closely related settings. For estimation from random samples, samples are necessary and sufficient. For agnostically learning a linear approximation with mean-square regret under the uniform distribution, nonadaptively chosen queries are necessary, while random samples are known to be sufficient (Linial, Mansour, and Nisan).
Our upper bounds apply to functions with bounded 4-norm. Our lower bounds apply even to -valued functions.
Keywords and phrases:
sublinear-time algorithms, statistical estimation, analysis of boolean functions, property testing, regressionCopyright and License:
2012 ACM Subject Classification:
Theory of computation Streaming, sublinear and near linear time algorithms ; Theory of computation Randomness, geometry and discrete structures ; Theory of computation Probabilistic computationAcknowledgements:
We thank Gautam Prakriya for insightful discussions and the anonymous ITCS reviewers for corrections and helpful comments.Funding:
This work was supported by an NSERC discovery grant.Editors:
Raghu MekaSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Finding linear approximations is perhaps the most important problem in statistics and data science. Linear regression [11, Section 9.2] seeks to learn the “best” linear predictor for from labeled data . Owing to its wide applicability and computational advantages, approximation error is often measured as the minimum squared loss under some distribution on examples.
Our main question of interest is whether estimating the minimum squared loss can be performed more efficiently than learning the predictor. A highly efficient estimator can be potentially used to decide whether learning a linear model is sensible at all.
We aim to highlight the conceptual distinctions between estimation and learning. We demonstrate that a complexity gap between the two problems is already exhibited by arguably the simplest distribution on examples: The uniform distribution over the Boolean cube .
Under the uniform distribution, the minimum squared loss is the distance between the function of interest viewed as a vector in -dimensional Euclidean space and the -dimensional subspace spanned by the linear functions. In this setting estimating the squared distance is essentially equivalent to estimating the squared length of the projection.
The power of linear regression in machine learning applications is derived to a great extent from the distribution-free nature of the training algorithms. In contrast our results concern linear regression for a known distribution. We believe that distribution-specific regression is interesting for several reasons.
First, restrictive assumptions on the data distribution may improve performance. For example, distributions in which some features linearly approximate other features can be problematic for popular algorithms like gradient descent. The uniform distribution, as a model of pairwise independence, avoids such obstacles.
Second, distribution-specific algorithms sometimes serve as a stepping stone to distribution-free ones. In the context of property testing this strategy was successful in obtaining distribution-free algorithms for combinatorial properties like monotonicity [6] as well as algebraic ones like linearity and low-degree representability [6, 5].
Third, negative results about learning and estimation are stronger in the distribution-specific model. Distribution-free learning must succeed for all concepts in the given class and all distributions on examples. To understand the limitations it is natural to decouple the complexities of the concept class and of the distribution. Do learning and estimation remain hard even when the distribution is simple?
In this work we focus on information-theoretic measures of complexity. Our principal measure of performance is the query complexity . The time complexity of all our algorithms is linear in , namely in the size of the labeled dataset. We study both algorithms that choose their queries and ones that are provided with uniformly random examples.
Our results
Our results are summarized in Table 1. Our main contribution is Theorem 1: The squared linear projection of
| (1) |
can be estimated within with chosen queries.111The query complexities of the squared projection and the squared distance to linearity are the same up to constant factor since can be estimated from samples assuming is bounded. For similar reasons so are the squared projection and distance to the space of affine functions. In contrast, learning the requisite approximation and obtaining the same estimate from random samples both entail queries that grow at least as fast as some root of .
| chosen queries | random samples | |
|---|---|---|
| estimation |
(Theorem 1)
(Theorem 8) |
(Theorem 18)
(Theorem 19 + 8) |
| learning |
[8]
non-adaptive (Theorem 11) |
[8]
(Theorem 11) |
We found this result surprising from a Fourier-analytic perspective. Our quantity of interest is the sum of squares of all first-level Fourier coefficients of . Up to the polylogarithmic factor, estimating the sum has the same query complexity as estimating any of its individual terms, or estimating the squared mean of .
In the decision version of estimation also known as tolerant testing [10], the objective is to distinguish functions that are -close to linear from those that are -far. In particular, the algorithm of Theorem 1 is a tolerant tester for any value of . In the intolerant limit , algorithms with query complexity are known [1, 3]. For functions over the query complexity is also under Gaussian measure [7], and under arbitrary measure [5]. With the exception of [7], these testers appear not to be robust against small perturbations as their soundness is analyzed with respect to the Hamming distance. The tester of Khot and Moshkovitz [7] can be viewed as an estimator, albeit one with large constant bias.
Our positive results apply to functions with bounded 4-norm. Boundedness of the 2-norm is clearly insufficient: The point function for a random point has unit 2-norm and unit squared distance to linearity, yet is indistinguishable from the all-zero function with queries. We did not investigate whether -norm boundedness for is sufficient.
In contrast, in Theorem 11 we show that proper learning of a linear hypothesis for that is within squared distance of optimal has non-adaptive query complexity linear in , specifically . Our technique also yields a general (adaptive) lower bound. On the positive side, it is known that random samples are sufficient for learning via empirical risk mininization [8]. The gap between the lower and upper bounds is reminiscent of analogous gaps in distribution-free agnostic learning of deterministic concept classes [2].
Techniques
Estimation from chosen queries
Algorithm 1 estimates the linear projection of by a suitably scaled product evaluated on a pair of correlated inputs. The query complexity can be decoupled from the input length already when the pairs are independent, marginally uniform, and of sufficiently low correlation . For simplicity assume is unbiased and -valued. By orthogonal decomposition (see (2)),
where is the degree- part of (see precise definition in Section 2). By Parseval’s identity, estimates with bias at most . By choosing the bias-variance tradeoff , empirically averaging this estimator yields an -approximation of with queries.
To improve the query complexity we work with a more general class of distributions on pairs . Barring additional information on , it is sensible that the distribution should place equal probability on all pairs whose difference has fixed Hamming weight. A special class of such distibutions is mixtures of -correlated pairs over some distribution on noise rates.
The uniform mixture of and -correlated pairs already improves the query complexity to . The reason is that the related Fourier decomposition
cancels out all even levels, thus reducing the bias of the estimator from to .
To achieve our query complexity of we construct a mixture of noise rates that annihilates levels up to while minimizing the contribution of levels and higher. A close to optimal error is achieved using Chebyshev interpolation. We give evidence that a lower bound is inherent in this approach (6), but we do not know if it can be bypassed in general.
In Remark 7 we explain why the low complexity of Algorithm 1 owes not only to the choice of interpolation scheme, but to some serendipity in the constraints (6) on the noise rates.
The proof of Theorem 8 is based on a reduction from the problem of estimating the bias of a coin.
Learning lower bound
Theorem 11 is proved in two steps. Proposition 12 addresses the regime of constant error . It is proved by exhibiting a family of functions whose linear projections are -far apart. The existence of this family is established using the probabilistic method in Lemma 13. An algorithm of sublinear query complexity is unlikely to disambiguate among members of the family and cannot be accurate.
In Proposition 15 we give a self-reduction for learning linear approximations from query complexity and constant error to query complexity and error . The reduction applies to any query model.
Estimation from samples
The random variable with random is an unbiased estimator of . To improve its variance at optimal query cost we average it over all pairs of examples (Algorithm 2 and Theorem 18). For constant , the lower bound in Theorem 19 is exhibited by a sign-rounded linear function with independent Gaussian coefficients. Such a function is weakly pseudorandom against examples yet has constant correlation with . We achieve optimal dependence on by adding Gaussian noise to .
Extensions and future work
We believe that a result analogous to Theorem 1 can be obtained for functions over under standard Gaussian measure using related techniques. It would be interesting to investigate the general setting of product distributions under Efron-Stein decomposition. The biased Boolean cube could be a good test case.
Theorem 1 can be extended to approximate the degree- part of with queries. Both extensions might be sensible in the context of regression, where a mix of categorical (Boolean) and numerical (Gaussian) data and higher-degree models (capturing e.g. decision trees) are more realistic.
An interesting open question is whether the polylogarithmic factor can be eliminated. We speculate that this may be possible under the stronger assumption that is bounded in infinity-norm.
Organization
2 Fourier analysis over the Boolean cube
We use standard notation from Boolean function analysis [9]. All functions are real-valued over the Boolean cube under uniform measure. Such functions live in Hilbert space under inner product . The -norm of a function is , with usually omitted. The monomials form an orthonormal basis. admits orthogonal decompositions
with (resp., ) being the projection onto the subspace spanned by monomials of degree exactly (resp., at least) . Our main object of interest is the linear part , especially its weight
The orthogonal decomposition can be further refined into the complete Fourier decomposition
By orthogonality, (Plancherel’s formula). In particular, when
By orthogonality, the distance to linearity is minimized when . The minimizer equals
as in (1).
A -biased string in is one in which the coordinates are independent and -biased. We use for the pointwise product of strings and in . Multiplying by a -biased string and averaging the outcome has the effect of dampening the higher-degree terms:
| (2) |
3 Estimation from chosen queries
Theorem 1.
There is an algorithm that makes queries to and outputs a value within of with probability at least .
The algorithm takes the empirical average of instantiations of Algorithm 1 with parameters , , , and . (We may assume is an inverse power of two as this doesn’t change the asymptotics.)
The conditional bias of this estimator given is , where
Here is the noise operator for a -biased . The constants are chosen so that the first Fourier level of survives but all other levels among the first are annihilated:
| (3) |
The theorem is proved by balancing the bias and the variance of the estimator . The next two claims bound the bias and the variance respectively. Their proofs are in Section 3.1.
Claim 2.
.
Claim 3.
has variance at most .
The quantity governs both. We analyze it next. Expanding in the Fourier basis and applying (2), the constraints (3) translate to
We seek a short solution to the linear system
| (4) |
in unknowns . The existence of a short solution strongly depends on the choice of evaluation points . The Chebyshev nodes
| (5) |
are close to best possible for this purpose, as will be argued in Corollary 5 and 6.
Claim 4.
, where
| (6) |
and is the -th Chebyshev polynomial of the first kind.
Proof.
The first Chebyshev polynomials are orthogonal with respect to inner product over the Chebyshev nodes:
Using the facts , and that is zero for even and for odd we derive the unique Chebyshev expansion
| (7) |
By orthogonality, equals , when and half that when . In the latter case, the first equation in (4) says that
Otherwise, the polynomial has no linear term, so
expands as a linear combination of for . By (4) such linear combinations vanish. Therefore if ,
Corollary 5.
.
Proof.
By orthogonality we have the explicit formula
which is at most twice the sum of odd integers between and . Those can be matched into pairs that add up to giving the upper bound.
Therefore . This bound is within a factor of optimal for any choice of roots:
Claim 6.
For any choice of , any that solves (4) has length .
Matching this bound (up to constant factor) would result in a factor improvement in query complexity. We do not know if as in (6) achieves it already.
Proof.
Set equal to or , whichever is odd. Take a linear combination of equations (4) weighted by the coefficients of . As is bounded on ,
Remark 7.
By the same argument, had the right-hand side of (6) been replaced by , there would have been no solution of 1-norm less than . Thus the low complexity of Algorithm 1 owes not only to the choice of noise parameters but to the fortunate form of the right-hand side in the linear system (6).
3.1 Proof of Theorem 1
Proof of Theorem 1.
By Chebyshev’s inequality, the empirical average of samples of an estimator with variance is within of its mean except with probability . By 2 and 3, with probability at least ,
As , it is sufficient that the choices of satisfy
for . By Claim 5, . The choice ensures that for ,
As ,
which is at most by our choice .
Proof of Claim 2.
Proof of Claim 3.
Operator is contractive: for all and .
| by Cauchy-Schwarz | ||||
| by contractivity | ||||
3.2 Lower bound
Theorem 8.
For every -query algorithm there exists a Boolean-valued such that with probability at least , as long as .
The advantage of a distinguisher with respect to random variable and is the absolute difference in the probabilities that accepts and . The maximum possible distinguishing advantage is known to equal the minimum possible probability of the event under all couplings of and , i.e., joint distributions over that are consistent with their marginals.
Fact 9.
Assuming , distinguishing between independent and -biased coin flips with advantage at least requires samples.
Proof.
There are several proofs for the case , see for example [13]. The general case reduces to this special case. Consider the reduction that takes the outcome of a coin flip, outputs it with probability , and outputs otherwise. This reduction maps unbiased coins into -biased ones and -biased coins into -biased ones. Distinguishing the latter with advantage therefore requires samples.
Proof of Theorem 8.
Choose any with . Let (resp., ) be a probabilistic Boolean function in which the values are independent -biased, (resp., -biased). By 10 is at least , and is at most except with probability at most . Therefore
by our choice of and . Had been -accurate with probability , by a union bound,
so . As this holds for an arbitrary coupling of and , the two can be -distinguished.
We now show this is impossible, namely and cannot be -distinguished with queries. For if they could be by some algorithm , then the following algorithm can distinguish and -biased coin flips with queries, contradicting 9: Whenever makes a new query , flip a coin, multiply the outcome by and provide this answer. The view of when the reduction provides and -biased coins is identical to interactions with and , respectively.
Claim 10.
Let be a random Boolean function whose values are independent bits. Then except with probability at most .
Proof.
By independence, for every so by Chebyshev’s inequality, except with probability . Therefore . The claim follows from a union bound and the triangle inequality.
For this argument to result in a lower bound it is essential that and be bounded away both from and from : Had they been too close to the coins would have been distinguishable from samples. Had they been too close to , the typical distance between and would have been .
4 Learning
In the setting of distribution-free (agnostic) learning of a linear approximation from independent random samples, samples are necessary for properly learning a bounded -dimensional function with respect to expected square loss error [12]. This bound can be matched by an improper learning algorithm but not by empirical risk minimization [14].
Shamir [12] points out that in general, loss minimization (output a linear hypothesis such that is -close to best possible) and model approximation (find such that for the that minimizes ) are not equivalent problems.
In the context of learning under the uniform distribution, however, the two problems are equivalent to approximating the projection onto the space of linear functions. Linial, Mansour, and Nisan [8] showed that the approximation can be calculated from samples via empirical risk minimization.
We show a query complexity lower bound of for proper agnostic learning, even when the learner is given non-adaptive query access to .
Theorem 11.
For every non-adaptive algorithm that makes queries and outputs an affine hypothesis, there is an such that only with probability assuming .
Our proof technique also gives a query complexity lower bound for adaptive .
Theorem 11 is proved in two steps. In Proposition 12 we establish it for sufficiently small but constant . In Proposition 15 we give a reduction from agnostically learning linear approximations with constant mean-square error and queries to the same problem with mean-square error and queries. The reduction is flexible with respect to the learning model, and even to the task. It can also be applied to estimation but gives a worse dependence on than what we have in Theorems 8 and 19.
4.1 Lower bound for constant error
Proposition 12.
For every there exist such that for sufficiently large and for every non-adaptive algorithm that makes at most queries and outputs an affine hypothesis, there exists such that with probability at most .
A adaptive query complexity lower bound follows from the same assumptions as adaptive algorithms that make Boolean-valued queries can be simulated with non-adaptive queries.
The proof relies on the following lemma, which states that there is a large set of Boolean functions whose close-to-best linear approximations are mutually far apart. Any candidate learning algorithm with too few queries does not have enough information to disambiguate between these functions and is unlikely to output an approximately correct hypothesis.
Lemma 13.
For every there exists such that for sufficiently large , there is a set of pairs of functions over domain , where is boolean-valued, is linear and real-valued, and
-
1.
For every , .
-
2.
For all distinct , .
By the triangle inequality:
Corollary 14.
For every there exists and a set of Boolean functions such that for every , and are at least far apart.
By Yao’s Minimax principle, to prove Proposition 12 it suffices (and is necessary) to give a probability distribution over the possible inputs and show that any non-adaptive algorithm that makes too few queries is likely to fail on this distribution. Our distribution is uniform over the collection from Corollary 14: As is larger than the query complexity, the algorithm does not have enough information to disambiguate between candidate inputs in and is therefore unlikely to be correct.
Proof of Proposition 12.
Choose at random from the set of functions in Corollary 14 instantiated with satisfying . Let denote the restriction of on the query set . Conditioned on the choice of , by the law of total expectation
where is the query complexity. By our choice of parameters this is at most .
The left-hand side upper bounds the probability that the algorithm succeeds. For conditioned on the view of and the choice of , its input is equally likely to have been any function in the set . However, the output of could be accurate for at most one function in this set: If
by Pythagoras’ theorem . By the triangle inequality and are -close, so it must be that . Therefore succeeds with probability at most .
4.2 Trading accuracy for queries
We describe a reduction that, given access to a high-accuracy learner that requires access to many queries, learns with low accuracy but fewer queries.
When makes a query, the reduction flips a coin with success probability . If the coin flip succeeds, the reduction forwards the query and returns the answer to . If it fails, the reduction answers the query randomly. When outputs its answer , the reduction returns .
The reduction can be implemented in any query model (random samples, non-adaptive, adaptive). We show that it is effective in the context of approximately learning linear approximations.
We show correctness for Boolean-valued functions as it matches our application, but the proposition should hold more generally under any -norm bound.
Proposition 15.
If makes queries and outputs such that given Boolean-valued as input, then makes at most queries except with probability , and outputs such that except with probability .
Proof.
Query complexity.
The number of queries is a random variable. By a Chernoff bound it exceeds with probability at most .
Correctness.
The input to provided by is indistinguishable from a function obtained by corrupting by noise of rate , namely , where are independent -biased bits. By the triangle inequality,
By our choice of , the first term on the right is at most , so it suffices to upper bound the second one by that much. In expectation, using Parseval’s identity,
| (8) |
By Markov’s inequality, except with probability .
As an aside, can be implemented in a “complexity-preserving” manner in the sense that if is only guaranteed to work on “low-complexity” inputs (e.g. those of small circuit complexity) then works on all inputs of slightly lower complexity. The reason is that Proposition 15 (specifically (8)) only relies on the pairwise independence of the noise.
Proof of Theorem 11.
Fix and let be the corresponding from Proposition 12. Suppose succeeds with higher probability. By Proposition 15 makes queries, succeeds with probability , and outputs an -approximation, violating Proposition 12.
A adaptive query lower bound follows from the same argument.
4.3 Proof of Lemma 13
Claim 16.
For every there exists a Boolean function such that for sufficiently large , where , assuming .
The proof uses the following special case of Khintchine’s inequality:
Fact 17 (Khintchine’s inequality).
For every linear function , .
Proof of 16.
We show that a random from a suitable distribution satisfies the desired property with nonzero probability. For each , is a random outcome with bias
The values of are chosen independently conditioned on for every . (This folding simplifies the proof a little bit.) The construction is illustrated in Figure 1. By construction, is balanced. By Parseval’s identity,
| (9) |
We analyze the concentration of . In expectation, and by linearity of expectation,
where is the function . As is a symmetric function, its first-level Fourier coefficients are equal and must therefore have absolute value
| by Parseval’s identity | ||||
| by Cauchy-Schwarz | ||||
| by 17 | ||||
| by Hoeffding’s bound | ||||
| by the assumption on . | ||||
As is the average of independent outcomes, by Hoeffding’s bound
Assuming is sufficiently large so that is less than , by the triangle inequality and a union bound, for all with positive probability. By (9) there must exist a choice of for which .
Proof of Lemma 13.
Let satisfy , be a code over of rate and mininum relative Hamming distance . Such codes exist for sufficiently large by the Gilbert-Varshamov bound. Let be the pair of functions from Claim 16. The collection consists of the pairs , where is the function .
As is a permutation of , proving property 1. As for property 2, by Parseval’s identity, for ,
by our choice of parameters.
5 Estimation from uniform samples
The queries in Algorithm 1 come in pairwise correlated pairs . They can be emulated from independent pairs after reweighting by the likelihood ratio thereby preserving the bias of the estimator. However, the variance of the likelihood ratio grows exponentially in , leading to a comparable blowup in the estimator variance (which becomes ). Such an algorithm is not query efficient as the seemingly optimal choice of still fails to attain sublinear complexity.
5.1 The algorithm
Instead of adapting Algorithm 1, our pair-based sample mean estimator directly calculates the empirical mean of the unbiased estimator averaged over all pairs of samples. The reuse of samples creates correlations, but their covariances are dominated by their individual variances.
The algorithm can be implemented in complexity linear in (the bit complexity of the samples) by evaluating the equivalent formula
Theorem 18.
For independent and uniform samples, with probability at least .
Proof.
When and are random, and therefore is an unbiased estimator of :
The variance of is the sum of the (co)variances of pairs and indexed by and scaled by . Only intersecting pairs have nonzero contribution. There are and pairs of intersection size and , respectively, resulting in
where for independent ,
| by Cauchy-Schwarz | ||||
| by 17 | ||||
and
| by Cauchy-Schwarz | ||||
| by Plancherel’s formula | ||||
| by 17 | ||||
Summing up, . The conclusion follows by applying Chebyshev’s inequality to .
5.2 Lower bound
We show that for sample-based algorithms (unlike query-based) for this problem the dependence on the dimension is inherent. More precisely, the following theorem say that Algorithm 2 is optimal in the sample model.
Theorem 19.
For every (possibly randomized) there exists so that when given independent examples with uniform , outputs a value within of with probability at most .
The proof in fact shows that the query complexity bound in Theorem 18 is tight in all parameters, even if is weakened to in the statement.
To prove Theorem 19, we construct a function that is weakly pseudorandom given examples but for which is likely to be . In contrast, for a random function is concentrated around .
The function is the sign of
where is an -dimensional standard normal and is a standard normal function over (the values are independent standard normals as ranges over .)
Claim 20.
The statistical distance between when and is a random function is at most .
Proof.
It is sufficient to show the claim for the real-valued functions and , as and a random Boolean function are obtained by taking signs of and , respectively, and postprocessing cannot increase statistical distance.
We will assume without loss of generality that are all distinct as repetitions can only decrease statistical distance. Fixing , are jointly centered Gaussian with covariance matrix
The covariance matrix of is the identity . The conditional statistical distance given is at most [4]
By Cauchy-Schwarz, the average, unconditional statistical distance is at most
Claim 21.
For every such that , except with probability .
Proof.
The expression inside the expectation is (up to a factor) of the form for . For fixed and standard normal
| (10) |
As the right-hand side is always nonnegative,
By independence of the values across ,
The claim follows from Chebyshev’s inequality.
Claim 22.
except with probability .
Proof.
By the Cauchy-Schwarz inequality, for any linear function ,
The function has squared -norm , which is between and except with probability . Applying Claim 21,
Claim 23.
For a random Boolean function , except with probability .
Proof.
By symmetry the expectation is . The bound follows from Markov’s inequality.
Proof of Theorem 19.
Applying Claims 22 with in the definition of scaled up by a suitable constant factor, except with probability . By Claim 23, except with probability .
Let be the distinguisher that accepts if ’s output is greater than and rejects if not. If is correct with probability at least on every input, then accepts with probability at least while accepts with probability at most . As ’s distinguishing advantage cannot exceed ’s, the statistical distance between ’s views on inputs and is at least . By Claim 20 it is at most , from where .
References
- [1] Mitali Bafna, Srikanth Srinivasan, and Madhu Sudan. Local decoding and testing of polynomials over grids. Random Struct. Algorithms, 57(3):658–694, 2020. doi:10.1002/RSA.20933.
- [2] Shai Ben-David and Ruth Urner. The sample complexity of agnostic learning under deterministic labels. In Proceedings of The 27th Conference on Learning Theory, volume 35, pages 527–542, 13–15 June 2014. URL: https://proceedings.mlr.press/v35/ben-david14.html.
- [3] Andrej Bogdanov and Gautam Prakriya. Direct Sum and Partitionability Testing over General Groups. In 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021), volume 198, pages 33:1–33:19, 2021. doi:10.4230/LIPIcs.ICALP.2021.33.
- [4] Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The total variation distance between high-dimensional gaussians with the same mean. arXiv preprint, 2018. arXiv:1810.08693.
- [5] Noah Fleming and Yuichi Yoshida. Distribution-free testing of linear functions on . arXiv preprint, 2019. arXiv:1909.03391.
- [6] Shirley Halevy and Eyal Kushilevitz. Distribution-free property-testing. SIAM J. Comput., 37(4):1107–1138, 2007. doi:10.1137/050645804.
- [7] Subhash Khot and Dana Moshkovitz. NP-hardness of approximately solving linear equations over reals. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 413–420, 2011. doi:10.1145/1993636.1993692.
- [8] Nathan Linial, Yishay Mansour, and Noam Nisan. Constant depth circuits, fourier transform, and learnability. Journal of the ACM (JACM), 40(3):607–620, 1993. doi:10.1145/174130.174138.
- [9] Ryan O’Donnell. Analysis of boolean functions. Cambridge University Press, 2014.
- [10] Michal Parnas, Dana Ron, and Ronitt Rubinfeld. Tolerant property testing and distance approximation. J. Comput. Syst. Sci., 72(6):1012–1042, 2006. doi:10.1016/j.jcss.2006.03.002.
- [11] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, USA, 2014.
- [12] Ohad Shamir. The sample complexity of learning linear predictors with the squared loss. J. Mach. Learn. Res., 16(1):3475–3486, January 2015. doi:10.5555/2789272.2912110.
- [13] Madhur Tulsiani. Lecture 5: Information and coding theory. Lecture Notes, Winter 2021. https://home.ttic.edu/~madhurt/courses/infotheory2021/l5.pdf.
- [14] Tomas Vaškevičius and Nikita Zhivotovskiy. Suboptimality of constrained least squares and improvements via non-linear predictors. Bernoulli, 29, February 2023. doi:10.3150/22-BEJ1465.
