Local decoding and testing of polynomials over grids

The well-known DeMillo-Lipton-Schwartz-Zippel lemma says that $n$-variate polynomials of total degree at most $d$ over grids, i.e. sets of the form $A_1 \times A_2 \times \cdots \times A_n$, form error-correcting codes (of distance at least $2^{-d}$ provided $\min_i\{|A_i|\}\geq 2$). In this work we explore their local decodability and local testability. While these aspects have been studied extensively when $A_1 = \cdots = A_n = \mathbb{F}_q$ are the same finite field, the setting when $A_i$'s are not the full field does not seem to have been explored before. In this work we focus on the case $A_i = \{0,1\}$ for every $i$. We show that for every field (finite or otherwise) there is a test whose query complexity depends only on the degree (and not on the number of variables). In contrast we show that decodability is possible over fields of positive characteristic (with query complexity growing with the degree of the polynomial and the characteristic), but not over the reals, where the query complexity must grow with $n$. As a consequence we get a natural example of a code (one with a transitive group of symmetries) that is locally testable but not locally decodable. Classical results on local decoding and testing of polynomials have relied on the 2-transitive symmetries of the space of low-degree polynomials (under affine transformations). Grids do not possess this symmetry: So we introduce some new techniques to overcome this handicap and in particular use the hypercontractivity of the (constant weight) noise operator on the Hamming cube.


Introduction
Low-degree polynomials have played a central role in computational complexity. (See for instance [26,8,5,20,22,18,27,3,2] for some of the early applications.) One of the key properties of low-degree n-variate polynomials underlying many of the applications is the "DeMillo-Lipton-Schwartz-Zippel" distance lemma [10,25,28] which upper bounds the number of zeroes that a non-zero low-degree polynomial may have over "grids", i.e., over domains of the form A 1 × · · · × A n . This turns the space of polynomials into an error-correcting code (first observed by Reed [23] and Muller [19]) and many applications are built around this class of codes. These applications have also motivated a rich collection of tools including polynomial time (global) decoding algorithms for these codes, and "local decoding" [4,17,9] and "local testing" [24,1,14] procedures for these codes. Somewhat strikingly though, many of these tools associated with these codes don't work (at least not immediately) for all grid-like domains, but work only for the specific case of the domain being the vector space F n where F is the field over which the polynomial is defined and F is finite. The simplest example of such a gap in knowledge was the case of "global decoding". Here, given a function f : n i=1 A i → F as a truth-table, the goal is to find a nearby polynomial (up to half the distance of the underlying code) in time polynomial in | i A i |. When the domain equals F n then such algorithms date back to the 1950s. However the case of general A i remained open till 2016 when Kim and Kopparty [16] finally solved this problem.
In this paper we initiate the study of local decoding and testing algorithms for polynomials when the domain is not a vector space. For uniformity, we consider the case of polynomials over hypercubes (i.e., when A i = {0, 1} ⊆ F for every i). We describe the problems formally next and then describe our results.

Distance, Local Decoding and Local Testing
We start with some brief notation. For finite sets A 1 , . . . , A n ⊆ F and functions f, g : A 1 × · · · A n → F, let the distance between f and g, denoted δ(f, g) be the quantity Pr a [f (a) = g(a)] where a is drawn uniformly from A 1 × · · · × A n . We say f is δ-close to g if δ(f, g) ≤ δ, and δ-far otherwise. For a family of functions F ⊆ {h : A 1 × · · · × A n → F}, let δ(F) = min f =g∈F {δ(f, g)}.
To set the context for some of the results on local decoding and testing, we first recall the distance property of polynomials. If |A i | ≥ 2 for every i, the polynomial distance lemma asserts that the distance between any two distinct degree d polynomials is at least 2 −d . Of particular interest is the fact that for fixed d this distance is bounded away from 0, independent of n or |F| or the structure of the sets A i . In turn this behavior effectively has led to "local decoding" and "local testing" algorithms with complexity depending only on d -we define these notions and elaborate on this sentence next.
Given a family of functions F from the domain A 1 × · · · × A n to F, we say F is (δ, q)-locally decodable if there exists a probabilistic algorithm that, given a ∈ A 1 × · · · × A n and oracle access to a function f : A 1 × · · · × A n → F that is δ-close to some function p ∈ F, makes at most q oracle queries to f and outputs p(a) with probability at least 3/4. (The existence of a (δ, q)-local decoder for F in particular implies that δ(F) ≥ 2δ.) We say that F is (δ, q)-locally testable if there exists a probabilistic algorithm that makes q queries to an oracle for f : A 1 × · · · × A n → F and accepts with probability at least 3/4 if f ∈ F and rejects with probability at least 3/4 if f is δ-far from every function in F.
When A 1 = · · · = A n = F (and so F is finite) it was shown by Kaufman and Ron [14] (with similar results in Jutla et al. [13]) that the family of n-variate degree d polynomials over F is (δ, q)locally decodable and (δ, q)-locally testable for some δ = exp(−d) and q = exp(d). In particular both q and 1/δ are bounded for fixed d, independent of n and F. Indeed in both cases δ is lower bounded by a constant factor of δ(F(n, d)) and q is upper bounded by a polynomial in the inverse of δ(F(n, d)) where F(n, d) denotes the family of degree d n-variate polynomials over F, seemingly suggesting that the testability and decodability may be consequences of the distance. If so does this phenomenon should extend to the case of other sets A i = F -does it? We explore this question in this paper.
In what follows we say that the family of degree d n-variate polynomials is locally decodable (resp. testable) if there is bounded q = q(d) and positive δ = δ(d) such that F(n, d) is (δ, q)-locally decodable (resp. testable) for every n. The specific question we address below is when are the family of degree d n-variate polynomials locally decodable and testable when the domain is {0, 1} n . (We stress that the choice of {0, 1} n as domain is partly for simplicity and is equivalent to the setting of |A i | = 2 for all i. Working with domains of other (and varying) sizes would lead to quantitative changes and we do not consider that setting in this paper.)

Main Results
Our first result (Theorem 3.2) shows that even the space of degree 1 polynomials is not locally decodable over fields of zero characteristic or over fields of large characteristic. This statement already stresses the main difference between the vector space setting ( domain being F n ) and the "grid" setting (domain = {0, 1} n ). One key reason underlying this difference is that the domain F n has a rich group of symmetries that preserve the space of degree d polynomials, where the space of symmetries is much smaller when the domain is {0, 1} n . Specifically the space of degree d polynomials over F n is "affine-invariant" (invariant under all affine maps from F n to F n ). The richness of this group of symmetries is well-known to lead to local decoding algorithms (see for instance [1]) and this explains the local decodability of F(n, d) over the domain F n . Of course the absence of this rich group of symmetries does not rule out local decodability -and so some work has to be done to establish Theorem 3.2. We give an overview of the proof in Section 1.3 and then give the proof in Section 5.
Our second result (Theorem 3.3) shows, in contrast, that the class of degree d polynomials over fields of small characteristic are locally decodable. Specifically, we show that there is a q = q(d, p) < ∞ and δ = δ(d, p) > 0 such that F(n, d) over the domain {0, 1} n over a (possibly infinite) field F of characteristic p is (δ, q)-locally decodable. This is perhaps the first local-decodability result for polynomials over infinite fields. A key technical ingredient that leads to this result, which may be of independent interest, is that when n = 2p t (twice a power of the characteristic of F) and g is a degree d polynomial for d < n/2 then g(0) can be determined from the value of g on the ball on Hamming weight n/2 (see Lemma 6.1). Again, we give an overview of the proof in Section 1.3 and then give the actual proof in Section 6.
Our final, and main technical, result (Theorem 3.1) shows somewhat surprisingly that F(n, d) is always (i.e., over all fields) locally testable. This leads to perhaps the simplest natural example of a locally testable code that is not locally decodable. We remark there are of course many examples of such codes (see, for instance, the locally testable codes of Dinur [11]) but these are results of careful constructions and in particular not very symmetric. On the other hand F(n, d) over {0, 1} n does possess moderate symmetry and in particular the automorphism group is transitive. We remark that for both our positive results (Theorems 3.3 and 3.1), the algorithms themselves are not obvious and the analysis leads to further interesting questions. We elaborate on these in the next section.

Overview of proofs
Impossibility of local decoding over fields of large characteristic. In Section 5 we show that even the family of affine functions over {0, 1} n is not locally decodable. The main idea behind this construction and proof is to show that the value of a affine function ℓ : {0, 1} n → F at 1 n can not be determined from its values on any set S if |S| is small (specifically |S| = o(log n/ log log n)) and S contains only "balanced" elements (i.e., x ∈ S ⇒ | i x i − (n/2)| = O( √ n). Since the space of affine functions from {0, 1} n to F forms a vector space, this in turn translates to showing that no set of up to |S| balanced vectors contain the vector 1 n in their affine span (over F) and we prove this in Lemma 5.2. Going from Lemma 5.2 to Theorem 5.3 is relatively standard in the case of finite fields. We show that if one picks a random linear function and simply erase its values on imbalanced inputs, this leads to only a small fraction of error, but its value at 1 n is not decodable with o(log n/ log log n) queries. (Indeed many of the ingredients go back to the work of [6], who show that a canonical non-adaptive algorithm is effectively optimal for linear codes, though their results are stated in terms of local testing rather than local decoding.) In the case of infinite fields one has to be careful since one can not simply work with functions that are chosen uniformly at random. Instead we work with random linear functions with bounded coefficients. The bound on the coefficients leads to mild complications due to border effects that need care. In Section 5.2 we show how to overcome these complications using a counting (or encoding) argument.
The technical heart of this part is thus the proof of Lemma 5.2 and we give some idea of this proof next. Suppose S = {x 1 , . . . , x t } contained x 0 = 1 n in its affine span and suppose | n j=1 x i j − (n/2)| ≤ n/s for all i. Let a 1 , . . . , a t ∈ F be coefficients such that x 0 = i a i x i with i a i = 1. Our proof involves reasoning about the size of the coefficients a 1 , . . . , a t . To get some intuition why this may help, note that So in particular if the a j 's are small, specifically if |a j | ≤ 1 then we conclude t = Ω(s). But what happens if large a j 's are used? To understand this, we first show that the coefficients need not be too large (as a function of t) -see Lemma 5.1, and then use this to prove Lemma 5.2. The details are in Section 5.1.
Local decodability over fields of small characteristic. The classical method to obtain a qquery local decoder is to find, given a target point x 0 ∈ F n , a distribution on queries x 1 , . . . , x q ∈ F n such that (1) P (x 0 ) is determined by P (x 1 ), . . . , P (x q ) for every degree d polynomial P , and (2) the query x i is independent of x 0 (so that an oracle f that usually equals P will satisfy P (x i ) = f (x i ) for all i, with probability at least 3/4). Classical reductions used the "2-transitivity" of the underlying space of automorphisms to guarantee that x i is independent of x j for every pair i = j ∈ {0, . . . , q} -a stronger property than required! Unfortunately, our automorphism space is not "2-transitive" but it turns out we can still find a distribution that satisfies the minimal needs.
Specifically, in our reduction we identify a parameter k = k(p, d) and map each variable x ℓ to either y j or 1 − y j for some j = j(ℓ) ∈ [k]. This reduces the n-variate decoding task with oracle access to f (x 1 , . . . , x k ) to a k-variate decoding task with access to the function g(y 1 , . . . , y k ). Since there are only 2 k distinct inputs to g, decoding can solved with at most 2 k queries (if it can be solved at all). The choice of whether x ℓ is mapped to y j or 1 − y j is determined by x 0 j so that f (x 0 ) = g(0 k ). Thus given x 0 , the only randomness is in the choice of j(ℓ). We choose j(ℓ) uniformly and independently from [k] for each ℓ. For y ∈ {0, 1} k , x y denote the corresponding query in {0, 1} n (i.e., g(y) = f (x y )). Given our choices, x y is not independent of x 0 for every choice of y. Indeed if y has Hamming weight 1, then x y is very likely to have Hamming distance ≈ n/k from x 0 which is far from independent. However if y ∈ {0, 1} k is a balanced vector with exactly k/2 1s (so in particular we will need k to be even), then it turns out x y is indeed independent of x 0 . So we query only those x y for which y is balanced. But this leads to a new challenge: can P (0 k ) be determined from the values of P (y) for balanced ys? It turns out that for a careful choice of k (and this is where the small characteristic plays a role) the value of a degree d polynomial at 0 is indeed determined by its values on balanced inputs (see Lemma 6.1) and this turns out to be sufficient to build a decoding algorithm over fields of small characteristic. Details may be found in Section 6.
Local testability over all fields. We now turn to the main technical result of the paper, namely the local testability of polynomials over grids. All previous analyses of local testability of polynomials with query complexity independent of the number of variables have relied on symmetry either implicitly or explicitly. (See for example [15] for further elaboration.) Furthermore many also depend on the local decodability explicitly; and in our setting we seem to have insufficient symmetry and definitely no local decodability. This forces us to choose the test and analysis quite carefully.
It turns out that among existing approaches to analyses of local tests, the one due to Bhattacharyya et al [7] (henceforth BKSSZ) seems to make the least use of local decodability and our hope is to be able to simulate this analysis in our case -but the question remains: "which tester should we use?". This is a non-trivial question since the BKSSZ test is a natural one in a setting with sufficient symmetry; but their analysis relies crucially on the ability to view their test as a sequence of restrictions: Given a function f : F n → F they produce a sequence of functions f = f n , f n−1 , . . . , f k , where the function f r is an r-variate function obtained by restricting f r+1 to a codimension one affine subspace. Their test finally checks to see if f k is a degree d polynomial. To emulate this analysis, we design a somewhat artificial test: We also produce a sequence of functions f n , f n−1 , . . . , f k with f r being an r-variate function. Since we do not have the luxury to restrict to arbitrary subspaces, we instead derive f r from f r+1 (z 1 , . . . , z r+1 ) by setting z i = z j or z i = 1−z j for some random pair i, j (since these are the only simple affine restrictions that preserve the domain). We stop when the number of variables k is small enough (and hopefully a number depending on d alone and not on n or F). We then test that the final function has degree d.
The analysis of this test is not straightforward even given previous works, but we are able to adapt the analyses to our setting. Two new ingredients that appear in our analyses are the hypercontractivity of hypercube with the constant weight noise operator (analyzed by Polyanskiy [21]) and the intriguing stochastics of a random set-union problem. We explain our analysis and where the above appear next.
We start with the part which is more immediate from the BKSSZ analysis. This corresponds to a key step in the BKSSZ analysis where it is shown that if f r+1 is far from degree d polynomials then, with high probability, so also is f r . This step is argued via contradiction. If f r is close to the space of degree d polynomials for many restrictions, then from the many polynomials that agree with f r (for many of the restrictions) one can glue together an r + 1-variate polynomial that is close to f r+1 . This step is mostly algebraic and works out in our case also; though the actual algebra is different and involves more cases. (See Lemma 4.6 and its proof in Section 4.2.) The new part in our analysis is in the case where f n is moderately close to some low-degree polynomial P . In this case we would still like to show that the test rejects f n with positive probability. In both BKSSZ and in our analysis this is shown by showing the the 2 k queries into f n (that given the entire truth table of the function f k ) satisfy the property that exactly f n is not equal to P on exactly one of the queried points. Note that the value of f k (y) is obtained by querying f at some point, which we denote x y . In the BKSSZ analysis x a and x b are completely independent given a = b ∈ {0, 1} k . (Note that the mapping from y to x y is randomized and depends on the random choices of the tester.) In our setting the behavior of x a and x b is more complex and depends on both the set of coordinates j such that where a j = b j and on the number of indices i ∈ [n] such that the variable x i is mapped to variable y j . Our analysis ends up depending on two new ingredients: (1) The number of variables x i that map to any particular variable y j is Ω(n/k) with probability at least 2 −O(k) (see Corollary 4.9). This part involves the analysis of a random set-union process elaborated on below. (2) Once the exact number of indices i such that x i maps to y j is fixed for every j ∈ [k] and none of the sets is too small, the distribution of x a and x b is sufficiently independent to ensure that the events f (x a ) = P (x a ) and f (x b ) = P (x b ) co-occur with probability much smaller than the individual probabilities of these events. This part uses the hypercontractivity of the hypercube but under an unusual noise operator corresponding to the "constant weight operator", fortunately analyzed by Polyanskiy [21]. Invoking his theorem we are able to conclude the proof of this section.
We now briefly expand on the "random set-union" process alluded to above. Recall that our process starts with n variables, and at each stage a pair of remaining variables is identified and given the same name. (We may ignore the complications due to the complementation of the form z i = 1 − z j for this part.) Equivalently we start with n sets X 1 , . . . , X n with X i = {i} initially. We then pick two random sets and merge them. We stop when there are k sets left and our goal is to understand the likelihood that one of the sets turn out to be too tiny. (The expected size of a set is n/k and too tiny corresponds to being smaller than n/(4k).) It turns out that the distribution of set sizes produced by this process has a particularly clean description as follows: Randomly arrange the elements 1 to n on a cycle and consider the partition into k sets generated by the set of elements that start with a special element and end before the next special element as we go clockwise around the cycle, where the elements in {1, . . . , k} are the special ones. The sizes of these partitions are distributed identically to the sizes of the sets S j ! For example, when k = 2 the two sets have sizes distributed uniformly from 1 to n − 1. In particular the sets size are not strongly concentrated around n/k -but nevertheless the probability that no set is tiny is not too small and this suffices for our analysis.
Details of this analysis may be found in Section 4.
Organization. In Section 2 we start with some preliminaries including the main definitions and some of the tools we will need later. In Section 3 we give a formal statement of our results. In Section 4 we present and analyze the local tester over all fields. In Section 5 we show that over fields of large (or zero) characteristic, local decoding is not possible. Finally in Section 6 we give a local decoder and its analysis over fields of small characteristic.

Basic notation
Fix a field F and an n ∈ N. We consider functions f : {0, 1} n → F that can be written as multilinear polynomials of total degree at most d. We denote this space by F(n, d; F). The space of all functions from {0, 1} n to F will be denoted simply as F(n; F). (We will simplify these to F(n, d) and F(n) respectively, if the field F is clear from context.) Given f, g ∈ F(n), we use δ(f, g) to denote the fractional Hamming distance between f and g.

Local Testers and Decoders
Let F be any field. We define the notion of a local tester and local decoder for subspaces of F(n).
We say that a randomized algorithm T is a (δ, q)-local tester for F ′ if on an input f ∈ F(n), the algorithm does the following.
• T makes at most q non-adaptive queries to f and either accepts or rejects.
We say that a tester is adaptive if the queries it makes to the input f depend on the answers to its earlier queries. Otherwise, we say that the tester is non-adaptive. Definition 2.2 (Local decoder). Fix q ∈ N and δ ∈ (0, 1). Let F ′ be any subspace of F(n).
We say that a randomized algorithm T is a (δ, q)-local decoder for F ′ if on an input f ∈ F(n) and x ∈ {0, 1} n , the algorithm does the following.
• T makes at most q queries to f and outputs b ∈ F.
We say that a decoder is adaptive if the queries it makes to the input f depend on the answers to its earlier queries. Otherwise, we say that the tester is non-adaptive.

Some basic facts about binomial coefficients
where H(·) is the binary entropy function.

Hypercontractivity theorem for spherical averages.
In this section, let R be the underlying field. Let η ∈ (0, 1) be arbitrary. We define a smoothing operator T η , which maps where x ⊕ J is the point y ∈ {0, 1} r obtained by flipping x at exactly the coordinates in J.
Recall that for any F ∈ F(r) and any p ≥ 1, We will use the following hypercontractivity theorem of Polanskiy [21].
, 1} r are chosen as follows: x ′ ∈ {0, 1} r and I ′ ∈ [r] ηr are chosen i.u.a.r., and we set x ′′ = x ′ ⊕ I ′ . Then we have where C is the constant from Theorem 2.4.
Proof. Let F : {0, 1} n → {0, 1} ⊆ R be the indicator function of the set E. Note that we have By the Cauchy-Schwarz inequality and Theorem 2.4 we get where for the last inequality we have used the fact that for η 0 ∈ [0, 1] we have Putting the upper bound on F p together with the fact that F 2 ≤ √ δ and (1), we get the claim.

Results
We show upper and lower bounds for testing and decoding polynomial codes over grids. All our upper bounds hold in the non-adaptive setting, while our lower bounds hold in the stronger adaptive setting.
Our first result is that for any choice of the field F (possibly even infinite), the space of functions F(n, d) is locally testable. More precisely, we show the following.
Theorem 3.1 (F(n, d) has a local tester for any field). Let F be any field. Fix a positive integer d and any n ∈ N. Then the space F(n, d; F) has a non-adaptive (ε, q)-local tester for q = 2 O(d) · poly(1/ε).
In contrast, we show that the space F(n, d) is not locally decodable over fields of large characteristic, even for d = 1.
Theorem 3.2 (F(n, d) does not have a local decoder for large characteristic). Let n ∈ N be a growing parameter. Let F be any field such that either char(F) = 0 or char(F) ≥ n 2 . Then any adaptive (ε, q)-local decoder for F(n, 1; F) that corrects an ε fraction of errors must satisfy q = Ω ε (log n/ log log n).
Complementing the above result, we can show that if char(F) is a constant, then in fact the space F(n, d) does have a local decoding procedure. Theorem 3.3 (F(n, d) has a local decoder for constant characteristic). Let char(F) = p be a positive constant. Fix any d, n ∈ N. There is a k ≤ pd such that the space F(n, d; F) has a non-adaptive (1/2 O(k) , 4 k )-local decoder.

A local tester for F (n, d) over any field
We now present our local tester and its analysis. The reader may find the overview from Section 1.3 helpful while reading the below.
We start by introducing some notation for this section. Throughout, fix any field F. We consider functions f : {0, 1} I → F where I is a finite set of positive integers and indexes into the set of variables {X i | i ∈ I}. We denote this space as F(I). Similarly, F(I, d) is defined to be the space of functions of degree at most d over the variables indexed by I.
The following is the test we use to check if a given function f : {0, 1} I → F is close to F(I, d).
Test T k,I (f I ) Notation. Given two variables X and Y and a ∈ {0, 1}, "replacing X by a ⊕ Y " refers to substituting X by Y if a = 0 and by 1 − Y if a = 1.
• If |I| > k, then -Choose a random a ∈ {0, 1} and distinct i 0 , j 0 ∈ I at random and replace X j 0 by a⊕ and output what it outputs.
-Check if the restricted function g(Y 1 , . . . , Y k ) ∈ F(k, d) by querying g on all its inputs. Accept if so and reject otherwise.
Theorem 3.1 immediately follows from Theorem 4.4 since to get an (ε, 2 O(d) )-tester, we repeat the test T k,[n] t = 2 O(d) · poly(1/ε) many times and accept if and only if each iteration of the test accepts. If the input function f ∈ F(n) is of degree at most d, this test accepts with probability 1. Otherwise, this test rejects with probability at least 3/4 for suitably chosen t as above. The number of queries made by the test is 2 Parameters. For the rest of this section, we use the following parameters. We choose for a large absolute constant M ∈ N and set where C is the absolute constant from Corollary 2.5. The constant M is chosen so that Note that the second constraint is satisfied for a large enough absolute constant M since we have which can be made arbitrary small for large enough constant M . Further, we set The following are the two main lemmas used to establish Theorem 4.4.
Lemma 4.5 (Small distance lemma). Fix any I such that |I| = r > k + 1 and f I : Lemma 4.6 (Large distance lemma). Fix any I such that |I| = r satisfies r 2 > 100ℓ 2 and f I : With the above lemmas in place, we show how to finish the proof of Theorem 4.4.
Proof of Theorem 4.4. Fix any I and consider the behaviour of the test T k,I on f I . Assume |I| = n. A single run of T k,I produces a sequence of functions f n = f I , f n−1 , . . . , f k , where f r is a function on r variables. Let I n = I, I n−1 , . . . , I k be the sequence of index sets produced. We have f r : {0, 1} Ir → F. Note that k ≥ 100ℓ by (4) and (5).
• F r is the event that δ d (f r ) > ε 1 .
For any f I , one of F r , C r , or E r occurs with probability 1. If either E n or C n occurs, then by Lemma 4.5 we are done. Therefore, we assume that F n holds.
We note that one of the following possibilities must occur: either all the f r satisfy We handle each of these cases somewhat differently.
Clearly, if F k holds, then deg(f I k ) > d and hence T k,I k rejects f I k with probability 1. On the other hand, by Lemma 4.5, we see that

Thus, we have
Let E denote the event ¬( n−1 r=k E r ∨ n−1 r=k F r ). Notice that if event E occurs, there must be an r ≥ k such that C r occurs but we also have F r+1 ∧ F r+2 ∧ · · · ∧ F n . By Lemma 4.6, the probability of this is upper bounded by 100ℓ 2 /r 2 for each r ≥ k.
By a conditional probability argument, we see that where we have used the fact that k ≥ 100ℓ and for the second inequality we also use (1 − x) ≥ exp(−2x) for x ∈ [0, 1/2]. Plugging the above into (6), we get the theorem.

Proof of Small Distance Lemma (Lemma 4.5)
We start with a brief overview of the proof of Lemma 4.5. Suppose f I is δ-close to some polynomial P for some δ ≤ ε 1 . As mentioned in Section 1.3, our aim is to show that the (random) restriction g of f obtained above and the corresponding restriction Q of P differ at only one point. Then we will be done since any two distinct degree-d polynomials on {0, 1} k must differ on at least 2 points (if k > d) and hence the restricted function g cannot be a degree-d polynomial.
Note that the restriction is effectively given by a ∈ {0, 1} I and φ : I → [k] such that g(y) = f I (x(y)) where x(y) = (x i (y)) i∈I is given by x i (y 1 , . . . , y k ) = y φ(i) ⊕ a i . (φ is obtained by a sequence of replacements followed by the bijection σ.) Similarly we define Q(y) = P (x(y)). To analyze the test, we consider the queries {x(y)} y∈{0,1} k made to the oracle for f I . For every fixed y ∈ {0, 1} k the randomness (in a and φ) leads to a random query x(y) ∈ {0, 1} I to f I and it is not hard to show that for each fixed y, x(y) is uniformly distributed over {0, 1} I . Hence, the probability that g and Q differ at any fixed y ∈ {0, 1} k is exactly δ.
We would now like to say that for distinct y ′ , y ′′ ∈ {0, 1} k , the probability that g and Q differ at both y ′ and y ′′ is much smaller than δ. This would be true if, for example, x(y ′ ) and x(y ′′ ) were independent of each other, but this is unfortunately not the case. For example, consider the case when no X i (i ∈ I) is identified with the variable Y k (i.e., for every i ∈ I, φ(i) = k). 1 In this case, x(y ′ ) = x(y ′′ ) for every y ′ and y ′′ that differ only at the kth position. More generally, if the number of variables that are identified with Y k is very small (much smaller than the expected number r/k) then x(y ′ ) and x(y ′′ ) would be heavily correlated if y ′ and y ′′ differed in only the kth coordinate.
So, the first step in our proof is to analyze the above restriction process and show that with reasonable probability, for every Y j there are many variables (close to the expected number) mapped to it, i.e., |φ −1 (j)| is Ω(r/k) for every j ∈ [k]. To get to this analysis we first give an alternate (non-iterative) description of the test T k,I and analyze it by exploring the random set-union process mentioned in Section 1.3. We note that this process and its analysis may be independently interesting.
Once we have a decent lower bound on min j |φ −1 (j)|, we can use the hypercontractivity theorem of Polyanskiy (Theorem 2.4) to argue that for any y ′ = y ′′ , the inputs x(y ′ ) and x(y ′′ ) are somewhat negatively correlated (see Corollary 2.5). We note that since the distribution of the pair (x(y ′ ), x(y ′′ )) is not the usual noisy hypercube distribution and so the usual hypercontractivity does not help. But this is where the strength of Polyanskiy's hypercontractivity comes in handyeven after we fix the Hamming distance between x(y ′ ) and x(y ′′ ) the symmetry of the space leads to enough randomness to apply Theorem 2.4. This application already allows us to show a weak version of Lemma 4.5 and hence a weak version of our final tester.
To prove Lemma 4.5 in full strength as stated, we note that stronger parameters for the lemma are linked to stronger negative correlation between x(y ′ ) and x(y ′′ ) for various y ′ and y ′′ . It turns out that this is directly related to the Hamming distance of y ′ and y ′′ : specifically, we would like their Hamming distance to not be too close to 0 or to k. Hence, we would like to restrict our attention to a subset T of the query points of {0, 1} k that form such a "code". At the same time, however, we need to ensure that, as for {0, 1} k , any two distinct degree-d polynomials cannot differ at exactly one point in T . We construct such a set T in Claim 4.10, and use it to prove Lemma 4.5.
We now begin the formal proof with an alternate but equivalent (non-recursive) description of test T k,I for |I| = r > k.
• For i ∈ 1, . . . , k • Check if the restricted function g(Y 1 , . . . , Y k ) is of degree at most d by querying g on all its inputs. Accept if so and reject otherwise.
Proposition 4.7. The iterative description above is equivalent to test T k,I .
We now begin the analysis of the test T k,I . As stated above, the first step is to understand the distribution of the number of X i (i ∈ I) eventually identified with Y j (for various j ∈ [k]). We will show (Corollary 4.9) that with reasonable probability, each Y j has Ω(r/k) X i s that are identified with it.
Fix any bijection π : [r] → [r]. For i, j such that i ≥ j and i ∈ {k, . . . , r}, we define B j,i to be the index set of those variables that are identified with X π(j) (or its complement) in the first r − i rounds of substitution. Formally, if i < r and p(i + 1) = j. B j,i+1 ∪ B i+1,i+1 if i < r and p(i + 1) = j.
For j ∈ [k], let B j = B j,k . This is the set of i such that X π(i) is "eventually" identified with X π(j) (or its complement). For i ∈ [r], we define b To analyze the distribution of the "buckets" B 1 , . . . , B k , it will be helpful to look at an equivalent way of generating this distribution. We do this by sampling the buckets in "reverse": i.e., we start with the jth bucket being the singleton set {j} and for each i = k + 1, . . . , r, we add i to the jth bucket if i falls into the the jth bucket.
Formally, for each j ∈ [k], define the set B ′ j,i to be B j ∩ [i]. Note that we have In particular, we see that for any i ≥ k + 1, This yields the following equivalent way of sampling sets from the above distribution. . . , i k in σ (σ(i 2 ) < · · · < σ(i k )). Define C 1 , . . . , C k as follows: • . . .
Then the distribution of (C 1 , . . . , C k ) is identical to the distribution of (B 1 , . . . , B k ).
Proof. Assume σ is sampled by starting with the element 1 and then inserting the elements i = 2, . . . , r one by one in a random position after 1 (since we are sampling σ such that σ(1) = 1). Simultaneously, consider the evolution of the jth bucket. Let C j,i denote the jth bucket after elements 2, . . . , i have been inserted. Note that no matter how the first k elements are ordered in σ, the element j ∈ [k] goes to the jth bucket at the end of the sampling process. Thus, after having inserted 2, . . . , k, we have C j,k = {j}.
We now insert (i + 1) for each i such that k ≤ i < r. The position of i + 1 is a uniform random position after the first position. For each i, the probability that i + 1 ends up in the jth bucket can be seen to be |C j,i |/i, exactly as in (7). This shows that (C 1 , . . . , C k ) has the same distribution as (B 1 , . . . , B k ). Proof. We assume that r > 4k since otherwise the statement to be proved is trivial (as each |B j | ≥ 1 with probability 1.) By Lemma 4.8 it suffices to prove the above statement for the sets (C 1 , . . . , C k ). Now, say a permutation σ of [r] fixing 1 is chosen u.a.r. and we set C j as in Lemma 4.8. We view the process of sampling σ as happening in two stages: we first choose a random linear ordering of A = {k + 1, . . . , r}, i.e. a random function σ ′ : A → [r − k], and then inserting the elements 2, . . . , k one by one at random locations in this ordering. (The position of the element 1 is of course determined.) Condition on any choice of σ ′ . For j ∈ {2, . . . , k}, let C ′ j = {i | (j − 1)r/k ≤ σ ′ (i) ≤ (j − 1)r/k + ⌈r/2k⌉}. Fix any bijection τ : {2, . . . , k} → {2, . . . , k}.
Consider the probability that on inserting 2, . . . , k into the ordering σ ′ , each j ∈ {2, . . . , k} is inserted between two elements of C ′ τ (j) . Call this event E τ . Conditioned on this event, it can be seen that for each j ∈ {2, . . . , k}, the jth bucket C j has size at least Similarly, conditioned on E τ , we have |C 1 | ≥ r/k ≥ r/(4k). Since this holds for each τ and the events E τ are mutually exclusive, we have We now analyze Pr [E τ ] for any fixed τ . Conditioned on the positions of 2, . . . , j − 1, the probability that σ(j) ∈ C ′ τ (j) is at least (r/(2k)) · (1/r) = 1/(2k). Therefore we have

Thus, we get
Pr ∀j ∈ {2, . . . , k}, where we have used the Stirling approximation for the final inequality. This concludes the proof of the corollary.
Note that the sets B j are determined by our choice of p. For the rest of the section, we condition on a choice of p = p 0 such that Corollary 4.9 holds. We now show how to finish the proof of Lemma 4.5.
We prove the following two claims.
Claim 4.10. There is a non-empty set T ⊆ {0, 1} k such that: • |T | ≤ k ≤d + 1, • Given distinct y ′ , y ′′ ∈ T , ∆ ′ (y ′ , y ′′ ) ≥ k/4, • No pair of polynomials P, P ′ ∈ F(I, d) can differ at exactly one input from T . 2 For each input y ∈ {0, 1} k to the restricted polynomial g, let x(y) ∈ {0, 1} I be the corresponding input to f I . Let S denote the multiset {x(y) | y ∈ T }. This is a subset of the set of inputs on which f I is queried.
Claim 4.11. Let p = p 0 be as chosen above. With probability at least δ · (|T |/2) over the choice of π and a, we have |S ∩ E| = 1 (i.e. there is a unique y ∈ T such that x(y) ∈ E).
Assuming Claims 4.10 and 4.11, we have proved Lemma 4.5 since with probability at least 1 2 O(k) · δ · (|T |/2) (cf. Corollary 4.9 and Claim 4.11), the restricted function g(Y 1 , . . . , Y k ) differs from the restriction P ′ (Y 1 , . . . , Y k ) of P at exactly 1 point in T . However, by our choice of the set T , any two polynomials from F(k, d) that differ on T must differ on at least two inputs. Hence, g cannot be a degree d polynomial, and thus the test rejects.

Proof of Claim 4.10
Given functions f, g ∈ F(k), we define their inner product f, g by f, g = y∈{0,1} k f (y)g(y). Recall that F(k, d) ⊥ is defined to be the set of all f ∈ F(k) such that f, g = 0 for each g ∈ F(k, d).
We will construct T by finding a suitable non-zero f ∈ F(k, d) ⊥ and setting T = Supp(f ), where Supp(f ) = {y ∈ {0, 1} k | f (y) = 0}. Thus, we need f to satisfy the following properties.

3.
No pair of polynomials P, P ′ ∈ F(I, d) can differ at exactly one input from Supp(f ).
We first observe that Property 3 is easily satisfied. To see this, assume that g 1 , g 2 ∈ F(k, d) differ at exactly one point, say y ′ , from Supp(f ). Then, since g = g 1 − g 2 ∈ F(k, d) and f ∈ F(k, d) ⊥ , we must have f, g = 0. On the other hand since Supp(g) ∩ Supp(f ) = {y ′ }, we have which yields a contradiction. Hence, we see that g 1 and g 2 cannot differ at exactly one point in Supp(f ).
We thus need to choose a non-zero f ∈ F(k, d) ⊥ so that Properties 1 and 2 hold. Note that to ensure that f ∈ F(k, d) ⊥ , it suffices to ensure that for each A ⊆ [k] of size at most d we have The number of such A is N = k ≤d . To ensure that Properties 1 and 2 hold, it suffices to ensure that Supp(f ) ⊆ U where U ⊆ {0, 1} k is a set of size N + 1 so that any distinct y ′ , y ′′ ∈ U satisfy ∆(y ′ , y ′′ ) ∈ [k/4, 3k/4]. (Note that this implies that ∆ ′ (y ′ , y ′′ ) ≥ k/4.) To see that such a set U exists, consider the following standard greedy procedure for finding such a set U : starting with an empty set, we repeatedly choose an arbitrary point z to add to U and then remove all points at Hamming distance at most k/4 and at least 3k/4 from z from future consideration. Note that this procedure can produce up to 2 k /(2 k ≤k/4 ) many points. By Fact 2.3 and our choice of k (see (2) and (4)) we have Hence, the above greedy procedure can be used to produce a set U of size N + 1 as required.
Since we assume that Supp(f ) ⊆ U , ensuring (8) reduces to ensuring the following for each A ⊆ [k] of size at most d: Choosing f (y) (y ∈ U ) so that the above holds reduces to solving a system of N homogeneous linear equations (one for each A) with |U | = N + 1 constraints. By standard linear algebra, this system has a non-zero solution. This yields a non-zero f ∈ F(k, d) ⊥ with the required properties.

Proof of Claim 4.11
Let y ′ , y ′′ be any two distinct points in T . Let ∆ denote ∆(y ′ , y ′′ ) and ∆ ′ denote ∆ ′ (y ′ , y ′′ ). We show that Pr π,a x(y ′ ) ∈ E = δ (10) where C is the absolute constant from the statement of Corollary 2.5. Given (10) and (11) we are done since we can argue by inclusion exclusion as follows.
Pr π,a (10) and (11)) Note that by our choice of ε 1 (see (3)) and Fact 2.3 we have , which along with our previous computation yields This finishes the proof of the Claim using (10) and (11). We now prove (10) and (11). To prove (10), we consider the distribution of x(y ′ ) for any fixed y ′ ∈ {0, 1} k . Condition on any choice of π. For any i ∈ [r], let A i = {j |i ∈ i ′ B j,i ′ }. Note that π(j) < π(i) for each j ∈ A i . We have which is a uniform random bit even after conditioning on all a j for j < i. In particular, it follows that for each choice of π, x(y ′ ) is a uniformly random element of {0, 1} I . This immediately implies (10). Also note that since x(y ′ ) has the same distribution for each choice of π, the random variables x(y ′ ) and π are independent from each other.
To prove (11), we will use our corollary to Polyanskiy's Hypercontractivity theorem (Corollary 2.5). Let D ⊆ [k] be the set of coordinates where y ′ and y ′′ differ. Condition on any choice of x(y ′ ) ∈ {0, 1} n . By (12), the point x(y ′′ ) satisfies, for each i, . Or equivalently, for any h ∈ I, we have Hence, we may equivalently sample the pair (x(y ′ ), x(y ′′ )) as follows: Choose x(y ′ ) ∈ {0, 1} I uniformly at random, and choose independently a random set I ′ ⊆ I of size |B D | and flip x(y ′ ) exactly in the coordinates in I ′ to get x(y ′′ ).

Proof of Large Distance Lemma (Lemma 4.6)
We follow the proof of [7, Lemma 12].
Given a triple (i, j, b) ∈ I 2 × {0, 1} with i, j distinct, call (i, j, b) a bad triple if the restricted function f ′ I obtained when the test chooses i 0 = i, j 0 = j and a = b is ε 0 -close to F(I \ j, d). To prove Lemma 4.6, it suffices to show that the number of bad triples is at most 100ℓ 2 . To do this, we bound instead the number of bad pairs, which are defined to be pairs (i, j) for which there exists b ∈ {0, 1} such that (i, j, b) is a bad triple. Note that (i, j) is a bad pair iff (j, i) is. Hence, the set of bad pairs (i, j) defines an undirected graph G bad . If there are fewer than 25ℓ 2 edges in G bad , we are done since this implies that there are at most 50ℓ 2 bad pairs and hence at most 100ℓ 2 bad triples. Otherwise, G bad has more than 25ℓ 2 edges and it is easy to see that one of the following two cases must occur: • G bad has a matching with at least ℓ + 1 edges, or • G bad has a star with at least ℓ + 1 edges.
We show that in each case, we can find a polynomial P ∈ F(I, d) is ε 1 -close to f I , which will contradict the assumption that δ d (f I ) > ε 1 and hence finish the proof of the lemma.
We first note that in either the matching or the star case, we can replace some variables X with 1 ⊕ X in f I (note that this does not change δ d (f I )) to ensure that the bad triples that give rise to the bad pairs are all of the form (X, X ′ , 0): i.e., all the bad triples come from identifying variables (and not from identifying a variable with the complement of another).
Let t 1 = (X i 1 , X j 1 , 0), . . . , t ℓ+1 = (X i ℓ+1 , X j ℓ+1 , 0) denote the bad triples obtained above (in either the matching or the star case). Each triple t h defines the subset R h ⊆ {0, 1} I where the variables X i h and X j h take the same values; let R ′ h denote the complement of R h . Note that each |R h | = 2 r−1 . Furthermore, it follows from the form of the triples that for each h we have where the latter function is obtained by identifying the variables X i h and X j h in f I . We will show the following claim.
Assuming the above claim, we show that the polynomial P above is actually ε 1 -close to f I , which contradicts our assumption about δ d (f I ).
Consider a uniformly random input x ∈ {0, 1} n . We have For each h, we have Hence, we obtain Plugging the above into (13), we get where the final inequality follows from our choice of ε 0 and ℓ (see (5)). This is a contradiction to our assumption on δ d (f I ), which concludes the proof of Lemma 4.6 assuming Claim 4.12.

Proof of Claim 4.12
We now prove Claim 4.12. The proof is a case analysis based on whether G bad has a large matching or a large star. For any h ∈ [ℓ + 1] and any polynomial Q ∈ F(I, d), we denote Q| h the polynomial obtained by identifying the variables X i h and X j h . We want to define a polynomial P such that for each h ∈ [ℓ + 1], we have As in [7], the crucial observation that will help us find a P as above is the following. Fix any distinct h, h ′ and consider P (h) | h ′ and P (h ′ ) | h . Note that these polynomials are both naturally defined on the set of inputs R h,h ′ := R h ∩ R h ′ . However, since f I is ε 1 -close to P (h) and P (h ′ ) on R h and R h ′ respectively, we see that where for the second inequality we have used the fact that Since any pair of distinct polynomials of degree d disagree on at least a (1/2 d ) fraction of inputs in R h,h ′ , we see that P (h) | h ′ = P (h ′ ) | h as polynomials. We record this fact below.
The Matching case of Claim 4.12. Let (X i 1 , X j 1 , 0), . . . , (X i ℓ+1 , Y i ℓ+1 , 0) be the set of bad triples that give rise to the distinct edges of the matching in G bad . By renaming variables we assume that I = [r] and the bad triples are all of the form (X 1 , X 2 , 0), . . . , (X 2ℓ+1 , X 2ℓ+2 , 0).
Assume that for each h ∈ [ℓ + 1], where X S = i∈S X i (note that P (h) ∈ F(I \ {2h}, d) and hence does not involve X 2h ). For any h, if |S| > d or S ∋ 2h, we define α (h) S = 0. Note that we have for any distinct i, j ∈ [ℓ + 1] In particular, Claim 4.13 implies the following for S ⊆ I such that |S| ≤ d and i, j distinct such that S ∩ {2i − 1, 2i, 2j − 1, 2j} = ∅, We define the polynomial P (X) = S⊆I:|S|≤d α S X S as follows. For each S ∈ I ≤d , set α S = α (j) S for any S such that S ∩ {2j − 1, 2j} = ∅: since |S| ≤ d ≤ ℓ, there is at least one such j ∈ [ℓ + 1]. By (16), we see that any choice of j as above yields the same coefficient α S . Note that Let α S | j denote the coefficient of X S in P | j . Now we show that P | i = P (i) for each choice of i ∈ [ℓ+1] by comparing coefficients of monomials and showing that α S | i = α (i) S for each S such that |S| ≤ d. That will conclude the proof of the matching case of Claim 4.12. Fix any S such that |S| ≤ d. We consider three cases.
• S ∋ 2i: In this case, α S | i = α  • S ∩ {2i − 1, 2i} = 2i − 1: In this case, let T = S \ {2i − 1} and fix j ∈ [ℓ + 1] such that j = i and T ∩ {2j − 1, 2j} = ∅. We see that The Star case of Claim 4.12. We proceed as in the matching case, except that the definition of P will be somewhat more involved. By renaming variables we assume that I = [r] and that the bad triples are all of the form (X 1 , X r , 0), (X 2 , X r , 0), . . . , (X ℓ+1 , X r , 0). Assume that for each h ∈ [ℓ + 1], where X S = i∈S X i (note that P (h) ∈ F(I \ {r}, d) and hence does not involve X r ). For any h, if |S| > d or S ∋ r, we define α (h) S = 0. For any distinct i, j ∈ [ℓ + 1] with i < j, we assume that P (i) | j and P (j) | i are obtained by replacing X j with X i . We thus have Using Claim 4.13 and comparing coefficients of P (i) | j and P (j) | i , we get for i = j and S such that S ∩ {i, j} = ∅, We now define the polynomial P (X) = S⊆I\{r}:|S|≤d β S X S + S⊆I:S∋r,|S|≤d γ S X S as follows.
• For S ∋ r, we define β S to be α (i) S for any i ∈ [ℓ + 1] such that i ∈ S. Since |S| ≤ d < ℓ + 1 there is such an i. Note that by (21), the choice of i is immaterial.
• For S ∋ r, we let T = S \ {r}. Note that |T | < d. We define γ T ∪{r} by downward induction on |T | as follows: where we assume that γ T ∪{r} = 0 for |T | ≥ d.
We will show first by downward induction on |T | that these coefficients are independent of the choice of i ∈ [ℓ + 1] \ T . Fix i, j ∈ [ℓ + 1] \ T . In the base case |T | = d − 1, we have (by (21) and α T ∪{i,j} = 0) T ∪{i,j} − β T ∪{i,j} − γ T ∪{i,j,r} (Applying definition of γ T ∪{i,r} and induction) (22) and rearranging terms.) We now conclude by showing that the restriction of P obtained by replacing X r by X i equals the polynomial P (i) . Let P | i denote the restriction of P obtained by replacing X r by X i and let α S | i be its coefficients. Note that T ∪{i} X T ∪{i} (By definition of β S and γ T ∪{r} ) = P (i) .
This concludes the proof for the star case.

Impossibility of local decoding when char(F) is large
In this section, we prove Theorem 5.3 which is a more detailed version of Theorem 3.2. Again we remind the reader that an overview may be found in Section 1.3.
Let n be a growing parameter and F a field of characteristic 0 or positive characteristic greater than n 2 . For the results in this section, it will be easier to deal with the domain {−1, 1} n rather than {0, 1} n . Since there a natural invertible linear map that maps {0, 1} to {−1, 1} (i.e. a → 1 − 2a), this change of input space is without loss of generality.

Local linear spans of balanced vectors
Let u ∈ F n and U ⊆ F n . For any integer t ∈ N, we say that u is in the t-span of U if it can be written as a linear combination of at most t elements of U . For x ∈ {−1, 1} n , we use |x| to denote the sum of the entries of x over Z. In this section, we wish to show that if the vector 1 n is in the t-span of balanced vectors, i.e., vectors x with |x| ≤ n/s then t is must be growing as a function of s.
As explained earlier we first establish a bound on the size of the solutions of linear equations in systems over Q with few variables or few constraints. This fact is well-known, but we prove it here for completeness. • If F is a field of characteristic and the system has a solution in F s , then there exist integers a 1 , . . . , a s , b ∈ Z with |a i |, |b| ≤ t! such that x i = a i /b is a solution to M x = u. In particular, there is a solution in Q s .
• If F is a field of characteristic p and if the system has a solution in F s , then there exist integers a 1 , . . . , a s , b ∈ Z with |a i |, |b| ≤ t! such that In particular, there is a solution in F s p . Proof. Note that we can assume that M has full column rank. This is because M x = u has a solution iffMx = u has a solution whereM is the submatrix of M obtained by a keeping a maximal set of linearly independent columns of M . When the columns are linearly independent, we have s is at most r and hence t = min{r, s} = s.
We start with the zero characteristic case. Let M ′ be an invertible s × s submatrix of M containing the set of s linearly independent rows of M and let u ′ ∈ F s be the vector corresponding to these rows. Note that the solution x is uniquely determined by M ′ x = u ′ . We now apply Cramer's rule to see that the solution is given by . The characteristic p case is similar with only differnece being the solution now is given by We now turn to the main technical lemma of this section showing that 1 n is not in linear span of a small number of nearly balanced elements of {−1, 1} n .
Lemma 5.2. Let n, s = s(n) ∈ N with s(n) ≤ n. Let S = {x ∈ {−1, 1} n | |x| ≤ n/s)}. Then x 0 = 1 n is not in the t-span of S unless t ≥ log s/ log log s provided F is field of zero characteristic or of characteristic p ≥ 2n 2 .
Proof. We first consider the case when F is of zero characteristic. Note that in this case Q ⊆ F. Suppose x 0 ∈ Span{x 1 , . . . , x t } with x 0 = n i=1 c i x i . Note that the c i 's are expressible as the solution to a linear system whose M z = u where M and u have entries in {−1, 0, 1} and M is a n × t matrix. By Lemma 5.1 we have that c i ∈ Q with |c i | ≤ t! (more specifically we have c i = a i /b with |a i | ≤ t! and this implies |c i | ≤ t!). We thus have We thus conclude that (t + 1)! ≥ s and thus t ≥ log s/ log log s.
In the case of finite field F, we proceed as above and let x 0 = t i=1 c i x i . By Lemma 5.1 we have that there are integers a i , b with |a i |, |b| ≤ t! such that c i = a i /b (mod p) is a solution to x 0 = t i=1 c i x i . Now consider b · n and we get b · n = t i=1 a i n j=1 x i j (mod p). We now show that this implies (t + 1)! ≥ min{p/(2n), s} = s (where the equality follows from p ≥ 2n 2 and s ≤ n). Assume (t + 1)! ≤ p/(2n). Then we have n ≤ |b · n| ≤ t! · n < p/2 over the integers, and t i=1 a i n j=1 x i j ≤ (t + 1)!(n/s) < p/2 also over the integers. We again conclude that n ≤ (t + 1)!(n/s) and so (t + 1)! ≥ s as claimed. The lemma follows.

Proof of Theorem 3.2
We now state and prove Theorem 5.3 which immediately implies Theorem 3.2.
Proof. The proof of the theorem will use the minimax principle. Specifically, we design a "hard" probability distribution D over functions that are ε-close to F(n, 1) such that any deterministic decoder that decodes the value of a random function (chosen according to D) at the point 1 n while making very few queries will fail to decode the value with probability at least 1/4.
We start with the case of positive characteristic which is somewhat simpler to describe. Let char(F) = p > n 2 . We define the hard distribution D as follows. Let so that, by the Chernoff bound, (see, e.g. [12]) |E| ≤ ε2 n . Let S = {−1, 1} n \ E.
We now sample a random function f ∼ D as follows: • Choose a 1 , . . . , a n ∈ F p ⊆ F uniformly at random independently. Let ℓ(X 1 , . . . , X n ) = i a i X i ∈ F(n, 1).
Let A be any deterministic decoding algorithm for decoding f (1 n ). Assume that the worst case number of queries t made by A satisfies t < log s/ log log s. W.l.o.g. we assume that A always makes exactly t queries and also that none of these queries are made to inputs x ∈ E (since at these points f (x) is known to take the value 0). Additionally, we may assume that these queries correspond to linearly independent inputs since if a query point x is a linear combination of previous queries, then ℓ(x) = a, x can be determined from the answers to previous queries. Let x 1 , . . . , x t be the (adaptive) queries made by A on the random function f . After these queries are made, the algorithm has ℓ(x i ) = a, x i for each i ∈ [t], where a = (a 1 , . . . , a n ). However, by Lemma 5.2, we know that 1 n is not in the t-span of the inputs in S and hence, given the values ℓ(x 1 ), . . . , ℓ(x t ), ℓ(1 n ) = i a i is still distributed uniformly over F p . Hence, the probability that the algorithm outputs ℓ(1 n ) correctly is at most 1/p < 3/4. Now consider the case when char(F) = 0. We define our hard distribution D exactly as above except that the coefficients a 1 , . . . , a n are chosen i.u.a.r. from {−N, . . . , N } where N = n ⌈log s/ log log s⌉ .
Let A be any deterministic decoding algorithm for decoding f (1 n ) as above. Again, we assume that A always makes t ≤ log s/ log log s many queries corresponding to linearly independent inputs, and also that none of these queries are made to inputs x ∈ E.
Let A ⊆ {−N, . . . , N } n be the set of coefficients of linear polynomials ℓ such that A is able to decode ℓ(1 n ) = i a i correctly.
To bound the size of |A|, we use an encoding argument. Considre any (a 1 , . . . , a n ) ∈ A and let ℓ(X) = i a i X i . Let x 1 , . . . , x t be the queries made on input ℓ. Given a, x i for i ∈ [t], the algorithm determines i a i = a, 1 n . Hence, at this point the algorithm has a, x for x ∈ I ′ = {x 1 , . . . , x t , 1 n }. Note that I ′ is a set of dimension t + 1 since by Lemma 5.2, 1 n is not in the t-span of S. We can thus a subset I ′′ = {e i 1 , . . . , e i n−t−1 } of the set of standard basis vectors {e 1 , . . . , e n } of size n − t − 1 so that I = I ′ ∪ I ′′ is a basis for F n .
Define an encoding function Note that each a, x j ∈ {−N n, . . . , N n} since a ∈ {−N, . . . , N } n and x j ∈ {−1, 1} n . We claim that E is 1-1. This is because, on being given E(a) as above, we can determine a, x for each x ∈ I by the following argument: E(a) directly gives us a, x for each x ∈ I \ {1 n } and by construction of x 1 , . . . , x t , we know that a, x 1 , . . . , a, x t determines the value of a, 1 n . Thus, we have a, x for each x ∈ I and as I is a basis for F n , we can obtain a ∈ F n as well.
which implies that the relative size of A inside {−N, . . . , N } n is at most 3/4. This concludes the proof.

Local decoding when char(F) is small
In this section, we give a local decoder over fields of small characteristic. An overview of this construction may be found in Section 1.3. Let p be a prime of constant size and let F be any (possibly infinite) field of characteristic p. Let d be the degree parameter and k be the smallest power of p that is strictly greater than d. Note that k ≤ pd. We show that the space F(n, d) has a (1/(4 · 2k k ), 2k k )-local decoder, hence proving Theorem 3.3.
The main technical tool we use is a suitable linear relation on the space F(2k, d), which we describe now. We say that a set S ⊆ {0, 1} 2k is useful if for every polynomial G ∈ F(2k, d), G(0 2k ) is determined by the restriction of the function G to the inputs in S. Let B ⊆ {0, 1} 2k denote the set of all balanced inputs (i.e. inputs of Hamming weight exactly k). Lemma 6.1. Fix d, k as above. Then the set B ⊆ {0, 1} 2k of balanced inputs is useful.
The proof of the above lemma will use Lucas' theorem, which we recall below. Theorem 6.2 (Lucas' theorem). Let p be any prime and a, b ∈ N. Let a 1 , . . . , a ℓ ∈ {0, . . . , p − 1} and b 1 , . . . , b ℓ ∈ {0, . . . , p − 1} be the digits in the p-ary expansion of a and b, i.e., a = j∈[ℓ] a j p j−1 and b = j∈[ℓ] b j p j−1 . Then, we have Proof. Note that by Lucas' theorem (Theorem 6.2), a b ≡ 0 (mod p) if and only if there are digits a j , b j in the p-ary expansions of a and b respectively with a j < b j .
Consider first the case when i = 0. Let a = d + k and b = k. Let where a j , b j ∈ {0, . . . , p − 1} and k = p ℓ−1 . Then, we have b j = 0 for each j < ℓ and b ℓ = a ℓ = 1. Hence by Lucas' theorem, we have a b = 0 (mod p). Now consider the case when i ∈ [d]. Let a = d + k − i and b = k − i. Again write a, b as in (23) with k = p ℓ−1 . In this case, we have a ℓ = 1 but b ℓ = 0, the latter due to the fact that b < k. Hence if we consider a ′ = j∈[ℓ−1] a j p j−1 and b ′ = j∈[ℓ−1] b j p j−1 , we get a ′ = d − i < b ′ = k − i. Therefore, there must exist j ∈ [ℓ − 1] such that a j < b j . From Lucas' theorem, it now follows that where Y I denotes i∈I Y i .
Let B ′ denote all those inputs in B where the last k − d bits are set to 0. We will compute the sum of G on inputs from B ′ . But let us first consider a monomial Y I and see what its sum over y ∈ B ′ looks like. The monomial evaluates to 1 on y ∈ B ′ if y i = 1 for every i ∈ I, and evaluates to 0 otherwise. There are exactly d+k−|I| k−|I| choices of y ∈ B ′ that satisfy y i = 1 for every i ∈ I. Thus summing over y ∈ B ′ we get y∈B ′ y I = d+k−|I| k−|I| . Summing over all monomials we get: By Corollary 6.3, it follows that for i ∈ {0, . . . , d}, we have if and only if i = 0 and so y∈B ′ G(y) = d+k k · α ∅ . Let c = d+k k (mod p). We have c ∈ F * p ⊆ F * and in particular c is invertible in F, and y∈B ′ G(y) = c · α ∅ = c · G(0 2k ). Hence, we get G(0 2k ) = c −1 · y∈B ′ G(y). Therefore, G(0 2k ) is determined by the restriction of G to B ′ and hence also by its restriction to B.
The decoder. We now give the formal description of the decoder. Let the decoder be given oracle access to f with the promise that f is 1/(4 · 2k k )-close to some F ∈ F(n, d). Let the input to the decoder be x ∈ {0, 1} n . The problem is to find F (x).
We describe the decoder below: Decoder D f k (x).
• • For i ∈ [2k] and j ∈ [n] such that h(j) = i, identify X j with Y i ⊕ x j .
• Let g(Y 1 , . . . , Y 2k ) and G(Y 1 , . . . , Y 2k ) be the restrictions of f and F respectively. Assuming g| B = G| B , query g at all inputs in B and decode G(0 2k ) from G| B . Output the value decoded.
The main theorem of this section is the following. Note that this implies Theorem 3.3.
Theorem 6.4. Let F be a field of characteristic p. For integer d ≥ 0, let k be the smallest power of p greater than d. Then the decoder D k is a (1/(4 · 2k k ), 2k k )-local decoder for F(n, d; F).
Proof. The bound on the query complexity of the decoder is clear from the description of D k . So we only need to argue that the decoder outputs the value of F (x) correctly with probability at least 3/4. The crucial observation is that for each fixed y ∈ B, querying g(y) amounts to querying f at a uniformly random point z ∈ {0, 1} n , where the randomness comes from the choice of h. This is because for each j ∈ [n], we have z j = y h(i) ⊕ x j where h : [n] → [2k] is a uniformly random function. Since y is balanced, each y h(i) is a uniformly random bit. Hence we see that z ∈ {0, 1} n is distributed uniformly over {0, 1} n . Thus, if δ(f, F ) ≤ 1/(4 · 2k k ), with probability at least 3/4, all the random queries made lie outside the error set E = {z ∈ {0, 1} n | f (z) = F (z)} and in this case, the decoder is able to access the function G| B at each input y ∈ B. By Lemma 6.1, this allows the decoder to determine G(0 2k ). Noting that the image of 0 2k in {0, 1} n is exactly x, we thus see that the decoder outputs F (x) correctly.