Computing Majority by Constant Depth Majority Circuits with Low Fan-in Gates

We study the following computational problem: for which values of $k$, the majority of $n$ bits $\text{MAJ}_n$ can be computed with a depth two formula whose each gate computes a majority function of at most $k$ bits? The corresponding computational model is denoted by $\text{MAJ}_k \circ \text{MAJ}_k$. We observe that the minimum value of $k$ for which there exists a $\text{MAJ}_k \circ \text{MAJ}_k$ circuit that has high correlation with the majority of $n$ bits is equal to $\Theta(n^{1/2})$. We then show that for a randomized $\text{MAJ}_k \circ \text{MAJ}_k$ circuit computing the majority of $n$ input bits with high probability for every input, the minimum value of $k$ is equal to $n^{2/3+o(1)}$. We show a worst case lower bound: if a $\text{MAJ}_k \circ \text{MAJ}_k$ circuit computes the majority of $n$ bits correctly on all inputs, then $k\geq n^{13/19+o(1)}$. This lower bound exceeds the optimal value for randomized circuits and thus is unreachable for pure randomized techniques. For depth $3$ circuits we show that a circuit with $k= O(n^{2/3})$ can compute $\text{MAJ}_n$ correctly on all inputs.


Introduction
In this paper we study majority functions and circuits consisting of them. These functions and circuits arise for various reasons in many areas of Computational Complexity (see e.g. [13,15,8]). In particular, the iterated majority function (or recursive majority) consisting of iterated application of majority of small number of variables to itself, turns out to be of great importance, helps in various constructions and provides an example of the function with interesting complexity properties in various models [9,12,14,10].
One of the most prominent examples to illustrate this is the proof by Valiant [19] that the majority MAJ n of n variables can be computed by a boolean circuit of depth 5.3 log n. The construction of Valiant is randomized and there is no deterministic construction known achieving the same (or even reasonably close) depth parameter. The construction works as follows. Consider a uniform boolean formula (that is, tree-like circuit) consisting of 5.3 log n interchanging layers of AND and OR gates of fan-in 2. For each input to the circuit substitute a random variable of the function MAJ n . Valiant showed that this circuit computes MAJ n with positive probability. Note that AND and OR gates are precisely MAJ 2 functions with different threshold values. Thus this construction can be viewed as a computation of MAJ n by a circuit consisting of MAJ 2 gates. There are versions of this construction with the circuits consisting of MAJ 3 gates (see, e.g., [5]).
In this paper we study what happens with this setting if we restrict the depth of the circuit to a small constant. That is, we study for which k the function MAJ n can be computed by small depth circuit consisting of MAJ k gates. We mostly concentrate on depth 2 and denote the corresponding model by MAJ k • MAJ k . For example, the majority of n = 7 bits x 1 , x 2 , . . . , x 7 can be computed with the following MAJ k • MAJ k circuit for k = 5: We study which upper and lower bounds on k can be shown.
More context to the problem under consideration comes from the studies of boolean circuits of constant depth. The class TC 0 of boolean functions computable by polynomial size constant depth circuits consisting of MAJ gates plays one of the central roles in this area. Its natural generalization is the class TC 0 in which instead of MAJ gates one can use arbitrary linear threshold gates, that is analogs of the majorities in which variables are summed up with arbitrary integer coefficients and are compared with arbitrary integer threshold. It is known that to express any threshold function it is enough to use exponential size coefficients. To show that TC 0 is actually the same class as TC 0 it is enough to show that any linear threshold function can be computed by constant depth circuit consisting of threshold functions with polynomial-size coefficients (polynomial size can be simulated in TC 0 by repetition of variables). It was shown by Siu and Bruck in [18] that any linear threshold function can be computed by polynomial size depth-3 majority circuit. This result was improved to depth-2 by Goldmann, Håstad and Razborov in [4]. More generally, it was shown in [4] that depth-d polynomial size threshold circuit can be computed by depth-(d + 1) polynomial size majority circuit, in particular establishing the class of depth-2 threshold circuits as one of the weakest classes for which we currently do not know superpolynomial size lower bounds. The best lower bound known so far is Ω( n 3/2 log 3 n ) by Kane and Williams [11]. Note, however, that the result of [4] does not translate to monotone setting. Hofmeister in [6] showed that there is a monotone linear threshold function requiring exponential size depth-2 monotone majority circuit. Recently this result was extended by Chen, Oliveira and Servedio [2] to monotone majority circuits of arbitrary constant depth.
Our setting can be viewed as a scale down of the setting of [4] and [6]. In [4,6] exponential weight threshold functions are compared to depth-2 threshold circuits with polynomial weights. In our setting we compare weight-n threshold functions with depth-2 threshold circuits with weights k. In this paper we consider monotone setting.
Another context to our studies comes from the studies of lower bounds against TC 0 . Allender and Koucký in [1] showed that to prove that some function is not in TC 0 it is enough to show that some self-reducible function requires circuit-size at least n 1+ε when computed by constant depth majority circuit. As an intermediate result they show that MAJ n can be computed by O(1)-depth circuit consisting of MAJ n ε gates and of size O(n log n). This setting is similar to ours, however in this paper we are interested in the precise depth and we do not pose additional bounds on the size of the circuit (however note that the bound on the fan-in k of the gates and the bound on the depth d of the circuit naturally imply the bound of O(k d ) on the size of the circuit).
We consider three models of computation of the majority function: computation on most of the inputs (that is, high correlation with the function), randomized computation with small error probability on all inputs, and deterministic computation with no errors. We prove the following lower and upper bounds for our setting.
• Circuits with high correlation. We observe that the minimum value of k for which there exists a MAJ k • MAJ k circuit that computes MAJ n correctly on 2/3 fraction of all the inputs, is equal to Θ(n 1/2 ). A lower bound is proved by observing that a circuit with k = αn 2 does not even have a possibility to read a large fraction of input bits when the constant α is small enough. We show that in this case the circuit errs on many inputs. An upper bound is proved for the following natural circuit: pick k = Θ(n 1/2 ) random subsets of the n inputs bits of size k, compute the majority for each of them, and then compute the majority of results. Such a circuit computes MAJ n correctly with high probability on inputs whose weight is not too close to n/2. By tuning the parameters appropriately, we ensure that the middle layers of the boolean hypercube (containing inputs where the circuits errs with high probability) constitute only a small fraction of all the inputs.
• Randomized circuits. We prove that for a probabilistic distribution C of MAJ k • MAJ k circuits with a property that for every input A ∈ {0, 1} n the probability that C(A) = MAJ n (A) is 1 − ε for a constant ε > 0, the minimum value of k is n 2/3 , up to polylogarithmic factors. A lower bound is proved by showing that a small circuit must err on a large fraction of minterms/maxterms of MAJ n . Roughly, the majority function have many inputs A ∈ {0, 1} n with a property that changing a single bit in A changes the value of the function (these are precisely minterms and maxterms of MAJ n ). If k is small enough, a MAJ k • MAJ k circuit can reflect such a change in the value only for a small fraction of inputs. To show an upper bound, we split the n input bits into blocks and for each block compute several middle layers values of the bits of this block in sorted order. We then compute the majority of all the resulting values. We show that by tuning the parameters appropriately, one can ensure that this circuit err only on a polynomially small fraction of inputs.
• Deterministic circuits. The trivial upper bound on k is k ≤ n. We do not have any nontrivial upper bound on k for depth 2 circuits. We however have examples for n = 7, 9, 11 of circuits with k = n − 2. For depth 3 we have an upper bound O(n 2/3 ) which coincides with the optimal value for depth 2 randomized circuits up to polylogarithmic factor. We prove this upper bound by extending the construction of upper bound for depth 2 randomized circuits. We use an extra layer of the circuit to preorder the inputs. Regarding the lower bound for depth 2 we observe that the following simple special case cannot compute MAJ n : each gate is a standard majority (that is, with threshold k/2) of exactly k = n − 2 distinct variables. Next, we proceed to the main result of the paper. We show that the minimum value of k for which there is a depth 2 circuit computing MAJ n on all inputs is at least n 13/19 up to a polylogarithmic factor.
Note that this lower bound exceeds the optimal value of k for randomized circuits. Thus, despite the fact that randomized techniques is extensively used for studying majority and circuits constructed from it and proves to be very powerful (recall for example Valiant's result [19]), in our setting using combinatorial methods we prove a lower bound that is unreachable for a pure probabilistic approach. The proof of this result however is still probabilistic: in essence we consider a circuit with k smaller than n 13/19 and build a distribution on inputs that fools this circuit. The catch is that the distribution is tailored to fool this particular circuit: it is constructed via a non-trivial process that involves the values of the gates of the circuit on various inputs.
The rest of the paper is organized as follows. In Section 2 we give necessary definitions and collect technical statements. In Section 3 we study circuits computing the function with high correlation. In Section 4 we give bounds for randomized circuits. In Section 5 we study deterministic circuits. Finally, in Section 6 we give concluding remarks and state several open problems. Most of the proofs are moved from the main text to Appendix.

Definitions and Preliminaries
In this section we will give necessary definitions and collect technical statements that we will use throughout the paper.
We are going to study circuits computing the well known boolean majority function defined as follows: Here, [·] denotes the standard Iverson bracket: for a predicate P , [P ] = 1 if P is true, and [P ] = 0 is P is false. To abuse notation, we will also use [m] to denote the set {1, 2, . . . , m}.
It will be convenient to use X = {x 1 , x 2 , . . . , x n } for the set of n input bits. For an assignment A : X → {0, 1}, by w(A) we denote the weight of A, that is, x∈X A(x). For a subset of input variables S ⊆ X, by w S (A) we denote the weight of A on X: w S (A) = x∈S A(x). By MAJ S (X) we denote the majority function on S: MAJ S (X) = [ x∈S x ≥ |S|/2]. In particular, MAJ X is just MAJ n .
An assignment A : X → {0, 1} is called a minterm of MAJ n if MAJ n (A) = 1, but flipping any 1 to 0 in A results in an assignment A ′ such that MAJ n (A ′ ) = 0. A maxterm is defined similarly with the roles of 0 and 1 interchanged.
The majority function is a special case of a threshold function: For such a function f and an assignment A : The MAJ k • MAJ k computational model that we study in this paper is defined as a depth two formula (we will call it a circuit also) consisting of arbitrary threshold gates of the form [ c i x i ≥ t] where c i 's are positive integers (this, in particular, means that the model is monotone) and c i ≤ k. At the same time, abusing notation, by MAJ n and MAJ X we always mean the standard majority function. We note that the coefficients in c i can be simulated by repetition of variables (note that k upper bounds the sum of the coefficients). So the generalization of the MAJ k in the circuit compared to MAJ n is that we allow arbitrary threshold. We note however, that if we are interested in the value of k up to a constant factor (which we usually do), it is not an actual generalization since any threshold can be simulated by substituting constants 0 and 1 as inputs to the circuit.
For a gate G at the bottom level of a MAJ k • MAJ k circuit, by X(G) we denote the set of its input bits.

Tail Bounds and Binomial Coefficients Estimates
We will use the following versions of Chernoff-Hoeffding bound (see, e.g., [3]).
We will also need the following well known estimates for the binomial coefficients (see, e.g., [16,Section 4.2]): Lemma 2. The middle binomial coefficient is about n 1/2 times smaller than 2 n . To make it smaller than 2 n by arbitrary polynomial factor, it is enough to step away from the middle by about Θ( √ n ln n) (0 < c < 1 is a constant below):

Hypergeometric Distribution
The hypergeometric distribution is defined in the following way. Consider a set S of size m and its subset S ′ of size k. Select (uniformly) a random subset T of size t in S. Then a random variable |T ∩ S ′ | has a hypergeometric distribution. The values m, k and t are parameters here. We will need the following basic properties of this distribution. For the sake of completeness their proofs can be found in the Appendix (Section 7.1). Lemma 4. Suppose in hypergeometric distribution k = k(m) ≤ m/2 (that is, k may depend on m). Let t = t(m) be a function with εm < t < (1 − ε)m for some constant 0 < ε < 1. Consider an arbitrary antichain A on S ′ (that is, a family of subsets of S ′ none of which is a subset of some other). Then the probability is for m → ∞ and the constant inside O(·) depends on ε, but does not depend on m, k and t.

Circuits with High Correlation
In this section, we prove that the minimum value of k for which there exists a MAJ k • MAJ k circuit that computes MAJ n correctly on, say, 2/3 fraction of all the inputs, is equal to Θ(n 1/2 ).

Upper Bound
Proof Sketch. The required circuit is straightforward: we just pick k random subsets S 1 , S 2 , . . . , S k of X of size k, compute the majority for each of them, and then compute the majority of the results: C(X) = MAJ k (MAJ S 1 (X), MAJ S 2 (X), . . . , MAJ S k (X)) . The resulting circuit has a high probability of error on middle layers of the boolean hypercube. We however select the parameters so that all the inputs from these middle layers constitute only a small ε/2 fraction. We then show that among all the remaining inputs (not belonging to middle layers) there is only a fraction ε/2 (of all the inputs) where MAJ n may be computed incorrectly. Overall, this gives a circuit that errs on at most ε fraction of the inputs. A detailed proof is provided in Section 7.2 in the Appendix.

Lower Bound
Next we show that this upper bound is tight.
Proof Sketch. Let k = αn 1/2 for a small enough constant α = α(ε). Note that such a circuit can read at most k 2 = α 2 n of the input bits. This means that the circuit errs on a large number of inputs. All formal estimates are given in Section 7.2 in the Appendix.

Randomized Circuits
The upper bound from the previous section, however, is not enough to obtain a randomized circuit since the construction in Theorem 6 has a very high error probability on the middle layers of the boolean cube. By a randomized circuit here we mean a probabilistic distribution on deterministic circuits computing the function correctly on every input with high probability.
It is not difficult to see that the existence of a randomized circuit is equivalent to an existence of a deterministic circuit computing the function correctly on most of minterms and maxterms (the proof of the following lemma can be found in Section 7.3 in the Appendix).
Lemma 8. If there exists a randomized circuit C in MAJ k • MAJ k computing MAJ n with error probability ε, then there exists a deterministic circuit C in MAJ k • MAJ k computing MAJ n incorrectly on at most ε fraction of minterms and maxterms. Conversely, if there exists a deterministic circuit C in MAJ k • MAJ k computing MAJ n incorrectly on at most ε fraction of minterms and maxterms, then there exists a randomized circuit C in MAJ k • MAJ k computing MAJ n with error probability at most 2ε.
So from now on instead of probabilistic circuits we study deterministic circuits with high accuracy on two middle layers of {0, 1} n .

Upper Bound
Theorem 9. There exists a randomized MAJ k • MAJ k circuit computing MAJ n incorrectly on each input with probability at most 1/ poly(n) for k = O(n 2/3 log 1/2 n).
Proof Sketch. Partition the set of n input bits into n 1/3 blocks of size p = n 2/3 : for t ≈ n 1/3 log 1/2 n, and return the majority of results. By selecting the right value of t, this gives a circuit that computes MAJ n incorrectly only on a fraction 1 poly(n) of inputs. The detailed proof is given in Section 7.3 in Appendix.

Lower Bound
In this subsection we show that the upper bound of the previous subsection is essentially tight.
Proof Sketch. The majority function have many inputs A ∈ {0, 1} n with a property that changing a single bit in A changes the value of the function (these are precisely minterms and maxterms of MAJ n ). If k = αn 2/3 for a small enough constant α, a MAJ k • MAJ k circuit can reflect such a change in the value only for a small fraction of inputs. A detailed proof is given in Section 7.3 in the Appendix.

Deterministic Circuits
In this section, we consider MAJ k • MAJ k circuits that compute MAJ n correctly on all 2 n inputs.

Depth Two
In this section, we present MAJ k • MAJ k circuits computing MAJ n on all inputs for k = n − 2 when n = 7, 9, 11. These circuits were found by extensive computer experiments (with the help of SAT-solvers). Though the examples below look quite "structured", currently, we do not know how to generalize them to all values of n (not to say about constructing such circuits for sublinear values of k). In the examples below, we provide k = n − 2 sequences consisting of k = n − 2 integers from [n]. These are exactly the input bits of the k majority gates at the lower level of the circuit. That is, each gate computes the standard MAJ k function (whose threshold value is k/2). n = 7:  4 5 6 7 8 9  1 2 3 4 5 6 7 10 11  1 2 3 4 5 8 9 10 11  1 2 3 6 7 8 9 10 11  1 4 5 6 7 8 9 10  Note that in the examples above there is always a gate in the circuit having one variable repeated more than once. Next we observe that this is unavoidable for k = n − 2.
Lemma 11. For odd n there is no MAJ k • MAJ k circuit for k = n − 2 with all gates being standard majorities (that is, with the threshold n/2) and having exactly k distinct variables in each gate on the bottom level.
We provide a proof of this lemma in Section 7.4 in the Appendix.

Depth Three
In this section we extend the proof of the upper bound for randomized depth-2 circuits (Theorem 9) to construct a circuit of depth 3 for k = O(n 2/3 ) computing majority on all inputs. Proof Sketch. We adopt the strategy of the proof of Theorem 9. That is, we break inputs into O(n 1/3 ) blocks, compute majorities on each block on middle O(n 1/3 ) layers and then compute the majority of the results. We use the third layer of majority gates to induce additional structure on the inputs. The full proof is given in Section 7.4 in the Appendix.

Lower Bound
In this section we will extend the lower bound on k above Ω(n 2/3 ) for depth-2 circuits computing MAJ n on all inputs.
We also show the following result for the special case of circuits with bounded weights.
Theorem 14. Suppose a MAJ k • MAJ k circuit computes MAJ n on all inputs and uses only weights at most W in the gates. Then k = Ω(n 7/10 · (log n) −1/5 · W −3/10 ) .
In particular, we get the following corollary for circuits with unweighted gates.
Corollary 15. Suppose a MAJ k • MAJ k circuit computes MAJ n on all inputs and each variable occurs in each gate of the bottom level at most once. Then k = Ω(n 7/10 · (log n) −1/5 ) .
The rest of this section is devoted to the unified proof of these lower bounds. To follow this proof it is convenient to think that k = n 2 3 +ε for some small ε > 0. In the end it will indeed be the case up to a logarithmic factor. In the proof we will calculate everything precisely in terms of parameters n and k, but we will provide estimates assuming that k = n 2/3+ε . This is done in order to help the reader to follow the proof.
Let F be a MAJ k • MAJ k formula computing MAJ n on all inputs from {0, 1} n . Denote by W the largest weight of a variable in gates of F .

Normalizing a formula
We start by "normalizing" F , that is, removing some pathological gates from F . We do this in two consecutive stages.
Stage 1: removing AND-like gates. We will need that no gate can be fixed to 0 by assigning a small number of variables to 0 (here and in what follows we consider gates from the bottom level only). For this, assume that there is a gate that can be fixed to 0 by assigning to 0 less than n/(100k) = n 1/3−ε /100 variables. Take these variables and substitute them by 0; this kills this gate (and might potentially introduce new gates with the property). We repeat this process until there are no bad gates left. Recall that the number of gates at the bottom level is at most k = n 2/3+ε , so there are at most k = n 2/3+ε steps in this process and hence n is replaced by 99n/100. To simplify the presentation, we just assume that |X| = n and that F has no bad gates.
Stage 2: removing other pathological gates and variables. The formula F contains at most k 2 = n 4 3 +2ε occurrences of variables (counting with multiplicities). Let x * ∈ X be a least frequent variable at the leaves. The number of occurrences of x * is at most k 2 /n = n 1/3+2ε . In the following we consider only assignments A with diff(MAJ n , A) = −1 setting x * to 0: We also focus on the gates from the first level that depend on x * , denote this set by G * (hence |G * | ≤ k 2 /n = n 1/3+2ε ). The total number of variables in the gates from G * (counting with multiplicities) is at most k|G * | ≤ k 3 /n = n 1+3ε .
We now additionally normalize the circuit. We get rid of the following bad gates and variables: 1. gates in G * that can be assigned to 1 by fixing less than n 2 /(100k 2 ) = n 2/3−2ε /100 variables in X \ {x * } to 1; 2. gates in G * with the weight of the variable x * greater than 100k 3 /n 2 = 100n 3ε ; 3. variables with total weight in all gates in G * greater than 100k 3 /n 2 = 100n 3ε .
We do this by the following iterative procedure. If on some step we have a gate violating 1 we fix less than n 2 /(100k 2 ) = n 2/3−2ε /100 variables of the gate among X \ {x * } to 1 to assign the gate to a constant. If we have a gate violating 2 we fix all the variables of the gate among X \ {x * } to 1 to assign the gate to a constant. If we have a variable violating 3, we fix the violating variable to 1. We note that if we fix all variables in G ∈ G * except x * to 1, then the gate becomes constant. Indeed, if it is not constant, then the gate outputs 0 on the input with x * = 0 and the rest of the variables equal to 1. Due to the monotonicity of the gate this means that the gate can be assigned to 0 by assigning a single variable x * to 0 and we got rid of the gates with this property on the first stage of the normalization.
Since there are at most k 2 /n = n 1/3+2ε gates in G * we will fix at most n/100 variables for case 1. Since the total weight of x * is at most k 2 /n = n 1/3+2ε we will have case 2 at most n/(100k) = n 1/3−ε /100 times. Since each gate has at most k = n 2/3+ε variables we will fix at most n/100 variables for the second case. Since the total weight of all variables in G * is at most k 3 /n = n 1+3ε we will fix at most n/100 of them for the case 3.
In particular, we have fixed all variables having weight greater than 100k 3 /n 2 = 100n 3ε in some gate of G * , so from now on we can assume that W ≤ 100k 3 /n 2 .
Another important observation is that now in each gate there are at least n 2 /(100k 2 ) inputs. Otherwise the gate falls under condition of case 1 above.
After this normalization n is replaced by 97n/100. To simplify the presentation, again, we assume that |X| = n and the circuit F is normalized. Note that after redefining n the threshold of the function MAJ n we are computing is no longer n/2, but rather is cn for some constant c close to 1/2. This does not affect the computations in the further proof.

Analysis
The key idea is that if we have an assignment A ∈ A * with diff(MAJ n , A) = −1, then there is a gate G ∈ G * with −W ≤ diff(G, A) ≤ −1. Indeed, otherwise we can flip the variable x * , the value of MAJ n changes, but none of the gates changes their value. The plan of the proof is to construct an assignment that violates this condition. This will lead to a contradiction.
For an assignment A ∈ A * with diff(MAJ n , A) = −1 and integer parameters s and d (to be chosen later), consider the following process walk(A, s, d).
X i ← set of variables G i depends on that are assigned 1 by A i−1

8:
y i ← a uniform random variable from X i

9:
A i ← assignment to X resulting from flipping the value of y i in A i−1 10: end if 11: end for Clearly, this process decreases the weight of the initial assignment A by 1 at each iteration, for at most s iterations. In particular, w(A) − w(A i ) = i. We now consider three cases. Case 1. There exists an assignment A ∈ A * with diff(MAJ n , A) = −1 such that walk(A, s, d) stops after less than s iterations for some choices of random bits. This means that after t < s iterations, for all the gates G in G * we have that either diff(G, We select randomly a subset T of t variables from Z = {x ∈ X \ {x * } : A t (x) = 0} and flip them. Denote the resulting assignment by A ′ . Clearly, w(A) = w(A ′ ) and so diff(MAJ n , A ′ ) = −1. Therefore there must be a gate G in G * such that −W ≤ diff(G, A ′ ) < 0. Thus, before flipping t random variables, all the gates with negative difference has difference less than −d, while after the flipping, at least one gate G has difference at least −W . Let Z ′ = {x ∈ X(G) \ {x * } : A t (x) = 0}. This means that the flipping changed the values of at least r = (d − W )/W variables of G, that is, |T ∩ Z ′ | ≥ r.
Let p be the probability that |T ∩ Z ′ | ≥ r where the probability is taken over the random choice of T . By choosing the parameters s and d we will make p small enough so that with non-zero probability no gate from G * satisfies this. Due to the discussion above this leads to a contradiction since flipping x * changes the value of the function, but not the value of the circuit. The probability that no gate from G * satisfies |T ∩ Z ′ | ≥ r is at least 1 − |G * |p. The probability p can be upper bounded using Lemma 5: where the second inequality follows since t < s, |Z ′ | ≤ k and |Z| ≥ n 2 . We want the probability 1 − |G * |p to be positive. Since |G * | ≤ k 2 /n = n 1/3+2ε we get the following inequality on s, d, and k: (k 2 /n) · (2sk/n) r < 1 . We can satisfy this if sk < n/4 and r ≥ log k 2 n . Since log n > log k 2 n for the latter it is enough to have d = W log n. Overall, this case poses the following constraint for the considered parameters: Case 2. For each assignment A ∈ A * (i.e., diff(MAJ n , A) = −1) the process walk(A, s, d) goes through all s iterations for all choices of random bits. We consider two subcases here. Case 2.1. For each assignment A ∈ A * (i.e., diff(MAJ n , A) = −1) there exists a choice of variables y 1 , . . . , y s at line 8 of the process walk(A, s, d), such that for each gate G ∈ {G 1 , . . . , G s } (recall that the gates G 1 , . . . , G s are selected at line 6 of the process) we have diff(G, A) ≤ f , where f is again a positive parameter to be chosen later.
We estimate the expected number E of gates G from G * that have −d ≤ diff(G, A) ≤ f where the expectation is taken over the random choices of A. Note that a particular gate G ∈ G * may appear in the sequence G 1 , . . . , G s at most d times: the first time it appears, it must have diff(G, A 1 ) ≤ −1 for the current assignment A 1 , the next time it has diff(G, A 2 ) ≤ −2 for the new current assignment A 2 , and so on. If Ed < s we get a contradiction: take an assignment A ∈ A * with diff(MAJ n , A) = −1 such that the number of gates G in G * with −d ≤ diff(G, A) ≤ f is at most E, then we cannot have that for all of G 1 , . . . , G s it is true that −d ≤ diff(G i , A) ≤ f , there are just not enough gates with this diff. Now we upper bound E. Due to the normalization stage any fixed gate has at least n 2 /(100k 2 ) = n 2/3−2ε /100 variables in it. Note that the set of inputs B to the gate G that give diff(G, B) = i for any i form an antichain. Then due to Lemma 4 the probability for a gate to attain a certain value is at most O(k/n) = O(1/n 1/3−ε ). Hence where for the last equality we add the constraint Overall, this case poses the following constraint for the parameters: Case 2.2. There exists an assignment A ∈ A * (i.e., diff(MAJ n , A) = −1) such that for any choice of variables y 1 , . . . , y s , for at least one gate G ∈ {G 1 , . . . , G s } we have diff(G, A) > f . Fix a gate G ∈ G * with diff(G, A) > f . We are going to upper bound the probability (over the random choices of variables y 1 , . . . , y s ) that G appears among G 1 , . . . , G s during the process. If this probability is less than 1/k, then by the union bound with a positive probability no gate such gate appears among G 1 , . . . , G s which leads to a contradiction with the case statement.
For G to appear among G 1 , . . . , G s , the process has to select a variable appearing in G at line 8 many times. Indeed, if G appears in the process, then its diff with the current assignment is negative. At the same time, in the beginning of the process diff(G, A) > f . Each time when the process reduces a variable at line 8 (that is, changes its value from 1 to 0), the value of the linear function computed at G decreases by at most W (just because W is the maximum weight of a variable in all the gates in G * ). Thus, it is enough to upper bound the probability that for a fixed gate G ∈ G * with diff(G, A) > f , the process selects a variable from X(G) at least f /W times.
Let Y 1 , . . . , Y s be random 0/1-variables defined as follows: Y i = 1 iff the i-th reduced variable appears in G (i.e., y i ∈ X(G)). Let Y = s i=1 Y i . Our goal is to upper bound Prob(Y ≥ f /W ).
Let H 1 , . . . , H l be all the gates that share at least one variable with G. Assume that on step j we reduce a variable from H i . Then Due to the stage 2.1 of the normalization process, |{x ∈ X(H i ) : To see this, assume the contrary. Recall that −d ≤ diff(H i , A j−1 ) < 0. This means that by increasing at most d variables (i.e., changing their values from 0 to 1) from X(H i ) in A j−1 results in an assignment of weight at most n 2 100k 2 that sets H i to 1. This, in turn, contradicts to the fact that the circuit is normalized. Thus, We are now going to use the fact that variables from a fixed gate H i can be reduced at most d times. We upper bound Y = s i=1 Y i by the following random variable: where each Z ij is a random 0/1-variable such that and Z ij are independent. That is, instead of reducing variables in some of H i 's in some random order, we reduce d variables in each H i . Thus we reduce maximal possible number of variables in all gates. Clearly, for any r we have Prob(Y ≥ r) ≤ Prob(Z ≥ r).
Let us bound the expectation of Z. Since due to the normalization each variable of G appear in other gates at most 100k 3 /n 2 = 100n 3ε times, we have Overall we get EZ ≤ 100dk 4 /n 2 n 2 /200k 2 = 4 · 10 4 · d k 6 n 4 = 4 · 10 4 · n 6ε · W · log n.
Application of Chernoff-Hoeffding bound (Lemma 1) immediately implies that the probability that Z is twice greater than the expectation is exponentially small in d · k 6 n 4 . Since d · k 6 n 4 = W · log n · n 9ε grows asymptotically faster than log n for sure, we conclude that as desired. Overall, this gives us the following constraint: f ≥ 4 · 10 4 · d · W · k 6 n 4 = 4 · 10 4 · n 9ε · W 2 · log n .

Tuning the parameters
It remains to set the parameters so that the inequalities (2)-(6) are satisfied and k is as large as possible. The inequality (4) sets a lower bound on s in terms of f , while (6) sets a lower bound on f . Putting them together gives a lower bound on s: s ≥ 4 · 10 4 · k 9 n 6 · W 3 · log 2 n .
Combining it with the upper bound on s from (2), we can set the following equality on k and n: n 4k = 4 · 10 4 · k 9 n 6 · W 3 · log 2 n.
Thus k = Ω n 7/10 (log n) 1/5 W 3/10 and it is easy to see that we with this k we can pick other parameters to satisfy all the constraints (we set f so that (6) turns into an equality, the inequalities (3) and (5) are satisfied since W ≤ k 3 n 2 ). This gives a proof of Theorem 14. For W = 1 we get k = n 7/10 ·(log n) −1/5 , which gives a proof for Corollary 15. For unbounded W recall that we can assume W ≤ k 3 n 2 and thus k = n 13/19 ·(log n) −2/19 and Theorem 13 follows.

Conclusion and Open Problems
The most interesting question left open is whether one can prove non-trivial upper bounds for k in the worst case. Currently, we do not know how to construct MAJ k • MAJ k circuits computing MAJ n on all inputs even for k = n − 2 (though we have many examples of such circuits for n = 7, 9, 11), not to say about k = n ε for ε < 1.
Another natural open question is to get rid of the logarithmic gap between upper and lower bound for depth-2 randomized circuits.
A natural direction is to extend our studies to the case of non-monotone MAJ k • MAJ k circuits. Many of our results naturally translate to larger depth circuits. Indeed, note that in the proofs of lower bounds we do not use the fact that the function on the top of the circuit is majority. In these proofs it can be any monotone function. Thus we can split a depth-d circuit consisting of MAJ k into two parts: bottom layer and the rest of the circuit. Then our lower bounds translate to this setting straightforwardly. It is interesting to proceed with the studies of larger depth majority circuits.
It is convenient to introduce notation c = t m . Note that then ε < c < 1 − ε. The probability above then can be rewritten as It is not hard to see that the maximum is achieved for l equal to ck (the probability is increasing for l < ck as a function of l and is decreasing for l > ck).
So we need to upper bound To bound the probability we will use Stirling's approximation, the following simple form will be enough n! ∼ n e n √ n.
Let us first consider binomial coefficients separately: where by d we denote 1 The resulting circuit has a high probability of error on middle layers of the boolean hypercube. We will however select the parameters so that all the inputs from these middle layers constitute only a small ε/2 fraction. We will then show that among all the remaining inputs (not belonging to middle layers) there is only a fraction ε/2 (of all the inputs) where MAJ n may be computed incorrectly. Overall, this gives a circuit that errs on at most ε fraction of the inputs. Assignments from middle layers. Consider all the inputs whose weight differs from n/2 by at most αn 1/2 where α = α(ε) is a parameter to be chosen later. The number of such inputs is By choosing a small enough value for α = α(ε), one ensures that this is at most ε 2 · 2 n . Assignments from outside of middle layers. Now, fix an input A ∈ {0, 1} n of weight n/2+αn 1/2 . Pick a random subset S ⊂ X of size k = βn 1/2 (again, β is a parameter to be defined later). We are going to lower bound the following probability (over the choices of S): The resulting lower bound will also hold for assignments A of weight greater than n/2 + αn 1/2 (the higher the weight of A, the larger is the probability that MAJ S (A) = 1). By symmetry, it will also give a lower bound on Prob(MAJ S (A) = 0) for assignments of weight at most n/2 − αn 1/2 .
The distribution of the weight of A on S is a hypergeometric distribution with mean k · w(A) n = βn 1/2 /2 + βα = k/2 + βα .
By choosing a large enough value of β, one ensures that αβ > 2. Then Lemma 3 guarantees that for a constant γ > 0. Collecting (8) and (9) By Chernoff-Hoeffding bound (Lemma 1), the resulting circuit (where the first level gates compute majorities over subsets S 1 , S 2 , . . . , S k ) computes MAJ X (A) incorrectly is By choosing a large enough value for β one makes this expression small enough. Thus, there exists a choice of S 1 , S 2 , . . . , S k such that the fraction (among all 2 n inputs) of all the inputs from outside of middle layers for which the corresponding circuit computes MAJ X incorrectly is at most ε/2. This gives a circuit that computes MAJ X correctly for at least a fraction (1 − ε) of all the inputs.
Proof of Theorem 7. Let k = αn 1/2 for a parameter α = α(ε) to be chosen later. We are going to show that one can set this parameter so that a MAJ k • MAJ k circuit errs on more than a fraction ε of inputs. Note that such a circuit can read at most k 2 = α 2 n of the input bits. Let R be the input bits that are read by the circuit C and U = X \ R be all the remaining input bits (for read and unread). Then |R| ≤ α 2 n. Intuitively, when α is small, the circuit does not even read a large fraction of input bits and for this reason errs on a large number of inputs. We formalize this intuition below.
If |R| < α 2 n it is convenient to extend |R| to |R| = α 2 n, so that |U | = (1 − α 2 )n and the circuit C reads only some of the input bits from R and does not read any input bits from U . Let β be a parameter to be chosen later. Denote by C R , F R , C U , F U the set of all assignments to the variables from R and U , respectively, whose weight is close to or far from the middle value, respectively: We would like to set the parameters α and β so that both |F U | and |C R | are large enough. Namely, that each of them has at least a fraction 1 − ε/10 of all the corresponding assignments.
By Lemma 1, for a randomly chosen assignment A : R → {0, 1}, On the other hand, We now tune the parameters. First, set β = α √ 2 ln 10 ε to ensure that (10) is at most ε/10. Then one can choose a small enough value for α so that (11) is also at most 2 |U | · ε/10. This is possible, since the function α (1−α 2 ) 1/2 decreases to 0 with α → 0. Now, break assignments from F U into pairs: A and ¬A (clearly, if the weight of A is far from the middle, then so is the weight of ¬A, since w(A) = |U | − w(¬A)). Consider an assignment A ∈ F U , its mate ¬A ∈ F U , and an assignment B ∈ C R . Consider the following two assignments to X: A ⊔ B and ¬A ⊔ B. Clearly, On the other hand, the circuit C outputs the same for both of them as it only reads the bits from R. This means that it errs on at least one of these two assignments. This, in turn, implies that the circuit errs on at least a fraction (1 − ε/10) 2 of all 2 n assignments. For ε ≤ 1/3, this is grater than ε, a contradiction.
We are going to set the parameters p and t such that this number is at most 2 n poly(n) . For this, take p = n Proof of Theorem 10. Consider a MAJ k • MAJ k circuit C computing MAJ n for k = αn 2/3 . We will show that for small enough value of the constant α such a circuit must err on more than ε fraction of minterms and maxterms.
For a function f : {0, 1} n → {0, 1}, define its boundary as follows: where by A i we denote an assignment from {0, 1} n resulting from A by flipping its i-th bit. In particular, by Lemma 2, | Bnd(MAJ n )| = Ω(2 n · n 1/2 ). Below, we show that for small enough value of α, | Bnd(C)| is much smaller than | Bnd(MAJ n )|, which implies that C errs on a large fraction of minterms and maxterms of MAJ n . Consider (A, i) ∈ Bnd(C). This means that C contains a gate G at a bottom level such that G(A) = G(A i ). Recall that G is a monotone function on l ≤ k variables. It is known (see, e.g., [15,Theorem 2.33]) that the influence of such a function is O(l 1/2 ): Note that by Lemma 2 any A ∈ {0, 1} l such that G(A) = G(A i ) can be extended to a minterm/maxterm of MAJ n in O(2 n−l · (n − l) −1/2 ) ways. Thus, G contributes at most O(k 1/2 · 2 n · n −1/2 ) pairs (A, i) to Bnd(C) (note that (n − l) 1/2 = Θ(n 1/2 ) since l ≤ k = Θ(n 2/3 )). Since C contains at most k such gates, we conclude that Bnd(C) = O(k 3/2 · 2 n · n −1/2 ) .
For small enough constant α, In particular, there are at most 1 10 n n/2 maxterms that contribute at least n/10 elements to Bnd(C). Thus there are at least 9 10 n n/2 maxterms that contribute to Bnd(C) less than n/10 elements. Since by our assumption C computes MAJ n correctly on at least 8/10 fraction of maxterms we have that there is a set M of at least 1 2 n n/2 maxterms on which the computation of C is correct, but the contribution to Bnd(C) is small. That is, M consists of assignments A : X → {0, 1} such that there are at least 4n/10 of i's for them with A i = 0, (A, i) / ∈ Bnd(C), and C(A) = 0. From this we will deduce that C computes MAJ X incorrectly on a large fraction of minterms.
Indeed, consider the following bipartite graph. The vertices of one part are elements of M . For each A ∈ M and for each i ∈ [n] with the properties above there is an outgoing edge corresponding to this pair (A, i). The other endpoint of this edge is labeled by A i . Note that A i is a minterm of MAJ n and by the analysis above C(A i ) = 0. The vertices on the second part of the graph are thus labeled by minterms connected to maxterms in M . It is left to estimate the number of elements in the second part. For this note that there are at least 1 2 n n/2 vertices in M each of degree at least 4n/10. On the other hand the degree of each vertex in the second part is at most n/2. From this it follows that there are at least vertices in the second part. Thus, the circuit C gives the wrong output on at least 4/10 of minterms, a contradiction.

Deterministic Circuits
Proof of Lemma 11. Suppose n = 2l +1 and suppose there is a depth-2 circuit F computing MAJ n , consisting of standard majorities of exactly 2l − 1 variables each and for each gate on the bottom layer having distinct variables as its inputs. Consider the following undirected graph G. Its vertices are the inputs x 1 , . . . , x n . Two vertices x i and x j are connected if there is a gate on the bottom layer that gets on input all variables except x i and x j . Thus, graph G has n vertices and n − 2 edges.
Consider a minterm A of the function MAJ n . Its weight is w(A) = l + 1. For the circuit F to output 1 on A there should be at least l gates on the bottom layer outputing 1 on A. For each of these gates to output 1 it has to receive at least l ones on inputs. This is equivalent to saying that one of the two variables that are not given on the input of the gate should be 0.
Thus in terms of the graph G, for the circuit to compute the function correctly it is needed that for any coloring of l vertices of G in color 0 there are at least l edges that have an endpoint colored in 0. It is not hard to see that this is impossible. Below we provide a formal proof.
We will construct a coloring of l vertices into color 0 such that there are at most l − 1 edges having an endpoint colored in 0. Since G has n vertices and n − 2 edges we have that there are at least two connected components in G. Now we are ready to color l vertices of the graph in the color 0. We color all vertices in the first several components and if needed we will color a part of one more component.
If after coloring l vertices we colored completely several components and have not started the next one, then clearly the sum of p(H) over colored components is negative and thus the number of edges with an endpoint colored in 0 is less than l.
Suppose we have colored several components and we need to color a part of the next component H. We will explain now how to do it. If p(H) = −1, then H is a tree. Color a part of H of needed size in such a way that the number of vertices in H colored in 0 is the same as the number of edges with an endpoint colored in 0 (for example, we can repeat the following procedure: color a leaf and remove it from the tree). Note that in the previous components the sum of the parameters p is negative and we are done. If p(H) = m ≥ 0 then the sum of parameters p of all colored components is at most −m − 2. Consider a spanning tree of H. It is obtained from H by removing m + 1 edges. Color a part of the spanning tree of H in such a way that the number of colored vertices in the spanning tree is the same as the number of edges with an endpoint colored in 0. If we return edges removed from H it will add at most m + 1 edges with an endpoint colored in 0. However, in all components in total the number of vertices colored in 0 is still greater than the number of edges with an endpoint colored in 0. Thus we have constructed a needed coloring and thus found an input on which the circuit gives the wrong output.
Proof of Theorem 12. We adopt the strategy of the proof of Theorem 9. That is, we break inputs into O(n 1/3 ) blocks, compute majorities on each block on middle O(n 1/3 ) layers and then compute the majority of the results. We use the third layer of majority gates to induce additional structure on the inputs.
We proceed to the formal proof. Partition the set of inputs into b = n 1/3 /2 1/3 blocks of size p = 2 1/3 n 2/3 each: X = X 1 ⊔ X 2 ⊔ . . . ⊔ X b . For each block X i , compute [ x∈X i x ≥ k] for all k ∈ [p]. This constitutes the first layer of the circuit. The outputs of each of these p gates is just a permutation of X i , that is, X i in decreasing order.
As an output of the first layer we have again n bit vector Y with the same number of 1's and 0's as in the input, but in each block the bits are ordered in decreasing order. On the second layer we split Y again into b blocks of size p: Y = Y 1 ⊔ Y 2 ⊔ . . . ⊔ Y b . But now block Y i consists of the bits of Y with numbers i, i + b, i + 2b, . . . , i + (p − 1)b. For each block Y i , we compute [ y∈Y i y ≥ k] for all k ∈ [ p 2 − ( n 2 ) 1/3 .. p 2 + ( n 2 ) 1/3 ]. Thus on the second layer we have 2 2/3 n 1/3 gates for each of b = n 1/3 /2 1/3 blocks, that is 2 1/3 n 2/3 outputs in total. Finally, on the third level we compute the majority of all of the outputs on the second layer. Now we need to show that this circuit computes the majority for all possible inputs. Since both the circuit and the majority function are monotone, it is enough to ensure that the computation is correct on min-terms and max-terms of majority.
Consider an input A : X → {0, 1} with w(A) = n/2. We will show, that for each block Y i , Indeed, since the variables in each X i are ordered and we include in Y i each b-th variable of each X j , where in ±b 2 the first b factor corresponds to the error in each block X i and the other b factor corresponds to the number of blocks X 1 , . . . , X b . On the other hand, we know that w(A) = n/2.
which implies (14). Now, (14) implies that the computation of the constructed circuit on A is correct. Indeed, by (14), on the block Y i , the assignment A has at least ( p 2 − b) zeroes and at least ( p 2 − b) ones. This, in turn means that by computing [ y∈Y i y ≥ k] only for middle values of k