Counting solutions to random CNF formulas

. We give the first efficient algorithm to approximately count the number of solutions in the random k -SAT model when the density of the formula scales exponentially with k . The best previous counting algorithm was due to Montanari and Shah and was based on the correlation decay method, which works up to densities (1 + o k (1)) 2log k k , the Gibbs uniqueness threshold for the model. Instead, our algorithm harnesses a recent technique by Moitra to work for random formulas. The main challenge in our setting is to account for the presence of high-degree variables whose marginal distribution is hard to control and cause significant correlations within the formula.


Introduction
Let Φ = Φ(k, n, m) be a k-CNF formula on n Boolean variables with m clauses chosen uniformly at random where each clause has size k ≥ 3. The random formula Φ shows an interesting threshold behaviour, where the asymptotic probability that Φ is satisfiable drops dramatically from 1 to 0 when the density α := m/n crosses a certain threshold α . There has been tremendous progress on establishing this phase transition and pinpointing the threshold α [25,19,3,4,12,15] guided by elaborate but non-rigorous methods in physics [28,27]. The exact value of the threshold α is established in [15] for sufficiently large k; it is known that α = 2 k ln 2 − 1 2 (1 + ln 2) + o k (1) as k → ∞. In contrast, the "average case" computational complexity of random k-CNF formulas remains elusive. It is a notoriously hard problem to design algorithms that succeed in finding a satisfying assignment when the density of the formula Φ is close to (but smaller than) the satisfiability threshold α . The best polynomial-time algorithm to find a satisfying assignment of Φ is due to Coja-Oghlan [8], which succeeds if α < (1 − o k (1)) · 2 k ln k/k. It is known that beyond this density bound 2 k ln k/k the solution space of the formula undergoes a phase transition and becomes severely more complicated [2], so local algorithms are bound to fail to find a satisfying assignment in polynomial time (see for example [24,9,11]).
It is also a natural question to determine the number of satisfying assignments to Φ, denoted by Z(Φ), when the density is below the satisfying threshold. It has been shown that 1 n log Z(Φ) is concentrated around its expectation [1,13] for α < (1 − o k (1)) · 2 k ln k/k. However, for the k-SAT model, there is no known formula for the expectation E 1 n log Z(Φ) (though see [35,14] for progress along these lines for more symmetric models of random formulas). Regarding the algorithmic question, Montanari and Shah [31] have given an efficient algorithm to approximate log Z(Φ) if α ≤ 2 log k k (1 + o k (1)), based on the correlation decay method and the uniqueness threshold of the Gibbs distribution. Note that this only gives an approximation to Z(Φ) within an exponential factor. Also, the threshold for α is exponentially lower than the satisfiability threshold. No efficient algorithm was known to give a more precise approximation.
In this paper, we address the algorithmic counting problem by giving the first fully polynomial-time approximation scheme (FPTAS) for the number of satisfying assignments to random k-CNF formulas, if the density α is less than 2 rk , for sufficiently large k and some constant r > 0. Our bound is exponential in k and goes well beyond the uniqueness threshold of 2 log k k (1 + o k (1)) which is required by the correlation decay method. Our result is related to other algorithmic counting results on random graphs such as counting colourings, independent sets, and other structures [33,37,16,26] in random graphs. However, previous methods, such as Markov Chain Monte Carlo and Barvinok's method, appear to be difficult to apply to random formulas. Instead, our algorithm is the first adaptation of Moitra's method [30] to the random instance setting. We give a high level overview of the techniques in Section 1.2.

The model and the main result
For k ≥ 3, let Φ = Φ(k, n, m) denote a k-SAT formula chosen uniformly at random from the set of all k-SAT formulas with n variables and m clauses. Specifically, Φ has n variables v 1 , v 2 , . . . , v n and m clauses c 1 , c 2 , . . . , c m . Each clause c i has k literals i,1 , i,2 , . . . , i,k and each literal i,j is chosen uniformly at random from 2n literals {v 1 , v 2 , . . . , v n , ¬v 1 , ¬v 2 , . . . , ¬v n }. Note that each clause has exactly k literals (repetitions allowed), so there are (2n) km possible formulas; we use Pr(·) to denote the uniform distribution on the set of all such formulas. Throughout, we will assume that m = nα , where α > 0 is the density of the formula. We say that an event E holds w.h.p. if Pr(E) = 1 − o(1) as n → ∞.
For a k-SAT formula Φ, we let Ω = Ω(Φ) denote the set of satisfying assignments of Φ.

Theorem 1.
There is a polynomial-time algorithm A and there are two constants r > 0 and k 0 ≥ 3 such that, for all k ≥ k 0 and all α < 2 rk , the following holds w.h.p. over the choice of the random k-SAT formula Φ = Φ(k, n, αn ). The algorithm A, given as input the formula Φ and a rational ε > 0, outputs in time poly(n, 1/ε) a number Z that satisfies e −ε |Ω(Φ)| ≤ Z ≤ e ε |Ω(Φ)|.
Throughout this paper, we will assume that k ≥ k 0 where k 0 is a sufficiently large constant. We will also assume that the density α of the formula Φ satisfies α < 2 k/300 /k 3 , so r can be taken to be 1/301 in Theorem 1. The constant 300 here is not optimised, but we do not expect to be able to use the current techniques to improve it substantially. Our main point is that for a density which is exponential in k, an FPTAS exists for random k-CNF formulas. Finally, we assume that k 2 α ≥ 1, otherwise it is well-known (see, e.g., Theorem 3.6 in [34]) that w.h.p. every connected component of Φ, viewed as a hypergraph where variables correspond to vertices and clauses correspond to hyperedges, is of size O(log n). In this case we can count the number of satisfying assignments by brute force.

Algorithm overview
We give a high-level overview of our algorithm here before giving the details. Approximately counting the satisfying assignments of a k-CNF formula has been a challenging problem using traditional algorithmic techniques, since the solution space (the set of satisfying assignments) is complicated and it is not connected, using the transitions of commonlystudied Markov chains. Recently some new approaches were introduced [30,20]. Most notably, the breakthrough work of Moitra [30] gives the first (and so far the only) efficient deterministic algorithm that can approximately count the satisfying assignments of k-CNF formulas in which each variable appears in at most d clauses, if, roughly, d 2 k/60 . Inspired by this, Feng et al. [18] have also given a MCMC algorithm which applies when d 2 k/20 . As our goal is to count satisfying assignments of sparse random k-CNF formulas, where these degree bounds do not hold, but average degrees are small, it is natural to also choose Moitra's method in the random instance setting. However, the first difficulty is that Moitra's method relies on the fact that the marginal probability of each variable (the probability that the variable is true in a uniformly-chosen satisfying assignment) is nearly 1/2. This is necessary because Moitra's method involves solving a certain linear program (LP) and the size of this LP is polynomially-bounded only if a certain process couples quickly. The proof that the process couples quickly relies on the fact that the marginals are nearly 1/2 (and certainly on the fact that they are bounded away from 0 and 1). In contrast, for a random k-CNF formula, although the average degree of variables is low, w.h.p. there are variables with degrees as high as Ω (log n/log log n). In the presence of these high-degree variables, the marginal probabilities of the variables can be arbitrarily near 0 or 1, instead of 1/2.
Our solution to this issue is to separate out high-degree variables, as well as those that are heavily influenced by high-degree variables. To do this, we define a process to recursively label "bad" variables. At the start, all high-degree variables are bad. Then, all clauses containing more than k/10 bad variables are labelled bad, as are all variables that they contain. We run this process until no more bad clauses are found. We call the remaining variables and clauses of the formula "good". A key property is that all good variables have an upper bound on their degree and all good clauses contain at least 9k/10 good variables; this allows us to show that the marginal probabilities of good variables are close to 1/2. The next step is to attempt to apply Moitra's method. The goal of Moitra's method is to compute more precise estimates for the marginal probabilities of the variables; given accurate estimates on the marginal probabilities, it is then relatively easy to approximate the number of satisfying assignments using refined self-reducibility techniques.
Of course, we need to modify the method to deal with the bad variables, which still appear in the formula. We first explain Moitra's method and then proceed with our modifications. The first step is to mark variables, so that every clause contains a good fraction of marked variables and a good fraction of unmarked variables. Then, for a particular marked variable 53:4 Counting Solutions to Random CNF Formulas v, we set up an LP. As noted earlier, the variables of the LP correspond to the states of a certain coupling process which couples two distributions on satisfying assignments using the marked variables -the first distribution over satisfying assignments in which v is true, and the second distribution over satisfying assignments in which v is false. Solving the LP recovers the transition probabilities of the coupling process and yields enough information to approximate the marginal probability of v.
In order to guarantee that the size of the LP is bounded by a polynomial in the size of the original CNF formula, we have to restrict the coupling process. The process can be viewed as a tree and it suffices to truncate this tree at a suitable level.
Thus, a crucial part of the proof (both in Moitra's case and in ours) is to show that the error caused by the truncation is sufficiently small. The reason that the error caused by the truncation is small is that, with high probability, branches of the coupling tree "die out" before reaching a large level. The reason for this is that the marginals of marked variables stay near 1/2, even when conditioning on partial assignments.
In our case where Φ is a random formula, the marginals are not all near 1/2, even without any conditioning. But the good variables do have marginals near 1/2. So we only mark/unmark good variables and we "give up" on bad variables. Given that we don't have any control over the bad variables, we have to modify the coupling process. Thus, whenever we meet a bad variable in the coupling process, we have to assume the worst case and treat this variable and all bad variables connected to it as if they all have failed the coupling, meaning that the disagreement spreads quickly over bad components.
The most important part of our analysis is to upper bound the size of connected bad components and how often we encounter them during the coupling processs. Given these upper bounds, we are able to show that the coupling still dies out sufficiently quickly, so the error caused by the truncation is not too large. Solving the LP then allows us to estimate the marginals of the good variables. Given that the bad components have small size, this turns out to be enough information to estimate the number of satisfying assignments of the original formula (containing both good and bad clauses).
We conclude this summary by discussing the prospects for improving our work. Although we have given an efficient algorithm which works for densities that are exponentially large in k, the densities that we can handle are still small compared to the satisfiability threshold or to the threshold under which efficient search algorithms exist. Perhaps a modest start towards obtaining comparable thresholds for approximate counting algorithms would be to consider models whose state spaces are connected. For example, for monotone k-CNF formulas where each variable appears in at most d clauses, Hermon et al. [23] showed that efficient randomised algorithms exist if d ≤ c2 k/2 for some constant c > 0, which is optimal up to the constant c due to complementing hardness results [6]. They also showed that the same algorithm works for random regular monotone k-CNF formulas, if the degree d ≤ c2 k /k for some c > 0. It remains open whether an average case bound of the same order can be achieved for random monotone k-CNF formulas.

2
The coupling tree

Identifying bad variables
We start by identifying bad variables; the method that we use is inspired by [12].

Definition 2.
Let Φ be a k-SAT formula. We say that a variable v of Φ is high-degree if Φ contains at least ∆ := 2 k/300 occurrences of literals involving the variable v.
The reason that high-degree variables are harmful is that their marginal probabilities (when we sample uniformly from satisfying assignments) are not bounded away from 0 and 1. Also, any variable that shares clauses with high-degree variables may also have biased marginals. In our algorithm, we will not be able to control these high degree variables or other variables that are affected by them. This variables contribute to the "bad" part of the formula Φ. Formally, denote the set of clauses of Φ by C and the set of variables by V. For each c ∈ C, let var(c) denote the set of variables in c. For each subset C of C, let var(C) := ∪ c∈C var(c). The bad variables and bad clauses of Φ are identified as follows: 1. V 0 (the initial bad variables) ← the set of high-degree variables; 2. C 0 ← the set of clauses with at least k/10 variables in V 0 ;

Marking good variables and identifying a satisfying assignment
Apart from the fact that we only mark variables in V good , our marking follows the approach of Moitra [30]. Formally, a "marking" is an assignment from V good to {marked, unmarked}. Using Observation 3 and applying the asymmetric version of the Lovász local lemma [17,36,22] and the algorithmic version of the local lemma by Moser and Tardos [32] it is easy to prove the following lemma.

Lemma 8.
There exists a marking on V good such that every good clause has at least 3k/10 marked variables and at least k/4 unmarked good variables. It has the property that there is a partial assignment of bad variables that satisfies all bad clauses. Furthermore, such a marking can be found in deterministic polynomial time.
We also use the Lovász local lemma to identify a partial assignment Λ * that we will use to apply self-reducibility. Lemma 10. Let Φ = Φ(k, n, m) and let v 1 , v 2 , . . . , v n be the variables of Φ. In each clause, order the literals in the order induced by the indices of their variables. Then there is a partial assignment Λ * of truth values to some subset of V marked with the property that every clause c ∈ C good is satisfied by its first k/20 literals corresponding to marked variables. Moreover, Λ * can be found in deterministic polynomial time.

The coupling tree
Fix a prefix Λ of the assignment Λ * from Lemma 10. Let Φ Λ be the formula produced by simplifying Φ under Λ (remove clauses that are satisfied under Λ and remove all false literals). C Λ denotes the clauses of Φ Λ and V Λ denotes the variables. We also define V Λ good = V good ∩V Λ and C Λ good = C good ∩ C Λ . Ω Λ denotes the set of satisfying assignments of Φ Λ . For a variable v * ∈ V Λ , let Ω Λ 1 be the set of assignments in Ω Λ in which v * is true, and let Ω Λ 2 be the set of assignments in Ω Λ in which v * is false. The algorithm estimates the marginal probability that v * is true by solving a certain LP which allows it to estimate the I C A L P 2 0 2 0 53:6 Counting Solutions to Random CNF Formulas The variables of the LP correspond to the states of a coupling process. The process couples the uniform distribution on Ω Λ 1 with the uniform distribution on Ω Λ 2 . We can now describe process via its "coupling tree" T Λ .
For each node ρ there is a partial assignment A 1 (ρ) ∈ Ω Λ 1 and a partial assignment The variables set in these partial assignments are Λ ∪ V set (ρ). The set V I (ρ) contains "problematic" variables. The details will be clear later. Roughly, these include variables in V set (ρ) on which A 1 (ρ) and A 2 (ρ) disagree, variables contained in clauses that are not satisfied in some A i (ρ), even though all marked variables have already been set, and variables "affected" by bad variables during the coupling process. C rem (ρ) is the set of remaining clauses to consider at descendants of ρ in the coupling.
The root of the coupling tree is the node ρ * with V set (ρ * ) = V I (ρ * ) = {v * }. The assignment A 1 (ρ * ) sets v * to T and the assignment A 2 (ρ * ) sets v * to F. C rem (ρ * ) = C Λ . Let n = |V|. In order to ensure that the size of the LP is bounded by a polynomial in n we need to ensure that the size of the coupling tree is also bounded by a polynomial in n. To do this, we choose truncation depth L := C 0 (3k 2 ∆) log(n/ε) where C 0 is a sufficiently large constant. We then truncate the tree as follows.

Definition 12.
A node ρ of the coupling tree is a leaf if |V I (ρ)| ≤ L and every c ∈ C rem (ρ) has the property that var(c) then ρ is a truncating node. We denote the set of leaves by L, the set of truncating nodes by T , and their union by L * := L ∪ T .
If ρ is not in L * then we define its four children as follows. The "first clause" of ρ is the first good clause c with a variable in V I (ρ) and a variable in V Λ \ V I (ρ). (The definitions imply that such a clause exists.) The "first variable" u of ρ is the first (good) variable in marked(c) \ V set (ρ). For each of the four pairs (τ 1 , τ 2 ) where τ 1 and τ 2 are assignments from {u} to {T, F}, we create a child ρ τ1,τ2 of ρ using the following algorithm. Algorithm 1 Constructing the child ρτ 1 ,τ 2 of a non-truncating node ρ of the coupling tree, where τ1, τ2 are assignments from {u} to {T, F}, and u is the first variable of ρ.

Key property of the coupling tree for a random formula
Recall that the variables of the LP which is used to estimate the marginal of the variable v * of Φ Λ correspond to the states of the coupling on the coupling tree T Λ . We will define two LP variables P 1,ρ and P 2,ρ for each node ρ of T Λ . In order to efficiently solve the LP, we need its size to be bounded by a polynomial in n, so we need the number of nodes of T Λ to be bounded by a polynomial in n. For a random formula, this follows from the following key lemma, which is a main technical contribution of our work. Lemma 14. W.h.p. over the choice of Φ, for every prefix Λ of Λ * , every node ρ in T Λ has the property that |V set (ρ)| ≤ 3k 3 αL + 1.
To see that Lemma 14 implies that the size of the coupling tree is at most a polynomial in n, note that the depth of the tree does not exceed max ρ∈T Λ |V set (ρ)| ≤ 3k 3 αL + 1 = O(log n ε ). Also, each node has at most 4 children.
In the rest of this section, we sketch the proof of Lemma 14. We start by defining some graphs associated with Φ. The formula Φ naturally corresponds to a bipartite "factor graph" where one side is variables and the other clauses (a variable has an edge to a clause in the factor graph if one its literals is contained in the clause). We also use the following two graphs.  The lemma says that if you take any "large" set of clauses Y that are connected in G Φ and any large set V of the variables of Y then there aren't many clauses outside of Y that contain variables in V . (There isn't a large set Z of such clauses.) Obviously, the lemma doesn't apply to every Φ, but is highly dependent on the random way in which Φ is chosen. The proof of Lemma 41 relies crucially on upper-bounding the probability that a set of clauses Y is connected in G Φ . To do this, we sum over possible trees connecting the clauses in Y . We use the bound from Lemma 39 of the full version, which shows that the probability that any particular tree T is connected in G Φ is at most (k 2 /n) |V (T )|−1 .

53:8 Counting Solutions to Random CNF Formulas
Proof of Lemma 14. Let Λ be a prefix of Λ * and let ρ be a node in T Λ . Our goal is to prove |V set (ρ)| ≤ 3k 3 αL + 1. We first consider the case in which ρ is not a truncating node, so |V I (ρ)| ≤ L and we show |V set (ρ)| ≤ 3k 3 αL. The proof has two parts.
). To prove Part 1, we consider any u ∈ V set (ρ) \ V I (ρ) and show that there is a clause c containing u and containing a variable in V I (ρ).
We first rule out the case that u = v * by noting (from the construction of the coupling tree) that v * ∈ V I (ρ) ∩ V set (ρ).
So consider u ∈ V set (ρ) \ V I (ρ) and let ρ be the ancestor of ρ in the coupling tree such that u is the first variable of ρ . The definition of the coupling tree guarantees that ρ is uniquely defined and that it is a proper ancestor of ρ -the definition of "first variable" guarantees that u / ∈ V set (ρ ), but for all proper descendants ρ of ρ , u ∈ V set (ρ ). Let ρ be the child of ρ on the path to ρ. We will show that there is a clause c containing u and containing a variable in V I (ρ ). Part 1 then follows from the fact that V I (ρ) contains V I (ρ ). The existence of such a clause c is immediate from the definition of "first variable"indeed c is the "first clause" of ρ . Part 2. W.h.p., the random formula Φ is such that ∀ρ, |Γ + H Φ (V I (ρ))| ≤ 3k 3 αL. For Part 2, it is important that the set V I (ρ) is connected in H Φ -this follows from the construction of the coupling tree. We show (this is Lemma 51) that, w.h.p. over the choice of Φ, every connected set of variables V ⊆ V satisfies which establishes Part 2 since |V I (ρ)| ≤ L. The proof of (1) is as follows. Let V be a connected of variables and let Y be the set of neighbours of V in the factor graph of Φ, i.e., Y = {c ∈ C | var(c) ∩ V = ∅}. Clearly |Γ + H Φ (V )| ≤ k |Y | and hence it suffices to show that |Y | ≤ 3k 2 α max{|V | , k log n}. There are two cases depending on the size of V .
|V | ≥ k log n. Since V is a connected set of variables, there exists a set Y ⊆ Y such that |V | /k ≤ |Y | ≤ |V | and V ∪ Y is connected in the factor graph of Φ. Hence, Y is a connected set of clauses and |Y | ≥ log n. Let Z = Y \ Y . If |Z| ≥ 2k 2 α |V | then we obtain a contradiction to Lemma 41, which holds w.h.p. Thus, w.h.p., |Z| ≤ 2k 2 α |V | which implies |Y | = |Y | + |Z| ≤ 3k 2 α |V |, as required. Otherwise |V | < k log n. If |Γ + H Φ (V )| < k log n then we are finished. Otherwise, consider an arbitrary connected V ⊃ V such that |V | = k log n . By the argument of the previous case, the set of neighbours of V in the factor graph, denoted Y , satisfies that |Y | ≤ 3k 2 α |V | ≤ 3k 3 α log n. Thus, |Y | ≤ |Y | ≤ 3k 3 α log n. This completes the proof of (1), and hence Part 2.

The linear program
Here we briefly list the constraints in the LP so that we can discuss its analysis. For a node ρ of the coupling tree, let C I (ρ) be the set of clauses c ∈ C Λ such that var(c) ⊆ V I (ρ) ∪ V set (ρ). For i ∈ {1, 2}, let N i (ρ) be the number of assignments τ to V I (ρ) \ V set (ρ) such that every clause in C I (ρ) is satisfied by τ ∪ A i (ρ). It turns out (see Lemma 15) that N i (ρ) = 0 for i ∈ {1, 2}, so we define r(ρ) = N 1 (ρ)/N 2 (ρ).
The LP relies on two constants r lower and r upper . The algorithm that uses the LP will move these closer and closer together by binary search. For each node ρ of the coupling tree, we introduce two variables P 1,ρ and P 2,ρ . The constraints are as follows. Constraint Set 0: For every node ρ of the coupling tree and every i ∈ {1, 2} we add the constraint 0 ≤ P i,ρ ≤ 1. Constraint Set 1: If ρ ∈ L then we add the constraint r lower P 2,ρ ≤ P 1,ρ r(ρ) and the constraint P 1,ρ r(ρ) ≤ r upper P 2,ρ . Constraint Set 2: For the root ρ * of the coupling tree, we add the constraints P 1,ρ * = 1 and P 2,ρ * = 1. For every node ρ of the coupling tree that is not in L * , let u be the first variable of ρ. For each X ∈ {T, F} add the constraints P 1,ρ = P 1,ρ u→X,u→T + P 1,ρ u→X,u→F and P 2,ρ = P 2,ρ u→T,u→X + P 2,ρ u→F,u→X . Constraint Set 3: For every node ρ of the coupling tree that is not in L * , every X ∈ {T, F}, and every i ∈ {1, 2}, let u be the first variable of ρ and add the constraint P i,ρ u→X,u→¬X ≤ 1 s P i,ρ .

Analysis of the linear program for a random formula and how it enables us to conclude Theorem 1
The key lemmas demonstrating the purpose of the linear program are as follows.
There is a set of variables P = {P i,ρ } that satisfies all constraints of the LP. Lemma 34. Fix r lower ≤ r upper . W.h.p. over the choice of Φ, the following holds. If the LP has a solution P using r lower and r upper , then e −ε/(3n) r lower ≤ |Ω Λ 1 |/|Ω Λ 2 | ≤ e ε/(3n) r upper .
The full version proves Theorem 1 using these two lemmas. Here we just give the main idea. First, consider the sub-goal of estimating |Ω Λ 1 |/|Ω Λ 2 | given Φ and a partial assignment Λ of Λ * . We can do this with accuracy exp (±ε/n) using the linear program. The proof of Lemma 57 in the full version uses the Lovász local lemma to establish values for r lower and r upper that meet the conditions in Lemma 24. Then, by binary search we bring r lower and r upper closer together until we achieve the desired accuracy (by Lemma 34). The initial values of r lower and r upper guarantee (see the proof of Lemma 57 for details) that the LP is run at most O(log(n/ε)) times. Since we have already shown that the size of the LP is bounded by a polynomial in n/ε the algorithm runs in polynomial time.
Now consider the proof of Theorem 1. Using standard self-reducibility, we can use the estimates that we have just established to obtain an accurate estimate (within exp (±ε)) of |Ω Λ * |/|Ω|, which is the probability that a random satisfying assignment is consistent with Λ * .
To finish we need one last key ingredient -we need a method to estimate |Ω Λ * |. Since all good clauses are satisfied by Λ * , the set C Λ * of clauses of Φ Λ * consists only of bad clauses. Now we need one more key lemma. Lemma 48 implies that C Λ * can be divided into disjoint subsets where each subset of clauses contains O(log n) variables. The algorithm can then compute the number of satisfying assignments of each subset by brute force in time poly(n). Then |Ω Λ * | is the product of these numbers.
This concludes the sketch of the proof of Theorem 1 -the details are in the full version. In the rest of this short version, we briefly discuss the proof of the remaining key lemmas, Lemmas 48, 34, and 24.
We start with the proof of Lemma 48. This lemma, which bounds the size of bad components, is one of the main technical achievements allowing us to extend Moitra's method to random CNFs with high density. Here we only have room for a very rough sketch. Recall that a bad component is a set S of variables that is connected in H Φ,bad . Let HD(S) = V 0 ∩ S be the set of high-degree variables in S. We wish to show that w.h.p., over the choice of Φ, every bad component S has size at most 21600k log n. This follows from the following two lemmas, which give a contradiction for large bad components S. Note that V bad = BC(V 0 ), where V 0 is the set of high-degree variables. We show (Lemma 43) that for every bad component S, we have S = BC(HD(S)). Thus, the process P can be viewed as a "local" process for identifying bad components.
Let S be a bad component. If S contains only an isolated variable, it must be a high-degree variable and hence HD(S) = S (so we are finished). Otherwise, since a bad component is a connected component of variables in H Φ,bad , the definition of H Φ,bad ensures that the bad component has at least k/10 high-degree variables.
Note that |HD(S)| ≤ |V 0 |. In Lemma 35 of the full version we use Poisson estimates for the degrees of the variables to show that, w.h.p., |V 0 | ≤ n/2 k 10 . The next step is to apply a counting argument to show that, w.h.p., for every set of variables Y such that 2 ≤ |Y | ≤ n/2 k , the number of clauses that contain at least k/10 variables from Y is at most 30 k |Y |. This is Corollary 38 of the full version. We apply the corollary with Y = HD(S), so we find that there are at most 30 k |HD(S)| clauses that contain at least k/10 variables from HD(S). Now, we run the process P starting with HD(S). Take Z to be the set of clauses that contain at least k/10 variables from HD(S) (so, from above, we have |Z| ≤ 30 k |HD(S)| ≤ 30 k n 2 k 10 ). The next step is to show that, w.h.p., the number of clauses c such that var(c) ⊆ BC(HD(S)) is at most 2|Z| (which we have already shown to be at most 60 |HD(S)| /k). This analysis is contained in Corollary 45. It is essentially an analysis of the process P which follows easily from a lemma of Coja-Oghlan and Frieze [10,Lemma 2.4]. The high-probability guarantees are universal over Z (hence universal over S).
Since S = BC(HD(S)) and each variable in S is contained in some bad clause, we have |S| ≤ We now turn to the proof of Lemma 34. There are two kinds of errors which cause solutions of the LP to differ from the ratio Ω Λ 1 / Ω Λ 2 . The first kind comes from so-called " -wrong assignments" and the second kind comes from the truncation of the coupling tree. To define these more precisely, we need some graph-theoretic notation.
Definition 25. Given a graph G and any positive integer k, let G ≤k be the graph with vertex set V (G) in which vertices u and v are connected iff there is a path from u to v in G of length at most k.
The main combinatorial structure that we use is a set D(G Φ ), which is based on Alon's "2,3-tree" [5]. Similar structures were subsequently used in [30,21]. The main difference between our definition and previous ones is that we take into account whether clauses are connected via good variables.
, and there is a size /2 subset S of T ∩ C Λ good such that the restriction of σ to marked variables in clauses in S does not satisfy any clause in S. Otherwise σ is -correct.
The full version proves The lemma follows easily from these. Combining (3) and (4) with the fact that (2) we obtain the result.
I C A L P 2 0 2 0

53:12 Counting Solutions to Random CNF Formulas
The main ingredient in the proof of (3) is Lemma 30, which shows that the fraction of assignments in Ω Λ i that are -wrong is at most (k∆) −9 . The main ingredient in the proof of (4) is showing that, w.h.p., for every -correct σ ∈ Ω Λ i , ρ∈T :σ∈Ω A i (ρ)∪Λ P i,ρ ≤ (k∆) −8 . This is handled in Lemmas 32 and 33.
To prove these lemmas (say for i = 1) we consider a sampling procedure for choosing a node ρ ∈ L * conditioned on some σ ∈ Ω Λ 1 . The probability that it reaches any node ρ ∈ L * with σ ∈ Ω A1(ρ)∪Λ is designed to be P 1,ρ so the goal is to bound the probability that it reaches the set Υ σ = {ρ ∈ T | σ ∈ Ω A1(ρ)∪Λ }. This is where the combinatorial structures that we have defined come in. We use F(ρ) to denote the set of clauses that "fail" in the coupling process, contributing variables to V I (ρ). Lemma 28 shows that w.h.p., for every node ρ ∈ Υ σ , there is a set T ⊆ F(ρ) containing the first clause c * such that T ∈ D(G Φ ), |T | = and |T ∩ C bad | ≤ |T |/3. This implies that |T ∩ C Λ good | ≥ 2|T |/3. We therefore need to upper bound the probability that such a T is contained in F(ρ) when ρ is chosen from the sampling procedure. T has size and contains c * and contains enough good clauses. So it turns out that, since σ is -correct, a lot of these failed clauses in F(ρ) must have failed due to disagreements in the coupling. Since T ∈ D(G Φ ) these clauses do not share good variables. The constraints in Constraint Set 3 then imply that the probability of all of these simultaneous disagreements is unlikely.
That concludes the proof, apart from proving the key Lemma 28. This again relies on properties about bad components -in particular on Lemma 50, which says that, w.h.p., for every connected set of clauses Y such that |var(Y )| ≥ 21600k log n, it holds that |Y ∩ C bad | ≤ |Y |/12. This is somewhat similar to the issues that we discussed regarding the proof of Lemma 48 -we defer the details to the full version.
We will next give an inductive definition of a function Q from nodes of the coupling tree to real numbers in [0, 1]. The way to think about this is as follows -we will implicitly define a probability distribution over paths from the root of the coupling tree to L * . For each node ρ, Q(ρ) will be the probability that ρ is included in a path drawn from this distribution.
Any such path starts at the root, so we define Q(ρ * ) = 1. Once we have defined Q(ρ) for a node ρ that is not in L * we can define Q(·) on the children of ρ as follows. Let u be the first variable of ρ and consider the four children ρ u→T,u→T , ρ u→T,u→F , ρ u→F,u→T , ρ u→F,u→F .