Abstract 1 Introduction 2 Preliminaries 3 Overview of algorithms and data structures 4 The REDUCE Algorithm 5 The FOOL Algorithm 6 Reducing the state space for counter automata 7 Applications References

Improved Parallel Derandomization via Finite Automata with Applications

Jeff Giliberti ORCID University of Maryland, College Park, MD, USA David G. Harris ORCID University of Maryland, College Park, MD, USA
Abstract

A central approach to algorithmic derandomization is the construction of small-support probability distributions that “fool” randomized algorithms, often enabling efficient parallel (NC) implementations. An abstraction of this idea is fooling polynomial-space statistical tests computed via finite automata [Sivakumar STOC’02]; this encompasses a wide range of properties including k-wise independence and sums of random variables.

We present new parallel algorithms to fool finite-state automata, with significantly reduced processor complexity. Briefly, our approach is to iteratively sparsify distributions using a work-efficient lattice rounding routine and maintain accuracy by tracking an aggregate weighted error that is determined by the Lipschitz value of the statistical tests being fooled.

We illustrate with improved applications to the Gale-Berlekamp Switching Game and to approximate MAX-CUT via SDP rounding. These involve further several optimizations, such as the truncation of the state space of the automata and FFT-based convolutions to compute transition probabilities efficiently.

Keywords and phrases:
Parallel Algorithms, Derandomization, MAX-CUT, Gale-Berlekamp Switching Game
Funding:
Jeff Giliberti: JG gratefully acknowledges financial support by the Fulbright U.S. Graduate Student Program, sponsored by the U.S. Department of State and the Italian-American Fulbright Commission. This content does not necessarily represent the views of the Program.
Copyright and License:
[Uncaptioned image] © Jeff Giliberti and David G. Harris; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Parallel algorithms
; Theory of computation Pseudorandomness and derandomization
Related Version:
Full Version: https://arxiv.org/abs/2411.18028
Acknowledgements:
Thanks to Mohsen Ghaffari and Christoph Grunau for explaining the works [19] and [20].
Editors:
Anne Benoit, Haim Kaplan, Sebastian Wild, and Grzegorz Herman

1 Introduction

A fundamental problem in the theory of computation is to derandomize existing randomized algorithms. Such randomized algorithms typically use a large number of independent random bits. One main genre of derandomization is the construction of a probability distribution which is much smaller (of polynomial instead of exponential size), to “fool” the randomized algorithm. That is, the behavior of relevant statistics should be similar when presented with fully-independent random bits vs. bits drawn from a small, carefully-constructed correlated distribution. This probability space can be searched exhaustively, and in parallel, to find a specific input with desired properties.

A simple and popular example of this technique comes from a probability space with k-wise independence, for constant k [2, 30, 32, 35]. This preserves basic statistical properties such as means and variances [7, 42]. There are many other constructions for more-advanced properties, e.g., near-k-wise independence [3, 5, 14, 17, 38], probability spaces fooling halfspaces or polytopes [33, 41, 23, 22], ordered branching programs [12, 15, 37, 27, 40] etc.

A significant abstraction of these methods is the construction of probability spaces that fool statistical properties (also called “tests”) computed by finite automata [10, 24, 31, 36, 39, 40, 43]. Such tests are ubiquitous in randomized algorithm analysis, encompassing k-wise independence, sums of random variables, and many other properties. For example, they are used in deterministic parallel algorithms for finding balanced independent sets in graphs [26], for applications of the Lovász Local Lemma in combinatorics [25], for covering and packing integer programs [6, 44], for undirected connectivity problems [40], and more.

1.1 Our contribution

We describe new parallel algorithms to fool automata, with significantly reduced processor complexity. The analysis is also simplified because we can cleanly separate out problem-specific optimizations from the general lattice discrepancy problems that are at the core of the algorithm. A summary of our algorithm performance is as follows:

Theorem 1 (Simplified).

Consider a probability space Ω on n binary random variables and a collection of statistical tests over Ω, each with η possible states and Lipschitz value λ.

There is a deterministic algorithm to generate a distribution D of size ε2polylog(n,η,,1/ε) to simultaneously “fool” all statistical tests to absolute weighted error λε. It uses O~(nηε2+nηω) processors and polylogarithmic time, where ω is the exponent of matrix multiplication.

By way of comparison, the algorithm of [24] would require roughly O(n32η4ε2) processors to construct a distribution of size O(η2ϵ2). Our full results are more general, easily allowing for non-binary alphabets, or more complex automata types; see Theorem 5 for details. We emphasize that the algorithm is still typically more efficient even for applications with no particular Lipschitz bounds.

We illustrate this framework through two prototypical applications: the Gale-Berlekamp Switching Game and SDP rounding for approximate MAX-CUT.

Theorem 2 (Gale-Berlekamp Switching Game).

Given an n×n matrix A, there is a deterministic parallel algorithm using O~(n3.5) processors and polylogarithmic time to find x,y{1,+1}n satisfying i,jAi,jxiyj(2/πo(1))n3/2.

It is interesting that the Gale-Berlekamp analysis revolves around anti-concentration bounds, which are precisely the opposite of discrepancy-minimization. Our algorithm for this problem beats the n5+o(1) complexity of the optimized algorithm of [24].

Theorem 3 (MAX-CUT Approximation).

Let ε>0 be an arbitrary constant. Given an n-vertex m-edge graph G(V,E), there is a deterministic parallel algorithm using O~(mn3) processors and polylogarithmic time to find an α(1ε) approximate MAX-CUT of G, where α0.878 is the Goemans-Williamson approximation constant [21].

The SDP relaxation of MAX-CUT can be approximated in polylogarithmic time and near-linear work by [1], so we will be concerned with rounding the SDP solution. A rounding procedure was presented in [43], which however required a very large number of processors. We drastically improve it, getting closer to the sequential O~(n3) runtime of [9].

The complexity bounds in Theorems 2 and 3 do not depend on fast matrix multiplication; the runtimes are valid even using naive multiplication (ω=3). Although thy are still not fully work-efficient, this makes progress toward practical deterministic parallel algorithms.

In a broader perspective, our construction differ in various aspects from classical derandomization via (sequential) PRGs. While our automata framework does not provide a PRG, it seems better suited to efficient parallel derandomization. It is known that, by connections to Boolean circuits [11], a PRG for log space-bounded computation can be simulated in PRAM in polylog time with a polynomial number of processors. However, relying on the equivalence between PRAM and Boolean circuits leads to very high processors complexities (e.g., Nisan’s log space-bounded PRG uses at least Ω(n45) processors [39]). Therefore, using known PRGs constructions would either require high polynomials (if logspace bounded) or not parallelize well. In addition to providing fast parallelization, our construction returns a distribution with seed length O(log1/ε+loglog(n,η,), which is line with known PRGs.

1.2 Technical approach

Our work follows the same general outline as previous algorithms for fooling automata, built from two main subroutines Reduce and Fool. The Fool algorithm (Section 5) recursively invokes Reduce (Section 4) to construct a fooling distribution, beginning with single-step automaton transitions and progressively scaling up to multi-step distributions. After log2n levels, the final distribution effectively fools n-step walks in the automata.

We introduce several novelties such as a work efficient Reduce subroutine and an analysis of Fool based on a different measure of error. Here, we attempt here to provide some high-level intuition on the technical content of this paper.

Distribution error tracking.

Prior work on automata fooling [24, 36] considered a notion of unweighted absolute error, where the goal was to fool all pairs of (start, end) states. We replace this with an aggregated notion: the weight of a final state corresponds to the value of the associated final outcome (e.g., the final count for a counter automaton). The value of an intermediate state is then the expected weight of the final state under a random suffix. This is similar to a scheme used in the context of pseudorandom generators and branching programs [12, 15], where it is taken advantage of sub-maximal influence on the final expected value. The processor complexity is determined by the Lipschitz value of each state: how much the expected final weight changes with a single step of the automaton.

REDUCE subroutine.

This subroutine takes as input a distribution E=D1×D2 over automata walks (drivestreams) with a weight function w; it outputs a refined distribution D with significantly smaller support while maintaining a weighted measure of closeness to E. This leverages the connection between automata fooling and lattice approximation noted by [36] and applies the work-efficient lattice approximation of [19]. Our key approach to reduce the processor complexity is to sparsify E – without materializing it – in log2|E| steps.

In each step, we model the problem of rounding the ith bit of the elements in D as a lattice-discrepancy problem, which is then solved by invoking [19]. This step resembles the PRG of INW [28]; there, one replaces the concatenation of two independent random walks with correlated random walks, using the derandomized square instead of the algorithm of [19].

Further optimizations.

In both of these applications, and many others, the relevant automata are counters which track the running sum of certain statistics. Instead of keeping track of these exactly, we truncate the counters to within a few standard deviations of their means. This reduces the state space of the automata by roughly a square-root factor. Doing this while also preserving relevant Lipschitz properties of the automata is technically challenging and requires significant analysis. This optimization will be discussed in Section 6.

The approximate MAX-CUT application (Section 7.2) requires a few additional problem-specific optimizations and arguments. In particular, to simulate the SDP rounding procedure efficiently, we (i) work with discretized truncated Gaussians and quantized counters, (ii) use a pessimistic estimator with a better Lipschitz constant, and (iii) we compute the transition matrices via FFT’s to exploit their convolution structure. We believe that these optimizations may generalize to other (SDP) rounding procedures and other settings as well.

2 Preliminaries

For an input of size N, our goal is to develop deterministic PRAM algorithms with poly(N) processor complexity and polylog(N) runtime. Unless stated otherwise, we assume that all relevant algorithms run in deterministic polylogarithmic time as a function of their processor and input complexities. For processor complexity and other parameters, we use O~ notation: we say that f(x)O~(g(x)) if f(x)g(x)polylog(N,g(x)) processors, where N is the size of all algorithm inputs. Throughout, we use log for logarithm in base e and lg for base 2. We write x for rounding to the nearest integer.

Throughout, proofs that are omitted (or sketched) appear in the full version.

2.1 Basic definitions for automata

The underlying probability space Ω is defined by drawing a sequence r=(r0,,rn1), where each rt is independently drawn from a distribution Ωt over an arbitrary alphabet. We consider an automaton F with a state space S. At each timestep t=0,1,,n1, the automaton in state sS receives an input rt and transitions to state F(rt,s).

For times t,t, we define Ωt,t=Ωt×Ωt+1××Ωt1. In this context, we call h=tt the horizon and the pair (t,h) as the window. We refer to a vector r=(rt,rt+1,,rt1)Ωt,t as a drivestream. We define F(r,s) to be the result of transiting from time t to t under r. We write rΩt,t to indicate that each rj:j=t,,t1 has non-zero probability in Ωj.

For a distribution D on drivestreams, we denote the transition matrix as TD, i.e. T(s,s) is the probability of transiting from state s to s for a drivestream rD. For brevity, we also write Tt,t:=TΩt,t and, for any weight function w:S, we also define

TD(s,w)=sSTD(s,s)w(s).

Throughout, we define η=|S| and σ=t=0n1|Ωt| where |Ωt| is the size of Ωt.

Our goal is to find a polynomial-size distribution D on drivestreams that “fools” the automaton. That is, the behavior of the automaton given a drivestream rD should be similar to its behavior when given a drivestream rΩ. We will follow a strategy of [12] and measure error via a weight function W:S over final states. It will be necessary to measure the “smoothness” of W as a function of the drivestream values. Formally, we consider two, closely related, notions:

Definition 4 (Lipschitz value/confusion of a weight functions).

Let W:S be a weight function for automaton F.

  • The Lipschitz value at state s and time t is defined by

    𝔏(s,t)=maxr1,r2|W(F(r1,s))W(F(r2,s))|

    where the maximum is over r1,r2Ωt,n which differ in only the tth coordinate.

  • The confusion at state s and time t is defined by

    (s,t)=maxr1,r2Ωt|Tt,n(F(r1,s),W)Tt,n(F(r2,s),W)|.

Colloquially, these are the maximum change to actual weight or the expected weight of the final state due to changing drivestream entry t. Note that (s,t)𝔏(s,t).

We define the total variability 𝔙(s) of a starting state sS as:

𝔙(s) :=t=0n1maxrΩ0,t(F(r,s),t)

With this rather technical definition, we can state our algorithm concretely:

Theorem 5.

The algorithm Fool takes as input a parameter ε>0, and produces a distribution D with

|TD(s,W)TΩ(s,W)|ε𝔙(s)for all states s.

The distribution D has size ε2polylog(n,η,σ,1/ε). The cost of Fool is O~(nη/ε2+ησ) processors, plus the cost of computing the expectations Tt,n(s,W) for states sS and times t=0,,n (more details on this step later.)

2.2 Lattice approximation

The problem of automata fooling is closely linked to lattice approximation, defined as follows:

Definition 6 (Lattice Approximation Problem (LAP)).

Given an m×n matrix A and fractional vector u[0,1]n, the objective is to compute an integral vector v{0,1}n to minimize the discrepancies Dk=|j=1nAkj(ujvj)|.

Intuitively, this models the process of “rounding” each random bit (represented by uj) to a static zero or one (represented by vj). The work [19] provides a parallel algorithm with near-optimal discrepancy as well as complexity; we summarize it as follows:

Theorem 7 (Theorem 1.3 of [19]).

Suppose that Akj[0,1] for all k,j. There is a deterministic parallel algorithm for the LAP with DkO(μklogm+logm) for all k, where μk=j=1nAkjuj. The algorithm uses O~(n+m+nnz(A)) processors, where nnz(A) denotes the number of non-zero entries in the matrix A.

It will be convenient to allow for the discrepancy matrix A to take arbitrary real values.

Proposition 8.

Let Δk=maxj|Akj| for each row k. There is a deterministic parallel algorithm for the LAP where DkO(Δkμ~klogm+Δklogm)O(Δk(nlogm+logm)) for all k, where μ~k=j=1n|Akj|uj. The algorithm uses O~(n+m+nnz(A)) processors.

Proof.

Construct a 2m×n matrix A~, where for each row k=0,,m1 of A, the matrix A~ has A~2k,j=max{0,AkjΔk} and A~2k+1,j=max{0,AkjΔk}. Then apply Theorem 7 to A~.

3 Overview of algorithms and data structures

The subroutine Reduce (Section 4) is the technical core of the automata-fooling construction: given an input distribution E over drivestreams and a weight function w, it returns a distribution D which is close to E (measured in terms of w), but which has much smaller support. Importantly, through the use of appropriate data structures, the cost of Reduce may be significantly less than the total size of E itself. Later, the final algorithm Fool (Section 5) will be defined by repeated calls to Reduce over successively larger time-horizons.

Concretely, we store each distribution D over window (t,h) as an array of drivestreams D[0],,D[1]Ωt,t+h with associated probabilities pD(0),,pD(1). Here =|D| is the size of D. By adding dummy zero entries, we can assume that |D| is a power of two.

For a bitstring b of length at most lg|D|, we denote by D[b] the induced distribution consisting of all the drivestreams in D whose indices start with b. So D[b]=D[b0]D[b1], where b0 and b1 refer to concatenating a 1 or 0 bit to b. Similarly, pD(b)=aD[b]pD(a).

We define the Prediction Problem for a distribution D and a weight function w:S as follows: we need to produce a data structure 𝒬(D,w), which can answer the following two types of queries: (i) given any bitstring b, return the value pD(b); (ii) given (b,s) where b is a bitstring and s is a state, return the value TD[b](s,w). Each query should take polylogarithmic time and processors. In either case, b can take on any length lg|D|.

Observation 9.

Let D be a distribution on window (t,h). The Prediction Problem for D and a weight function w can be solved with O~(|D|hη) processors.

The most important case is for a Cartesian product, which will be used in Section 5. Critically, we can use the structure within a distribution to avoid materializing it explicitly.

Proposition 10.

Given distributions D1,D2 on windows (t1,h) and (t1+h,h) respectively, the Prediction Problem for distribution E=D1×D2 and any weight function w can be solved with O~(ηh(|D1|+|D2|)) processors.

4 The REDUCE Algorithm

To reiterate, the goal of Reduce (Algorithm 1) is to take an input distribution E and weight function w, and produce a smaller distribution D which is close to it. Note that here w will not be the given weight function over final states W.

For intuition, consider the following process. Draw m elements D[0],,D[m1] randomly and independently with replacement from the support of the distribution E, wherein D[i]=v is selected with probability proportional to pE(v). Then set pD[i]=1/m, so the process is unbiased. Via standard concentration bounds, appropriate choices of m ensure that TD(s,w)TE(s,w). The algorithm Reduce is based on derandomizing this process via a slowed-down simulation: to compute D, we iteratively compute distributions D0,D1,,D=D by fixing the ith bit level of each entry in Di1.

Algorithm 1 Reduce(E,ε,w).

In order to measure distribution error, we introduce the following key definition:

Definition 11 (Sensitivity).

For a function w:S and state s, the sensitivity α(s,w,E) is:

α(s,w,E):=maxrEw(F(r,s))minrEw(F(r,s))

The following is our main result analyzing the algorithm:

Theorem 12.

The algorithm Reduce runs in O~((h+η)/ε2) processors, plus the cost of solving the Prediction Problem for E. The final distribution D is uniform with size O(logηlog2|E|ε2). For large enough constant C, the distribution D satisfies

|TD(s,w)TE(s,w)|εα(s,w,E)for each state s

Proof.

Let us fix E,w; for brevity, we write αs:=α(s,w,E) for each state s.

Each multiset Hi contains m bitstrings Hi[0],,Hi[m1] of length i. Consider the distributions Di obtained by drawing a bitstring bHi uniformly at random and then drawing drivestream from E[b]. In particular, D0=E and the distribution returned by Reduce is precisely D=D. We have the following equation for every state s:

TDi(s,w) =1mbHiTE[b](s,w).

Now, to analyze a given step i, observe that Hi+1 is obtained by appending a bit vb to each bitstring bHi. We expose the choice of the next bit in Di and Di+1 as follows

TDi(s,w) =1mbHi[pE(b1)pE(b)TE[b1](s,w)+pE(b0)pE(b)TE[b0](s,w)]
TDi+1(s,w) =1mbHi[vbTE[b1](s,w)+(1vb)TE[b0](s,w)].

Thus we can calculate the difference between probabilities for Di and Di+1 as:

TDi+1(s,w)TDi(s,w) =1mbHi(pE(b1)pE(b)vb)(TE[b1](s,w)TE[b0](s,w)). (1)

In light of Eq. (1), we apply Proposition 8 where each state s corresponds to a constraint row k with entries Akb=TE[b1](s,w)TE[b0](s,w), and with ub=pE(b1)pE(b) for all b. It is evident that, after solving the Prediction Problem for E, the values TE[bx],pE(bx) for the Lattice Approximation Problem at Line 6 can be generated using O(ηm) processors. Furthermore, the maximum spread of the values w(F(r,s)) over r is at most αs, so Δk=maxk|Akb|αs. Since the matrix has η rows and m columns, Proposition 8 gives:

|TDi+1(s,w)TDi(s,w)|O(αsmlogη+αslogηm)

By our choice of m, this is at most εαs/. Over all iterations, this gives the desired bound

|TD(s,w)TE(s,w)|=i=01|TDi+1(s,w)TDi(s,w)|εαs.

5 The FOOL Algorithm

We build the automata-fooling distribution via the algorithm Fool (Algorithm 2).

Algorithm 2 Fool(ε) (Assume n is a power of two).

The algorithm first finds the “expected value” vectors Vt for each distribution Ωt,n. This may take advantage of problem-specific automaton properties, and it may have some small error. We will also describe a few “generic” methods to calculate these probabilities exactly.

The main loop fools distributions on time horizons h=1,2,4,8, in a bottom-up fashion, and merges them together using the Reduce procedure from the previous section. Specifically, at each level i, it executes Reduce in parallel to combine Di,t and Di,t+h to obtain Di+1,t. Note that we never materialize the Cartesian product of Di,t and Di,t+h.

Theorem 13.

The cost of Fool is O~(nη/ε2+ησ) processors, plus the cost of computing the vectors V^t:t=0,,n. The final distribution Dlgn,0 has size O(log5(nησ/ε)ε2).

Let us now proceed to analyze the error of the final distribution D. For purposes of analysis, we define the exact transition vector Vt=Tt,n(s,W) and its approximation error by β=maxs,t|Vt(s)V^t(s)|.

Theorem 14.

For any state sS, the final distribution D returned by Fool(ε,W) satisfies

|TD(s,W)TΩ(s,W)|ε𝔙(s)+3βn

Proof Sketch.

For any state s and times kt, let as,t,k be the maximum value of α(s,Vk+1,Ωk) over s=F(r,s):rΩt,k. We show by induction on i that each Di,t for any s satisfies

|TDi,t(s,Vt+2i)Tt,t+2i(s,W)|3(2i1)β+2iδk=tt+2i1as,t,k (2)

The base case i=0 is vacuous since D0,t=Ωt. The final case i=lgn,t=0 establishes the claimed result, since δ=ε20(1+lgn) and 𝔙(s)=t=0n1as,0,t.

As we have mentioned, there can be problem-specific shortcuts to compute (or approximate) the transition matrices T and resulting expected-value vectors Vt. There is the following generic method to compute them via matrix multiplication.

Proposition 15.

All vectors Vt=Tt,n(s,W), can be computed with O~(nηω) processors, where ω is the exponent of any efficiently-parallelizable matrix-multiplication algorithm.

In particular, for the Coppersmith-Winograd algorithm [16], we have ω2.38. (See [29] for further details on parallel implementation.)

Moreover, if we can compute all transition matrices Tt,t+h:h=2i,t=j2i, then we can compute all vectors Vt=Tt,n(,W) exactly (for any W) with O~(nη2) processors.

Therefore, when dealing with multiple statistical tests, we have the following result.

Corollary 16.

Consider statistical tests i=1,,, each computed by its own automaton Fi on a state space Si of size |Si|=ηi, with its own weight function Wi. When combined into a single automaton, the resulting automaton F has the following properties:

  1. 1.

    The total statespace is η=iηi.

  2. 2.

    Each starting state s for automaton i has 𝔙(s)=𝔙i(s), where 𝔙i denotes the value of 𝔙 for weight function Wi and automaton Fi.

  3. 3.

    The exact vectors Vt can be calculated in O~(niηiω) processors.

In particular, the Fool algorithm has processor complexity O~(nη/ε2+ση+niηiω), and the resulting distribution D has |TD(s,W)TΩ(s,W)|ε𝔙i(s) for each state s of automaton i.

6 Reducing the state space for counter automata

Given a weight function W, let us consider a statistic of the form

W(tft(rt))for functions ft:Ωt

We refer to such statistics as counters. They are ubiquitous in randomized algorithm; they are often used directly, and can also be combined together into higher-order statistics.

In the automata-fooling setting, it is easy to compute tft(rt) by tracking all possible values for the running sum. However, this is often overkill: typically, the running sum is confined to a much smaller window, roughly the standard deviation around the mean. We can take advantage of this by constructing an automaton which only maintains the running sum within “typical” values close to the mean, along with a “reject” state for some exponentially-rare deviations. Let us define some relevant parameters for the counter.

μt=𝔼rΩt[ft(r)],Mt=maxrΩt|ft(r)μt|,κ=t=0n1𝕍arrΩt[ft(r)],M=maxt[n]Mt

Given some error parameter δ>0, we choose a value B with B100(1+M+κ)log(n/δ). In this context, we refer to B as the span of the automaton.

Observation 17.

For a drivestream r drawn randomly from Ω, it holds with probability at least 1δ that |i=tt(f(ri)μi)|<0.15B for all times t,t.

Proof.

Apply Bernstein’s inequality and take a union bound over t,t.

In light of Observation 17, our strategy will be to construct a truncated automaton F~, which only stores the running sum within a window of ±B from the running mean. Define at=i=0tμi for each value t. The truncated automaton has state space S~={B,B+1,,B1,B}{}; given input rt, it updates its state c to a new state c as:

c={c+f(rt)at+at1if c and |c+f(rt)at+at1|Bif c= or |c+f(rt)at+at1|>B

Each integer state c at time t corresponds to a running sum c+at. The state represents a “reject” or “rare” state where the running sum has deviated too far from its mean. We define a related potential function ϕ:S~[0,1] as

ϕ(c)={1 if |c|B/3(2B/3|c|)/(B/3) if B/3<|c|<2B/3 0 if |c|>2B/3 or c=

Intuitively, ϕ measures how close the current state c is to a reject state. It can be used to “damp” the weight function W and make a gradual transition to the reject state.

For purposes of analysis, it is useful to compare F~ with the “truthful” automaton F. The potential function ϕ can also be applied on the original state space S, where we use the convention that any states |s|>B corresponds to the reject state of S~.

Relevant properties of F~ such as computing its transition matrix or its weight variability can be derived from F up to a small error. Indeed, our application to Gale-Berlekamp Switching Game will use precisely this approach. For the MAX-CUT application, we will need to combine multiple truncated automata; this will require a slightly different damping function and analysis.

Proposition 18.

Let W be a weight function for automaton F and define

W~(c)={0if c=ϕ(c)W(c)if c,Δ=maxcS|W(c)|,Δ~=maxcS:|cμt|2B|W(c)|.

Let T,T~ denote the transition matrices for automata F,F~ respectively.

  1. 1.

    For any time t and state cS, there holds |T~t,n(c,W~)Tt,n(c,W~)|δΔ.

  2. 2.

    For the starting state c=0 and time t=0, there holds |T~t,n(c,W~)Tt,n(c,W)|δΔ.

  3. 3.

    For any time t and state c, there holds

    ~(c,t)𝔏(c,t)+2Δδ+6Δ~Mt/B,

    where ~ denotes the confusion with respect to automata F~ and weight function W~, while 𝔏 denotes the Lipschitz value with respect to automaton F and weight function W.

Note that, in both these cases, we use the convention that if |c|>B, then c corresponds to the reject state of automaton F~.

7 Applications

We present the derandomization of two basic problems: the Gale-Berlekamp Switching Game (Section 7.1) and approximate MAX-CUT via SDP rounding (Section 7.2). Through our new construction, we improve the deterministic processor complexity for both problems.

7.1 Gale-Berlekamp Switching Game

The Gale-Berlekamp Switching Game is a classic combinatorial problem: given an n×n matrix A with entries Aij=±1, we want to find vectors x,y{1,+1}n to maximize the imbalance I=i,jAi,jxiyj. An elegant randomized algorithm of [4], based on the Central Limit Theorem, gives imbalance of I(2/πo(1))n3/2. This was derandomized in [24], using automata-fooling with a processor complexity of n5+o(1).

Theorem 19.

There is a parallel deterministic algorithm to find x,y{1,+1}n with imbalance I(2/πo(1))n3/2 using O~(n3.5) processors.

In order to show Theorem 19, following [8] and [4], we set xi=1 if jAijyj>0, and xi=1 otherwise. This gives

i,jAi,jxiyj=ixijAi,jyj=i|jAi,jyj|.

As shown in [13], for y uniformly drawn from Ω={1,+1}n, we have 𝔼[|jAi,jyj|]2n/πo(n) for each i. This is the statistical test we want to fool.

We will choose our drivestream values to be rt=yt, with Ωt the uniform distrubition on {1,1} For each value i, we have a truncated counter automaton F~i to track jAijrj. This uses the construction of Section 6 with δ=n10, where M=1 and κ=jAij2=n and B=O~(n). This automaton uses the weight function

W~(c)={0if c=ϕ(c)|c|if c
Proposition 20.

Fix a row i and automaton F~=F~i, and correspondingly let F be the full “truthful” automaton. Let T,T~ be the transition matrices for F,F~ respectively.

  1. 1.

    For any time t and state s of F, we have |Tt,n(s,W~)T~t,n(s,W~)|2δn.

  2. 2.

    The final state z~=F~(y,0) for rΩ satisfies 𝔼[W~(z~)]2n/πo(n).

  3. 3.

    The weight function W~ for automaton F~ has total variability O(n).

Proof.

We apply here Proposition 18 with W=|c|. Clearly here 𝔏(c,t)=2 and Mt1 for all c,t. Also, we calculate Δ=n,Δ~O~(n). With Proposition 18(a), this proves (a). Similarly, by Proposition 18(b), we have

𝔼[W~(z~)] δn+𝔼Ω[W(z)]o(n)+(2/πo(1))n

where z is the final state for the automaton F.

Finally, for (c), we use the definition of total variability and Proposition 18(c) to get:

𝔙(c)=t=0n1maxrΩ0,t~(F~(r,c),t)t=0n1(𝔏(c,t)+2Δδ+6Δ~Mt/B)=O~(n).

The complexity now follows from Corollary 16 with ε=1/nlogn, where we have n automata each of state space η^i=O~(n). By applying Proposition 20, we get that for any sequence y drawn from the resulting distribution D and automaton F~i,

|T~D(0,W~)TΩ(0,W)| |T~D(0,W~)T~Ω(0,W~)|+|T~Ω(0,W~)TΩ(0,W)|
ε𝔙(0)+o(n)=o(n)

Therefore, for any row i, the final state z~ satisfies

𝔼D[|jAijyj|]𝔼Ω[Wi(z~)]o(n)(2/πO(ϵ))n

By searching the space exhaustively, we can find a specific sequence y satisfying the desired imbalance bounds. This concludes the proof of Theorem 19.

7.2 Approximate MAX-CUT

The MAX-CUT problem for a graph G=(V,E) is to find a vertex set SV to maximize the total weight of edges with exactly one endpoint in S. The seminal work of Goemans & Williamson [21] showed that MAX-CUT can be approximated to a factor of α0.878 by rounding its semi-definite programming (SDP) formulation. Moreover, the integrality gap of the SDP is precisely α [18], and assuming the Unique Games Conjecture no better approximation ratio is possible for polynomial-time algorithms [34].

A sequential deterministic α-approximation with O~(n3) runtime was shown in [9]. This relies on the method of conditional expectation, which is hard to parallelize. The work of [43] used the automata derandomization framework to get a parallel deterministic α(1ε)-approximation algorithm. The main downside of this approach is the huge polynomial processor complexity (on the order of n100). We give an improved analysis with our new automata-fooling construction and some other optimizations.

To review, note that MAX-CUT can be formulated as the following integer program:

max12(i,j)=eEwe(1vivj)s.t.vi{1,1},i[n].

This can be relaxed to the following SDP:

max12(i,j)=eEwe(1vivj)s.t.vi[1,1]n,vi2=1,i[n],

where denotes the inner product in n.

We can round a given SDP solution v by drawing independent standard Gaussian random variables X0,,Xn1 and constructing the cut S={iviX0}, which gives cutsize W

𝔼[W] =(i,j)=eEwePr[(i,j) is cut]=(i,j)=eEwePr[sgn(viX)sgn(vjX)]
=(i,j)=eEwearccos(vivj)παOPT,

where OPT denotes the size of the maximum cut. Following [43], we will derandomize this by fooling each statistical test Pr[sgn(viX)sgn(vjX)].

Theorem 21.

There is a deterministic parallel algorithm to find an α(1ε)-approximate rounding of a MAX-CUT SDP solution with O~(mn3ε4) processors.

Via the algorithm of [1] to solve the SDP, this gives the following main result:

Corollary 22.

For arbitrary constant ε>0, we can find an α(1ε)-approximate MAX-CUT solution using O~(mn3) processors and polylogarithmic time.

The remainder of the section is devoted to showing Theorem 21. To avoid cumbersome calculations, we will show an algorithm for a (1O(ε))-approximation, assuming ε is smaller than any needed constants. For readability, O~() suppresses any polylog(n/ε) terms.

As a starting point, consider the following process. We draw n independent standard Gaussian variables X0,,Xn1. For each edge (i,j), we have a statistical test to read these values in order and compute ci=viX and cj=vjX. We then apply the weight function Wij(ci,cj) as the indicator variable that sgn(ci)=sgn(cj).

In order to apply our derandomization framework efficiently, we make a number of changes to this basic strategy: (i) we define the drivestream value rt to be a discretized truncated Gaussian; (ii) we further quantize each term vikrk in the computation of the sum ci=kvikrk; (iii) we apply the optimization of Section 6 to reduce the state space for each counter automaton; and (iv) we modify the weight function to a pessimistic estimator with better Lipschitz properties.Let us fix two quantization parameters for some constant C>0

γ=εCnlog(n/ε),R=Clog(n/ε)

Concretely, each drivestream value rk is derived by truncating a standard Gaussian to within ±R, and rounding to the nearest multiple of γ. Then, each term in the product vikrk is rounded to the nearest multiple of γ. Since this comes up frequently in the analysis, let us denote this “rounded” inner product by: vr=γk1γvkrk. Given this definition, we will form our cut set by setting S={i:vir0}.

Lemma 23.

For appropriately chosen C, there is a coupling between random variables X and Y such that, for any unit vector un, there holds Pr(|uXur|>ε)(ε/n)10.

Observe that the scaled sum 1γ(vir) is an integer-valued counter. We can apply the method of Section 6 to construct a “vertex” automaton F~i that tracks the running sum within a window. We record a few parameters and observations about this automaton.

Proposition 24.

The automaton F~i can be implemented with the following parameters:

MtO~(|vit|/γ),MO~(1/γ),κO~(1/γ2),δ=(ε/n)10,B=Θ~(1/γ).

From these vertex-automata, we construct an edge-automaton F~ij for each edge (i,j)E. Automaton F~ij keeps track of the joint states of F~i,F~j, i.e. the truncated running sums vir and vjr. It uses the following weight function for the state s=(ci,cj):

W~ij(s)={0if ci= or cj=0if sgn(ci)=sgn(cj)min{1,|cicj|ε}ϕi(ci)ϕj(cj)otherwise

Note that whenever W~ij(s)>0, we have sgn(vir)sgn(vjr), i.e. edge ij is cut.

Proposition 25.

For an edge ij, let F~=F~ij, and consider the “truthful” automaton F=Fij (which uses the full untruncated automata for vertices i and j). Let T,T~ denote the transition matrices for F,F~ respectively.

  1. 1.

    For any time t and state s of F, there holds |Tt,n(s,W~)T~t,n(s,W~)|2δ.

  2. 2.

    The final automaton state z~=F~(r,0) for rΩ satisfies 𝔼[W~(z~)]arccos(vivj)πO(ε).

  3. 3.

    For drivestreams r1,r2 differing in coordinate t, the final states z1=F(r1,0),z2=F(r2,0) of automaton F have |W~(z1)W~(z2)|O~(ε1(|vit|+|vjt|)).

  4. 4.

    The weight function W~ has total variability O~(n/ε) for automaton F~.

Proof.

The proof is very similar to Proposition 18; see full version.

Next, we discuss how to approximate the transition matrices for the automata. In light of Proposition 25(a), we will compute vectors V^t(s)=Tt,n(s,W) for the full automaton Fij. This is much easier than computing the transition matrices for F~ij, since we do not need to track reject states. In particular, this achieves error β=2δ compared to Vt=T~t,n(s,W).

Proposition 26.

For each automaton Fij, the vectors Tt,n(s,W):t=0,,n, can be computed with O~(n3ε4) processors.

Proof Sketch.

We will recursively compute transition matrices Tt,h for h=1,2,4, and t divisible by h. The critical observation here is that since Fij is a pair of counters, we only need to keep track of the probability distribution on the difference pairs

(k=tt+h1vikrk/γ,k=tt+h1vjkrk/γ).

This is effectively a convolution, computable via an FFT.

We are now ready to compute the total complexity. Each probability distribution Ωt has size O~(γ1)=O~(n/ε), thus σ=O~(n1.5/ε). The reduced total number of states of an edge-automaton Fij is O~(1/γ2)=O~(n/ε2) and the weight function Wij has total variability O~(n/ε). So we run Fool with parameter ε=ε/n, and the fooling process has cost O~(mn3ε4+mn2.5ε3). Consider the resulting distribution D. The edge i,j is only cut if W~ij(s)>0. For each Fij, by Theorem 14, the resulting final state z satisfies

PrrD(edge ij cut)𝔼rD[Wij]𝔼rΩ[Wij]O(ε)O(βn)

Here Proposition 25(a) gives β2δε/n, and Proposition 25(b) gives 𝔼rΩarccos(vivj)πO(ε). Thus, overall, the edge i,j is cut with prob. at least arccos(vivj)πO(ε).

Summing over all edges, the total expected weight of the cut edges is

ijEwearccos(vivj)πijEweO(ε)

The first term is at least αOPT. The second term is at most O(εOPT) since OPTewe/2. By searching D exhaustively, we can find a cut satisfying these bounds.

References

  • [1] Zeyuan Allen-Zhu, Yin Tat Lee, and Lorenzo Orecchia. Using optimization to obtain a width-independent, parallel, simpler, and faster positive SDP solver. In Proc. 27th annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1824–1831, 2016. doi:10.1137/1.9781611974331.ch127.
  • [2] Noga Alon, László Babai, and Alon Itai. A fast and simple randomized parallel algorithm for the maximal independent set problem. Journal of Algorithms, 7(4):567–583, 1986. doi:10.1016/0196-6774(86)90019-2.
  • [3] Noga Alon, Oded Goldreich, Johan Håstad, and René Peralta. Simple constructions of almost k-wise independent random variables. Random Structures & Algorithms, 3(3):289–304, 1992. doi:10.1002/rsa.3240030308.
  • [4] Noga Alon and Joel H Spencer. The probabilistic method. John Wiley & Sons, 2015. URL: http://cs.nyu.edu/cs/faculty/spencer/nogabook/nogabook.html.
  • [5] Yossi Azar, Rajeev Motwani, and Joseph Naor. Approximating probability distributions using small sample spaces. Combinatorica, 18(2):151–171, 1998. doi:10.1007/PL00009813.
  • [6] Nikhil Bansal, Nitish Korula, Viswanath Nagarajan, and Aravind Srinivasan. Solving packing integer programs via randomized rounding with alterations. Theory of Computing, 8(24):533–565, 2012. doi:10.4086/toc.2012.v008a024.
  • [7] Mihir Bellare and John Rompel. Randomness-efficient oblivious sampling. In Proc. 35th annual Symposium on Foundations of Computer Science (FOCS), pages 276–287, 1994. doi:10.1109/SFCS.1994.365687.
  • [8] Bonnie Berger. The fourth moment method. SIAM Journal on Computing, 26(4):1188–1207, 1997. doi:10.1137/S0097539792240005.
  • [9] Ankur Bhargava and S Rao Kosaraju. Derandomization of dimensionality reduction and SDP based algorithms. In Workshop on Algorithms and Data Structures, pages 396–408. Springer, 2005. doi:10.1007/11534273_35.
  • [10] Manuel Blum and Oded Goldreich. Towards a computational theory of statistical tests. In Proc. 33rd annual Symposium on Foundations of Computer Science (FOCS), pages 406–416, 1992. doi:10.1109/SFCS.1992.267749.
  • [11] Allan Borodin. On relating time and space to size and depth. SIAM journal on computing, 6(4):733–744, 1977. doi:10.1137/0206054.
  • [12] Mark Braverman, Anup Rao, Ran Raz, and Amir Yehudayoff. Pseudorandom generators for regular branching programs. SIAM Journal on Computing, 43(3):973–986, 2014. doi:10.1137/120875673.
  • [13] Thomas A Brown and Joel H Spencer. Minimization of ±1 matrices under line shifts. In Colloquium Mathematicum, volume 1, pages 165–171, 1971.
  • [14] Suresh Chari, Pankaj Rohatgi, and Aravind Srinivasan. Improved algorithms via approximations of probability distributions. In Proc. 26th annual ACM Symposium on Theory of Computing (STOC), pages 584–592, 1994. doi:10.1145/195058.195411.
  • [15] Lijie Chen, William M. Hoza, Xin Lyu, Avishay Tal, and Hongxun Wu. Weighted pseudorandom generators via inverse analysis of random walks and shortcutting. In Proc. 64th annual Symposium on Foundations of Computer Science (FOCS), pages 1224–1239, 2023. doi:10.1109/FOCS57990.2023.00072.
  • [16] Don Coppersmith and Shmuel Winograd. Matrix multiplication via arithmetic progressions. In Proc. 19th annual ACM Symposium on Theory of Computing (STOC), pages 1–6, 1987. doi:10.1145/28395.28396.
  • [17] Guy Even, Oded Goldreich, Michael Luby, Noam Nisan, and Boban Veličkovic. Approximations of general independent distributions. In Proc. 24th annual ACM Symposium on Theory of Computing (STOC), pages 10–16, 1992. doi:10.1145/129712.129714.
  • [18] Uriel Feige and Gideon Schechtman. On the optimality of the random hyperplane rounding technique for MAX CUT. Random Structures & Algorithms, 20(3):403–440, 2002. doi:10.1002/rsa.10036.
  • [19] Mohsen Ghaffari and Christoph Grunau. Work-efficient parallel derandomization II: Optimal concentrations via bootstrapping. In Proc. 56th annual ACM Symposium on Theory of Computing (STOC), pages 1889–1900, 2024. doi:10.1145/3618260.3649668.
  • [20] Mohsen Ghaffari, Christoph Grunau, and Václav Rozhoň. Work-efficient parallel derandomization I: Chernoff-like concentrations via pairwise independence. In Proc. 64th annual Symposium on Foundations of Computer Science (FOCS), pages 1551–1562, 2023. doi:10.1109/FOCS57990.2023.00094.
  • [21] Michel X Goemans and David P Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM (JACM), 42(6):1115–1145, 1995. doi:10.1145/227683.227684.
  • [22] Parikshit Gopalan, Raghu Meka, Omer Reingold, Luca Trevisan, and Salil Vadhan. Better pseudorandom generators from milder pseudorandom restrictions. In Proc. 53rd annual Symposium on Foundations of Computer Science (FOCS), pages 120–129, 2012. doi:10.1109/FOCS.2012.77.
  • [23] Parikshit Gopalan, Raghu Meka, Omer Reingold, and David Zuckerman. Pseudorandom generators for combinatorial shapes. In Proc. 43rd annual ACM Symposium on Theory of Computing (STOC), pages 253–262, 2011. doi:10.1145/1993636.1993671.
  • [24] David G Harris. Deterministic parallel algorithms for bilinear objective functions. Algorithmica, 81:1288–1318, 2019. doi:10.1007/s00453-018-0471-0.
  • [25] David G Harris. Deterministic algorithms for the Lovász local lemma: simpler, more general, and more parallel. Random Structures & Algorithms, 63(3):716–752, 2023. doi:10.1002/rsa.21152.
  • [26] David G Harris, Ehab Morsy, Gopal Pandurangan, Peter Robinson, and Aravind Srinivasan. Efficient computation of sparse structures. Random Structures & Algorithms, 49(2):322–344, 2016. doi:10.1002/rsa.20653.
  • [27] William M Hoza, Edward Pyne, and Salil Vadhan. Pseudorandom generators for unbounded-width permutation branching programs. In Proc. 12th Innovations in Theoretical Computer Science Conference (ITCS), volume 185, pages 7:1–7:20, 2021. doi:10.4230/LIPIcs.ITCS.2021.7.
  • [28] Russell Impagliazzo, Noam Nisan, and Avi Wigderson. Pseudorandomness for network algorithms. In Proceedings of the twenty-sixth annual ACM symposium on Theory of computing, pages 356–364, 1994. doi:10.1145/195058.195190.
  • [29] Joseph JáJá. An Introduction to Parallel Algorithms. Addison-Wesley Publishing Company, 1992.
  • [30] Anatole Joffe. On a set of almost deterministic k-independent random variables. the Annals of Probability, 2(1):161–162, 1974.
  • [31] David R Karger and Daphne Koller. (De)randomized construction of small sample spaces in NC. Journal of Computer and System Sciences, 55(3):402–413, 1997. doi:10.1006/jcss.1997.1532.
  • [32] Howard Karloff and Yishay Mansour. On construction of k-wise independent random variables. In Proc. 26th annual ACM Symposium on Theory of Computing (STOC), pages 564–573, 1994. doi:10.1007/BF01196134.
  • [33] Zander Kelley and Raghu Meka. Random restrictions and PRGs for PTFs in Gaussian space. In Proc. 37th Computational Complexity Conference (CCC), 2022. doi:10.4230/LIPIcs.CCC.2022.21.
  • [34] Subhash Khot, Guy Kindler, Elchanan Mossel, and Ryan O’Donnell. Optimal inapproximability results for MAX-CUT and other 2-variable CSPs? SIAM Journal on Computing, 37(1):319–357, 2007. doi:10.1137/S0097539705447372.
  • [35] Michael Luby. A simple parallel algorithm for the maximal independent set problem. In Proc. 17th annual ACM Symposium on Theory of Computing (STOC), pages 1–10, 1985. doi:10.1145/22145.22146.
  • [36] Sanjeev Mahajan, Edgar A Ramos, and KV Subrahmanyam. Solving some discrepancy problems in NC. Algorithmica, 29(3):371–395, 2001. doi:10.1007/s004530010046.
  • [37] Raghu Meka, Omer Reingold, and Avishay Tal. Pseudorandom generators for width-3 branching programs. In Proc. 51st annual ACM Symposium on Theory of Computing (STOC), pages 626–637, 2019. doi:10.1145/3313276.3316319.
  • [38] Joseph Naor and Moni Naor. Small-bias probability spaces: Efficient constructions and applications. In Proc. 22nd annual ACM Symposium on Theory of Computing (STOC), pages 213–223, 1990. doi:10.1145/100216.100244.
  • [39] Noam Nisan. Pseudorandom generators for space-bounded computations. In Proc. 22nd annual ACM Symposium on Theory of Computing (STOC), pages 204–212, 1990. doi:10.1145/100216.100242.
  • [40] Noam Nisan. RL SC. In Proc. 24th annual ACM Symposium on Theory of Computing (STOC), pages 619–623, 1992. doi:10.1145/129712.129772.
  • [41] Ryan O’Donnell, Rocco A Servedio, and Li-Yang Tan. Fooling Gaussian PTFs via local hyperconcentration. In Proc. 52nd annual ACM Symposium on Theory of Computing (STOC), pages 1170–1183, 2020. doi:10.1145/3357713.3384281.
  • [42] Jeanette P Schmidt, Alan Siegel, and Aravind Srinivasan. Chernoff–Hoeffding bounds for applications with limited independence. SIAM Journal on Discrete Mathematics, 8(2):223–250, 1995. doi:10.1137/S089548019223872X.
  • [43] D Sivakumar. Algorithmic derandomization via complexity theory. In Proc. 34th annual ACM Symposium on Theory of Computing (STOC), pages 619–626, 2002. doi:10.1145/509907.509996.
  • [44] Aravind Srinivasan. New approaches to covering and packing problems. In Proc. 12th annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 567–576, 2001. URL: http://dl.acm.org/citation.cfm?id=365411.365535.