A Polynomial-Time Algorithm for 1/3-Approximate Nash Equilibria in Bimatrix Games

Since the celebrated PPAD-completeness result for Nash equilibria in bimatrix games, a long line of research has focused on polynomial-time algorithms that compute $\varepsilon$-approximate Nash equilibria. Finding the best possible approximation guarantee that we can have in polynomial time has been a fundamental and non-trivial pursuit on settling the complexity of approximate equilibria. Despite a significant amount of effort, the algorithm of Tsaknakis and Spirakis, with an approximation guarantee of $(0.3393+\delta)$, remains the state of the art over the last 15 years. In this paper, we propose a new refinement of the Tsaknakis-Spirakis algorithm, resulting in a polynomial-time algorithm that computes a $(\frac{1}{3}+\delta)$-Nash equilibrium, for any constant $\delta>0$. The main idea of our approach is to go beyond the use of convex combinations of primal and dual strategies, as defined in the optimization framework of Tsaknakis and Spirakis, and enrich the pool of strategies from which we build the strategy profiles that we output in certain bottleneck cases of the algorithm.


Introduction
The notion of Nash equilibrium has been undoubtedly a fundamental solution concept in strategic games, ever since the seminal result of Nash [33], on the existence of equilibria for all finite games. Nash's theorem however is only existential; it only shows that such an equilibrium always exists, but it does not provide an efficient algorithm to find one. In fact, many years after the work of Nash, in a series of breakthrough results, it was proven that computing a Nash equilibrium is PPAD-complete [16], even for bimatrix games [10], which provides strong evidence that computing an equilibrium is an intractable problem. These negative results have naturally led to the study of approximate Nash equilibria. In an ε-approximate Nash equilibrium (ε-NE), no player can increase her payoff more than ε, by unilaterally changing her strategy. In contrast to exact Nash equilibria, the relaxation to ϵ-NE does admit subexponential algorithms. More precisely, the quasi polynomial-time approximation scheme (QPTAS) of [27] can find an ε-NE in time n O(log n/ϵ 2 ) , for a game with n available pure strategies per player. One can then wonder whether the QPTAS could be improved to a PTAS or even a FPTAS. Unfortunately this does not seem to be the case, as the result of Chen, Deng, and Teng [10] already ruled out the existence of an FPTAS, unless PPAD=P. Some years later, in another breakthrough result, Rubinstein [35] showed that, assuming the exponential-time hypothesis for PPAD, there exists a very small, yet unspecified, constant ε ⋆ such that finding an ε-NE requires quasi polynomial time for every constant ε < ε * . This would rule out a PTAS too.
Although it seems unlikely to have a polynomial time algorithm for any ϵ > 0, it is still important to identify the best constant ϵ for which we can have an efficient algorithm. In fact, this has been one of the fundamental questions of algorithmic game theory, that is still unresolved. Soon after the initial PPAD-hardness results of [10,16], there was a flourish of works along this direction. Kontogiannis, Panagopoulou, and Spirakis [23] derived a polynomial-time algorithm for ε = 3/4; Daskalakis, Mehta, and Papadimitriou [17,18] improved it to ε = 1/2 and ε ≈ 0.382; Bosse, Byrka, and Markakis [7] achieved ε = 0.364; and finally Tsaknakis and Spirakis [36] attained a bound of ε = 0.3393 + δ, for any constant δ > 0. Ever since this last work however, the progress on this front has stalled, and the result of Tsaknakis and Spirakis (referred to as the TS algorithm from now on) remains the state of the art over the last 15 years. It is particularly puzzling that so far, it has remained an open problem to even improve the approximation to 1/3 + δ (even though it has been conjectured that such an approximation should be feasible). To make things worse, in the very recent work of [12], it was shown that the TS algorithm and its analysis are tight.
In order to beat the 0.3393-guarantee of the TS algorithm, it is instructive to understand first its bottleneck cases. At a high level, we can think of the algorithm as consisting of two phases: the Descent phase and the Strategy-construction phase. In the Descent phase, it performs "gradient descent" on the maximum regret among the two players, i.e., the maximum additional gain that a player can have by a unilateral deviation to another strategy. This process terminates at an approximate "stationary" point, i.e., a strategy profile such that any local change does not decrease the value of the maximum regret. When we reach a δ-stationary point for some small constant δ, the Strategy-construction phase begins. This phase performs a case analysis, based on certain relevant parameters of the game, and tries to decide which strategy profile to output in each of the five cases that arise.
In doing so, the algorithm has at its disposal the δ-stationary profile, along with a "dual" strategy profile (produced by solving the dual of the linear program used in the Descent phase). A close inspection reveals that one of these two profiles suffices to guarantee a ( 1 3 + δ)-NE in three out of the five cases. In the remaining two cases, the algorithm outputs a convex combination of the stationary and the dual strategies, and this is where the bottleneck occurs, causing the algorithm to output a (0.3393 + δ)-NE.

Our contribution
We improve upon the state of the art and provide a polynomial-time algorithm for computing a ( 1 3 + δ)-NE in bimatrix games, for any constant δ > 0. More specifically, we modify sufficiently the TS algorithm by designing an improved Strategy-construction phase to handle the problematic cases of TS. Our main insights in doing so are as follows: Apart from convex combinations between primal (stationary) and dual strategies, we also consider best response strategies to such convex combinations. Hence, we enrich the pool of strategies, out of which we choose the profile to output in each case. As a result, in the cases where the δ-stationary point or the dual profile (or their combinations) do not have the desired guarantee, we have one of the players use a carefully chosen convex combination between our newly defined strategies and her dual strategy. We produce a more refined case analysis, that is based on the values of some new auxiliary parameters (e.g., the quantities v r , t r andμ, defined in Section 4). These parameters encode payoff differences or regrets of the players for using specific strategies, and they help us in two ways. First, they are used to obtain improved upper bounds on the maximum regret of the δ-stationary profile (Section 4.1). Secondly, their values greatly help us in decomposing our analysis into convenient subcases in order to establish the approximation guarantee.

Further related work
A different notion of approximation of NE is that of ε-well-supported NE (ε-WSNE). In an ε-WSNE every player is required to place positive probability only to actions that are within ε of being best responses. Hence, ε-WSNE are more constrained than ε-NE, where the players can place a positive probability on any strategy. After a series of papers on the topic [24,21], the currently best approximation is for ε = 0.6528 due to [14]. Another line of research has focused on more structured classes of bimatrix games such as: constant-rank games, where the matrix defined by the sum of the two payoff matrices has constant rank [1,22,31]; win-lose games, where the payoff for every pure action is either 0 or 1 [11,13,28]; sparse games, where there are only "a few" outcomes that yield a non-zero payoff for each player [9], imitation games, where the payoff matrix for one of the players is the identity matrix [29,30,32]; random games, where the payoff entries are drawn from certain distributions [4,34]; symmetric games, where the payoff matrix of one player is the transpose of the other [15,25]. In most of these classes, it has been possible to obtain improved approximation guarantees and have a better understanding of how to construct approximate equilibria.
Concerning quasi-polynomial algorithms, in addition to the QPTAS of [27], three new QPTASs have been obtained, which contain the original result of [27] as a special case: [5] gave a refined, parameterized, approximation scheme; [3] gave a QPTAS that can be applied to multi-player games as well; [19] gave a more general approach for approximation schemes for the existential theory of the reals. More recently, more negative results for ε-NE were derived: [26] gave an unconditional lower bound, based on the sum of squares hierarchy; [6] proved PPAD-hardness in the smoothed analysis setting; [8,20,2] gave quasi-polynomial time lower bounds for constrained ε-NE, under the exponential time hypothesis.

Preliminaries
In what follows, let [n] := {1, 2, . . . , n}, and let ∆ n denote the (n − 1)-dimensional simplex. We focus on n × n bimatrix games, where n denotes the number of available pure strategies per player. Such games are defined by a pair (R, C) ∈ [0, 1] n×n of two matrices: R and C are the payoff matrices for the row player and the column player respectively. We follow the usual assumption in the relevant literature that the matrices are normalized, so that all entries are in [0, 1]. It is also assumed without loss of generality, that both players have the same number of pure strategies, since otherwise one can add dummy strategies to equalize the rows and columns. The semantics of the payoff matrices are that when the row player picks a row i ∈ [n], and the column player picks a column j ∈ [n], then they receive a payoff of R ij and C ij respectively.

41:4 A Polynomial-Time Algorithm for 1/3-NE in Bimatrix Games
A mixed strategy is a probability distribution over [n]. We use x ∈ ∆ n to denote a mixed strategy for the row player and x i to denote the probability the player assigns to the pure strategy i. For the column player, we use y ∈ ∆ n and y i , respectively. If x and y are mixed strategies for the row and the column player, then we call (x, y) a (mixed) strategy profile. It is often also convenient to represent pure strategies as vectors. Hence, we will use the vector e i , which has 1 at index i and zero elsewhere, to denote the i-th pure strategy, in other words the distribution where a player assigns probability one to play the pure strategy i.
Given a strategy profile (x, y), the expected payoff of the row player is R(x, y) := x T Ry, and the expected payoff of the column player is C(x, y) := x T Cy. Thus, for a pure strategy e i , the term R(e i , y) := j R ij y j , denotes the expected payoff of the row player, when she plays the pure strategy i against strategy y of the column player. Similarly, C(x, e j ) is the expected payoff of the column player when she plays the pure strategy j against x. We say that a pure strategy is a best-response strategy for a player if it maximizes her expected payoff against a chosen strategy of her opponent. So, under a strategy profile (x, y), the set of pure best responses for the row player is B r (y) : The regret of the row player at a profile (x, y), is reg r (x, y) = max i R(e i , y) − R(x, y) and the regret of the column player is reg c (x, y) = max j C(x, e j ) − C(x, y). The strategy profile (x, y) is an ε-Nash equilibrium, or ε-NE, if the regret of both players is bounded by ε ∈ [0, 1], formally max{reg r (x, y), reg c (x, y)} ≤ ε. If ε = 0, then the strategy profile (x, y) is an exact Nash equilibrium.

The Tsaknakis-Spirakis algorithm
In this section we give a description of the algorithm by [36] and we highlight the bottleneck cases, where it fails to provide a ( 1 3 + δ)-approximation. In order to have a self-contained exposition, we also present some of the lemmas that are used in the analysis of [36], which are needed for our work as well.
The core of the algorithm is to consider the function g(x, y) = max{reg r (x, y), reg c (x, y)}, i.e., the maximum regret among the two players. Clearly, if we arrive at a profile (x, y) such that g(x, y) ≤ ε, then (x, y) is an ε-Nash equilibrium. At a high level, one can think of TS as consisting of two phases: the Descent phase, and the Strategy-construction phase.
Descent Phase. During this phase, TS performs "gradient descent" on the function g(x, y), until it reaches a "stationary" point, i.e., a strategy profile such that any local change does not decrease the value of g. More concretely, every iteration of the Descent phase performs a series of steps: given the current profile under consideration, it equalizes the regrets of the players, then it solves an appropriate linear program to identify a feasible direction, and finally depending on the solution of the LP, it either updates the strategy profile, or it decides that it has reached an approximate stationary point.
The first step runs the RegretEqualization procedure described below. This procedure is based on solving a single linear program to equalize the regrets of the two players, and most importantly, it guarantees that the maximum regret does not increase.
RegretEqualization(x 0 , y 0 ) Input: A strategy profile (x 0 , y 0 ). Output: A strategy profile (x, y) such that reg r (x, y) = reg c (x, y) ≤ g(x 0 , y 0 ). 1. If reg r (x 0 , y 0 ) ≥ reg c (x 0 , y 0 ), keep y 0 fixed and solve the following linear program: Minimize reg r (x, y 0 ) Such that reg r (x, y 0 ) ≥ reg c (x, y 0 ) and x ∈ ∆ n . Return (x, y 0 ), where x is the solution of the linear program. 2. If reg r (x 0 , y 0 ) < reg c (x 0 , y 0 ), keep x 0 fixed and solve the following linear program: Minimize reg c (x 0 , y) Such that reg c (x 0 , y) ≥ reg r (x 0 , y) and y ∈ ∆ n . Return (x 0 , y), where y is the solution of the linear program.
Given the output (x, y) of RegretEqualization, the next step is to either find a feasible direction to follow so as to decrease the maximum regret, or to decide that (x, y) is an approximate stationary point. This is enforced by solving the following linear program.
It is proved in [36] that the solution of Primal(x, y) guarantees one of the following: 1. it either identifies a strategy profile (x ′ , y ′ ) such that the maximum regret can be strictly decreased by a constant fraction, if we move from (x, y) towards (x ′ , y ′ ); 2. or it decides that (x, y) is a δ-stationary point 1 , which is the termination criterion of the descent. Putting everything together, the Descent phase of the TS algorithm is described below, starting from some arbitrary initial strategy profile, and its main properties are captured by the following lemma.

41:6 A Polynomial-Time Algorithm for 1/3-NE in Bimatrix Games
Strategy-construction Phase. In this phase, the algorithm utilizes the dual linear program of Primal(x, y), in order to identify some alternative candidate strategies for the players.
Dual Linear Program: Dual(x, y) pi ≥ 0, i ∈ Br(y), qj ≥ 0, j ∈ Bc(x), P = i∈Br (y) pi, Q = j∈Bc(x) qj, Given the δ-stationary profile (x s , y s ) from the Descent phase, the algorithm solves Dual(x s , y s ) and computes the following (from the optimal dual variables).
The dual strategy w for the row player, where w i = p i / j∈Br(ys) p j , for i ∈ B r (y s ), and 0 elsewhere; note that by construction, w is a best-response strategy against y s . The dual strategy z for the column player, where z i = q i / j∈Bc(xs) q j , for i ∈ B r (x s ), and 0 elsewhere; by construction, z is a best-response strategy against x s . The dual variables P, Q ∈ [0, 1], that are useful for the approximation analysis. In addition, we define the following two quantities λ and µ, that help in parameterizing the maximum regret bound. These quantities are equal to the payoff difference of a player between the dual and the primal strategies, when the other player uses her dual strategy: Fact. Obviously, λ ≤ 1, and µ ≤ 1 and furthermore, R(w, z) ≥ λ, and C(w, z) ≥ µ. The algorithm then constructs and outputs a strategy profile as follows.
We present below some important lemmas from [36] that are needed in our analysis too.
The first, and most important, lemma below shows how Primal(x s , y s ) and Dual(x s , y s ) can be used to bound the value of the maximum regret, g(x s , y s ).
▶ Lemma 4 (implied by [36]). Let (x s , y s ) be a δ-stationary point produced by the Descent Phase, for a constant δ > 0. Let also w, z and P , be derived by an optimal solution to Dual(x s , y s ), as seen before. Then, for any strategy profile (x ′ , y ′ ), it holds that Lemma 4 plays a crucial role as it allows us to bound g(x s , y s ) in terms of λ, µ and P , by making appropriate choices for x ′ and y ′ . This is used both in the following lemma and in Lemma 11 of Section 4.
▶ Lemma 5 ([36]). Let (x s , y s ) be a δ-stationary point produced by the Descent phase, for a constant δ > 0, and let P be obtained by an optimal solution of Dual(x s , y s ). It holds that One may worry that the bound λ·µ λ+µ is not well-defined when λ + µ = 0. However, as we explain below, this is not a concern.
The definitions of λ and µ, along with Lemma 5 can immediately be used to prove that Cases 1-3 from the Strategy-construction phase return a ( 1 3 + δ)-Nash equilibrium. Hence, the bottleneck of the TS algorithm comes from Cases 4 and 5. In fact, it was also recently shown in [12] that the analysis of these cases in [36] is tight, and therefore one needs to come up with a different construction in order to obtain an improvement. Proof. We will consider every case independently.

Improved Strategy-construction Phase
In this section we replace Cases 4 and 5 from the original TS algorithm in order to bypass the bottleneck in the approximation. To do so, we utilize the δ-stationary point (x s , y s ), the dual strategies w, z, their convex combinations and best-response strategies to such combinations. We then perform a more refined analysis and prove that in every case we can efficiently construct a tailored strategy profile that is a ( 1 3 + δ)-Nash equilibrium. We note that all missing proofs from this section can be found in the full version of our work.
Our new Strategy-construction phase works as follows.
Note that Cases 1-3 are identical to the Strategy-construction phase of the TS algorithm. Thus, by Lemma 7 they return a ( 1 3 + δ)-Nash equilibrium. The new part concerns Cases 4 and 5. The analysis in both cases is based on certain auxiliary parameters (v r , t r andμ for Case 4 and analogously for Case 5), that we define in the statement of the algorithm. These parameters encode payoff differences or regrets of the players for using specific strategies, and they help us decompose the problem into convenient subcases, so as to obtain better upper bounds on the maximum regret.
To prove the theorem, it suffices to analyze Case 4, where 1 2 < λ ≤ 2 3 < µ, since Case 5 is symmetric to Case 4 and is analyzed in exactly the same way.
Intuition and Roadmap. The overall analysis in the sequel looks rather technical, therefore, we will first provide some elaboration on the choices that the algorithm makes in Case 4. The first crucial component in the design of the new algorithm is that the upper bounds on the regret of the δ-stationary point (x s , y s ), obtained in Lemma 5, can be further refined based on the values of the parameters λ, µ,μ, v r . This is precisely implemented in Section 4.1 with Lemmas 11, 12, and 13. Once this is done, we then try to answer the following question: Whenever (x s , y s ) does not provide a ( 1 3 + δ)-approximation, which profiles can form alternative candidates for a better performance? One idea is to exploit the dual strategies w, and z, as was also done in [36]. However, the profile (w, z) may not be a ( 1 3 + δ)-equilibrium either (in most cases). A next attempt then is to consider appropriate convex combinations of the primal and the dual strategy for each player, i.e., a combination of x s and w for the row player and y s and z for the column player. Unfortunately, this again does not work in all cases. But one next step is to also take into consideration best-response strategies against such convex combinations. E.g., the strategyŵ defined in Case 4 is a best response to the equiprobable combination of y s and z. This completes our weaponry, and at the end, in all subcases of Case 4, we consider profiles where the row player uses a convex combination of w andŵ, and the column player selects a combination between her primal and dual strategies, y s and z. Analogous profiles with the roles of the players reversed are constructed for Case 5 too. Finally, we also know that whenever (x s , y s ) does not attain a ( 1 3 + δ)-approximation, this restricts the relation between the parameters λ, µ,μ and v r due to the lemmas of Section 4.1. This is exploitable for us in the sense that it allows us to construct the exact coefficients for the convex combinations that we use so as to have the desired approximation.
To proceed, we start with two helpful observations, which are used repeatedly for the analysis of Cases 4.1 and 4.2.
Proof. By the previous lemma we have R(ŵ, z)

Bounding the regret of δ-stationary points
In this subsection, we establish three important lemmas that provide different ways of bounding the maximum regret of any δ-stationary point. The first of these lemmas is an improvement over [36], where we add a third upper bound for the δ-stationary point, in addition to the bounds stated in Lemma 5 from Section 3.
▶ Lemma 11. Let (x s , y s ) be a δ-stationary point with δ ≥ 0, and let P be obtained by an optimal solution of Dual(x s , y s ), as the sum of the dual variables: P = i∈Br(ys) p i . It holds that g(x s , y s ) ≤ min{P · λ, (1 − P ) · µ, P · v r + (1 − P ) ·μ} + δ.
Proof. By Lemma 5 it holds that g(x s , y s ) ≤ min{P ·λ, (1−P )·µ}+δ. So, it suffices to prove that g(x s , y s ) ≤ P ·v r +(1−P )·μ+δ. This follows from Lemma 4 when we set (x ′ , y ′ ) = (ŵ, y s ). Indeed, in this case we have g(x s , y s ) ≤ P · (R(w, y s ) − R(ŵ, y s ) − R(x s , y s ) + R(x s , y s )) + (1 − P ) · (C(ŵ, z) − C(ŵ, y s ) − C(x s , y s ) + C(x s , y s )) = P · v r + (1 − P ) ·μ + δ, by the definitions of v r andμ. ◀ The remaining two lemmas help in attaining a more fine-grained analysis on upper bounding the regret of the players, under the restrictions on the values of λ and µ in Case 4.

Case 4.1 of the Improved Strategy-construction Phase
We now analyze the approximation we obtain, when we fall into Case 4.1 of the algorithm. We establish that either the δ-stationary point has the desired approximation or otherwise, this is achieved by having the row player use an appropriate convex combination of w andŵ and the column player play the dual strategy z.

2·(vr+tr)
, the payoff of both the row and the column player is at least λ+µ 2 .
Proof. Note first that under the assumptions of the lemma, and since µ > λ, the parameter p is a valid probability. For the row player, we have that her payoff is R(p · w + (1 − p) ·ŵ, z) = p · R(w, z) + (1 − p) · R(ŵ, z) ≥ p · λ + (1 − p) · λ + (1 − p) · (v r + t r ) (from Lemma 9) Our result has some extra positive consequences for games with more than two players. In [7] it was shown that if we have an algorithm that finds an α-Nash equilibrium in a (k − 1)-player game, then in polynomial time we can compute a ( 1 2−α )-NE for any k-player game. Thus, our algorithm improves the state of the art for k-player normal-form games, for any k > 2. Namely, we get a (0.6 + δ)-NE for three-player games, a (5/7 + δ)-NE for four-player games, and so on.