Abstract 1 Introduction 2 Notation and conventions 3 Dual-extraction framework 4 Efficient maximin algorithms 5 Obtaining critical points of convex functions References

Extracting Dual Solutions via Primal Optimizers

Yair Carmon ORCID Tel Aviv University, Israel Arun Jambulapati University of Michigan, Ann Arbor, MI, USA Liam O’Carroll Stanford University, CA, USA Aaron Sidford ORCID Stanford University, CA, USA
Abstract

We provide a general method to convert a “primal” black-box algorithm for solving regularized convex-concave minimax optimization problems into an algorithm for solving the associated dual maximin optimization problem. Our method adds recursive regularization over a logarithmic number of rounds where each round consists of an approximate regularized primal optimization followed by the computation of a dual best response. We apply this result to obtain new state-of-the-art runtimes for solving matrix games in specific parameter regimes, obtain improved query complexity for solving the dual of the CVaR distributionally robust optimization (DRO) problem, and recover the optimal query complexity for finding a stationary point of a convex function.

Keywords and phrases:
Minimax optimization, black-box optimization, matrix games, distributionally robust optimization
Funding:
Yair Carmon: YC acknowledges support from the Israeli Science Foundation (ISF) grant no. 2486/21, and the Alon Fellowship.
Liam O’Carroll: LO acknowledges support from NSF Grant CCF-1955039.
Aaron Sidford: AS acknowledges support from a Microsoft Research Faculty Fellowship, NSF CAREER Grant CCF-1844855, NSF Grant CCF-1955039, and a PayPal research award.
Copyright and License:
[Uncaptioned image] © Yair Carmon, Arun Jambulapati, Liam O’Carroll, and Aaron Sidford; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Mathematical optimization
Related Version:
Full Version: http://arxiv.org/abs/2412.02949
Editor:
Raghu Meka

1 Introduction

We consider the foundational problem of efficiently solving convex-concave games. For nonempty, closed, convex constraint sets 𝒳d and 𝒴n and differentiable convex-concave objective function ψ:d×n (namely, ψ(,y) is convex for any fixed y and ψ(x,) is concave for any fixed x), we consider the following primal, minimax optimization problem (P) and its associated dual, maximin optimization problem (D):

minimizex𝒳f(x) for f(x)maxy𝒴ψ(x,y), and (P)
maximizey𝒴ϕ(y) for ϕ(y)minx𝒳ψ(x,y). (D)

If additionally 𝒳 and 𝒴 are bounded (which we assume for simplicity in the introduction but generalize later), every pair of primal and dual optimizers xargminx𝒳f(x) and yargmaxy𝒴ϕ(y) satisfies the minimax principle: f(x)=ϕ(y)=ψ(x,y).

Convex-concave games are pervasive in algorithm design, machine learning, data analysis, and optimization. For example, the games induced by bilinear objectives, i.e., ψ(x,y)=xAy+bx+cy, where 𝒳 and 𝒴 are either the simplex, Δk{x0k:x1=1}, or the Euclidean ball, Bk{xk:x21}, encompass zero-sum games, linear programming, hard-margin support vector machines (SVMs), and minimum enclosing/maximum inscribed ball [14, 2, 31, 10]. Additionally, the case when ψ(x,y)=i=1nyifi(x) for some functions fi:d and 𝒴 is a subset of the simplex encompasses a variety of distributionally robust optimization (DRO) problems [29, 5] and (for 𝒴=Δn) the problem of minimizing the maximum loss [6, 8, 4].

In this paper, we study the following question:

Given only a black-box oracle which solves (regularized versions of) (P) to ϵ accuracy, and a black-box oracle for computing an exact dual best response yxargmaxy𝒴ψ(x,y) to any primal point x𝒳, can we extract an ϵ-optimal solution to (D)?

We develop a general dual-extraction framework which answers this question in the affirmative. We show that as long as these oracles can be implemented as cheaply as obtaining an ϵ-optimal point of (P), then our framework can obtain an ϵ-optimal point of (D) at the same cost as that of obtaining an ϵ-optimal point of (P), up to logarithmic factors. We then instantiate our framework to obtain new state-of-the-art results in the settings of bilinear matrix games and DRO. Finally, as evidence of its broader applicability, we show that our framework can be used to recover the optimal complexity for computing a stationary point of a smooth convex function.

In the remainder of the introduction we describe our results in greater detail (Section 1.1), give an overview of the dual extraction framework and its analysis (Section 1.2), discuss related work (Section 1.3), and provide a roadmap for the remainder of the paper (Section 1.4).

1.1 Our results

From primal algorithms to dual optimization

We give a general framework which obtains an ϵ-optimal solution to (D) via a sequence of calls to two black-box oracles: (i) an oracle for obtaining an ϵ-optimal point of a regularized version of (P), and (ii) an oracle for obtaining a dual best response yxargmaxy𝒴ψ(x,y) for a given x𝒳. In particular, we show it is always possible to obtain an ϵ-optimal point to (D) with at most a logarithmic number of calls to each of these oracles, where the regularized primal optimization oracle is always called to an accuracy of ϵ over a logarithmic factor. We also provide an alternate scheme (or more specifically choice of parameters) for applications where the cost of obtaining an ϵ-optimal point of the regularized primal problem decreases sufficiently as the regularization level increases. In such cases, e.g., in our stationary point application, it is possible to avoid even logarithmic factor increases in computational complexity for approximately solving (D) relative to the complexity of approximately solving (P).

Application 1: Bilinear matrix games

In this application, ψ(x,y)xAy for a matrix Ad×n, 𝒴 is the simplex Δn, and 𝒳 is either the simplex Δd or the unit Euclidean ball Bd. Recently, [8] gave a new state-of-the-art runtime in certain parameter regimes of O~(nd+n(d/ϵ)2/3+dϵ2) for obtaining an expected ϵ-optimal point for the primal problem (P) for this setup. However, unlike previous algorithms for bilinear matrix games (see Section 1.3 for details), their algorithm does not return an ϵ-optimal solution for the dual (D), and their runtime is not symmetric in the dimensions n and d. As a result, it was unclear whether the same runtime is achievable for obtaining an ϵ-optimal solution of the dual (D). We resolve this question by applying our general framework to achieve an expected ϵ-optimal point of (D) with runtime O~(nd+n(d/ϵ)2/3+dϵ2). We then observe (see Corollary 22) that in the setting where 𝒳=Δd, our result can equivalently be viewed as a new state-of-the-art runtime of O~(nd+d(n/ϵ)2/3+nϵ2) for obtaining an ϵ-optimal point of the primal (P) due to the symmetry of ψ and the constraint sets.

Application 2: CVaR at level 𝜶 DRO

In this application, ψ(x,y)i=1nyifi(x) for convex, bounded, and Lipschitz loss functions fi:d, 𝒳 is a convex, compact set, and 𝒴{yΔn:y1αn} is the CVaR at level α uncertainty set for α[1/n,1]. The primal (P) is a canonical and well-studied DRO problem, and corresponds to the average of the top α fraction of the losses. We consider this problem given access to a first-order oracle that, when queried at xd and i[n], outputs (fi(x),fi(x)). Ignoring dependencies other than α, the target accuracy ϵ>0, and the number of losses n for brevity, [29] gave a matching upper and lower bound (up to logarithmic factors) of O~(α1ϵ2) queries to obtain an expected ϵ-optimal point of the primal (P). However, the best known query complexity for obtaining an expected ϵ-optimal point of the dual (D) was O~(nϵ2) prior to this paper (see Section 1.3 for details). Applying our general framework to this setting, we obtain an algorithm with a new state-of-the-art query complexity of O~(α1ϵ2+n) for obtaining an expected ϵ-optimal point of the dual (D). In particular, note that this complexity is nearly linear in n when ϵ(αn)2.

Application 3: Obtaining stationary points of convex functions

In this application, we show that our framework yields an alternative optimal approach for computing an approximate critical point of a smooth convex function given a gradient oracle. Specifically, for γ>0 and convex and β-smooth h:n, in Section 5, we give an algorithm which computes xn such that h(x)2γ using O(βΔ/γ) gradient queries, where Δh(x0)infxnh(x) is the initial suboptimalityf. While this optimal complexity has been achieved before [24, 37, 15, 28, 27], that we achieve it is a consequence of our general framework illustrates its broad applicability.

For this application, we instantiate our framework with ψ(x,y)x,yh(y), where h:n denotes the convex conjugate of h. (For reasons discussed in Section 5, we actually first substitute h for an appropriately regularized version of h, call it f, before applying the framework, but the following discussion still holds with respect to f.) This objective function ψ is known as the Fenchel game and has been used in the past to recover classic convex optimization algorithms (e.g., the Frank-Wolfe algorithm and Nesterov’s accelerated methods) via a minimax framework [1, 43, 12, 23]. In the Fenchel game, a dual best response corresponds to a gradient evaluation:

argmaxyn{x,yh(y)}=h(x),

and we show that approximately optimal points for the dual objective (D) must have small norm. As a result, obtaining an approximately optimal dual point y as a best response to a primal point x yields a bound on the norm of y=f(x). Furthermore, we note that in this setting, adding regularization to ψ with respect to an appropriate choice of distance-generating function (namely h) is equivalent to rescaling and recentering the primal function f, as well as the point at which a gradient is taken in the dual best response computation (cf. Lemma 14 in the full version). Thus, the properties of the Fenchel game extend naturally to appropriately regularized versions of ψ.

1.2 Overview of the framework and analysis

We now give an overview of the dual-extraction framework. Our framework applies generally to a set of assumptions given in Section 3.1 (cf. Definition 9), but for now we specialize to the assumptions given above, namely: (i) the constraint sets 𝒳 and 𝒴 are nonempty, compact, and convex; and (ii) ψ is differentiable and convex-concave. Throughout this section, let denote any norm on n and assume that the dual function, ϕ, is L-Lipschitz with respect to .111This is a weak assumption since we ensure at most a logarithmic dependence on L; see Remark 5. Let r:n denote a differentiable distance-generating function (dgf) which is μr-strongly convex with respect to for μr>0,222Section 2 gives the general setup for a distance-generating function which also covers the case where domrn. and let Vu(v)r(v)r(u)r(u),vu denote the associated Bregman divergence. For the sake of illustration, it may be helpful to consider the choices 2, r(u)12u22, μr=1, and Vu(v)=12uv22 in the following, in which case relative strong convexity with respect to r is equivalent to the standard notion of strong convexity with respect to 2.

How should we obtain an ϵ-optimal point for (D) using the two oracles discussed previously, namely: (i) an oracle for approximately solving a regularized primal objective, and (ii) an oracle for computing a dual best response? We call (i) a dual-regularized primal optimization (DRPO) oracle and (ii) a dual-regularized best response (DRBR) oracle; their formal definitions are given in Section 3.1. Note that to solve (D), one cannot simply solve the primal problem (P) to high accuracy and then compute a dual best response. Consider ψ(x,y)=xy with 𝒳=𝒴=[1,1]; clearly x=y=0, but for any x arbitrarily close to x, the dual best response is either 1 or 1.

The key observation underlying our framework is that if ψ(x,) is strongly concave for a given x𝒳, it is possible to upper bound the distance between the best response yxargmaxy𝒴ψ(x,y) and the dual optimum y in terms of the primal suboptimality of x. Figure 1 illustrates why this should be the case when subtracting a quadratic regularizer in y (so that ψ(x,) is strongly concave) to the preceding example of ψ(x,y)=xy. We generalize this intuition in the following lemma (replacing strong concavity with relative strong concavity and a distance bound with a divergence bound), which is itself generalized further and proven in Section 3:

Refer to caption
(a)
Refer to caption
(b)
Figure 1: An example to give intuition behind Lemma 1. Here, ψ(x,y)=xy0.8y2, (x,y)=(0,0), x=0.8, and yx=0.5. To see why it is possible to bound |yyx| in terms of the primal suboptimality f(x)f(x), note that by the strong concavity of ψ(x,) and the fact that yx is the maximizer of ψ(x,) over 𝒴, we can upper bound |yyx| in terms of ψ(x,yx)ψ(x,y) (the vertical drop over the green line) via a standard strong-concavity inequality. In turn, ψ(x,yx)ψ(x,y) can be upper bounded by ψ(x,yx)ψ(x,y)=f(x)f(x) (the vertical drop over the green line plus the vertical drop over the red line) due to the fact that ψ(x,y)ψ(x,y) by the optimality of x.
Lemma 1 (Lemma 3 from the full version specialized).

For a given x𝒳, suppose ψ(x,) is μ-strongly convex relative to the dgf r for some μ>0. Then yxargmaxy𝒴ψ(x,y) satisfies

Vyx(y)f(x)f(x)μ.

A first try

In particular, Lemma 1 suggests the following approach: Define “dual-regularized” versions of ψ,ϕ,f as follows for λ>0 and y0𝒴:

ψ1(x,y) ψ(x,y)λVy0(y),
f1(x) maxy𝒴ψ1(x,y),
ϕ1(y) minx𝒳ψ1(x,y).

(Here, the subscript 1 denotes one level of regularization and will be extended later.) For any x𝒳, note that ψ1(x,) is λ-strongly convex relative to r, in which case Lemma 1 applied to ψ1 yields

Vyx(y1)f1(x)f1(x1)λ, (1)

for y1argmaxy𝒴ϕ1(y), x1argminx𝒳f1(x), and yxargmaxy𝒴ψ1(x,y). Then note

ϕ(y1)ϕ1(y1)ϕ1(y)=minx𝒳{ψ(x,y)λVy0(y)}=ϕ(y)λVy0(y), (2)

where the first inequality follows since ϕϕ1 pointwise. Then by the L-Lipschitzness of ϕ and μr-strong convexity of r, it is straightforward to bound the suboptimality of yx as

ϕ(y)ϕ(yx)λVy0(y)+L2(f1(x)f1(x1))μrλ. (3)

Consequently, an ϵ-optimal point for (D) can be obtained via our oracles as follows: Set λϵ2Vy0(y), and use the DRPO oracle on the regularized primal problem to obtain x𝒳 such that

f1(x)f1(x1)ϵ3μr16L2Vy0(y). (4)

Then the best response to x with respect to ψ1, namely yxargmaxy𝒴ψ1(x,y), is ϵ-optimal by (3). However, a typical setting in our applications is Vy0(y)=Ω(1), μr=1, and L1, in which case ensuring (4) requires solving the regularized primal problem to O(ϵ3) error.

Recursive regularization and the dual-extraction framework

To lower the accuracy requirements, we apply dual regularization recursively. A key issue with the preceding argument is that it required a nontrivial bound on Vy0(y). However, it provided us with a nontrivial bound (1) on Vyx(y1), the “level-one equivalent” of Vy0(y). This suggests solving f1 to lower accuracy while still obtaining a bound on Vyx(y1) due to (1), and then adding regularization centered at yx with a larger value of λ. Indeed, our framework recursively repeats this process until the total regularization is large enough so that (a term similar to) the right-hand side of (3) can be bounded by ϵ, despite never needing to solve a regularized primal problem to high accuracy.

To more precisely describe our approach, let ψ0ψ,f0f,ϕ0ϕ. Over iterations k=1,2,,K, our framework implicitly constructs a sequence of convex-concave games ψk:d×n, along with corresponding primal and dual functions fk:𝒳 and ϕk:𝒴 respectively, as follows:

ψk(x,y)ψk1(x,y)λk1Vyk1(y),fk(x)maxy𝒴ψk(x,y),ϕk(y)minx𝒳ψk(x,y). (5)

Here, (λk>0)k=0K1 is a dual-regularization schedule given as input to the framework, and (yk𝒴)i=0K is a sequence of dual-regularization “centers” generated by the algorithm, with y0 given as input. For k{0}[K], it will be useful to let yk denote a maximizer of ϕk over 𝒴 and xk denote a minimizer of fk over 𝒳, with y0y and x0x in particular.

Over the K rounds of recursive dual regularization, we aim to balance two goals:

  • On the one hand, we want λk to increase quickly so that ψk(x,) is very strongly convex relative to r, thereby allowing us to apply Lemma 1 with a larger strong convexity constant.

  • On the other hand, we want to maintain the invariant that, roughly speaking, yk is always ϵ/2-optimal for the original dual ϕ. Indeed, we were constrained in choosing λ in (2) to be on the order of ϵ/Vy0(y) to ensure y1 is ϵ/2-optimal for ϕ. A similar “constraint” on the dual-regularization schedule (λk)k=0K1 appears when (2) is extended to additional levels of regularization. This prevents us from increasing λk too quickly.

In all the applications in this paper we choose λk2λk1. λ0 typically must remain on the order of ϵ/Vy0(y) due to the second point.

Pseudocode of the framework is given in Algorithm 1. Each successive dual-regularization center yk is computed via the DRBR oracle (Line 5) as a best response to a primal point xk obtained via the DRPO oracle (Line 4). In Section 3, we generalize Algorithm 1 (cf. Algorithm 2) in several ways: (i) we allow for stochasticity in the DRPO oracle; (ii) we allow for distance-generating functions r such that domrn; (iii) we give different but equivalent characterizations of xk and yk which facilitate the derivation of explicit expressions for the DRPO and DRBR oracles in applications.

Algorithm 1 Dual-extraction framework (Algorithm 2 specialized).

Analysis of Algorithm 1

Theorem 2 is our main result for Algorithm 1. We then instantiate Theorem 2 with two illustrative choices of parameters in Corollaries 5 and 7, and defer the proofs of the latter to their general versions in Section 3. All of the remarks below (Remarks 3, 5, 7) are stated with reference to the specialized results in this section (Theorem 2 and Corollaries 4, 6 resp.), but extend immediately to the corresponding general versions (Theorem 15 and Corollaries 16, 17 resp.).

Theorem 2 (Theorem 15 specialized).

Algorithm 1 returns yK satisfying

VyK(u)ϵKΛK where Λkj=0k1λj for k[K] (6)

and u𝒴 is a point with dual suboptimality bounded as

ϕ(y)ϕ(u)λ0Vy0(y)+k=1K1λkΛkϵk. (7)

If we additionally assume that ϕ is L-Lipschitz with respect to , we can directly bound the suboptimality of yK as

ϕ(y)ϕ(yK)λ0Vy0(y)+k=1K1λkΛkϵk+L2μrϵKΛK. (8)
Proof.

We claim the first half of Theorem 2 holds with uyK. To see this, note that we can bound the suboptimality of yK as

ϕ(yK)(i)ϕK(yK)ϕK(yK1) =maxx𝒳{ψK1(x,yK1)λK1VyK1(yK1)}
=ϕK1(yK1)λK1VyK1(yK1)
(ii)ϕ0(y0)λ0Vy0(y0)k=1K1λkVyk(yk)
(iii)ϕ(y)λ0Vy0(y)k=1K1λkΛkϵk,

where (i) follows since ϕϕK pointwise, (ii) follows from repeating the argument in the previous lines recursively (starting by lower bounding ϕK1(yK1), etc.), and (iii) uses Lemma 1 applied to ψk, which yields by Lines 4 and 5 in Algorithm 1:

Vyk(yk)fk(xk)fk(xk)ΛkϵkΛk,

since ψk(x,)=ψ(x,)+j=0k1λjVyj() is Λk-strongly concave relative to r. Thus, we have proven Equation 7, and Equation 6 follows again from Lemma 1 applied to ψK. Equation 8 then follows since the fact that r is μr-strongly convex with respect to and Equation 6 imply

yKyK2μrVyK(yK)2μrϵKΛK.

We give a remark regarding how to pick the parameters (λk)k=0K1 and (ϵk)k=1K when applying Theorem 2:

 Remark 3 (Picking the parameters for Theorem 2).

Equation 8 can be interpreted as follows: To ensure yK is ϵ-optimal for ϕ, it suffices to choose the sequences (λk)k=0K1 and (ϵk)k=1K so that the right side of (8) is at most ϵ. Then the first term, λ0Vy0(y), constrains λ0 to be on the order of ϵ/Vy0(y). Skipping ahead, the third term, L2μrϵKΛK, is the reason we always choose λk2λk1 in our applications, as this ensures ΛK is large enough to handle this term with K only needing to be logarithmic in the problem parameters. Then the second term, k=1K1λkΛkϵk, effectively constrains roughly k=1K1ϵkϵ, as λk/Λk1.

Corollary 4 (Corollary 16 specialized).

Suppose ϕ is L-Lipschitz with respect to , and let B>0 be such that Vy0(y)B. Then for any ϵ>0, and Kmax{log2L2Bμrϵ2,1}+10, the output of Algorithm 1 with dual-regularization and primal-accuracy schedules of

λk=2kϵ4B for k{0}[K1] and ϵk=ϵ4K for k[K]

satisfies ϕ(y)ϕ(yK)ϵ.

 Remark 5.

Corollary 4 achieves the stated goal of obtaining an ϵ-optimal point for (D) by running for a number of iterations which depends logarithmically on the problem parameters, and solving each dual-regularized primal subproblem to an accuracy of ϵ divided by a logarithmic factor. Note in particular the logarithmic dependence on the dual divergence bound B and dual Lipschitz constant L, meaning these are weak assumptions. Furthermore, it is clear from the proof of Theorem 2 that ϕ only need be L-Lipschitz on a set containing yK and yK.

Corollary 6 (Corollary 17 specialized).

Let B>0 be such that Vy0(y)B. Then for any ϵ>0 and K, the output of Algorithm 1 with dual-regularization and primal-accuracy schedules of

λk=2kϵ4B for k{0}[K1] and ϵk=ϵ81.5k for k[K]

satisfies

yKu11.5K2Bμr,

where u𝒴 is a point whose suboptimality is at most ϵ, i.e., ϕ(y)ϕ(u)ϵ.

 Remark 7.

Later calls to the DRPO oracle during the run of Algorithm 1 may be cheaper since there will be a significant amount of dual regularization at that point (namely, Λk=j=0k1λj is large). One can sometimes take advantage of this (in particular, if the cost of a DRPO oracle call scales inverse polynomially with the regularization) to design schedules that avoid even the typically additional multiplicative logarithmic cost of Corollary 4 over the cost of a single DRPO oracle call. In such cases, a choice of schedules similar to those of Corollary 6 is often appropriate. With this choice of schedules, later rounds require very high accuracy. However, if one can argue that the increasing dual regularization Λk makes the DRPO oracle call cheaper at a faster rate than the decreasing error ϵk makes it more expensive (as we do in Section 5), the total cost of applying the framework may collapse geometrically to the cost of a single DRPO oracle call made with target error approximately ϵ.

We purposely state Corollary 6 without the assumption that ϕ is Lipschitz because that is the form we will use in Section 5. However, it is straightforward to reformulate a version of Corollary 6 with the Lipschitz assumption. Here the focus was to illustrate different possible choices of schedules.

1.3 Related work

Black-box reductions

Our main contribution can be viewed as a black-box reduction from (regularized) primal optimization to dual optimization. Similar black-box reductions exist in the optimization literature. For example, [3] develops reductions between various fundamental classes of optimization problems, e.g., strongly convex optimization and smooth optimization. In a similar vein, the line of work [30, 18, 7] reduces convex optimization to approximate proximal point computation (i.e., regularized minimization).

Bilinear matrix games

Consider the bilinear objective ψ(x,y)=xAy where 𝒳 and 𝒴 are either the simplex, Δk{x0k:x1=1}, or the Euclidean ball, Bk{xk:x21}. State-of-the-art methods in regard to runtime for obtaining an approximately optimal primal and/or dual solution can be divided into second-order interior point methods [11, 42] and stochastic first-order methods [22, 10, 9, 8]; see Table 2 in [8] for a summary of the best known runtimes as well as other references. Of importance to this paper, all state-of-the-art algorithms other than that of [8] are either (i) primal-dual algorithms which return both an ϵ-optimal primal and dual solution simultaneously, and/or (ii) achieve runtimes which are symmetric in the primal dimension d and dual dimension n, meaning the cost of obtaining an ϵ-optimal dual solution is the same as that of obtaining an ϵ-optimal primal solution. The algorithm of [8], on the other hand, only returns an ϵ-optimal primal point and further has a runtime which is not symmetric in n and d (see the footnote on the first page of that paper). As a result, solving the dual by simply swapping the roles of the primal and dual variables may be more expensive than solving the primal. (In fact, swapping the variables in this way may not even always be possible without further modifications due to restrictions on the constraint sets.)

CVaR at level 𝜶 distributionally robust optimization (DRO)

The DRO objectives we study are of the form ψ(x,y)=i=1nyifi(x), where the functions fi:d are convex, bounded, and Lipschitz, and 𝒴, known as the uncertainty set, is a subset of the simplex. This objective corresponds to a robust version of the empirical risk minimization (ERM) objective where instead of taking an average over the losses (namely, yi is fixed at 1/n), larger losses may be given more weight. In particular, in this paper we focus on a canonical DRO setting, CVaR at level α, where the uncertainty set is given by 𝒴{yΔn:y1αn} for a choice of α[1/n,1]. CVaR DRO, along with its generalization f-divergence DRO, has been of significant interest over the past decade; see [29, 5, 13, 32, 16] and the references therein. [29] is the most relevant to this paper – omitting parameters other than α, the number of losses n, and the target accuracy ϵ>0, they give a matching upper and lower bound (up to logarithmic factors) of O~(α1ϵ2) first-order queries of the form (fi(x),fi(x)) to obtain an expected ϵ-optimal point of the primal objective. Their upper bound is achieved by a stochastic gradient method where the gradient estimator is based on a multilevel Monte Carlo (MLMC) scheme [19, 20]. However, the best known complexity for obtaining an expected ϵ-optimal point of the dual of CVaR at level α is O(nϵ2) via a primal-dual method based on [33]; see also [13, 32] as well as [5, Appendix A.1], the last of which obtains complexity O~(nϵ2) in the more general setting of the uncertainty set being an f-divergence ball.

Stationary point computation

For γ>0, convex and β-smooth h:n with global minimum z, and initialization point z0, consider the problem of computing a point z such that h(z)2γ. Two worst-case optimal gradient query complexities for this problem exist in the literature: O(β(h(z0)h(z))/γ) and O(βz0z2/γ). An algorithm (the OGM-G method) which achieves the former complexity was given in [24], and [37] pointed out that any algorithm which achieves the former complexity can achieve the latter complexity. This is obtainable by running N iterations of any optimal gradient method for reducing the function value, followed by N iterations of a method which achieves the former complexity for reducing the gradient magnitude. In what may be of independent interest, we observe in Section 5.1 that a reduction in the opposite direction is also possible. More broadly, algorithms and frameworks for reducing the gradient magnitude of convex functions have been of much recent interest, and further algorithms and related work for this problem include [25, 27, 26, 28, 15, 37, 21], with lower bounds given in [34, 35].

1.4 Paper organization

In Section 2, we go over notation and conventions for the rest of the paper. We give our general dual-extraction framework and its guarantees in Section 3. In Section 4, we apply our framework to bilinear matrix games and the CVaR at level α DRO problem. Finally, in Section 5 we give an optimal algorithm (in terms of query complexity) for computing an approximate stationary point of a convex and β-smooth function.

2 Notation and conventions

We defer standard notation and conventions to the full version, and only include paper-specific notation here.

For ψ:d×n, we use the notation ψ(,y):d for a fixed yn to denote the map xψ(x,y) (and define ψ(x,) analogously). When we say ψ(,y) satisfies a property, we mean it satisfies that property for any fixed yn (and analogously for ψ(x,)). We let [K]{1,2,,K}, Δn{x0n:x1=1}, and Brn(x){xn:x2r}. In the latter two definitions, we may drop the superscript n if it is clear from context, the argument x if it is 0, and the subscript r if it is 1. For yn, we may use either the notation yi or [y]i to denote its i-th entry. 𝟏 denotes the all-ones vector. For a function f which depends on some inputs x1,,xk, we write fpoly(x1,,xk) to denote the fact that f is uniformly bounded above by a polynomial in x1,,xk as x1,,xk vary. We use the notation f for the convex or Fenchel conjugate of f. For Sn, we let 𝕀S denote the infinite indicator of S, namely 𝕀S(x)=0 if xS and 𝕀S(x)= if xS. For a function f:S[,] initially defined on a strict subset Sn, we may implicitly extend the domain of f to all of n via its indicator as f+𝕀S without additional comment. For a function f:U[,] with SUn, we let fSf+𝕀S denote the restriction of f to S. We note that fS denotes the convex conjugate of fS (and not f restricted to S).

Following [38, Sec. 6.4], we encapsulate the setup for a dgf as follows. See the full version for additional discussion of this definition.

Definition 8 (dgf setup).

We say (𝒰,𝒫,,r) is a dgf setup over n for closed and convex sets 𝒰𝒫n with 𝒰int𝒫 if: (i) the distance-generating function (dgf) r:𝒫 is convex and continuous over 𝒫, differentiable on int𝒫, and μr-strongly convex with respect to the chosen norm on 𝒰int𝒫 for some μr>0; and (ii) either limubd𝒫r(u)2= or 𝒰int𝒫.

For a given dgf setup, we define its induced Bregman divergence Vur(v)r(v)r(u)r(u),vu for uint𝒫,v𝒫, and drop the superscript r when it is clear from context.

3 Dual-extraction framework

In this section, we provide our general dual-extraction framework and its guarantees. In Section 3.1, we give the general setup, oracle definitions, and assumptions with which we apply and analyze the framework. Section 3.2 contains the statement and guarantees of the framework and Section 3.3 in the full version contains the associated proofs.

3.1 Preliminaries

We bundle all of the inputs to our framework into what we call a dual-extraction setup, defined below. Recall that when we say ψ(x,) satisfies a property, we mean it satisfies that property for any fixed xd (and analogously for ψ(,y)).

Definition 9 (Dual-extraction setup).

A dual-extraction setup is a tuple (ψ,𝒳,𝒴,𝒰,𝒫,,r) where:

  1. 1.

    ψ(x,) is differentiable;

  2. 2.

    ψ(,y) and ψ(x,) are convex and concave respectively;

  3. 3.

    (𝒰,𝒫,,r) is a dgf setup over n per Definition 8;

  4. 4.

    the constraint sets 𝒳d and 𝒴n are nonempty, closed, and convex with 𝒴𝒰 and 𝒴int𝒫;

  5. 5.

    𝒳 is bounded or ψ(,y) is strongly convex;

  6. 6.

    𝒴 is bounded or ψ(x,) is strongly concave;

  7. 7.

    over all p𝒰int𝒫 and w𝕀𝒰(p), the map yw,y is constant over 𝒴.333In all of our applications, this map will in fact be constant over 𝒰.

Assumption 1 is only used in the proofs of Lemma 3 in the full version (the general version of Lemma 1 from Section 1.2) and Corollary 13 in the full version (used to show the framework is well-defined when domrn). Assumptions 2, 5, and 6 ensure that the minimax optimization problem with objective ψ and constraint sets 𝒳 and 𝒴 satisfies the minimax principle; see below. Regarding Assumptions 3, 4, and 3, the fact that 𝒴 is potentially a strict subset of 𝒰 as well as the necessity of the technical assumption 3 is discussed in Remark 4 in the full version. In particular, Assumption 3 is only used to derive an equivalent formulation of the framework to Algorithm 1 which often allows for easier instantiations in applications, but is not strictly necessary to obtain our guarantees.

While our main results are stated in the full generality of Definition 9, in our applications we only particularize to Definition 10 and Definition 11 introduced below.

Definition 10 (Unbounded setup).

A (ψ,𝒳,𝒴,r)-unbounded setup is a
(ψ,𝒳,𝒴,n,n,2,r)-dual-extraction setup.

In other words, in an unbounded setup we choose 𝒰=𝒫=n and the Euclidean norm, in which case the dgf r can be any differentiable and strongly convex function with respect to 2. Note that Assumption 3 is trivial as 𝕀𝒰(p)={0} for all pn.

Definition 11 (Simplex setup).

A (ψ,𝒳,𝒴)-simplex setup is a (ψ,𝒳,𝒴,Δn,0n,1,r)-dual-extraction setup where r(u)i=1nuilnui (with 0ln00).

In other words, in a simplex setup we choose 𝒰=Δn, 𝒫=0n, we are using the 1-norm, and the dgf is negative entropy when restricted to the simplex. It is a standard result known as Pinsker’s inequality that r is 1-strongly convex over Δ>0n with respect to 1, and the associated Bregman divergence is given by the Kullback-Leibler (KL) divergence Vu(w)=i=1nwilnwiui for uΔ>0n and wΔn. We verify that Assumption 3 holds in Appendix A.1 in the full version.

Notation associated with a setup

Whenever we instantiate a dual-extraction setup (Definition 9), we use the following notation and oracles associated with that setup without additional comment. We define the associated primal f:𝒳 and dual ϕ:𝒴 functions, along with their corresponding primal and dual optimization problems, as they were introduced above in (P) and (D). We let xargminx𝒳f(x) and yargmaxy𝒴ϕ(y) denote arbitrary primal and dual optima. To facilitate the discussion of dual-regularized problems, we define fλ,q(x):𝒳 as follows

fλ,q(x)maxy𝒴{ψ(x,y)λVq(y)} for λ>0 and q𝒰int𝒫.

The minimax principle

Assumptions 2, 5, and 6 in Definition 9 guarantee f(x)=ψ(x,y)=ϕ(y), which we refer to as the minimax principle. See, e.g., [39, 41] as well as Propositions 1.2 and 2.4 in [17, Ch. VI].

Oracle definitions

Our framework assumes black-box access to ψ, 𝒳, and 𝒴 via a dual-regularized primal optimization (DRPO) oracle and a dual-regularized dual best response (DRBR) oracle defined below. Note that we generalize the setting of Section 1.2 by allowing the DRPO oracle to return an expected ϵ-optimal point; this is used in our applications in Section 4.

Definition 12 (DRPO oracle).

A (q𝒰int𝒫,λ>0,ϵp>0)-dual-regularized primal optimization oracle, DRPO(q,λ,ϵp), returns an expected ϵp-minimizer of fλ,q, i.e., a point x𝒳 such that 𝔼fλ,q(x)infx𝒳fλ,q(x)+ϵp, where the expectation is over the internal randomness of the oracle.

Definition 13 (DRBR oracle).

A (q𝒰int𝒫,λ>0,x𝒳)-dual-regularized best response oracle, DRBR(q,λ,x), returns argmaxy𝒴{ψ(x,y)λVq(y)}.

We also define a version of the DRPO oracle, called the DRPOSP oracle, which allows for a failure probability. We include this definition here due to its generality and broad applicability, but it is only used in Section 4.1 since the external result we cite to obtain an expected ϵp-minimizer of fλ,q in that application has a failure probability. We also show in Appendix A.4 in the full version how to boost the success probability of a DRPOSP oracle.

Definition 14 (DRPOSP oracle).

A (q𝒰int𝒫,λ>0,ϵp>0,δ[0,1))-dual-regularized primal optimization oracle with success probability, DRPOSP(q,λ,ϵp,δ), returns an expected ϵp-minimizer of fλ,q with success probability at least 1δ, where the expectation and success probability are over the internal randomness of the oracle.

3.2 The framework and its guarantees

Algorithm 2 Dual-extraction framework.

We now state the general dual-extraction framework, Algorithm 2, and its guarantees, with proofs in the next section. As mentioned in Section 1.2, Algorithm 2 generalizes Algorithm 1 in three major ways: (i) we allow for stochasticity in the DRPO oracle; (ii) we allow for distance-generating functions r where domrn; and (iii) we give different but equivalent characterizations of xk and yk which often allow for easier instantiations of the framework.

Regarding (iii), consider the case where the DRPO oracle is deterministic and domr=n for the sake of discussion. Note that in this case, the definitions of xk and yk in Lines 4 and 5 of Algorithm 2 may seem different than those in Lines 4 and 5 of Algorithm 1 at first glance. In particular, xk in Line 4 of Algorithm 2 is an ϵk-minimizer of xmaxy𝒴{ψ(x,y)ΛkVqk(y)} over 𝒳, whereas xk in Line 4 of Algorithm 1 is an ϵk-minimizer of xmaxy𝒴{ψ(x,y)j=0k1λjVyj(y)} over 𝒳. Similarly, yk=argmaxy𝒴{ψ(xk,y)ΛkVqk(y)} in Line 5 of Algorithm 2, whereas yk=argmaxy𝒴{ψ(x,y)j=0k1λjVyj(y)} in Line 5 of Algorithm 1. In fact, we show in Section 3.3 in the full version that these are equivalent; see Lemma 2 and Remark 4 in the full version. The potential advantage of the expressions in Algorithm 2 compared to those in Algorithm 1 is that they involve only a single regularization term.

Note also that Line 3 of Algorithm 2 gives two equivalent expressions for the iterate qk; their equivalence is proven in Appendix A.2 in the full version. Also, note that Line 4 is the only potential source of randomness in Algorithm 2; in particular, yk and qk+1 are deterministic upon conditioning on xk. Finally, we show that Algorithm 2 is well-defined in Appendix A.3 in the full version; in particular, whenever a Bregman divergence Vu(w) is written in Algorithm 2, it is the case that u𝒰int𝒫. For example, in the context of a simplex setup per Definition 11, this corresponds to uΔ>0n.

We now give the main guarantee for Algorithm 2. See Remark 3 for additional explanation.

Theorem 15 (Algorithm 2 guarantee).

With K calls to a DRPO oracle and K calls to a DRBR oracle, Algorithm 2 returns yK satisfying

𝔼VyK(u)ϵKΛK,

where u𝒴 is a point with expected suboptimality bounded as

ϕ(y)𝔼ϕ(u)λ0Vy0(y)+k=1K1λkΛkϵk.

If we additionally assume that ϕ is L-Lipschitz with respect to , the expected suboptimality of yK can be directly bounded as

ϕ(y)𝔼ϕ(yK)λ0Vy0(y)+k=1K1λkΛkϵk+L2μrϵKΛK. (9)

We now particularize Theorem 15 using two exemplary choices of the dual-regularization and primal-accuracy schedules. See Remarks 5 and 7 for additional comments.

Corollary 16.

Suppose ϕ is L-Lipschitz with respect to , and let B>0 be such that Vy0(y)B. Then for any ϵ>0, and Kmax{log2L2Bμrϵ2,1}+10, the output of Algorithm 2 with dual-regularization and primal-accuracy schedules given by

λk=2kϵ4B for k{0}[K1] and ϵk=ϵ4K for k[K]

satisfies ϕ(y)𝔼ϕ(yK)ϵ.

Corollary 17.

Let B>0 be such that Vy0(y)B. Then for any ϵ>0 and K, the output of Algorithm 2 with dual-regularization and primal-accuracy schedules given by

λk=2kϵ4B for k{0}[K1] and ϵk=ϵ81.5k for k[K] (10)

satisfies

𝔼yKu11.5K2Bμr,

where u𝒴 is a point whose expected suboptimality is at most ϵ, i.e., ϕ(y)𝔼ϕ(u)ϵ.

4 Efficient maximin algorithms

In this section, we obtain new state-of-the-art runtimes for solving bilinear matrix games in certain parameter regimes (Section 4.1), as well as an improved query complexity for solving the dual of the CVaR at level α distributionally robust optimization (DRO) problem (Section 4.2). In each application, we apply Corollary 16 to compute an ϵ-optimal point for the dual problem at approximately the same cost as computing an ϵ-optimal point for the primal problem (up to logarithmic factors and the cost of representing a dual vector when it comes to CVaR at level α).

4.1 Bilinear matrix games

In this section, we instantiate ψ(x,y)xAy for a matrix Ad×n. Given p,q1, we write Apqmaxvd,v0Avqvp, and use the notation Aij, Ai:, and A:j for the (i,j) entry, i-th row as a row vector, and j-th column as a column vector. We consider two setups:

Definition 18 (Matrix games ball setup).

In the matrix games ball setup, we set 𝒳Bd (the unit Euclidean ball in d), 𝒴Δn, and fix a (ψ,𝒳,𝒴)-simplex setup (Definition 11). We assume A2=maxi[n]A:i21.

Definition 19 (Matrix games simplex setup).

In the matrix games simplex setup, we set 𝒳Δd, 𝒴Δn, and fix a (ψ,𝒳,𝒴)-simplex setup (Definition 11). We assume A1=maxi,j|Aij|1.

Throughout Section 4.1, any theorem, statement, or equation which does not make reference to a specific choice of Definition 18 or 19 applies to both setups. Specializing the primal (P) and dual (D) to this application gives

minimizex𝒳f(x) for f(x)maxyΔnxAy, and (P-MG)
maximizeyΔnϕ(y) for ϕ(y)minx𝒳xAy. (D-MG)

Regarding the assumptions on the norm of the matrix A in Definitions 18 and 19, note that we can equivalently write f(x)=maxyΔni=1nyifi(x) with fi(x)[Ax]i. Then the assumptions on the norm of A correspond to ensuring fi is 1-Lipschitz with respect to the 2-norm in Definition 18 and 1-norm in Definition 19 (which in turn implies f is 1-Lipschitz in the respective norms). This normalization is performed to simplify expressions as in [8]. (In particular, [8] also considers the more general problem where each fi can be any smooth, Lipschitz, convex function.)

Recently, [8, Cor. 8.2] achieved a state-of-the-art runtime in certain parameter regimes of O~(nd+n(d/ϵ)2/3+dϵ2) for obtaining an ϵ-optimal point for (P-MG). However, unlike previous algorithms for (P-MG) (see Section 1.3 for an extended discussion), their algorithm does not yield an ϵ-optimal point for (D-MG) with the same runtime.

Algorithm 3 Dual extraction for matrix games.

Our instantiation of the dual-extraction framework in Algorithm 3 and the accompanying guarantee Theorem 21 resolves this asymmetry between the complexity of obtaining a primal versus dual ϵ-optimal point by obtaining an ϵ-optimal point of (D-MG) with the same runtime of O~(nd+n(d/ϵ)2/3+dϵ2). At the end of Section 4.1, we observe that Theorem 21 also yields a new state-of-the-art runtime for the primal (P-MG) in the setting of Definition 19 due to the symmetry of the constraint sets and ψ.

Before giving the guarantee Theorem 21 for Algorithm 3, the following lemma provides a runtime bound for the DRPOSP oracle when the success probability is 9/10 (see Appendix B.1 in the full version for the proof). In particular, Lemma 20 shows that adding dual regularization to (P-MG) does not increase the complexity of obtaining an ϵ-optimal point over the guarantee of [8, Cor. 8.2] discussed above.

Lemma 20 (DRPOSP oracle for matrix games).

In the settings of Definitions 18 and 19, for any qΔ>0n, ϵp>0, and λ>0, with success probability at least 9/10, there exists an algorithm which returns an expected ϵp-optimal point of fλ,q with runtime O~(nd+n(d/ϵp)2/3+dϵp2). (Equivalently, per Definition 14, we have that DRPOSP(q,λ,ϵp,1/10) can be implemented with this runtime.)

Now for the main guarantee (we defer the proof to the full version):

Theorem 21 (Guarantee for Algorithm 3).

In the settings of Definitions 18 and 19, given target error ϵ>0 and with success probability at least 9/10, Algorithm 3 with dual-regularization and primal-accuracy schedules given by

λk=2kϵ4lnn for k{0}[K1] and ϵk=ϵ4K for k[K]

for K=max{log2lnnϵ2,1}+10 returns an expected ϵ-optimal point for (D-MG), and can be implemented with runtime O~(nd+n(d/ϵ)2/3+dϵ2).

The primal perspective

As alluded to above, the guarantee of Theorem 21 also implies a new state-of-the-art runtime for the primal (P-MG) in the setting of Definition 19. This follows because in the matrix games simplex setup, (P-MG) and (D-MG) are symmetric in terms of their constraint sets, so we can obtain an expected ϵ-optimal point for (P-MG) via Theorem 21 by negating and treating (P-MG) as if it were the dual problem. Formally (we defer the proof to the full version):

Corollary 22 (Guarantee for (P-MG) in the matrix games simplex setup).

In the setting of Definition 19, there exists an algorithm which, given target error ϵ>0 and with success probability at least 9/10, returns an expected ϵ-optimal point for (P-MG) with runtime O~(nd+d(n/ϵ)2/3+nϵ2).

See the full version for a discussion of how this runtime compares to the prior art.

4.2 CVaR at level 𝜶 DRO

In this section, we instantiate ψ(x,y)i=1nyifi(x) for convex, bounded, and G-Lipschitz (with respect to the Euclidean norm) functions fi:d.444Note that we do not require the functions fi to be differentiable. Here, it is important that Definition 9 only requires ψ(x,) to be differentiable. Given a compact, convex set 𝒳 and α[1/n,1], the primal and dual problem for CVaR at level α are as follows (we explain the reason for the notation f¯ as opposed to f in the full version; in short, we apply the framework to a proxy objective):

minimizex𝒳f¯(x) for f¯(x)maxyΔn,y1αni=1nyifi(x), and (P-CVaR)
maximizeyΔn,y1αnϕ(y) for ϕ(y)minx𝒳i=1nyifi(x). (D-CVaR)

Our complexity model in this section is the number of computations of the form (fi(x),fi(x)) for x𝒳 and i[n]. We refer to the evaluation of (fi(x),fi(x)) for a given x𝒳 and i[n] as a single first-order query. Omitting the Lipschitz constant G and bounds on the range of the fi’s and size of 𝒳 for clarity, [29, Sec. 4] gave an algorithm which returns an expected555To be precise, [29] gives a O~(α1ϵ2)-complexity high probability bound in Theorem 2. They do not state a O~(α1ϵ2)-complexity expected suboptimality bound explicitly in a theorem, but they note in the text above Theorem 2 that such a bound follows immediately from Propositions 3 and 4 in their paper. ϵ-optimal point of (P-CVaR) with O~(α1ϵ2) first-order queries, and also proved a matching lower bound up to logarithmic factors when n is sufficiently large. However, to the best of our knowledge, the best known complexity for obtaining an expected ϵ-optimal point of (D-CVaR) is O~(nϵ2) via a primal-dual method based on [33]; see also [13, 32, 5]. In our main guarantee for this section, Theorem 24, we apply Algorithm 2 to obtain an expected ϵ-optimal point of (D-CVaR) with complexity O~(α1ϵ2+n), which always improves upon or matches O~(nϵ2) since α[1/n,1].

Toward stating our main guarantee, we encapsulate the formal assumptions of [29, Sec. 2] in the following definition:

Definition 23 (CVaR at level α setup).

We assume 𝒳 is nonempty, closed, convex, and satisfies xy2R for all x,y𝒳. We also assume, for all i[n], that fi is convex, G-Lipschitz with respect to 2, and satisfies fi(x)[0,M] for all x𝒳.

We ultimately obtain the following guarantee via Algorithm 2. Note that the upper bound on ϵ in Theorem 24 is without loss of generality since if ϵM, any feasible point is ϵ-optimal. We defer the proof to the full version.

Theorem 24 (Guarantee for (D-CVaR)).

In the setting of Definition 23 with target error ϵ(0,4M) and α[1/n,1], there exists an algorithm which computes an expected ϵ-optimal point of (D-CVaR) with complexity O~(n+G2R2α1ϵ2).

5 Obtaining critical points of convex functions

In this section, our goal is to obtain an approximate critical point of a convex, β-smooth function h:n, given access to a gradient oracle for h. We show that our general framework yields an algorithm with the optimal query complexity for this problem. In Section 5.1, we give the formal problem definition and some important preliminaries. In Section 5.2, we give the setup for applying our main framework Algorithm 2 to this problem and a sketch of why the resulting algorithm works. In Section 5.3, we formally state the resulting algorithm for obtaining an approximate critical point of h and prove that it achieves the optimal rate using the guarantees associated with Algorithm 2.

5.1 Preliminaries for Section 5

Throughout Section 5, we fix to be the standard Euclidean norm over n. We assume h:n is convex, β-smooth with respect to , and Δh(x0)infxnh(x)< for an arbitrary initialization point x0n. We access h through a gradient oracle. For γ>0, our goal will be to obtain a γ-critical point of h, i.e., a point xn such that h(x)γ. Instead of operating on h itself, our algorithm will operate on a regularized version of h:

f(x)h(x)+γ216Δxx02. (11)

This notation was chosen to mirror the notation of Section 3.1; f will be the primal function when we apply the framework. Let xf denote the unique global minimum of f. The following corollary of Lemma 13 in Appendix C in the full version summarizes the key properties of f:

Corollary 25 (Properties of the regularized function f).

We have

  1. 1.

    xfx04Δ/γ.

  2. 2.

    If un is such that f(u)γ/4, then h(u)γ.

Proof.

This follows immediately from Lemma 13 in the full version with αγ28Δ and νγ/4.

The second part of Corollary 25 says that to find a γ-critical point of h, it suffices to find a (γ/4)-critical point of f. Furthermore, clearly a single query to h suffices to obtain f at a point. As a result, we will focus on finding a (γ/4)-critical point of f. Furthermore, Corollary 25 may be of independent interest since it trivially allows one to achieve a gradient query complexity of O(βΔ/γ) via a method which achieves query complexity O(βx0xh/γ) (for xh defined as some minimizer of h over n, assuming one exists); see Section 1.3.

The reason we perform this regularization before applying our framework is it enables us to obtain a sufficiently tight bound on Vy0(y) (equivalently, a small enough value of B when we ultimately apply Corollary 17). It is possible to apply the framework more directly to h, but it is not clear how to do so in a way that achieves an optimal complexity.

Finally, we provide a notation guide for Section 5 in Table 1, which may be useful to reference as additional notation is introduced in Sections 5.2 and 5.3.

Table 1: Notation guide for Section 5.
Notation Description Section
Euclidean norm 5.1
h Convex, β-smooth function
γ Target critical point error for h
x0 Arbitrary initialization point
Δ h(x0)infxnh(x)<
f(x) h(x)+γ216Δxx02
xf The global minimizer of f
ψ(x,y) x,yf(y) 5.2
R 5Δ/γ
𝒳 BRn(x0)
𝒴 n
dgf r f
ϕ(y) x0,yRyf(y)
λk 2k/32 5.3
ϵk Δ/(641.5k)
CGM Fast composite gradient method oracle

5.2 Instantiating the framework

For this application, we instantiate

ψ(x,y) x,yf(y).

Recall that ψ is the Fenchel game [1, 43, 12, 23]; see Section 1.1 for a discussion of why it is a natural choice in this setting. For the rest of Section 5, we fix a (ψ,𝒳BRn(x0),𝒴n,f)-unbounded setup (Definition 10) with R5Δ/γ. f is a valid choice for the dgf because f is differentiable and (β+γ28Δ)1-strongly convex [38, Thm. 6.11]. The strong convexity of f also implies that Assumption 6 holds. Note that the associated primal function xmaxynψ(x,y) is precisely f=f (hence the choice of notation in (11)), and the dual function is given by

ϕ(y)=minxBRn(x0){x,yf(y)}=x0Ryy,yf(y)=x0,yRyf(y).

Next, the following lemma fulfills part of the outline given in Section 1.1 by showing that approximately optimal points for the dual objective (D) must have small norm. We defer the proof to the full version.

Lemma 26 (Bounding the norm by dual suboptimality).

If yn is ϵ-optimal for (D) for some ϵ>0, then yϵγ/Δ.

We now derive the oracles of Definitions 12 and 13. Regarding Definition 12, for the rest of Section 5 we restrict DRPO() to denote a deterministic implementation of the DRPO oracle, since we can always obtain a deterministic implementation in this application. Then the following corollary is an immediate consequence of a more general lemma given in Appendix C in the full version which characterizes the properties of the Fenchel game with added dual regularization; see also Section 1.1.

Corollary 27.

The set of valid output points of DRPO(qn,λ>0,ϵp>0) is precisely

argminϵpxBRn(x0)(1+λ)f(x+λf(q)1+λ), and
DRBR(qn,λ>0,xBRn(x0))=f(x+λf(q)1+λ).
Proof.

Apply Lemma 14 in the full version with gf.

Taken together, Lemma 26 and Corollary 27 nearly immediately imply that Algorithm 2 can be applied to the above setup to obtain a (γ/4)-critical point of f (and therefore a γ-critical point of h). In particular, we will apply the schedules of Corollary 17 to certify that the output yK of Algorithm 2 is close in distance to an ϵ-optimal point for (D) for an appropriate choice of ϵ>0. Then Lemma 26 and a triangle inequality yield a bound on yK. Finally, since

yKDRBR(qK,ΛK,xK)=f(xK+ΛKf(qK)1+ΛK)

by Corollary 27, we have that xK+ΛKf(qK)1+ΛK is an approximate critical point of f (and therefore h). One may worry about the presence of f(qK) here and, more generally, the presence of f(q) in the expressions for the oracles in Corollary 27. However, f() never needs to be evaluated explicitly since per the alternate expression for qk given in Line 3 of Algorithm 2, note that qk was itself computed as the gradient of f at a point (recall the dgf is f and f=f), in which case f simply undoes this operation by Lemma 16 in the full version.

We formalize this sketch and provide a complexity guarantee in the next section. We also reframe this sketch and treat the sequence of xk+Λkf(qk)1+Λk terms as our iterates (as opposed to the sequence of xk’s), as this leads to a simpler statement and interpretation of the resulting algorithm.

5.3 The resulting algorithm and guarantee

We now formalize the sketch given at the end of the previous section, state the resulting algorithm, and provide a complexity guarantee. But first, we define a subroutine which will be used by the algorithm to implement the DRPO oracle:

Definition 28 (CGM oracle [40, 36]).

A (ζ>0,wn,ϵ>0)-fast composite gradient method oracle, CGM(ζ,w,ϵ), returns an ϵ-minimizer of f over xBζn(w), i.e., an element of argmaxxBζn(w)ϵf(x), using at most O(1+βζ2ϵ) queries to f.

For example, implementations with a small constant can be found in [40] or [36, Sec. 6.1.3]. The implementation of the CGM oracle falls under fast gradient methods for composite minimization, where letting g denote a convex, β-smooth function and Ψ a “simple regularizer” (a quadratic in our case), the goal is to minimize g~(x)g(x)+Ψ(x) with the same complexity as it takes to minimize g. The domain constraint can also be baked into the regularizer Ψ by adding an indicator.

Our method for computing a γ-critical point of h is given in Algorithm 4, with the associated guarantee in Theorem 30. We note that the decision to introduce the equivalent notation z0 for x0 in Line 1 is aesthetic (to make Line 5 simpler to state and interpret). Furthermore, we state Algorithm 4 for general schedules (λk)k=0K1 and (ϵk)k=1K for clarity, but ultimately we will choose the schedules given in Theorem 30, which correspond to particularizing the schedules of Corollary 17 to this setting. With this choice of schedules, Λk2k and ϵkΔ/1.5k so that ϵk1+ΛkΔ/3k. As a result, Algorithm 4 can be interpreted as optimizing f in a sequence of balls where the radius and target error are both decreasing geometrically, and the center is a convex combination of the past iterates. While we choose the iteration count K to be logarithmic in the problem parameters, we avoid multiplicative logarithmic factors in the total complexity because the ratio ζ2/ϵ in the complexity of the CGM oracle call (to borrow the notation of Definition 28) in Line 5 of Algorithm 4 is R24k3kΔ at the k-th iteration, meaning it is collapsing geometrically.

Algorithm 4 Algorithm for obtaining a γ-critical point of h.

Toward analyzing Algorithm 4, we first connect the sequence of iterates zk produced by Algorithm 4 to the sequence of iterates xk,yk,qk produced by Algorithm 2 with the same input parameters. Namely, we are formalizing the comment made at the end of Section 5.2 about reframing the sequence of iterates to achieve a more interpretable algorithm. We defer the proof to the full version.

Lemma 29 (Connecting Algorithm 4 to Algorithm 2).

Consider Algorithm 2 with input given by a (ψ,BRn(x0),n,f)-unbounded setup (Definition 10); y0f(x0); and K, (ϵk)k=1K, and (λk)k=0K1 as in Algorithm 4. Then letting (zk)k=0K denote the sequence of iterates generated by Algorithm 4, the following are valid sequences of iterates for Algorithm 2:

qk =f(1Λkj=0k1λjzj) for k[K], (12)
xk =(1+Λk)zkj=0k1λjzj for k[K], and (13)
yk =f(zk) for k{0}[K]. (14)

Having connected Algorithm 4 to Algorithm 2, we can apply the schedules given in Corollary 17 to show that Algorithm 4 returns a γ-critical point of h with an optimal complexity. We defer the proof to the full version.

Theorem 30 (Guarantee for Algorithm 4).

For any666The restriction on γ is without loss of generality since h(x0)2βΔ by smoothness. We add it because it simplifies the analysis. γ(0,2βΔ) and with K=O(log(βΔ/γ)), the output of Algorithm 4 with schedules given by

λk=2k32 for k{0}[K1] and ϵk=Δ641.5k for k[K] (15)

satisfies h(zK)γ, and the algorithm makes at most O(βΔγ) gradient queries to h.

References

  • [1] Jacob D Abernethy and Jun-Kun Wang. On frank-wolfe and equilibrium computation. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  • [2] Ilan Adler. The equivalence of linear programs and zero-sum games. International Journal of Games Theory, Volume 42:165-177, February 2013.
  • [3] Zeyuan Allen-Zhu and Elad Hazan. Optimal black-box reductions between optimization objectives. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pages 1614–1622, Red Hook, NY, USA, 2016. Curran Associates Inc.
  • [4] Hilal Asi, Yair Carmon, Arun Jambulapati, Yujia Jin, and Aaron Sidford. Stochastic bias-reduced gradient methods. In Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS ’21, Red Hook, NY, USA, 2024. Curran Associates Inc.
  • [5] Yair Carmon and Danielle Hausler. Distributionally robust optimization via ball oracle acceleration. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2024. Curran Associates Inc.
  • [6] Yair Carmon, Arun Jambulapati, Yujia Jin, and Aaron Sidford. Thinking inside the ball: Near-optimal minimization of the maximal loss. In Mikhail Belkin and Samory Kpotufe, editors, Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 866–882. PMLR, 15–19 August 2021. URL: http://proceedings.mlr.press/v134/carmon21a.html.
  • [7] Yair Carmon, Arun Jambulapati, Yujia Jin, and Aaron Sidford. Recapp: Crafting a more efficient catalyst for convex optimization. In International Conference on Machine Learning, 2022.
  • [8] Yair Carmon, Arun Jambulapati, Yujia Jin, and Aaron Sidford. A Whole New Ball Game: A Primal Accelerated Method for Matrix Games and Minimizing the Maximum of Smooth Functions, pages 3685–3723. Society for Industrial and Applied Mathematics, 2024. doi:10.1137/1.9781611977912.130.
  • [9] Yair Carmon, Yujia Jin, Aaron Sidford, and Kevin Tian. Variance reduction for matrix games. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  • [10] Kenneth L. Clarkson, Elad Hazan, and David P. Woodruff. Sublinear optimization for machine learning. J. ACM, 59(5), November 2012. doi:10.1145/2371656.2371658.
  • [11] Michael B. Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, pages 938–942, New York, NY, USA, 2019. Association for Computing Machinery. doi:10.1145/3313276.3316303.
  • [12] Michael B. Cohen, Aaron Sidford, and Kevin Tian. Relative lipschitzness in extragradient methods and a direct recipe for acceleration. In James R. Lee, editor, 12th Innovations in Theoretical Computer Science Conference, ITCS 2021, January 6-8, 2021, Virtual Conference, volume 185 of LIPIcs, pages 62:1–62:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.ITCS.2021.62.
  • [13] Sebastian Curi, Kfir Y. Levy, Stefanie Jegelka, and Andreas Krause. Adaptive sampling for stochastic risk-averse learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc.
  • [14] G. B. Dantzig. Linear programming and extensions, 1953.
  • [15] Jelena Diakonikolas and Puqian Wang. Potential function-based framework for minimizing gradients in convex and min-max optimization. SIAM Journal on Optimization, 32(3):1668–1697, 2022. doi:10.1137/21M1395302.
  • [16] John Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49, June 2021.
  • [17] I. Ekeland, Roger Temam, Society for Industrial, and Applied Mathematics. Convex analysis and variational problems. Classics in applied mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, Pa., 1999.
  • [18] Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 2540–2548. JMLR.org, 2015. URL: http://proceedings.mlr.press/v37/frostig15.html.
  • [19] Michael B. Giles. Multilevel monte carlo path simulation. Operations Research, 56(3):607–617, June 2008. doi:10.1287/OPRE.1070.0496.
  • [20] Michael B. Giles. Multilevel monte carlo methods. Acta Numerica, 24:259–328, May 2015. doi:10.1017/S096249291500001X.
  • [21] G. N. Grapiglia and Yurii Nesterov. Tensor methods for finding approximate stationary points of convex functions. Optimization Methods and Software, 37(2):605–638, 2022. doi:10.1080/10556788.2020.1818082.
  • [22] Michael D. Grigoriadis and Leonid G. Khachiyan. A sublinear-time randomized approximation algorithm for matrix games. Operations Research Letters, 18(2):53–58, 1995. doi:10.1016/0167-6377(95)00032-0.
  • [23] Yujia Jin, Aaron Sidford, and Kevin Tian. Sharper rates for separable minimax and finite sum optimization via primal-dual extragradient methods. In Po-Ling Loh and Maxim Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 4362–4415. PMLR, 02–05 July 2022. URL: https://proceedings.mlr.press/v178/jin22b.html.
  • [24] Donghwan Kim and Jeffrey A. Fessler. Optimizing the efficiency of first-order methods for decreasing the gradient of smooth convex functions. Journal of Optimization Theory and Applications, 188(1):192–219, January 2021. doi:10.1007/S10957-020-01770-2.
  • [25] Jaeyeon Kim, Asuman Ozdaglar, Chanwoo Park, and Ernest K. Ryu. Time-reversed dissipation induces duality between minimizing gradient norm and function value, 2023. arXiv:2305.06628.
  • [26] Jaeyeon Kim, Chanwoo Park, Asuman Ozdaglar, Jelena Diakonikolas, and Ernest K. Ryu. Mirror duality in convex optimization, 2024. arXiv:2311.17296.
  • [27] Guanghui Lan, Yuyuan Ouyang, and Zhe Zhang. Optimal and parameter-free gradient minimization methods for convex and nonconvex optimization, 2023. arXiv:2310.12139.
  • [28] Jongmin Lee, Chanwoo Park, and Ernest K. Ryu. A geometric structure of acceleration and its role in making gradients small fast. In Neural Information Processing Systems, 2021.
  • [29] Daniel Levy, Yair Carmon, John C Duchi, and Aaron Sidford. Large-scale methods for distributionally robust optimization. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 8847–8860. Curran Associates, Inc., 2020.
  • [30] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optimization. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, pages 3384–3392, Cambridge, MA, USA, 2015. MIT Press.
  • [31] M. Minsky and S. Papert. Perceptrons: An introduction to computational geometry, 1988.
  • [32] Hongseok Namkoong and John C. Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pages 2216–2224, Red Hook, NY, USA, 2016. Curran Associates Inc.
  • [33] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009. doi:10.1137/070704277.
  • [34] A.S Nemirovsky. On optimality of krylov’s information when solving linear operator equations. Journal of Complexity, 7(2):121–130, 1991. doi:10.1016/0885-064X(91)90001-E.
  • [35] A.S Nemirovsky. Information-based complexity of linear operator equations. Journal of Complexity, 8(2):153–175, 1992. doi:10.1016/0885-064X(92)90013-2.
  • [36] Yurii Nesterov. Lectures on Convex Optimization. Springer Publishing Company, Incorporated, 2nd edition, 2018.
  • [37] Yurii Nesterov, Alexander Gasnikov, Sergey Guminov, and Pavel Dvurechensky. Primal-dual accelerated gradient methods with small-dimensional relaxation oracle. Optimization Methods and Software, 36(4):773–810, July 2021. doi:10.1080/10556788.2020.1731747.
  • [38] Francesco Orabona. A modern introduction to online learning, 2023. arXiv 1912.13213. arXiv:1912.13213.
  • [39] Maurice Sion. On general minimax theorems. Pacific Journal of Mathematics, 8:171–176, 1958.
  • [40] Paul Tseng. On accelerated proximal gradient methods for convex-concave optimization, 2008.
  • [41] J. v. Neumann. Zur theorie der gesellschaftsspiele. Mathematische Annalen, 100(1):295–320, December 1928.
  • [42] Jan van den Brand, Yin Tat Lee, Yang P. Liu, Thatchaphol Saranurak, Aaron Sidford, Zhao Song, and Di Wang. Minimum cost flows, mdps, and 1-regression in nearly linear time for dense instances. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2021, pages 859–869, New York, NY, USA, 2021. Association for Computing Machinery.
  • [43] Jun-Kun Wang, Jacob Abernethy, and Kfir Y. Levy. No-regret dynamics in the fenchel game: a unified framework for algorithmic convex optimization. Mathematical Programming, 205(1-2):203–268, May 2024. doi:10.1007/S10107-023-01976-Y.