Extracting Dual Solutions via Primal Optimizers

Carmon, Yair; Jambulapati, Arun; O'Carroll, Liam; Sidford, Aaron

doi:10.4230/LIPIcs.ITCS.2025.29

Extracting Dual Solutions via Primal Optimizers

Yair Carmon

Tel Aviv University, Israel Arun Jambulapati University of Michigan, Ann Arbor, MI, USA Liam O’Carroll Stanford University, CA, USA Aaron Sidford

Stanford University, CA, USA

Abstract

We provide a general method to convert a “primal” black-box algorithm for solving regularized convex-concave minimax optimization problems into an algorithm for solving the associated dual maximin optimization problem. Our method adds recursive regularization over a logarithmic number of rounds where each round consists of an approximate regularized primal optimization followed by the computation of a dual best response. We apply this result to obtain new state-of-the-art runtimes for solving matrix games in specific parameter regimes, obtain improved query complexity for solving the dual of the CVaR distributionally robust optimization (DRO) problem, and recover the optimal query complexity for finding a stationary point of a convex function.

Keywords and phrases:

Minimax optimization, black-box optimization, matrix games, distributionally robust optimization

Funding:

Yair Carmon: YC acknowledges support from the Israeli Science Foundation (ISF) grant no. 2486/21, and the Alon Fellowship.

Liam O’Carroll: LO acknowledges support from NSF Grant CCF-1955039.

Aaron Sidford: AS acknowledges support from a Microsoft Research Faculty Fellowship, NSF CAREER Grant CCF-1844855, NSF Grant CCF-1955039, and a PayPal research award.

Copyright and License:

2012 ACM Subject Classification:

Theory of computation

\rightarrow

Mathematical optimization

Editor:

Raghu Meka

Series and Publisher:

Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

We consider the foundational problem of efficiently solving convex-concave games. For nonempty, closed, convex constraint sets $\mathcal{X}\subseteq\mathbb{R}^{d}$ and $\mathcal{Y}\subseteq\mathbb{R}^{n}$ and differentiable convex-concave objective function $\psi:\mathbb{R}^{d}\times\mathbb{R}^{n}\rightarrow\mathbb{R}$ (namely, $\psi(\cdot,y)$ is convex for any fixed $y$ and $\psi(x,\cdot)$ is concave for any fixed $x$ ), we consider the following primal, minimax optimization problem (P) and its associated dual, maximin optimization problem (D):

		$\displaystyle\mathop{\rm minimize}_{x\in\mathcal{X}}f(x)\text{ for }f(x)% \coloneqq\max_{y\in\mathcal{Y}}\psi(x,y),\text{ and}$		(P)
		$\displaystyle\mathop{\rm maximize}_{y\in\mathcal{Y}}\phi(y)\text{ for }\phi(y)% \coloneqq\min_{x\in\mathcal{X}}\psi(x,y).$		(D)

If additionally $\mathcal{X}$ and $\mathcal{Y}$ are bounded (which we assume for simplicity in the introduction but generalize later), every pair of primal and dual optimizers $x^{\star}\in\mathop{\rm argmin}_{x\in\mathcal{X}}f(x)$ and $y^{\star}\in\mathop{\rm argmax}_{y\in\mathcal{Y}}\phi(y)$ satisfies the minimax principle: $f(x^{\star})=\phi(y^{\star})=\psi(x^{\star},y^{\star})$ .

Convex-concave games are pervasive in algorithm design, machine learning, data analysis, and optimization. For example, the games induced by bilinear objectives, i.e., $\psi(x,y)=x^{\top}Ay+b^{\top}x+c^{\top}y$ , where $\mathcal{X}$ and $\mathcal{Y}$ are either the simplex, $\Delta^{k}\coloneqq\{x\in\mathbb{R}^{k}_{\geq 0}:\|x\|_{1}=1\}$ , or the Euclidean ball, $B^{k}\coloneqq\{x\in\mathbb{R}^{k}:\|x\|_{2}\leq 1\}$ , encompass zero-sum games, linear programming, hard-margin support vector machines (SVMs), and minimum enclosing/maximum inscribed ball [14, 2, 31, 10]. Additionally, the case when $\psi(x,y)=\sum_{i=1}^{n}y_{i}f_{i}(x)$ for some functions $f_{i}:\mathbb{R}^{d}\to\mathbb{R}$ and $\mathcal{Y}$ is a subset of the simplex encompasses a variety of distributionally robust optimization (DRO) problems [29, 5] and (for $\mathcal{Y}=\Delta^{n}$ ) the problem of minimizing the maximum loss [6, 8, 4].

In this paper, we study the following question:

Given only a black-box oracle which solves (regularized versions of) (P) to $\epsilon$ accuracy, and a black-box oracle for computing an exact dual best response $y_{x}\coloneqq\mathop{\rm argmax}_{y\in\mathcal{Y}}\psi(x,y)$ to any primal point $x\in\mathcal{X}$ , can we extract an $\epsilon$ -optimal solution to (D)?

We develop a general dual-extraction framework which answers this question in the affirmative. We show that as long as these oracles can be implemented as cheaply as obtaining an $\epsilon$ -optimal point of (P), then our framework can obtain an $\epsilon$ -optimal point of (D) at the same cost as that of obtaining an $\epsilon$ -optimal point of (P), up to logarithmic factors. We then instantiate our framework to obtain new state-of-the-art results in the settings of bilinear matrix games and DRO. Finally, as evidence of its broader applicability, we show that our framework can be used to recover the optimal complexity for computing a stationary point of a smooth convex function.

In the remainder of the introduction we describe our results in greater detail (Section 1.1), give an overview of the dual extraction framework and its analysis (Section 1.2), discuss related work (Section 1.3), and provide a roadmap for the remainder of the paper (Section 1.4).

1.1 Our results

From primal algorithms to dual optimization

We give a general framework which obtains an $\epsilon$ -optimal solution to (D) via a sequence of calls to two black-box oracles: (i) an oracle for obtaining an $\epsilon$ -optimal point of a regularized version of (P), and (ii) an oracle for obtaining a dual best response $y_{x}\coloneqq\mathop{\rm argmax}_{y\in\mathcal{Y}}\psi(x,y)$ for a given $x\in\mathcal{X}$ . In particular, we show it is always possible to obtain an $\epsilon$ -optimal point to (D) with at most a logarithmic number of calls to each of these oracles, where the regularized primal optimization oracle is always called to an accuracy of $\epsilon$ over a logarithmic factor. We also provide an alternate scheme (or more specifically choice of parameters) for applications where the cost of obtaining an $\epsilon$ -optimal point of the regularized primal problem decreases sufficiently as the regularization level increases. In such cases, e.g., in our stationary point application, it is possible to avoid even logarithmic factor increases in computational complexity for approximately solving (D) relative to the complexity of approximately solving (P).

Application 1: Bilinear matrix games

In this application, $\psi(x,y)\coloneqq x^{\top}Ay$ for a matrix $A\in\mathbb{R}^{d\times n}$ , $\mathcal{Y}$ is the simplex $\Delta^{n}$ , and $\mathcal{X}$ is either the simplex $\Delta^{d}$ or the unit Euclidean ball $B^{d}$ . Recently, [8] gave a new state-of-the-art runtime in certain parameter regimes of $\widetilde{O}(nd+n(d/\epsilon)^{2/3}+d\epsilon^{-2})$ for obtaining an expected $\epsilon$ -optimal point for the primal problem (P) for this setup. However, unlike previous algorithms for bilinear matrix games (see Section 1.3 for details), their algorithm does not return an $\epsilon$ -optimal solution for the dual (D), and their runtime is not symmetric in the dimensions $n$ and $d$ . As a result, it was unclear whether the same runtime is achievable for obtaining an $\epsilon$ -optimal solution of the dual (D). We resolve this question by applying our general framework to achieve an expected $\epsilon$ -optimal point of (D) with runtime $\widetilde{O}(nd+n(d/\epsilon)^{2/3}+d\epsilon^{-2})$ . We then observe (see Corollary 22) that in the setting where $\mathcal{X}=\Delta^{d}$ , our result can equivalently be viewed as a new state-of-the-art runtime of $\widetilde{O}(nd+d(n/\epsilon)^{2/3}+n\epsilon^{-2})$ for obtaining an $\epsilon$ -optimal point of the primal (P) due to the symmetry of $\psi$ and the constraint sets.

Application 2: CVaR at level $\alpha$ DRO

In this application, $\psi(x,y)\coloneqq\sum_{i=1}^{n}y_{i}f_{i}(x)$ for convex, bounded, and Lipschitz loss functions $f_{i}:\mathbb{R}^{d}\to\mathbb{R}$ , $\mathcal{X}$ is a convex, compact set, and $\mathcal{Y}\coloneqq\left\{y\in\Delta^{n}:\|y\|_{\infty}\leq\frac{1}{\alpha n}\right\}$ is the CVaR at level $\alpha$ uncertainty set for $\alpha\in[1/n,1]$ . The primal (P) is a canonical and well-studied DRO problem, and corresponds to the average of the top $\alpha$ fraction of the losses. We consider this problem given access to a first-order oracle that, when queried at $x\in\mathbb{R}^{d}$ and $i\in[n]$ , outputs $(f_{i}(x),\nabla f_{i}(x))$ . Ignoring dependencies other than $\alpha$ , the target accuracy $\epsilon>0$ , and the number of losses $n$ for brevity, [29] gave a matching upper and lower bound (up to logarithmic factors) of $\widetilde{O}(\alpha^{-1}\epsilon^{-2})$ queries to obtain an expected $\epsilon$ -optimal point of the primal (P). However, the best known query complexity for obtaining an expected $\epsilon$ -optimal point of the dual (D) was $\widetilde{O}(n\epsilon^{-2})$ prior to this paper (see Section 1.3 for details). Applying our general framework to this setting, we obtain an algorithm with a new state-of-the-art query complexity of $\widetilde{O}(\alpha^{-1}\epsilon^{-2}+n)$ for obtaining an expected $\epsilon$ -optimal point of the dual (D). In particular, note that this complexity is nearly linear in $n$ when $\epsilon\geq(\alpha n)^{-2}$ .

Application 3: Obtaining stationary points of convex functions

In this application, we show that our framework yields an alternative optimal approach for computing an approximate critical point of a smooth convex function given a gradient oracle. Specifically, for $\gamma>0$ and convex and $\beta$ -smooth $h:\mathbb{R}^{n}\to\mathbb{R}$ , in Section 5, we give an algorithm which computes $x\in\mathbb{R}^{n}$ such that $\|\nabla h(x)\|_{2}\leq\gamma$ using $O\left(\sqrt{\beta\Delta}/\gamma\right)$ gradient queries, where $\Delta\coloneqq h(x_{0})-\inf_{x\in\mathbb{R}^{n}}h(x)$ is the initial suboptimalityf. While this optimal complexity has been achieved before [24, 37, 15, 28, 27], that we achieve it is a consequence of our general framework illustrates its broad applicability.

For this application, we instantiate our framework with $\psi(x,y)\coloneqq\left\langle x,y\right\rangle-h^{*}(y)$ , where $h^{*}:\mathbb{R}^{n}\to\mathbb{R}$ denotes the convex conjugate of $h$ . (For reasons discussed in Section 5, we actually first substitute $h$ for an appropriately regularized version of $h$ , call it $f$ , before applying the framework, but the following discussion still holds with respect to $f$ .) This objective function $\psi$ is known as the Fenchel game and has been used in the past to recover classic convex optimization algorithms (e.g., the Frank-Wolfe algorithm and Nesterov’s accelerated methods) via a minimax framework [1, 43, 12, 23]. In the Fenchel game, a dual best response corresponds to a gradient evaluation:

\displaystyle\mathop{\rm argmax}_{y\in\mathbb{R}^{n}}\left\{\left\langle x,y% \right\rangle-h^{*}(y)\right\}=\nabla h(x),

and we show that approximately optimal points for the dual objective (D) must have small norm. As a result, obtaining an approximately optimal dual point $y$ as a best response to a primal point $x$ yields a bound on the norm of $y=\nabla f(x)$ . Furthermore, we note that in this setting, adding regularization to $\psi$ with respect to an appropriate choice of distance-generating function (namely $h^{*}$ ) is equivalent to rescaling and recentering the primal function $f$ , as well as the point at which a gradient is taken in the dual best response computation (cf. Lemma 14 in the full version). Thus, the properties of the Fenchel game extend naturally to appropriately regularized versions of $\psi$ .

1.2 Overview of the framework and analysis

We now give an overview of the dual-extraction framework. Our framework applies generally to a set of assumptions given in Section 3.1 (cf. Definition 9), but for now we specialize to the assumptions given above, namely: (i) the constraint sets $\mathcal{X}$ and $\mathcal{Y}$ are nonempty, compact, and convex; and (ii) $\psi$ is differentiable and convex-concave. Throughout this section, let $\|\cdot\|$ denote any norm on $\mathbb{R}^{n}$ and assume that the dual function, $\phi$ , is $L$ -Lipschitz with respect to $\|\cdot\|$ .¹¹1This is a weak assumption since we ensure at most a logarithmic dependence on $L$ ; see Remark 5. Let $r:\mathbb{R}^{n}\to\mathbb{R}$ denote a differentiable distance-generating function (dgf) which is $\mu_{r}$ -strongly convex with respect to $\|\cdot\|$ for $\mu_{r}>0$ ,²²2Section 2 gives the general setup for a distance-generating function which also covers the case where $\mathop{\rm dom}r\neq\mathbb{R}^{n}$ . and let $V_{u}\left(v\right)\coloneqq r(v)-r(u)-\left\langle\nabla r(u),v-u\right\rangle$ denote the associated Bregman divergence. For the sake of illustration, it may be helpful to consider the choices $\|\cdot\|\coloneqq\|\cdot\|_{2}$ , $r(u)\coloneqq\frac{1}{2}\|u\|_{2}^{2}$ , $\mu_{r}=1$ , and $V_{u}\left(v\right)=\frac{1}{2}\|u-v\|_{2}^{2}$ in the following, in which case relative strong convexity with respect to $r$ is equivalent to the standard notion of strong convexity with respect to $\|\cdot\|_{2}$ .

How should we obtain an $\epsilon$ -optimal point for (D) using the two oracles discussed previously, namely: (i) an oracle for approximately solving a regularized primal objective, and (ii) an oracle for computing a dual best response? We call (i) a dual-regularized primal optimization (DRPO) oracle and (ii) a dual-regularized best response (DRBR) oracle; their formal definitions are given in Section 3.1. Note that to solve (D), one cannot simply solve the primal problem (P) to high accuracy and then compute a dual best response. Consider $\psi(x,y)=xy$ with $\mathcal{X}=\mathcal{Y}=[-1,1]$ ; clearly $x^{\star}=y^{\star}=0$ , but for any $x$ arbitrarily close to $x^{\star}$ , the dual best response is either $-1$ or $1$ .

The key observation underlying our framework is that if $\psi(x,\cdot)$ is strongly concave for a given $x\in\mathcal{X}$ , it is possible to upper bound the distance between the best response $y_{x}\coloneqq\mathop{\rm argmax}_{y\in\mathcal{Y}}\psi(x,y)$ and the dual optimum $y^{\star}$ in terms of the primal suboptimality of $x$ . Figure 1 illustrates why this should be the case when subtracting a quadratic regularizer in $y$ (so that $\psi(x,\cdot)$ is strongly concave) to the preceding example of $\psi(x,y)=xy$ . We generalize this intuition in the following lemma (replacing strong concavity with relative strong concavity and a distance bound with a divergence bound), which is itself generalized further and proven in Section 3:

Lemma 1 (Lemma 3 from the full version specialized).

For a given $x\in\mathcal{X}$ , suppose $-\psi(x,\cdot)$ is $\mu$ -strongly convex relative to the dgf $r$ for some $\mu>0$ . Then $y_{x}\coloneqq\mathop{\rm argmax}_{y\in\mathcal{Y}}\psi(x,y)$ satisfies

\displaystyle V_{y_{x}}\left(y^{\star}\right)\leq\frac{f(x)-f(x^{\star})}{\mu}.

A first try

In particular, Lemma 1 suggests the following approach: Define “dual-regularized” versions of $\psi,\phi,f$ as follows for $\lambda>0$ and $y_{0}\in\mathcal{Y}$ :

	$\displaystyle\psi_{1}(x,y)$	$\displaystyle\coloneqq\psi(x,y)-\lambda V_{y_{0}}(y),$
	$\displaystyle f_{1}(x)$	$\displaystyle\coloneqq\max_{y\in\mathcal{Y}}\psi_{1}(x,y),$
	$\displaystyle\phi_{1}(y)$	$\displaystyle\coloneqq\min_{x\in\mathcal{X}}\psi_{1}(x,y)\,.$

(Here, the subscript 1 denotes one level of regularization and will be extended later.) For any $x\in\mathcal{X}$ , note that $-\psi_{1}(x,\cdot)$ is $\lambda$ -strongly convex relative to $r$ , in which case Lemma 1 applied to $\psi_{1}$ yields

\displaystyle V_{y_{x}}\left(y^{\star}_{1}\right)\leq\frac{f_{1}(x)-f_{1}(x^{% \star}_{1})}{\lambda},

(1)

for $y^{\star}_{1}\coloneqq\mathop{\rm argmax}_{y\in\mathcal{Y}}\phi_{1}(y)$ , $x^{\star}_{1}\in\mathop{\rm argmin}_{x\in\mathcal{X}}f_{1}(x)$ , and $y_{x}\coloneqq\mathop{\rm argmax}_{y\in\mathcal{Y}}\psi_{1}(x,y)$ . Then note

\displaystyle\phi(y^{\star}_{1})\geq\phi_{1}(y^{\star}_{1})\geq\phi_{1}(y^{% \star})=\min_{x\in\mathcal{X}}\left\{\psi(x,y^{\star})-\lambda V_{y_{0}}\left(% y^{\star}\right)\right\}=\phi(y^{\star})-\lambda V_{y_{0}}\left(y^{\star}% \right),

(2)

where the first inequality follows since $\phi\geq\phi_{1}$ pointwise. Then by the $L$ -Lipschitzness of $\phi$ and $\mu_{r}$ -strong convexity of $r$ , it is straightforward to bound the suboptimality of $y_{x}$ as

\displaystyle\phi(y^{\star})-\phi(y_{x})\leq\lambda V_{y_{0}}\left(y^{\star}% \right)+L\sqrt{\frac{2(f_{1}(x)-f_{1}(x^{\star}_{1}))}{\mu_{r}\lambda}}.

(3)

Consequently, an $\epsilon$ -optimal point for (D) can be obtained via our oracles as follows: Set $\lambda\leftarrow\frac{\epsilon}{2V_{y_{0}}\left(y^{\star}\right)}$ , and use the DRPO oracle on the regularized primal problem to obtain $x\in\mathcal{X}$ such that

f_{1}(x)-f_{1}(x^{\star}_{1})\leq\frac{\epsilon^{3}\mu_{r}}{16L^{2}\cdot V_{y_% {0}}\left(y^{\star}\right)}\,.

(4)

Then the best response to $x$ with respect to $\psi_{1}$ , namely $y_{x}\coloneqq\mathop{\rm argmax}_{y\in\mathcal{Y}}\psi_{1}(x,y)$ , is $\epsilon$ -optimal by (3). However, a typical setting in our applications is $V_{y_{0}}\left(y^{\star}\right)=\Omega(1)$ , $\mu_{r}=1$ , and $L\geq 1$ , in which case ensuring (4) requires solving the regularized primal problem to $O(\epsilon^{3})$ error.

Recursive regularization and the dual-extraction framework

To lower the accuracy requirements, we apply dual regularization recursively. A key issue with the preceding argument is that it required a nontrivial bound on $V_{y_{0}}\left(y^{\star}\right)$ . However, it provided us with a nontrivial bound $\eqref{eq:first-try-div-bound}$ on $V_{y_{x}}\left(y^{\star}_{1}\right)$ , the “level-one equivalent” of $V_{y_{0}}\left(y^{\star}\right)$ . This suggests solving $f_{1}$ to lower accuracy while still obtaining a bound on $V_{y_{x}}\left(y^{\star}_{1}\right)$ due to (1), and then adding regularization centered at $y_{x}$ with a larger value of $\lambda$ . Indeed, our framework recursively repeats this process until the total regularization is large enough so that (a term similar to) the right-hand side of (3) can be bounded by $\epsilon$ , despite never needing to solve a regularized primal problem to high accuracy.

To more precisely describe our approach, let $\psi_{0}\coloneqq\psi,f_{0}\coloneqq f,\phi_{0}\coloneqq\phi$ . Over iterations $k=1,2,\dots,K$ , our framework implicitly constructs a sequence of convex-concave games $\psi_{k}:\mathbb{R}^{d}\times\mathbb{R}^{n}\to\mathbb{R}$ , along with corresponding primal and dual functions $f_{k}:\mathcal{X}\to\mathbb{R}$ and $\phi_{k}:\mathcal{Y}\to\mathbb{R}$ respectively, as follows:

\displaystyle\begin{split}\psi_{k}(x,y)&\coloneqq\psi_{k-1}(x,y)-\lambda_{k-1}% V_{y_{k-1}}\left(y\right),\\ f_{k}(x)&\coloneqq\max_{y\in\mathcal{Y}}\psi_{k}(x,y),\\ \phi_{k}(y)&\coloneqq\min_{x\in\mathcal{X}}\psi_{k}(x,y).\end{split}

(5)

Here, $\left(\lambda_{k}\in\mathbb{R}_{>0}\right)_{k=0}^{K-1}$ is a dual-regularization schedule given as input to the framework, and $\left(y_{k}\in\mathcal{Y}\right)_{i=0}^{K}$ is a sequence of dual-regularization “centers” generated by the algorithm, with $y_{0}$ given as input. For $k\in\left\{0\right\}\cup[K]$ , it will be useful to let $y^{\star}_{k}$ denote a maximizer of $\phi_{k}$ over $\mathcal{Y}$ and $x^{\star}_{k}$ denote a minimizer of $f_{k}$ over $\mathcal{X}$ , with $y^{\star}_{0}\coloneqq y^{\star}$ and $x^{\star}_{0}\coloneqq x^{\star}$ in particular.

Over the $K$ rounds of recursive dual regularization, we aim to balance two goals:

$\blacksquare$

On the one hand, we want $\lambda_{k}$ to increase quickly so that $-\psi_{k}(x,\cdot)$ is very strongly convex relative to $r$ , thereby allowing us to apply Lemma 1 with a larger strong convexity constant.
$\blacksquare$

On the other hand, we want to maintain the invariant that, roughly speaking, $y^{\star}_{k}$ is always $\epsilon/2$ -optimal for the original dual $\phi$ . Indeed, we were constrained in choosing $\lambda$ in (2) to be on the order of $\epsilon/V_{y_{0}}\left(y^{\star}\right)$ to ensure $y^{\star}_{1}$ is $\epsilon/2$ -optimal for $\phi$ . A similar “constraint” on the dual-regularization schedule $(\lambda_{k})_{k=0}^{K-1}$ appears when (2) is extended to additional levels of regularization. This prevents us from increasing $\lambda_{k}$ too quickly.

In all the applications in this paper we choose $\lambda_{k}\approx 2\lambda_{k-1}$ . $\lambda_{0}$ typically must remain on the order of $\epsilon/V_{y_{0}}\left(y^{\star}\right)$ due to the second point.

Pseudocode of the framework is given in Algorithm 1. Each successive dual-regularization center $y_{k}$ is computed via the DRBR oracle (Line 5) as a best response to a primal point $x_{k}$ obtained via the DRPO oracle (Line 4). In Section 3, we generalize Algorithm 1 (cf. Algorithm 2) in several ways: (i) we allow for stochasticity in the DRPO oracle; (ii) we allow for distance-generating functions $r$ such that $\mathop{\rm dom}r\neq\mathbb{R}^{n}$ ; (iii) we give different but equivalent characterizations of $x_{k}$ and $y_{k}$ which facilitate the derivation of explicit expressions for the DRPO and DRBR oracles in applications.

Algorithm 1 Dual-extraction framework (Algorithm 2 specialized).

Analysis of Algorithm 1

Theorem 2 is our main result for Algorithm 1. We then instantiate Theorem 2 with two illustrative choices of parameters in Corollaries 5 and 7, and defer the proofs of the latter to their general versions in Section 3. All of the remarks below (Remarks 3, 5, 7) are stated with reference to the specialized results in this section (Theorem 2 and Corollaries 4, 6 resp.), but extend immediately to the corresponding general versions (Theorem 15 and Corollaries 16, 17 resp.).

Theorem 2 (Theorem 15 specialized).

Algorithm 1 returns $y_{K}$ satisfying

\displaystyle V_{y_{K}}\left(u\right)\leq\frac{\epsilon_{K}}{\Lambda_{K}}\text% { where }\Lambda_{k}\coloneqq\sum_{j=0}^{k-1}\lambda_{j}\text{ for }k\in[K]

(6)

and $u\in\mathcal{Y}$ is a point with dual suboptimality bounded as

\displaystyle\phi(y^{\star})-\phi(u)\leq\lambda_{0}V_{y_{0}}\left(y^{\star}% \right)+\sum_{k=1}^{K-1}\frac{\lambda_{k}}{\Lambda_{k}}\epsilon_{k}.

(7)

If we additionally assume that $\phi$ is $L$ -Lipschitz with respect to $\|\cdot\|$ , we can directly bound the suboptimality of $y_{K}$ as

\displaystyle\phi(y^{\star})-\phi(y_{K})\leq\lambda_{0}V_{y_{0}}\left(y^{\star% }\right)+\sum_{k=1}^{K-1}\frac{\lambda_{k}}{\Lambda_{k}}\epsilon_{k}+L\sqrt{% \frac{2}{\mu_{r}}\frac{\epsilon_{K}}{\Lambda_{K}}}.

(8)

Proof.

We claim the first half of Theorem 2 holds with $u\leftarrow y^{\star}_{K}$ . To see this, note that we can bound the suboptimality of $y^{\star}_{K}$ as

	$\displaystyle\phi(y^{\star}_{K})\overset{(i)}{\geq}\phi_{K}(y^{\star}_{K})\geq% \phi_{K}(y^{\star}_{K-1})$	$\displaystyle=\max_{x\in\mathcal{X}}\left\{\psi_{K-1}(x,y^{\star}_{K-1})-% \lambda_{K-1}V_{y_{K-1}}\left(y^{\star}_{K-1}\right)\right\}$
		$\displaystyle=\phi_{K-1}(y^{\star}_{K-1})-\lambda_{K-1}V_{y_{K-1}}\left(y^{% \star}_{K-1}\right)$
		$\displaystyle\overset{(ii)}{\geq}\phi_{0}(y^{\star}_{0})-\lambda_{0}V_{y_{0}}% \left(y^{\star}_{0}\right)-\sum_{k=1}^{K-1}\lambda_{k}V_{y_{k}}\left(y^{\star}% _{k}\right)$
		$\displaystyle\overset{(iii)}{\geq}\phi(y^{\star})-\lambda_{0}V_{y_{0}}\left(y^% {\star}\right)-\sum_{k=1}^{K-1}\frac{\lambda_{k}}{\Lambda_{k}}\epsilon_{k},$

where $(i)$ follows since $\phi\geq\phi_{K}$ pointwise, $(ii)$ follows from repeating the argument in the previous lines recursively (starting by lower bounding $\phi_{K-1}(y^{\star}_{K-1})$ , etc.), and $(iii)$ uses Lemma 1 applied to $\psi_{k}$ , which yields by Lines 4 and 5 in Algorithm 1:

\displaystyle V_{y_{k}}\left(y^{\star}_{k}\right)\leq\frac{f_{k}(x_{k})-f_{k}(% x^{\star}_{k})}{\Lambda_{k}}\leq\frac{\epsilon_{k}}{\Lambda_{k}},

since $\psi_{k}(x,\cdot)=\psi(x,\cdot)+\sum_{j=0}^{k-1}\lambda_{j}V_{y_{j}}\left(% \cdot\right)$ is $\Lambda_{k}$ -strongly concave relative to $-r$ . Thus, we have proven Equation 7, and Equation 6 follows again from Lemma 1 applied to $\psi_{K}$ . Equation 8 then follows since the fact that $r$ is $\mu_{r}$ -strongly convex with respect to $\|\cdot\|$ and Equation 6 imply

\displaystyle\|y_{K}-y^{\star}_{K}\|\leq\sqrt{\frac{2}{\mu_{r}}V_{y_{K}}\left(% y^{\star}_{K}\right)}\leq\sqrt{\frac{2}{\mu_{r}}\frac{\epsilon_{K}}{\Lambda_{K% }}}.

$\hfill\blacktriangleleft$

We give a remark regarding how to pick the parameters $(\lambda_{k})_{k=0}^{K-1}$ and $(\epsilon_{k})_{k=1}^{K}$ when applying Theorem 2:

$\blacktriangleright$ Remark 3 (Picking the parameters for Theorem 2).

Equation 8 can be interpreted as follows: To ensure $y_{K}$ is $\epsilon$ -optimal for $\phi$ , it suffices to choose the sequences $(\lambda_{k})_{k=0}^{K-1}$ and $(\epsilon_{k})_{k=1}^{K}$ so that the right side of (8) is at most $\epsilon$ . Then the first term, $\lambda_{0}V_{y_{0}}\left(y^{\star}\right)$ , constrains $\lambda_{0}$ to be on the order of $\epsilon/V_{y_{0}}\left(y^{\star}\right)$ . Skipping ahead, the third term, $L\sqrt{\frac{2}{\mu_{r}}\frac{\epsilon_{K}}{\Lambda_{K}}}$ , is the reason we always choose $\lambda_{k}\approx 2\lambda_{k-1}$ in our applications, as this ensures $\Lambda_{K}$ is large enough to handle this term with $K$ only needing to be logarithmic in the problem parameters. Then the second term, $\sum_{k=1}^{K-1}\frac{\lambda_{k}}{\Lambda_{k}}\epsilon_{k}$ , effectively constrains roughly $\sum_{k=1}^{K-1}\epsilon_{k}\leq\epsilon$ , as $\lambda_{k}/\Lambda_{k}\approx 1$ .

Corollary 4 (Corollary 16 specialized).

Suppose $\phi$ is $L$ -Lipschitz with respect to $\|\cdot\|$ , and let $B>0$ be such that $V_{y_{0}}\left(y^{\star}\right)\leq B$ . Then for any $\epsilon>0$ , and $K\geq\max\left\{\log_{2}\frac{L^{2}B}{\mu_{r}\epsilon^{2}},1\right\}+10$ , the output of Algorithm 1 with dual-regularization and primal-accuracy schedules of

\displaystyle\lambda_{k}=2^{k}\frac{\epsilon}{4B}\text{ for $k\in\left\{0% \right\}\cup[K-1]$}\text{ and }\epsilon_{k}=\frac{\epsilon}{4K}\text{ for $k% \in[K]$}

satisfies $\phi(y^{\star})-\phi(y_{K})\leq\epsilon$ .

$\blacktriangleright$ Remark 5.

Corollary 4 achieves the stated goal of obtaining an $\epsilon$ -optimal point for (D) by running for a number of iterations which depends logarithmically on the problem parameters, and solving each dual-regularized primal subproblem to an accuracy of $\epsilon$ divided by a logarithmic factor. Note in particular the logarithmic dependence on the dual divergence bound $B$ and dual Lipschitz constant $L$ , meaning these are weak assumptions. Furthermore, it is clear from the proof of Theorem 2 that $\phi$ only need be $L$ -Lipschitz on a set containing $y_{K}$ and $y^{\star}_{K}$ .

Corollary 6 (Corollary 17 specialized).

Let $B>0$ be such that $V_{y_{0}}\left(y^{\star}\right)\leq B$ . Then for any $\epsilon>0$ and $K\in\mathbb{N}$ , the output of Algorithm 1 with dual-regularization and primal-accuracy schedules of

\displaystyle\lambda_{k}=2^{k}\frac{\epsilon}{4B}\text{ for $k\in\left\{0% \right\}\cup[K-1]$}\text{ and }\epsilon_{k}=\frac{\epsilon}{8\cdot 1.5^{k}}% \text{ for $k\in[K]$}

satisfies

\displaystyle\|y_{K}-u\|\leq\frac{1}{1.5^{K}}\sqrt{\frac{2B}{\mu_{r}}},

where $u\in\mathcal{Y}$ is a point whose suboptimality is at most $\epsilon$ , i.e., $\phi(y^{\star})-\phi(u)\leq\epsilon$ .

$\blacktriangleright$ Remark 7.

Later calls to the DRPO oracle during the run of Algorithm 1 may be cheaper since there will be a significant amount of dual regularization at that point (namely, $\Lambda_{k}=\sum_{j=0}^{k-1}\lambda_{j}$ is large). One can sometimes take advantage of this (in particular, if the cost of a DRPO oracle call scales inverse polynomially with the regularization) to design schedules that avoid even the typically additional multiplicative logarithmic cost of Corollary 4 over the cost of a single DRPO oracle call. In such cases, a choice of schedules similar to those of Corollary 6 is often appropriate. With this choice of schedules, later rounds require very high accuracy. However, if one can argue that the increasing dual regularization $\Lambda_{k}$ makes the DRPO oracle call cheaper at a faster rate than the decreasing error $\epsilon_{k}$ makes it more expensive (as we do in Section 5), the total cost of applying the framework may collapse geometrically to the cost of a single DRPO oracle call made with target error approximately $\epsilon$ .

We purposely state Corollary 6 without the assumption that $\phi$ is Lipschitz because that is the form we will use in Section 5. However, it is straightforward to reformulate a version of Corollary 6 with the Lipschitz assumption. Here the focus was to illustrate different possible choices of schedules.

1.3 Related work

Black-box reductions

Our main contribution can be viewed as a black-box reduction from (regularized) primal optimization to dual optimization. Similar black-box reductions exist in the optimization literature. For example, [3] develops reductions between various fundamental classes of optimization problems, e.g., strongly convex optimization and smooth optimization. In a similar vein, the line of work [30, 18, 7] reduces convex optimization to approximate proximal point computation (i.e., regularized minimization).

Bilinear matrix games

Consider the bilinear objective $\psi(x,y)=x^{\top}Ay$ where $\mathcal{X}$ and $\mathcal{Y}$ are either the simplex, $\Delta^{k}\coloneqq\{x\in\mathbb{R}^{k}_{\geq 0}:\|x\|_{1}=1\}$ , or the Euclidean ball, $B^{k}\coloneqq\{x\in\mathbb{R}^{k}:\|x\|_{2}\leq 1\}$ . State-of-the-art methods in regard to runtime for obtaining an approximately optimal primal and/or dual solution can be divided into second-order interior point methods [11, 42] and stochastic first-order methods [22, 10, 9, 8]; see Table 2 in [8] for a summary of the best known runtimes as well as other references. Of importance to this paper, all state-of-the-art algorithms other than that of [8] are either (i) primal-dual algorithms which return both an $\epsilon$ -optimal primal and dual solution simultaneously, and/or (ii) achieve runtimes which are symmetric in the primal dimension $d$ and dual dimension $n$ , meaning the cost of obtaining an $\epsilon$ -optimal dual solution is the same as that of obtaining an $\epsilon$ -optimal primal solution. The algorithm of [8], on the other hand, only returns an $\epsilon$ -optimal primal point and further has a runtime which is not symmetric in $n$ and $d$ (see the footnote on the first page of that paper). As a result, solving the dual by simply swapping the roles of the primal and dual variables may be more expensive than solving the primal. (In fact, swapping the variables in this way may not even always be possible without further modifications due to restrictions on the constraint sets.)

CVaR at level $\alpha$ distributionally robust optimization (DRO)

The DRO objectives we study are of the form $\psi(x,y)=\sum_{i=1}^{n}y_{i}f_{i}(x)$ , where the functions $f_{i}:\mathbb{R}^{d}\to\mathbb{R}$ are convex, bounded, and Lipschitz, and $\mathcal{Y}$ , known as the uncertainty set, is a subset of the simplex. This objective corresponds to a robust version of the empirical risk minimization (ERM) objective where instead of taking an average over the losses (namely, $y_{i}$ is fixed at $1/n$ ), larger losses may be given more weight. In particular, in this paper we focus on a canonical DRO setting, CVaR at level $\alpha$ , where the uncertainty set is given by $\mathcal{Y}\coloneqq\left\{y\in\Delta^{n}:\|y\|_{\infty}\leq\frac{1}{\alpha n}\right\}$ for a choice of $\alpha\in[1/n,1]$ . CVaR DRO, along with its generalization $f$ -divergence DRO, has been of significant interest over the past decade; see [29, 5, 13, 32, 16] and the references therein. [29] is the most relevant to this paper – omitting parameters other than $\alpha$ , the number of losses $n$ , and the target accuracy $\epsilon>0$ , they give a matching upper and lower bound (up to logarithmic factors) of $\widetilde{O}(\alpha^{-1}\epsilon^{-2})$ first-order queries of the form $(\nabla f_{i}(x),f_{i}(x))$ to obtain an expected $\epsilon$ -optimal point of the primal objective. Their upper bound is achieved by a stochastic gradient method where the gradient estimator is based on a multilevel Monte Carlo (MLMC) scheme [19, 20]. However, the best known complexity for obtaining an expected $\epsilon$ -optimal point of the dual of CVaR at level $\alpha$ is $O(n\epsilon^{-2})$ via a primal-dual method based on [33]; see also [13, 32] as well as [5, Appendix A.1], the last of which obtains complexity $\widetilde{O}(n\epsilon^{-2})$ in the more general setting of the uncertainty set being an $f$ -divergence ball.

Stationary point computation

For $\gamma>0$ , convex and $\beta$ -smooth $h:\mathbb{R}^{n}\to\mathbb{R}$ with global minimum $z^{\star}$ , and initialization point $z_{0}$ , consider the problem of computing a point $z$ such that $\|\nabla h(z)\|_{2}\leq\gamma$ . Two worst-case optimal gradient query complexities for this problem exist in the literature: $O\left(\sqrt{\beta(h(z_{0})-h(z^{\star}))}/\gamma\right)$ and $O\left(\sqrt{\beta\|z_{0}-z^{\star}\|_{2}/\gamma}\right)$ . An algorithm (the OGM-G method) which achieves the former complexity was given in [24], and [37] pointed out that any algorithm which achieves the former complexity can achieve the latter complexity. This is obtainable by running $N$ iterations of any optimal gradient method for reducing the function value, followed by $N$ iterations of a method which achieves the former complexity for reducing the gradient magnitude. In what may be of independent interest, we observe in Section 5.1 that a reduction in the opposite direction is also possible. More broadly, algorithms and frameworks for reducing the gradient magnitude of convex functions have been of much recent interest, and further algorithms and related work for this problem include [25, 27, 26, 28, 15, 37, 21], with lower bounds given in [34, 35].

1.4 Paper organization

In Section 2, we go over notation and conventions for the rest of the paper. We give our general dual-extraction framework and its guarantees in Section 3. In Section 4, we apply our framework to bilinear matrix games and the CVaR at level $\alpha$ DRO problem. Finally, in Section 5 we give an optimal algorithm (in terms of query complexity) for computing an approximate stationary point of a convex and $\beta$ -smooth function.

2 Notation and conventions

We defer standard notation and conventions to the full version, and only include paper-specific notation here.

For $\psi:\mathbb{R}^{d}\times\mathbb{R}^{n}\to\mathbb{R}$ , we use the notation $\psi(\cdot,y):\mathbb{R}^{d}\to\mathbb{R}$ for a fixed $y\in\mathbb{R}^{n}$ to denote the map $x\mapsto\psi(x,y)$ (and define $\psi(x,\cdot)$ analogously). When we say $\psi(\cdot,y)$ satisfies a property, we mean it satisfies that property for any fixed $y\in\mathbb{R}^{n}$ (and analogously for $\psi(x,\cdot)$ ). We let $[K]\coloneqq\left\{1,2,\dots,K\right\}$ , $\Delta^{n}\coloneqq\left\{x\in\mathbb{R}_{\geq 0}^{n}:\|x\|_{1}=1\right\}$ , and $B_{r}^{n}(x)\coloneqq\left\{x\in\mathbb{R}^{n}:\|x\|_{2}\leq r\right\}$ . In the latter two definitions, we may drop the superscript $n$ if it is clear from context, the argument $x$ if it is 0, and the subscript $r$ if it is 1. For $y\in\mathbb{R}^{n}$ , we may use either the notation $y_{i}$ or $[y]_{i}$ to denote its $i$ -th entry. $\mathbf{1}$ denotes the all-ones vector. For a function $f$ which depends on some inputs $x_{1},\dots,x_{k}\in\mathbb{R}$ , we write $f\leq\mathrm{poly}(x_{1},\dots,x_{k})$ to denote the fact that $f$ is uniformly bounded above by a polynomial in $x_{1},\dots,x_{k}$ as $x_{1},\dots,x_{k}$ vary. We use the notation $f^{*}$ for the convex or Fenchel conjugate of $f$ . For $S\subseteq\mathbb{R}^{n}$ , we let $\mathbb{I}_{S}$ denote the infinite indicator of $S$ , namely $\mathbb{I}_{S}(x)=0$ if $x\in S$ and $\mathbb{I}_{S}(x)=\infty$ if $x\notin S$ . For a function $f:S\to[-\infty,\infty]$ initially defined on a strict subset $S\subset\mathbb{R}^{n}$ , we may implicitly extend the domain of $f$ to all of $\mathbb{R}^{n}$ via its indicator as $f+\mathbb{I}_{S}$ without additional comment. For a function $f:U\to[-\infty,\infty]$ with $S\subseteq U\subseteq\mathbb{R}^{n}$ , we let $f_{S}\coloneqq f+\mathbb{I}_{S}$ denote the restriction of $f$ to $S$ . We note that $f^{*}_{S}$ denotes the convex conjugate of $f_{S}$ (and not $f^{*}$ restricted to $S$ ).

Following [38, Sec. 6.4], we encapsulate the setup for a dgf as follows. See the full version for additional discussion of this definition.

Definition 8 (dgf setup).

We say $(\mathcal{U},\mathcal{P},\|\cdot\|,r)$ is a dgf setup over $\mathbb{R}^{n}$ for closed and convex sets $\mathcal{U}\subseteq\mathcal{P}\subseteq\mathbb{R}^{n}$ with $\mathcal{U}\cap\mathop{\rm int}\mathcal{P}\neq\emptyset$ if: (i) the distance-generating function (dgf) $r:\mathcal{P}\to\mathbb{R}$ is convex and continuous over $\mathcal{P}$ , differentiable on $\mathop{\rm int}\mathcal{P}$ , and $\mu_{r}$ -strongly convex with respect to the chosen norm $\|\cdot\|$ on $\mathcal{U}\cap\mathop{\rm int}\mathcal{P}$ for some $\mu_{r}>0$ ; and (ii) either $\lim_{u\to\mathop{\rm bd}\mathcal{P}}\|\nabla r(u)\|_{2}=\infty$ or $\mathcal{U}\subseteq\mathop{\rm int}\mathcal{P}$ .

For a given dgf setup, we define its induced Bregman divergence $V^{r}_{u}\left(v\right)\coloneqq r(v)-r(u)-\langle\nabla r(u),v-u\rangle$ for $u\in\mathop{\rm int}\mathcal{P},v\in\mathcal{P}$ , and drop the superscript $r$ when it is clear from context.

3 Dual-extraction framework

In this section, we provide our general dual-extraction framework and its guarantees. In Section 3.1, we give the general setup, oracle definitions, and assumptions with which we apply and analyze the framework. Section 3.2 contains the statement and guarantees of the framework and Section 3.3 in the full version contains the associated proofs.

3.1 Preliminaries

We bundle all of the inputs to our framework into what we call a dual-extraction setup, defined below. Recall that when we say $\psi(x,\cdot)$ satisfies a property, we mean it satisfies that property for any fixed $x\in\mathbb{R}^{d}$ (and analogously for $\psi(\cdot,y)$ ).

Definition 9 (Dual-extraction setup).

A dual-extraction setup is a tuple $(\psi,\mathcal{X},\mathcal{Y},\mathcal{U},\mathcal{P},\|\cdot\|,r)$ where:

1.

$\psi(x,\cdot)$ is differentiable;
2.

$\psi(\cdot,y)$ and $\psi(x,\cdot)$ are convex and concave respectively;
3.

$(\mathcal{U},\mathcal{P},\|\cdot\|,r)$ is a dgf setup over $\mathbb{R}^{n}$ per Definition 8;
4.

the constraint sets $\mathcal{X}\subseteq\mathbb{R}^{d}$ and $\mathcal{Y}\subseteq\mathbb{R}^{n}$ are nonempty, closed, and convex with $\mathcal{Y}\subseteq\mathcal{U}$ and $\mathcal{Y}\cap\mathop{\rm int}\mathcal{P}\neq\emptyset$ ;
5.

$\mathcal{X}$ is bounded or $\psi(\cdot,y)$ is strongly convex;
6.

$\mathcal{Y}$ is bounded or $\psi(x,\cdot)$ is strongly concave;
7.

over all $p\in\mathcal{U}\cap\mathop{\rm int}\mathcal{P}$ and $w\in\partial\mathbb{I}_{\mathcal{U}}(p)$ , the map $y\mapsto\left\langle w,y\right\rangle$ is constant over $\mathcal{Y}$ .³³3In all of our applications, this map will in fact be constant over $\mathcal{U}$ .

Assumption 1 is only used in the proofs of Lemma 3 in the full version (the general version of Lemma 1 from Section 1.2) and Corollary 13 in the full version (used to show the framework is well-defined when $\mathop{\rm dom}r\neq\mathbb{R}^{n}$ ). Assumptions 2, 5, and 6 ensure that the minimax optimization problem with objective $\psi$ and constraint sets $\mathcal{X}$ and $\mathcal{Y}$ satisfies the minimax principle; see below. Regarding Assumptions 3, 4, and 3, the fact that $\mathcal{Y}$ is potentially a strict subset of $\mathcal{U}$ as well as the necessity of the technical assumption 3 is discussed in Remark 4 in the full version. In particular, Assumption 3 is only used to derive an equivalent formulation of the framework to Algorithm 1 which often allows for easier instantiations in applications, but is not strictly necessary to obtain our guarantees.

While our main results are stated in the full generality of Definition 9, in our applications we only particularize to Definition 10 and Definition 11 introduced below.

Definition 10 (Unbounded setup).

A $(\psi,\mathcal{X},\mathcal{Y},r)$ -unbounded setup is a
$(\psi,\mathcal{X},\mathcal{Y},\mathbb{R}^{n},\mathbb{R}^{n},\|\cdot\|_{2},r)$ -dual-extraction setup.

In other words, in an unbounded setup we choose $\mathcal{U}=\mathcal{P}=\mathbb{R}^{n}$ and the Euclidean norm, in which case the dgf $r$ can be any differentiable and strongly convex function with respect to $\|\cdot\|_{2}$ . Note that Assumption 3 is trivial as $\partial\mathbb{I}_{\mathcal{U}}(p)=\left\{0\right\}$ for all $p\in\mathbb{R}^{n}$ .

Definition 11 (Simplex setup).

A $(\psi,\mathcal{X},\mathcal{Y})$ -simplex setup is a $(\psi,\mathcal{X},\mathcal{Y},\Delta^{n},\mathbb{R}^{n}_{\geq 0},\|\cdot\|_{1}% ,r)$ -dual-extraction setup where $r(u)\coloneqq\sum_{i=1}^{n}u_{i}\ln u_{i}$ (with $0\ln 0\coloneqq 0$ ).

In other words, in a simplex setup we choose $\mathcal{U}=\Delta^{n}$ , $\mathcal{P}=\mathbb{R}^{n}_{\geq 0}$ , we are using the $\ell_{1}$ -norm, and the dgf is negative entropy when restricted to the simplex. It is a standard result known as Pinsker’s inequality that $r$ is 1-strongly convex over $\Delta_{>0}^{n}$ with respect to $\|\cdot\|_{1}$ , and the associated Bregman divergence is given by the Kullback-Leibler (KL) divergence $V_{u}\left(w\right)=\sum_{i=1}^{n}w_{i}\ln\frac{w_{i}}{u_{i}}$ for $u\in\Delta^{n}_{>0}$ and $w\in\Delta^{n}$ . We verify that Assumption 3 holds in Appendix A.1 in the full version.

Notation associated with a setup

Whenever we instantiate a dual-extraction setup (Definition 9), we use the following notation and oracles associated with that setup without additional comment. We define the associated primal $f:\mathcal{X}\to\mathbb{R}$ and dual $\phi:\mathcal{Y}\to\mathbb{R}$ functions, along with their corresponding primal and dual optimization problems, as they were introduced above in (P) and (D). We let $x^{\star}\in\mathop{\rm argmin}_{x\in\mathcal{X}}f(x)$ and $y^{\star}\in\mathop{\rm argmax}_{y\in\mathcal{Y}}\phi(y)$ denote arbitrary primal and dual optima. To facilitate the discussion of dual-regularized problems, we define $f_{\lambda,q}(x):\mathcal{X}\to\mathbb{R}$ as follows

\displaystyle f_{\lambda,q}(x)\coloneqq\max_{y\in\mathcal{Y}}\left\{\psi(x,y)-% \lambda V_{q}(y)\right\}\text{ for }\lambda>0\text{ and }q\in\mathcal{U}\cap% \mathop{\rm int}\mathcal{P}\,.

The minimax principle

Assumptions 2, 5, and 6 in Definition 9 guarantee $f(x^{\star})=\psi(x^{\star},y^{\star})=\phi(y^{\star})$ , which we refer to as the minimax principle. See, e.g., [39, 41] as well as Propositions 1.2 and 2.4 in [17, Ch. VI].

Oracle definitions

Our framework assumes black-box access to $\psi$ , $\mathcal{X}$ , and $\mathcal{Y}$ via a dual-regularized primal optimization (DRPO) oracle and a dual-regularized dual best response (DRBR) oracle defined below. Note that we generalize the setting of Section 1.2 by allowing the DRPO oracle to return an expected $\epsilon$ -optimal point; this is used in our applications in Section 4.

Definition 12 (DRPO oracle).

A $(q\in\mathcal{U}\cap\mathop{\rm int}\mathcal{P},\lambda>0,\epsilon_{\mathrm{p}% }>0)$ -dual-regularized primal optimization oracle, $\textsc{DRPO}(q,\lambda,\epsilon_{\mathrm{p}})$ , returns an expected $\epsilon_{\mathrm{p}}$ -minimizer of $f_{\lambda,q}$ , i.e., a point $x\in\mathcal{X}$ such that $\mathbb{E}f_{\lambda,q}(x)\leq\inf_{x^{\prime}\in\mathcal{X}}f_{\lambda,q}(x^{% \prime})+\epsilon_{\mathrm{p}}$ , where the expectation is over the internal randomness of the oracle.

Definition 13 (DRBR oracle).

A $(q\in\mathcal{U}\cap\mathop{\rm int}\mathcal{P},\lambda>0,x\in\mathcal{X})$ -dual-regularized best response oracle, $\textsc{DRBR}(q,\lambda,x)$ , returns $\mathop{\rm argmax}_{y\in\mathcal{Y}}\left\{\psi(x,y)-\lambda V_{q}\left(y% \right)\right\}$ .

We also define a version of the DRPO oracle, called the DRPOSP oracle, which allows for a failure probability. We include this definition here due to its generality and broad applicability, but it is only used in Section 4.1 since the external result we cite to obtain an expected $\epsilon_{\mathrm{p}}$ -minimizer of $f_{\lambda,q}$ in that application has a failure probability. We also show in Appendix A.4 in the full version how to boost the success probability of a DRPOSP oracle.

Definition 14 (DRPOSP oracle).

A $(q\in\mathcal{U}\cap\mathop{\rm int}\mathcal{P},\lambda>0,\epsilon_{\mathrm{p}% }>0,\delta\in[0,1))$ -dual-regularized primal optimization oracle with success probability, $\textsc{DRPOSP}(q,\lambda,\epsilon_{\mathrm{p}},\delta)$ , returns an expected $\epsilon_{\mathrm{p}}$ -minimizer of $f_{\lambda,q}$ with success probability at least $1-\delta$ , where the expectation and success probability are over the internal randomness of the oracle.

3.2 The framework and its guarantees

Algorithm 2 Dual-extraction framework.

We now state the general dual-extraction framework, Algorithm 2, and its guarantees, with proofs in the next section. As mentioned in Section 1.2, Algorithm 2 generalizes Algorithm 1 in three major ways: (i) we allow for stochasticity in the DRPO oracle; (ii) we allow for distance-generating functions $r$ where $\mathop{\rm dom}r\neq\mathbb{R}^{n}$ ; and (iii) we give different but equivalent characterizations of $x_{k}$ and $y_{k}$ which often allow for easier instantiations of the framework.

Regarding (iii), consider the case where the DRPO oracle is deterministic and $\mathop{\rm dom}r=\mathbb{R}^{n}$ for the sake of discussion. Note that in this case, the definitions of $x_{k}$ and $y_{k}$ in Lines 4 and 5 of Algorithm 2 may seem different than those in Lines 4 and 5 of Algorithm 1 at first glance. In particular, $x_{k}$ in Line 4 of Algorithm 2 is an $\epsilon_{k}$ -minimizer of $x\mapsto\max_{y\in\mathcal{Y}}\{\psi(x,y)-\Lambda_{k}V_{q_{k}}\left(y\right)\}$ over $\mathcal{X}$ , whereas $x_{k}$ in Line 4 of Algorithm 1 is an $\epsilon_{k}$ -minimizer of $x\mapsto\max_{y\in\mathcal{Y}}\{\psi(x,y)-\sum_{j=0}^{k-1}\lambda_{j}V_{y_{j}}% \left(y\right)\}$ over $\mathcal{X}$ . Similarly, $y_{k}=\mathop{\rm argmax}_{y\in\mathcal{Y}}\{\psi(x_{k},y)-\Lambda_{k}V_{q_{k}% }\left(y\right)\}$ in Line 5 of Algorithm 2, whereas $y_{k}=\mathop{\rm argmax}_{y\in\mathcal{Y}}\{\psi(x,y)-\sum_{j=0}^{k-1}\lambda% _{j}V_{y_{j}}\left(y\right)\}$ in Line 5 of Algorithm 1. In fact, we show in Section 3.3 in the full version that these are equivalent; see Lemma 2 and Remark 4 in the full version. The potential advantage of the expressions in Algorithm 2 compared to those in Algorithm 1 is that they involve only a single regularization term.

Note also that Line 3 of Algorithm 2 gives two equivalent expressions for the iterate $q_{k}$ ; their equivalence is proven in Appendix A.2 in the full version. Also, note that Line 4 is the only potential source of randomness in Algorithm 2; in particular, $y_{k}$ and $q_{k+1}$ are deterministic upon conditioning on $x_{k}$ . Finally, we show that Algorithm 2 is well-defined in Appendix A.3 in the full version; in particular, whenever a Bregman divergence $V_{u}\left(w\right)$ is written in Algorithm 2, it is the case that $u\in\mathcal{U}\cap\mathop{\rm int}\mathcal{P}$ . For example, in the context of a simplex setup per Definition 11, this corresponds to $u\in\Delta^{n}_{>0}$ .

We now give the main guarantee for Algorithm 2. See Remark 3 for additional explanation.

Theorem 15 (Algorithm 2 guarantee).

With $K$ calls to a DRPO oracle and $K$ calls to a DRBR oracle, Algorithm 2 returns $y_{K}$ satisfying

\displaystyle\mathbb{E}V_{y_{K}}\left(u\right)\leq\frac{\epsilon_{K}}{\Lambda_% {K}},

where $u\in\mathcal{Y}$ is a point with expected suboptimality bounded as

\displaystyle\phi(y^{\star})-\mathbb{E}\phi(u)\leq\lambda_{0}V_{y_{0}}\left(y^% {\star}\right)+\sum_{k=1}^{K-1}\frac{\lambda_{k}}{\Lambda_{k}}\epsilon_{k}.

If we additionally assume that $\phi$ is $L$ -Lipschitz with respect to $\|\cdot\|$ , the expected suboptimality of $y_{K}$ can be directly bounded as

\displaystyle\phi(y^{\star})-\mathbb{E}\phi(y_{K})\leq\lambda_{0}V_{y_{0}}% \left(y^{\star}\right)+\sum_{k=1}^{K-1}\frac{\lambda_{k}}{\Lambda_{k}}\epsilon% _{k}+L\sqrt{\frac{2}{\mu_{r}}\frac{\epsilon_{K}}{\Lambda_{K}}}.

(9)

We now particularize Theorem 15 using two exemplary choices of the dual-regularization and primal-accuracy schedules. See Remarks 5 and 7 for additional comments.

Corollary 16.

Suppose $\phi$ is $L$ -Lipschitz with respect to $\|\cdot\|$ , and let $B>0$ be such that $V_{y_{0}}\left(y^{\star}\right)\leq B$ . Then for any $\epsilon>0$ , and $K\geq\max\left\{\log_{2}\frac{L^{2}B}{\mu_{r}\epsilon^{2}},1\right\}+10$ , the output of Algorithm 2 with dual-regularization and primal-accuracy schedules given by

\displaystyle\lambda_{k}=2^{k}\frac{\epsilon}{4B}\text{ for $k\in\left\{0% \right\}\cup[K-1]$}\text{ and }\epsilon_{k}=\frac{\epsilon}{4K}\text{ for $k% \in[K]$}

satisfies $\phi(y^{\star})-\mathbb{E}\phi(y_{K})\leq\epsilon$ .

Corollary 17.

Let $B>0$ be such that $V_{y_{0}}\left(y^{\star}\right)\leq B$ . Then for any $\epsilon>0$ and $K\in\mathbb{N}$ , the output of Algorithm 2 with dual-regularization and primal-accuracy schedules given by

\displaystyle\lambda_{k}=2^{k}\frac{\epsilon}{4B}\text{ for $k\in\left\{0% \right\}\cup[K-1]$}\text{ and }\epsilon_{k}=\frac{\epsilon}{8\cdot 1.5^{k}}% \text{ for $k\in[K]$}

(10)

satisfies

\displaystyle\mathbb{E}\|y_{K}-u\|\leq\frac{1}{1.5^{K}}\sqrt{\frac{2B}{\mu_{r}% }},

where $u\in\mathcal{Y}$ is a point whose expected suboptimality is at most $\epsilon$ , i.e., $\phi(y^{\star})-\mathbb{E}\phi(u)\leq\epsilon$ .

4 Efficient maximin algorithms

In this section, we obtain new state-of-the-art runtimes for solving bilinear matrix games in certain parameter regimes (Section 4.1), as well as an improved query complexity for solving the dual of the CVaR at level $\alpha$ distributionally robust optimization (DRO) problem (Section 4.2). In each application, we apply Corollary 16 to compute an $\epsilon$ -optimal point for the dual problem at approximately the same cost as computing an $\epsilon$ -optimal point for the primal problem (up to logarithmic factors and the cost of representing a dual vector when it comes to CVaR at level $\alpha$ ).

4.1 Bilinear matrix games

In this section, we instantiate $\psi(x,y)\coloneqq x^{\top}Ay$ for a matrix $A\in\mathbb{R}^{d\times n}$ . Given $p,q\geq 1$ , we write $\|A\|_{p\to q}\coloneqq\max_{v\in\mathbb{R}^{d},v\neq 0}\frac{\|Av\|_{q}}{\|v% \|_{p}}$ , and use the notation $A_{ij}$ , $A_{i:}$ , and $A_{:j}$ for the $(i,j)$ entry, $i$ -th row as a row vector, and $j$ -th column as a column vector. We consider two setups:

Definition 18 (Matrix games ball setup).

In the matrix games ball setup, we set $\mathcal{X}\coloneqq B^{d}$ (the unit Euclidean ball in $\mathbb{R}^{d}$ ), $\mathcal{Y}\coloneqq\Delta^{n}$ , and fix a $(\psi,\mathcal{X},\mathcal{Y})$ -simplex setup (Definition 11). We assume $\|A^{\top}\|_{2\to\infty}=\max_{i\in[n]}\|A_{:i}\|_{2}\leq 1$ .

Definition 19 (Matrix games simplex setup).

In the matrix games simplex setup, we set $\mathcal{X}\coloneqq\Delta^{d}$ , $\mathcal{Y}\coloneqq\Delta^{n}$ , and fix a $(\psi,\mathcal{X},\mathcal{Y})$ -simplex setup (Definition 11). We assume $\|A^{\top}\|_{1\to\infty}=\max_{i,j}|A_{ij}|\leq 1$ .

Throughout Section 4.1, any theorem, statement, or equation which does not make reference to a specific choice of Definition 18 or 19 applies to both setups. Specializing the primal (P) and dual (D) to this application gives

		$\displaystyle\mathop{\rm minimize}_{x\in\mathcal{X}}f(x)\text{ for }f(x)% \coloneqq\max_{y\in\Delta^{n}}x^{\top}Ay,\text{ and}$		(P-MG)
		$\displaystyle\mathop{\rm maximize}_{y\in\Delta^{n}}\phi(y)\text{ for }\phi(y)% \coloneqq\min_{x\in\mathcal{X}}x^{\top}Ay.$		(D-MG)

Regarding the assumptions on the norm of the matrix $A$ in Definitions 18 and 19, note that we can equivalently write $f(x)=\max_{y\in\Delta^{n}}\sum_{i=1}^{n}y_{i}f_{i}(x)$ with $f_{i}(x)\coloneqq[A^{\top}x]_{i}$ . Then the assumptions on the norm of $A$ correspond to ensuring $f_{i}$ is 1-Lipschitz with respect to the $\ell_{2}$ -norm in Definition 18 and $\ell_{1}$ -norm in Definition 19 (which in turn implies $f$ is 1-Lipschitz in the respective norms). This normalization is performed to simplify expressions as in [8]. (In particular, [8] also considers the more general problem where each $f_{i}$ can be any smooth, Lipschitz, convex function.)

Recently, [8, Cor. 8.2] achieved a state-of-the-art runtime in certain parameter regimes of $\widetilde{O}(nd+n(d/\epsilon)^{2/3}+d\epsilon^{-2})$ for obtaining an $\epsilon$ -optimal point for (P-MG). However, unlike previous algorithms for (P-MG) (see Section 1.3 for an extended discussion), their algorithm does not yield an $\epsilon$ -optimal point for (D-MG) with the same runtime.

Algorithm 3 Dual extraction for matrix games.

Our instantiation of the dual-extraction framework in Algorithm 3 and the accompanying guarantee Theorem 21 resolves this asymmetry between the complexity of obtaining a primal versus dual $\epsilon$ -optimal point by obtaining an $\epsilon$ -optimal point of (D-MG) with the same runtime of $\widetilde{O}(nd+n(d/\epsilon)^{2/3}+d\epsilon^{-2})$ . At the end of Section 4.1, we observe that Theorem 21 also yields a new state-of-the-art runtime for the primal (P-MG) in the setting of Definition 19 due to the symmetry of the constraint sets and $\psi$ .

Before giving the guarantee Theorem 21 for Algorithm 3, the following lemma provides a runtime bound for the DRPOSP oracle when the success probability is $9/10$ (see Appendix B.1 in the full version for the proof). In particular, Lemma 20 shows that adding dual regularization to (P-MG) does not increase the complexity of obtaining an $\epsilon$ -optimal point over the guarantee of [8, Cor. 8.2] discussed above.

Lemma 20 (DRPOSP oracle for matrix games).

In the settings of Definitions 18 and 19, for any $q\in\Delta^{n}_{>0}$ , $\epsilon_{\mathrm{p}}>0$ , and $\lambda>0$ , with success probability at least $9/10$ , there exists an algorithm which returns an expected $\epsilon_{\mathrm{p}}$ -optimal point of $f_{\lambda,q}$ with runtime $\widetilde{O}(nd+n(d/\epsilon_{\mathrm{p}})^{2/3}+d\epsilon_{\mathrm{p}}^{-2})$ . (Equivalently, per Definition 14, we have that $\textsc{DRPOSP}(q,\lambda,\epsilon_{\mathrm{p}},1/10)$ can be implemented with this runtime.)

Now for the main guarantee (we defer the proof to the full version):

Theorem 21 (Guarantee for Algorithm 3).

In the settings of Definitions 18 and 19, given target error $\epsilon>0$ and with success probability at least $9/10$ , Algorithm 3 with dual-regularization and primal-accuracy schedules given by

\displaystyle\lambda_{k}=2^{k}\frac{\epsilon}{4\ln n}\text{ for $k\in\left\{0% \right\}\cup[K-1]$}\text{ and }\epsilon_{k}=\frac{\epsilon}{4K}\text{ for $k% \in[K]$}

for $K=\left\lceil\max\left\{\log_{2}\frac{\ln n}{\epsilon^{2}},1\right\}\right% \rceil+10$ returns an expected $\epsilon$ -optimal point for (D-MG), and can be implemented with runtime $\widetilde{O}(nd+n(d/\epsilon)^{2/3}+d\epsilon^{-2})$ .

The primal perspective

As alluded to above, the guarantee of Theorem 21 also implies a new state-of-the-art runtime for the primal (P-MG) in the setting of Definition 19. This follows because in the matrix games simplex setup, (P-MG) and (D-MG) are symmetric in terms of their constraint sets, so we can obtain an expected $\epsilon$ -optimal point for (P-MG) via Theorem 21 by negating and treating (P-MG) as if it were the dual problem. Formally (we defer the proof to the full version):

Corollary 22 (Guarantee for (P-MG) in the matrix games simplex setup).

In the setting of Definition 19, there exists an algorithm which, given target error $\epsilon>0$ and with success probability at least $9/10$ , returns an expected $\epsilon$ -optimal point for (P-MG) with runtime $\widetilde{O}(nd+d(n/\epsilon)^{2/3}+n\epsilon^{-2})$ .

See the full version for a discussion of how this runtime compares to the prior art.

4.2 CVaR at level $\alpha$ DRO

In this section, we instantiate $\psi(x,y)\coloneqq\sum_{i=1}^{n}y_{i}f_{i}(x)$ for convex, bounded, and $G$ -Lipschitz (with respect to the Euclidean norm) functions $f_{i}:\mathbb{R}^{d}\to\mathbb{R}$ .⁴⁴4Note that we do not require the functions $f_{i}$ to be differentiable. Here, it is important that Definition 9 only requires $\psi(x,\cdot)$ to be differentiable. Given a compact, convex set $\mathcal{X}$ and $\alpha\in[1/n,1]$ , the primal and dual problem for CVaR at level $\alpha$ are as follows (we explain the reason for the notation $\bar{f}$ as opposed to $f$ in the full version; in short, we apply the framework to a proxy objective):

		$\displaystyle\mathop{\rm minimize}_{x\in\mathcal{X}}\bar{f}(x)\text{ \; \;\;\;% for }\bar{f}(x)\coloneqq\max_{y\in\Delta^{n},\\|y\\|_{\infty}\leq\frac{1}{% \alpha n}}\sum_{i=1}^{n}y_{i}f_{i}(x),\text{ and}$		(P-CVaR)
		$\displaystyle\mathop{\rm maximize}_{y\in\Delta^{n},\\|y\\|_{\infty}\leq\frac{1}{% \alpha n}}\phi(y)\text{ for }\phi(y)\coloneqq\min_{x\in\mathcal{X}}\sum_{i=1}^% {n}y_{i}f_{i}(x).$		(D-CVaR)

Our complexity model in this section is the number of computations of the form $(f_{i}(x),\nabla f_{i}(x))$ for $x\in\mathcal{X}$ and $i\in[n]$ . We refer to the evaluation of $(f_{i}(x),\nabla f_{i}(x))$ for a given $x\in\mathcal{X}$ and $i\in[n]$ as a single first-order query. Omitting the Lipschitz constant $G$ and bounds on the range of the $f_{i}$ ’s and size of $\mathcal{X}$ for clarity, [29, Sec. 4] gave an algorithm which returns an expected⁵⁵5To be precise, [29] gives a $\widetilde{O}(\alpha^{-1}\epsilon^{-2})$ -complexity high probability bound in Theorem 2. They do not state a $\widetilde{O}(\alpha^{-1}\epsilon^{-2})$ -complexity expected suboptimality bound explicitly in a theorem, but they note in the text above Theorem 2 that such a bound follows immediately from Propositions 3 and 4 in their paper. $\epsilon$ -optimal point of (P-CVaR) with $\widetilde{O}(\alpha^{-1}\epsilon^{-2})$ first-order queries, and also proved a matching lower bound up to logarithmic factors when $n$ is sufficiently large. However, to the best of our knowledge, the best known complexity for obtaining an expected $\epsilon$ -optimal point of (D-CVaR) is $\widetilde{O}(n\epsilon^{-2})$ via a primal-dual method based on [33]; see also [13, 32, 5]. In our main guarantee for this section, Theorem 24, we apply Algorithm 2 to obtain an expected $\epsilon$ -optimal point of (D-CVaR) with complexity $\widetilde{O}(\alpha^{-1}\epsilon^{-2}+n)$ , which always improves upon or matches $\widetilde{O}(n\epsilon^{-2})$ since $\alpha\in[1/n,1]$ .

Toward stating our main guarantee, we encapsulate the formal assumptions of [29, Sec. 2] in the following definition:

Definition 23 (CVaR at level $\alpha$ setup).

We assume $\mathcal{X}$ is nonempty, closed, convex, and satisfies $\|x-y\|_{2}\leq R$ for all $x,y\in\mathcal{X}$ . We also assume, for all $i\in[n]$ , that $f_{i}$ is convex, $G$ -Lipschitz with respect to $\|\cdot\|_{2}$ , and satisfies $f_{i}(x)\in[0,M]$ for all $x\in\mathcal{X}$ .

We ultimately obtain the following guarantee via Algorithm 2. Note that the upper bound on $\epsilon$ in Theorem 24 is without loss of generality since if $\epsilon\geq M$ , any feasible point is $\epsilon$ -optimal. We defer the proof to the full version.

Theorem 24 (Guarantee for (D-CVaR)).

In the setting of Definition 23 with target error $\epsilon\in(0,4M)$ and $\alpha\in[1/n,1]$ , there exists an algorithm which computes an expected $\epsilon$ -optimal point of (D-CVaR) with complexity $\widetilde{O}(n+G^{2}R^{2}\alpha^{-1}\epsilon^{-2})$ .

5 Obtaining critical points of convex functions

In this section, our goal is to obtain an approximate critical point of a convex, $\beta$ -smooth function $h:\mathbb{R}^{n}\to\mathbb{R}$ , given access to a gradient oracle for $h$ . We show that our general framework yields an algorithm with the optimal query complexity for this problem. In Section 5.1, we give the formal problem definition and some important preliminaries. In Section 5.2, we give the setup for applying our main framework Algorithm 2 to this problem and a sketch of why the resulting algorithm works. In Section 5.3, we formally state the resulting algorithm for obtaining an approximate critical point of $h$ and prove that it achieves the optimal rate using the guarantees associated with Algorithm 2.

5.1 Preliminaries for Section 5

Throughout Section 5, we fix $\|\cdot\|$ to be the standard Euclidean norm over $\mathbb{R}^{n}$ . We assume $h:\mathbb{R}^{n}\to\mathbb{R}$ is convex, $\beta$ -smooth with respect to $\|\cdot\|$ , and $\Delta\coloneqq h(x_{0})-\inf_{x\in\mathbb{R}^{n}}h(x)<\infty$ for an arbitrary initialization point $x_{0}\in\mathbb{R}^{n}$ . We access $h$ through a gradient oracle. For $\gamma>0$ , our goal will be to obtain a $\gamma$ -critical point of $h$ , i.e., a point $x\in\mathbb{R}^{n}$ such that $\|\nabla h(x)\|\leq\gamma$ . Instead of operating on $h$ itself, our algorithm will operate on a regularized version of $h$ :

\displaystyle f(x)\coloneqq h(x)+\frac{\gamma^{2}}{16\Delta}\|x-x_{0}\|^{2}.

(11)

This notation was chosen to mirror the notation of Section 3.1; $f$ will be the primal function when we apply the framework. Let $x^{\star}_{f}$ denote the unique global minimum of $f$ . The following corollary of Lemma 13 in Appendix C in the full version summarizes the key properties of $f$ :

Corollary 25 (Properties of the regularized function $f$ ).

We have

1.

$\|x^{\star}_{f}-x_{0}\|\leq 4\Delta/\gamma$ .
2.

If $u\in\mathbb{R}^{n}$ is such that $\|\nabla f(u)\|\leq\gamma/4$ , then $\|\nabla h(u)\|\leq\gamma$ .

Proof.

This follows immediately from Lemma 13 in the full version with $\alpha\leftarrow\frac{\gamma^{2}}{8\Delta}$ and $\nu\leftarrow\gamma/4$ . $\hfill\blacktriangleleft$

The second part of Corollary 25 says that to find a $\gamma$ -critical point of $h$ , it suffices to find a $(\gamma/4)$ -critical point of $f$ . Furthermore, clearly a single query to $\nabla h$ suffices to obtain $\nabla f$ at a point. As a result, we will focus on finding a $(\gamma/4)$ -critical point of $f$ . Furthermore, Corollary 25 may be of independent interest since it trivially allows one to achieve a gradient query complexity of $O\left(\sqrt{\beta\Delta}/\gamma\right)$ via a method which achieves query complexity $O\left(\sqrt{\beta\|x_{0}-x^{\star}_{h}\|/\gamma}\right)$ (for $x^{\star}_{h}$ defined as some minimizer of $h$ over $\mathbb{R}^{n}$ , assuming one exists); see Section 1.3.

The reason we perform this regularization before applying our framework is it enables us to obtain a sufficiently tight bound on $V_{y_{0}}\left(y^{\star}\right)$ (equivalently, a small enough value of $B$ when we ultimately apply Corollary 17). It is possible to apply the framework more directly to $h$ , but it is not clear how to do so in a way that achieves an optimal complexity.

Finally, we provide a notation guide for Section 5 in Table 1, which may be useful to reference as additional notation is introduced in Sections 5.2 and 5.3.

Table 1: Notation guide for Section 5.

Notation	Description	Section
$\\|\cdot\\|$	Euclidean norm	5.1
$h$	Convex, $\beta$ -smooth function
$\gamma$	Target critical point error for $h$
$x_{0}$	Arbitrary initialization point
$\Delta$	$h(x_{0})-\inf_{x\in\mathbb{R}^{n}}h(x)<\infty$
$f(x)$	$h(x)+\frac{\gamma^{2}}{16\Delta}\\|x-x_{0}\\|^{2}$
$x^{\star}_{f}$	The global minimizer of $f$
$\psi(x,y)$	$\left\langle x,y\right\rangle-f^{*}(y)$	5.2
$R$	$5\Delta/\gamma$
$\mathcal{X}$	$B_{R}^{n}(x_{0})$
$\mathcal{Y}$	$\mathbb{R}^{n}$
dgf $r$	$f^{*}$
$\phi(y)$	$\left\langle x_{0},y\right\rangle-R\\|y\\|-f^{*}(y)$
$\lambda_{k}$	$2^{k}/32$	5.3
$\epsilon_{k}$	$\Delta/(64\cdot 1.5^{k})$
CGM	Fast composite gradient method oracle

5.2 Instantiating the framework

For this application, we instantiate

\displaystyle\psi(x,y)

\displaystyle\coloneqq\left\langle x,y\right\rangle-f^{*}(y).

Recall that $\psi$ is the Fenchel game [1, 43, 12, 23]; see Section 1.1 for a discussion of why it is a natural choice in this setting. For the rest of Section 5, we fix a $(\psi,\mathcal{X}\coloneqq B_{R}^{n}(x_{0}),\mathcal{Y}\coloneqq\mathbb{R}^{n}% ,f^{*})$ -unbounded setup (Definition 10) with $R\coloneqq 5\Delta/\gamma$ . $f^{*}$ is a valid choice for the dgf because $f^{*}$ is differentiable and $\left(\beta+\frac{\gamma^{2}}{8\Delta}\right)^{-1}$ -strongly convex [38, Thm. 6.11]. The strong convexity of $f^{*}$ also implies that Assumption 6 holds. Note that the associated primal function $x\mapsto\max_{y\in\mathbb{R}^{n}}\psi(x,y)$ is precisely $f^{**}=f$ (hence the choice of notation in (11)), and the dual function is given by

\displaystyle\phi(y)=\min_{x\in B_{R}^{n}(x_{0})}\left\{\left\langle x,y\right% \rangle-f^{*}(y)\right\}=\left\langle x_{0}-R\frac{y}{\|y\|},y\right\rangle-f^% {*}(y)=\left\langle x_{0},y\right\rangle-R\|y\|-f^{*}(y).

Next, the following lemma fulfills part of the outline given in Section 1.1 by showing that approximately optimal points for the dual objective (D) must have small norm. We defer the proof to the full version.

Lemma 26 (Bounding the norm by dual suboptimality).

If $y\in\mathbb{R}^{n}$ is $\epsilon$ -optimal for (D) for some $\epsilon>0$ , then $\|y\|\leq\epsilon\gamma/\Delta$ .

We now derive the oracles of Definitions 12 and 13. Regarding Definition 12, for the rest of Section 5 we restrict $\textsc{DRPO}(\cdot)$ to denote a deterministic implementation of the DRPO oracle, since we can always obtain a deterministic implementation in this application. Then the following corollary is an immediate consequence of a more general lemma given in Appendix C in the full version which characterizes the properties of the Fenchel game with added dual regularization; see also Section 1.1.

Corollary 27.

The set of valid output points of $\textsc{DRPO}(q\in\mathbb{R}^{n},\lambda>0,\epsilon_{\mathrm{p}}>0)$ is precisely

	$\displaystyle\underset{x\in B_{R}^{n}(x_{0})}{{\mathop{\rm argmin}}^{\epsilon_% {\mathrm{p}}}}\,(1+\lambda)\cdot f\left(\frac{x+\lambda\nabla f^{*}(q)}{1+% \lambda}\right),\text{ and}$
	$\displaystyle\textsc{DRBR}(q\in\mathbb{R}^{n},\lambda>0,x\in B_{R}^{n}(x_{0}))% =\nabla f\left(\frac{x+\lambda\nabla f^{*}(q)}{1+\lambda}\right).$

Proof.

Apply Lemma 14 in the full version with $g\leftarrow f$ . $\hfill\blacktriangleleft$

Taken together, Lemma 26 and Corollary 27 nearly immediately imply that Algorithm 2 can be applied to the above setup to obtain a $(\gamma/4)$ -critical point of $f$ (and therefore a $\gamma$ -critical point of $h$ ). In particular, we will apply the schedules of Corollary 17 to certify that the output $y_{K}$ of Algorithm 2 is close in distance to an $\epsilon$ -optimal point for (D) for an appropriate choice of $\epsilon>0$ . Then Lemma 26 and a triangle inequality yield a bound on $\|y_{K}\|$ . Finally, since

\displaystyle y_{K}\coloneqq\textsc{DRBR}(q_{K},\Lambda_{K},x_{K})=\nabla f% \left(\frac{x_{K}+\Lambda_{K}\nabla f^{*}(q_{K})}{1+\Lambda_{K}}\right)

by Corollary 27, we have that $\frac{x_{K}+\Lambda_{K}\nabla f^{*}(q_{K})}{1+\Lambda_{K}}$ is an approximate critical point of $f$ (and therefore $h$ ). One may worry about the presence of $\nabla f^{*}(q_{K})$ here and, more generally, the presence of $\nabla f^{*}(q)$ in the expressions for the oracles in Corollary 27. However, $\nabla f^{*}(\cdot)$ never needs to be evaluated explicitly since per the alternate expression for $q_{k}$ given in Line 3 of Algorithm 2, note that $q_{k}$ was itself computed as the gradient of $f$ at a point (recall the dgf is $f^{*}$ and $f=f^{**}$ ), in which case $\nabla f^{*}$ simply undoes this operation by Lemma 16 in the full version.

We formalize this sketch and provide a complexity guarantee in the next section. We also reframe this sketch and treat the sequence of $\frac{x_{k}+\Lambda_{k}\nabla f^{*}(q_{k})}{1+\Lambda_{k}}$ terms as our iterates (as opposed to the sequence of $x_{k}$ ’s), as this leads to a simpler statement and interpretation of the resulting algorithm.

5.3 The resulting algorithm and guarantee

We now formalize the sketch given at the end of the previous section, state the resulting algorithm, and provide a complexity guarantee. But first, we define a subroutine which will be used by the algorithm to implement the DRPO oracle:

Definition 28 (CGM oracle [40, 36]).

A $(\zeta>0,w\in\mathbb{R}^{n},\epsilon>0)$ -fast composite gradient method oracle, $\textsc{CGM}(\zeta,w,\epsilon)$ , returns an $\epsilon$ -minimizer of $f$ over $x\in B^{n}_{\zeta}(w)$ , i.e., an element of ${\mathop{\rm argmax}}_{x\in B^{n}_{\zeta}(w)}^{\epsilon}\,f(x)$ , using at most $O\left(1+\sqrt{\frac{\beta\zeta^{2}}{\epsilon}}\right)$ queries to $\nabla f$ .

For example, implementations with a small constant can be found in [40] or [36, Sec. 6.1.3]. The implementation of the CGM oracle falls under fast gradient methods for composite minimization, where letting $g$ denote a convex, $\beta$ -smooth function and $\Psi$ a “simple regularizer” (a quadratic in our case), the goal is to minimize $\tilde{g}(x)\coloneqq g(x)+\Psi(x)$ with the same complexity as it takes to minimize $g$ . The domain constraint can also be baked into the regularizer $\Psi$ by adding an indicator.

Our method for computing a $\gamma$ -critical point of $h$ is given in Algorithm 4, with the associated guarantee in Theorem 30. We note that the decision to introduce the equivalent notation $z_{0}$ for $x_{0}$ in Line 1 is aesthetic (to make Line 5 simpler to state and interpret). Furthermore, we state Algorithm 4 for general schedules $(\lambda_{k})_{k=0}^{K-1}$ and $(\epsilon_{k})_{k=1}^{K}$ for clarity, but ultimately we will choose the schedules given in Theorem 30, which correspond to particularizing the schedules of Corollary 17 to this setting. With this choice of schedules, $\Lambda_{k}\approx 2^{k}$ and $\epsilon_{k}\approx\Delta/1.5^{k}$ so that $\frac{\epsilon_{k}}{1+\Lambda_{k}}\approx\Delta/3^{k}$ . As a result, Algorithm 4 can be interpreted as optimizing $f$ in a sequence of balls where the radius and target error are both decreasing geometrically, and the center is a convex combination of the past iterates. While we choose the iteration count $K$ to be logarithmic in the problem parameters, we avoid multiplicative logarithmic factors in the total complexity because the ratio $\zeta^{2}/\epsilon$ in the complexity of the CGM oracle call (to borrow the notation of Definition 28) in Line 5 of Algorithm 4 is $\approx\frac{R^{2}}{4^{k}}\cdot\frac{3^{k}}{\Delta}$ at the $k$ -th iteration, meaning it is collapsing geometrically.

Algorithm 4 Algorithm for obtaining a

\gamma

-critical point of

h

.

Toward analyzing Algorithm 4, we first connect the sequence of iterates $z_{k}$ produced by Algorithm 4 to the sequence of iterates $x_{k},y_{k},q_{k}$ produced by Algorithm 2 with the same input parameters. Namely, we are formalizing the comment made at the end of Section 5.2 about reframing the sequence of iterates to achieve a more interpretable algorithm. We defer the proof to the full version.

Lemma 29 (Connecting Algorithm 4 to Algorithm 2).

Consider Algorithm 2 with input given by a $(\psi,B_{R}^{n}(x_{0}),\mathbb{R}^{n},f^{*})$ -unbounded setup (Definition 10); $y_{0}\coloneqq\nabla f(x_{0})$ ; and $K$ , $(\epsilon_{k})_{k=1}^{K}$ , and $(\lambda_{k})_{k=0}^{K-1}$ as in Algorithm 4. Then letting $(z_{k})_{k=0}^{K}$ denote the sequence of iterates generated by Algorithm 4, the following are valid sequences of iterates for Algorithm 2:

$\displaystyle q_{k}$	$\displaystyle=\nabla f\left(\frac{1}{\Lambda_{k}}\sum_{j=0}^{k-1}\lambda_{j}z_% {j}\right)$	$\displaystyle\text{ for $k\in[K]$},$	(12)
$\displaystyle x_{k}$	$\displaystyle=(1+\Lambda_{k})z_{k}-\sum_{j=0}^{k-1}\lambda_{j}z_{j}$	for $k\in[K]$ , and	(13)
$\displaystyle y_{k}$	$\displaystyle=\nabla f(z_{k})$	$\displaystyle\text{ for $k\in\left\{0\right\}\cup[K]$}.$	(14)

Having connected Algorithm 4 to Algorithm 2, we can apply the schedules given in Corollary 17 to show that Algorithm 4 returns a $\gamma$ -critical point of $h$ with an optimal complexity. We defer the proof to the full version.

Theorem 30 (Guarantee for Algorithm 4).

For any⁶⁶6The restriction on $\gamma$ is without loss of generality since $\|\nabla h(x_{0})\|\leq\sqrt{2\beta\Delta}$ by smoothness. We add it because it simplifies the analysis. $\gamma\in(0,\sqrt{2\beta\Delta})$ and with $K=O(\log(\beta\Delta/\gamma))$ , the output of Algorithm 4 with schedules given by

\displaystyle\lambda_{k}=\frac{2^{k}}{32}\text{ for $k\in\left\{0\right\}\cup[% K-1]$}\text{ and }\epsilon_{k}=\frac{\Delta}{64\cdot 1.5^{k}}\text{ for $k\in[% K]$}

(15)

satisfies $\|\nabla h(z_{K})\|\leq\gamma$ , and the algorithm makes at most $O\left(\frac{\sqrt{\beta\Delta}}{\gamma}\right)$ gradient queries to $h$ .

References

[1] Jacob D Abernethy and Jun-Kun Wang. On frank-wolfe and equilibrium computation. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
[2] Ilan Adler. The equivalence of linear programs and zero-sum games. International Journal of Games Theory, Volume 42:165-177, February 2013.
[3] Zeyuan Allen-Zhu and Elad Hazan. Optimal black-box reductions between optimization objectives. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pages 1614–1622, Red Hook, NY, USA, 2016. Curran Associates Inc.
[4] Hilal Asi, Yair Carmon, Arun Jambulapati, Yujia Jin, and Aaron Sidford. Stochastic bias-reduced gradient methods. In Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS ’21, Red Hook, NY, USA, 2024. Curran Associates Inc.
[5] Yair Carmon and Danielle Hausler. Distributionally robust optimization via ball oracle acceleration. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2024. Curran Associates Inc.
[6] Yair Carmon, Arun Jambulapati, Yujia Jin, and Aaron Sidford. Thinking inside the ball: Near-optimal minimization of the maximal loss. In Mikhail Belkin and Samory Kpotufe, editors, Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 866–882. PMLR, 15–19 August 2021. URL: http://proceedings.mlr.press/v134/carmon21a.html.
[7] Yair Carmon, Arun Jambulapati, Yujia Jin, and Aaron Sidford. Recapp: Crafting a more efficient catalyst for convex optimization. In International Conference on Machine Learning, 2022.
[8] Yair Carmon, Arun Jambulapati, Yujia Jin, and Aaron Sidford. A Whole New Ball Game: A Primal Accelerated Method for Matrix Games and Minimizing the Maximum of Smooth Functions, pages 3685–3723. Society for Industrial and Applied Mathematics, 2024. doi:10.1137/1.9781611977912.130.
[9] Yair Carmon, Yujia Jin, Aaron Sidford, and Kevin Tian. Variance reduction for matrix games. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
[10] Kenneth L. Clarkson, Elad Hazan, and David P. Woodruff. Sublinear optimization for machine learning. J. ACM, 59(5), November 2012. doi:10.1145/2371656.2371658.
[11] Michael B. Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, pages 938–942, New York, NY, USA, 2019. Association for Computing Machinery. doi:10.1145/3313276.3316303.
[12] Michael B. Cohen, Aaron Sidford, and Kevin Tian. Relative lipschitzness in extragradient methods and a direct recipe for acceleration. In James R. Lee, editor, 12th Innovations in Theoretical Computer Science Conference, ITCS 2021, January 6-8, 2021, Virtual Conference, volume 185 of LIPIcs, pages 62:1–62:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.ITCS.2021.62.
[13] Sebastian Curi, Kfir Y. Levy, Stefanie Jegelka, and Andreas Krause. Adaptive sampling for stochastic risk-averse learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc.
[14] G. B. Dantzig. Linear programming and extensions, 1953.
[15] Jelena Diakonikolas and Puqian Wang. Potential function-based framework for minimizing gradients in convex and min-max optimization. SIAM Journal on Optimization, 32(3):1668–1697, 2022. doi:10.1137/21M1395302.
[16] John Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49, June 2021.
[17] I. Ekeland, Roger Temam, Society for Industrial, and Applied Mathematics. Convex analysis and variational problems. Classics in applied mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, Pa., 1999.
[18] Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 2540–2548. JMLR.org, 2015. URL: http://proceedings.mlr.press/v37/frostig15.html.
[19] Michael B. Giles. Multilevel monte carlo path simulation. Operations Research, 56(3):607–617, June 2008. doi:10.1287/OPRE.1070.0496.
[20] Michael B. Giles. Multilevel monte carlo methods. Acta Numerica, 24:259–328, May 2015. doi:10.1017/S096249291500001X.
[21] G. N. Grapiglia and Yurii Nesterov. Tensor methods for finding approximate stationary points of convex functions. Optimization Methods and Software, 37(2):605–638, 2022. doi:10.1080/10556788.2020.1818082.
[22] Michael D. Grigoriadis and Leonid G. Khachiyan. A sublinear-time randomized approximation algorithm for matrix games. Operations Research Letters, 18(2):53–58, 1995. doi:10.1016/0167-6377(95)00032-0.
[23] Yujia Jin, Aaron Sidford, and Kevin Tian. Sharper rates for separable minimax and finite sum optimization via primal-dual extragradient methods. In Po-Ling Loh and Maxim Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 4362–4415. PMLR, 02–05 July 2022. URL: https://proceedings.mlr.press/v178/jin22b.html.
[24] Donghwan Kim and Jeffrey A. Fessler. Optimizing the efficiency of first-order methods for decreasing the gradient of smooth convex functions. Journal of Optimization Theory and Applications, 188(1):192–219, January 2021. doi:10.1007/S10957-020-01770-2.
[25] Jaeyeon Kim, Asuman Ozdaglar, Chanwoo Park, and Ernest K. Ryu. Time-reversed dissipation induces duality between minimizing gradient norm and function value, 2023. arXiv:2305.06628.
[26] Jaeyeon Kim, Chanwoo Park, Asuman Ozdaglar, Jelena Diakonikolas, and Ernest K. Ryu. Mirror duality in convex optimization, 2024. arXiv:2311.17296.
[27] Guanghui Lan, Yuyuan Ouyang, and Zhe Zhang. Optimal and parameter-free gradient minimization methods for convex and nonconvex optimization, 2023. arXiv:2310.12139.
[28] Jongmin Lee, Chanwoo Park, and Ernest K. Ryu. A geometric structure of acceleration and its role in making gradients small fast. In Neural Information Processing Systems, 2021.
[29] Daniel Levy, Yair Carmon, John C Duchi, and Aaron Sidford. Large-scale methods for distributionally robust optimization. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 8847–8860. Curran Associates, Inc., 2020.
[30] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optimization. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, pages 3384–3392, Cambridge, MA, USA, 2015. MIT Press.
[31] M. Minsky and S. Papert. Perceptrons: An introduction to computational geometry, 1988.
[32] Hongseok Namkoong and John C. Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pages 2216–2224, Red Hook, NY, USA, 2016. Curran Associates Inc.
[33] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009. doi:10.1137/070704277.
[34] A.S Nemirovsky. On optimality of krylov’s information when solving linear operator equations. Journal of Complexity, 7(2):121–130, 1991. doi:10.1016/0885-064X(91)90001-E.
[35] A.S Nemirovsky. Information-based complexity of linear operator equations. Journal of Complexity, 8(2):153–175, 1992. doi:10.1016/0885-064X(92)90013-2.
[36] Yurii Nesterov. Lectures on Convex Optimization. Springer Publishing Company, Incorporated, 2nd edition, 2018.
[37] Yurii Nesterov, Alexander Gasnikov, Sergey Guminov, and Pavel Dvurechensky. Primal-dual accelerated gradient methods with small-dimensional relaxation oracle. Optimization Methods and Software, 36(4):773–810, July 2021. doi:10.1080/10556788.2020.1731747.
[38] Francesco Orabona. A modern introduction to online learning, 2023. arXiv 1912.13213. arXiv:1912.13213.
[39] Maurice Sion. On general minimax theorems. Pacific Journal of Mathematics, 8:171–176, 1958.
[40] Paul Tseng. On accelerated proximal gradient methods for convex-concave optimization, 2008.
[41] J. v. Neumann. Zur theorie der gesellschaftsspiele. Mathematische Annalen, 100(1):295–320, December 1928.
[42] Jan van den Brand, Yin Tat Lee, Yang P. Liu, Thatchaphol Saranurak, Aaron Sidford, Zhao Song, and Di Wang. Minimum cost flows, mdps, and $\ell_{1}$ -regression in nearly linear time for dense instances. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2021, pages 859–869, New York, NY, USA, 2021. Association for Computing Machinery.
[43] Jun-Kun Wang, Jacob Abernethy, and Kfir Y. Levy. No-regret dynamics in the fenchel game: a unified framework for algorithmic convex optimization. Mathematical Programming, 205(1-2):203–268, May 2024. doi:10.1007/S10107-023-01976-Y.

[bib.bib1] [1] Jacob D Abernethy and Jun-Kun Wang. On frank-wolfe and equilibrium computation. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

[bib.bib2] [2] Ilan Adler. The equivalence of linear programs and zero-sum games. International Journal of Games Theory, Volume 42:165-177, February 2013.

[bib.bib3] [3] Zeyuan Allen-Zhu and Elad Hazan. Optimal black-box reductions between optimization objectives. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pages 1614–1622, Red Hook, NY, USA, 2016. Curran Associates Inc.

[bib.bib4] [4] Hilal Asi, Yair Carmon, Arun Jambulapati, Yujia Jin, and Aaron Sidford. Stochastic bias-reduced gradient methods. In Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS ’21, Red Hook, NY, USA, 2024. Curran Associates Inc.

[bib.bib5] [5] Yair Carmon and Danielle Hausler. Distributionally robust optimization via ball oracle acceleration. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2024. Curran Associates Inc.

[bib.bib6] [6] Yair Carmon, Arun Jambulapati, Yujia Jin, and Aaron Sidford. Thinking inside the ball: Near-optimal minimization of the maximal loss. In Mikhail Belkin and Samory Kpotufe, editors, Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 866–882. PMLR, 15–19 August 2021. URL: http://proceedings.mlr.press/v134/carmon21a.html.

[bib.bib7] [7] Yair Carmon, Arun Jambulapati, Yujia Jin, and Aaron Sidford. Recapp: Crafting a more efficient catalyst for convex optimization. In International Conference on Machine Learning, 2022.

[bib.bib8] [8] Yair Carmon, Arun Jambulapati, Yujia Jin, and Aaron Sidford. A Whole New Ball Game: A Primal Accelerated Method for Matrix Games and Minimizing the Maximum of Smooth Functions, pages 3685–3723. Society for Industrial and Applied Mathematics, 2024. doi:10.1137/1.9781611977912.130.

[bib.bib9] [9] Yair Carmon, Yujia Jin, Aaron Sidford, and Kevin Tian. Variance reduction for matrix games. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

[bib.bib10] [10] Kenneth L. Clarkson, Elad Hazan, and David P. Woodruff. Sublinear optimization for machine learning. J. ACM, 59(5), November 2012. doi:10.1145/2371656.2371658.

[bib.bib11] [11] Michael B. Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, pages 938–942, New York, NY, USA, 2019. Association for Computing Machinery. doi:10.1145/3313276.3316303.

[bib.bib12] [12] Michael B. Cohen, Aaron Sidford, and Kevin Tian. Relative lipschitzness in extragradient methods and a direct recipe for acceleration. In James R. Lee, editor, 12th Innovations in Theoretical Computer Science Conference, ITCS 2021, January 6-8, 2021, Virtual Conference, volume 185 of LIPIcs, pages 62:1–62:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.ITCS.2021.62.

[bib.bib13] [13] Sebastian Curi, Kfir Y. Levy, Stefanie Jegelka, and Andreas Krause. Adaptive sampling for stochastic risk-averse learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc.

[bib.bib14] [14] G. B. Dantzig. Linear programming and extensions, 1953.

[bib.bib15] [15] Jelena Diakonikolas and Puqian Wang. Potential function-based framework for minimizing gradients in convex and min-max optimization. SIAM Journal on Optimization, 32(3):1668–1697, 2022. doi:10.1137/21M1395302.

[bib.bib16] [16] John Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49, June 2021.

[bib.bib17] [17] I. Ekeland, Roger Temam, Society for Industrial, and Applied Mathematics. Convex analysis and variational problems. Classics in applied mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, Pa., 1999.

[bib.bib18] [18] Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 2540–2548. JMLR.org, 2015. URL: http://proceedings.mlr.press/v37/frostig15.html.

[bib.bib19] [19] Michael B. Giles. Multilevel monte carlo path simulation. Operations Research, 56(3):607–617, June 2008. doi:10.1287/OPRE.1070.0496.

[bib.bib20] [20] Michael B. Giles. Multilevel monte carlo methods. Acta Numerica, 24:259–328, May 2015. doi:10.1017/S096249291500001X.

[bib.bib21] [21] G. N. Grapiglia and Yurii Nesterov. Tensor methods for finding approximate stationary points of convex functions. Optimization Methods and Software, 37(2):605–638, 2022. doi:10.1080/10556788.2020.1818082.

[bib.bib22] [22] Michael D. Grigoriadis and Leonid G. Khachiyan. A sublinear-time randomized approximation algorithm for matrix games. Operations Research Letters, 18(2):53–58, 1995. doi:10.1016/0167-6377(95)00032-0.

[bib.bib23] [23] Yujia Jin, Aaron Sidford, and Kevin Tian. Sharper rates for separable minimax and finite sum optimization via primal-dual extragradient methods. In Po-Ling Loh and Maxim Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 4362–4415. PMLR, 02–05 July 2022. URL: https://proceedings.mlr.press/v178/jin22b.html.

[bib.bib24] [24] Donghwan Kim and Jeffrey A. Fessler. Optimizing the efficiency of first-order methods for decreasing the gradient of smooth convex functions. Journal of Optimization Theory and Applications, 188(1):192–219, January 2021. doi:10.1007/S10957-020-01770-2.

[bib.bib25] [25] Jaeyeon Kim, Asuman Ozdaglar, Chanwoo Park, and Ernest K. Ryu. Time-reversed dissipation induces duality between minimizing gradient norm and function value, 2023. arXiv:2305.06628.

[bib.bib26] [26] Jaeyeon Kim, Chanwoo Park, Asuman Ozdaglar, Jelena Diakonikolas, and Ernest K. Ryu. Mirror duality in convex optimization, 2024. arXiv:2311.17296.

[bib.bib27] [27] Guanghui Lan, Yuyuan Ouyang, and Zhe Zhang. Optimal and parameter-free gradient minimization methods for convex and nonconvex optimization, 2023. arXiv:2310.12139.

[bib.bib28] [28] Jongmin Lee, Chanwoo Park, and Ernest K. Ryu. A geometric structure of acceleration and its role in making gradients small fast. In Neural Information Processing Systems, 2021.

[bib.bib29] [29] Daniel Levy, Yair Carmon, John C Duchi, and Aaron Sidford. Large-scale methods for distributionally robust optimization. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 8847–8860. Curran Associates, Inc., 2020.

[bib.bib30] [30] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optimization. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, pages 3384–3392, Cambridge, MA, USA, 2015. MIT Press.

[bib.bib31] [31] M. Minsky and S. Papert. Perceptrons: An introduction to computational geometry, 1988.

[bib.bib32] [32] Hongseok Namkoong and John C. Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pages 2216–2224, Red Hook, NY, USA, 2016. Curran Associates Inc.

[bib.bib33] [33] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009. doi:10.1137/070704277.

[bib.bib34] [34] A.S Nemirovsky. On optimality of krylov’s information when solving linear operator equations. Journal of Complexity, 7(2):121–130, 1991. doi:10.1016/0885-064X(91)90001-E.

[bib.bib35] [35] A.S Nemirovsky. Information-based complexity of linear operator equations. Journal of Complexity, 8(2):153–175, 1992. doi:10.1016/0885-064X(92)90013-2.

[bib.bib36] [36] Yurii Nesterov. Lectures on Convex Optimization. Springer Publishing Company, Incorporated, 2nd edition, 2018.

[bib.bib37] [37] Yurii Nesterov, Alexander Gasnikov, Sergey Guminov, and Pavel Dvurechensky. Primal-dual accelerated gradient methods with small-dimensional relaxation oracle. Optimization Methods and Software, 36(4):773–810, July 2021. doi:10.1080/10556788.2020.1731747.

[bib.bib38] [38] Francesco Orabona. A modern introduction to online learning, 2023. arXiv 1912.13213. arXiv:1912.13213.

[bib.bib39] [39] Maurice Sion. On general minimax theorems. Pacific Journal of Mathematics, 8:171–176, 1958.

[bib.bib40] [40] Paul Tseng. On accelerated proximal gradient methods for convex-concave optimization, 2008.

[bib.bib41] [41] J. v. Neumann. Zur theorie der gesellschaftsspiele. Mathematische Annalen, 100(1):295–320, December 1928.

[bib.bib42] [42] Jan van den Brand, Yin Tat Lee, Yang P. Liu, Thatchaphol Saranurak, Aaron Sidford, Zhao Song, and Di Wang. Minimum cost flows, mdps, and $\ell_{1}$ -regression in nearly linear time for dense instances. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2021, pages 859–869, New York, NY, USA, 2021. Association for Computing Machinery.

[bib.bib43] [43] Jun-Kun Wang, Jacob Abernethy, and Kfir Y. Levy. No-regret dynamics in the fenchel game: a unified framework for algorithmic convex optimization. Mathematical Programming, 205(1-2):203–268, May 2024. doi:10.1007/S10107-023-01976-Y.

Extracting Dual Solutions via Primal Optimizers

Abstract

Keywords and phrases:

Funding:

Copyright and License:

2012 ACM Subject Classification:

Related Version:

DOI:

Event:

Editor:

Series and Publisher:

1 Introduction

1.1 Our results

From primal algorithms to dual optimization

Application 1: Bilinear matrix games

Application 2: CVaR at level 𝜶 DRO

Application 3: Obtaining stationary points of convex functions

1.2 Overview of the framework and analysis

Lemma 1 (Lemma 3 from the full version specialized).

A first try

Recursive regularization and the dual-extraction framework

Analysis of Algorithm 1

Theorem 2 (Theorem 15 specialized).

Proof.

▶ Remark 3 (Picking the parameters for Theorem 2).

Corollary 4 (Corollary 16 specialized).

▶ Remark 5.

Corollary 6 (Corollary 17 specialized).

▶ Remark 7.

1.3 Related work

Black-box reductions

Bilinear matrix games

CVaR at level 𝜶 distributionally robust optimization (DRO)

Stationary point computation

1.4 Paper organization

2 Notation and conventions

Definition 8 (dgf setup).

3 Dual-extraction framework

3.1 Preliminaries

Definition 9 (Dual-extraction setup).

Definition 10 (Unbounded setup).

Definition 11 (Simplex setup).

Notation associated with a setup

The minimax principle

Oracle definitions

Definition 12 (DRPO oracle).

Definition 13 (DRBR oracle).

Definition 14 (DRPOSP oracle).

3.2 The framework and its guarantees

Theorem 15 (Algorithm 2 guarantee).

Corollary 16.

Corollary 17.

4 Efficient maximin algorithms

4.1 Bilinear matrix games

Definition 18 (Matrix games ball setup).

Definition 19 (Matrix games simplex setup).

Lemma 20 (DRPOSP oracle for matrix games).

Theorem 21 (Guarantee for Algorithm 3).

The primal perspective

Corollary 22 (Guarantee for (P-MG) in the matrix games simplex setup).

4.2 CVaR at level 𝜶 DRO

Definition 23 (CVaR at level α setup).

Theorem 24 (Guarantee for (D-CVaR)).

5 Obtaining critical points of convex functions

5.1 Preliminaries for Section 5

Corollary 25 (Properties of the regularized function f).

Proof.

5.2 Instantiating the framework

Lemma 26 (Bounding the norm by dual suboptimality).

Corollary 27.

Proof.

5.3 The resulting algorithm and guarantee

Definition 28 (CGM oracle [40, 36]).

Lemma 29 (Connecting Algorithm 4 to Algorithm 2).

Theorem 30 (Guarantee for Algorithm 4).

References

Application 2: CVaR at level $\alpha$ DRO

$\blacktriangleright$ Remark 3 (Picking the parameters for Theorem 2).

$\blacktriangleright$ Remark 5.

$\blacktriangleright$ Remark 7.

CVaR at level $\alpha$ distributionally robust optimization (DRO)

4.2 CVaR at level $\alpha$ DRO

Definition 23 (CVaR at level $\alpha$ setup).

Corollary 25 (Properties of the regularized function $f$ ).