ACM Other Conferences

10.1145/acmotherconferences

0000000

10.5555/0000000

Proceedings of the 8th Innovations in Theoretical Computer Science Conference (ITCS 2017)

ITCS 2017

10.4230/LIPIcs.ITCS.2017.3

Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent

Allen-Zhu

Zeyuan

Author Orecchia

Lorenzo

Author

28 11 2017

3:1 3:22

First-order methods play a central role in large-scale machine learning. Even though many variations exist, each suited to a particular problem, almost all such methods fundamentally rely on two types of algorithmic steps: gradient descent, which yields primal progress, and mirror descent, which yields dual progress.

We observe that the performances of gradient and mirror descent are complementary, so that faster algorithms can be designed by "linearly coupling" the two. We show how to reconstruct Nesterov's accelerated gradient methods using linear coupling, which gives a cleaner interpretation than Nesterov's original proofs. We also discuss the power of linear coupling by extending it to many other settings that Nesterov's methods cannot apply to.

linear coupling gradient descent mirror descent acceleration

Zeyuan Allen-Zhu. Katyusha: Accelerated Variance Reduction for Faster SGD. ArXiv e-prints, abs/1603.05953, March 2016.

Zeyuan Allen-Zhu and Elad Hazan. Variance Reduction for Faster Non-Convex Optimization. In ICML, 2016.

Zeyuan Allen-Zhu, Yin Tat Lee, and Lorenzo Orecchia. Using optimization to obtain a width-independent, parallel, simpler, and faster positive SDP solver. In SODA, 2016.

Zeyuan Allen-Zhu and Lorenzo Orecchia. Nearly-Linear Time Positive LP Solver with Faster Convergence Rate. In STOC, 2015.

Zeyuan Allen-Zhu and Lorenzo Orecchia. Using optimization to break the epsilon barrier: A faster and simpler width-independent algorithm for solving positive linear programs in parallel. In SODA, 2015.

Zeyuan Allen-Zhu, Peter Richtárik, Zheng Qu, and Yang Yuan. Even faster accelerated coordinate descent using non-uniform sampling. In ICML, 2016.

Sanjeev Arora, Elad Hazan, and Satyen Kale. Fast Algorithms for Approximate Semidefinite Programming using the Multiplicative Weights Update Method. In FOCS, pages 339-348. IEEE, 2005. URL: http://dx.doi.org/10.1109/SFCS.2005.35.

Sanjeev Arora, Elad Hazan, and Satyen Kale. The Multiplicative Weights Update Method: a Meta-Algorithm and Applications. Theory of Computing, 8:121-164, 2012. URL: http://dx.doi.org/10.4086/toc.2012.v008a006.

Aharon Ben-Tal and Arkadi Nemirovski. Lectures on Modern Convex Optimization. Society for Industrial and Applied Mathematics, January 2013. URL: http://dx.doi.org/10.1137/1.9780898718829.

Sébastien Bubeck, Yin Tat Lee, and Mohit Singh. A geometric alternative to Nesterov’s accelerated gradient descent. ArXiv e-prints, abs/1506.08187, June 2015. URL: http://arxiv.org/abs/abs/1506.08187.

Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches. The Journal of Machine Learning Research, 13(1):165-202, 2012. URL: http://arxiv.org/abs/1012.1367.

John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Ambuj Tewari. Composite Objective Mirror Descent. In COLT, 2010.

Olivier Fercoq and Peter Richtárik. Accelerated, parallel, and proximal coordinate descent. SIAM Journal on Optimization, 25(4):1997-2023, 2015. First appeared on ArXiv 1312.5799 in 2013.

Yoav Freund and Robert E Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In Computational learning theory, pages 23-37. Springer, 1995.

Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169-192, August 2007. URL: http://dx.doi.org/10.1007/s10994-007-5016-8.

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, NIPS 2013, pages 315-323, 2013.

Anatoli Juditsky. Convex optimization ii: Algorithms. Lecture notes, November 2013.

Jonathan A. Kelner, Yin Tat Lee, Lorenzo Orecchia, and Aaron Sidford. An Almost-Linear-Time Algorithm for Approximate Max Flow in Undirected Graphs, and its Multicommodity Generalizations. In SODA, April 2014. URL: http://dx.doi.org/10.1137/1.9781611973402.16.

Guanghui Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2):365-397, January 2011. URL: http://dx.doi.org/10.1007/s10107-010-0434-y.

Yin Tat Lee, Satish Rao, and Nikhil Srivastava. A new approach to computing maximum flows using electrical flows. In STOC, page 755, New York, New York, USA, 2013.

Yin Tat Lee and Aaron Sidford. Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. In FOCS, pages 147-156. IEEE, 2013.

Michael W. Mahoney, Satish Rao, Di Wang, and Peng Zhang. Approximating the solution to mixed packing and covering lps in parallel Õ(ε^-3) time. In ICALP, 2016.

H. Brendan McMahan and Matthew Streeter. Adaptive Bound Optimization for Online Convex Optimization. In COLT, 2010. URL: http://arxiv.org/abs/1002.4908.

Arkadi Nemirovsky and David Yudin. Problem complexity and method efficiency in optimization. Nauka Publishers, Moscow (in Russian), 1978. John Wiley, New York (in English) 1983.

Yurii Nesterov. A method of solving a convex programming problem with convergence rate O(1/k²). In Doklady AN SSSR (translated as Soviet Mathematics Doklady), volume 269, pages 543-547, 1983.

Yurii Nesterov. Introductory Lectures on Convex Programming Volume: A Basic course, volume I. Kluwer Academic Publishers, 2004.

Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127-152, December 2005. URL: http://dx.doi.org/10.1007/s10107-004-0552-5.

Yurii Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 120(1):221-259, June 2007. URL: http://dx.doi.org/10.1007/s10107-007-0149-x.

Yurii Nesterov. Accelerating the cubic regularization of newton’s method on convex problems. Mathematical Programming, 112(1):159-181, 2008.

Yurii Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125-161, 2013. URL: http://dx.doi.org/10.1007/s10107-012-0629-5.

Yurii Nesterov. Universal gradient methods for convex optimization problems. Mathematical Programming, May 2014. URL: http://dx.doi.org/10.1007/s10107-014-0790-0.

Brendan O'Donoghue and Emmanuel Candès. Adaptive Restart for Accelerated Gradient Schemes. Foundations of Computational Mathematics, July 2013. URL: http://dx.doi.org/10.1007/s10208-013-9150-3.

Lorenzo Orecchia, Sushant Sachdeva, and Nisheeth K. Vishnoi. Approximating the exponential, the lanczos method and an Õ(m)-time spectral algorithm for balanced separator. In STOC '12. ACM Press, November 2012.

Serge A. Plotkin, David B. Shmoys, and Éva Tardos. Fast Approximation Algorithms for Fractional Packing and Covering Problems. Mathematics of Operations Research, 20(2):257-301, May 1995. URL: http://dx.doi.org/10.1287/moor.20.2.257.

Ankan Saha, S. V. N. Vishwanathan, and Xinhua Zhang. New Approximation Algorithms for Minimum Enclosing Convex Shapes. In SODA, pages 1146-1160, September 2011. URL: http://arxiv.org/abs/0909.1062.

Shai Shalev-Shwartz. Online Learning and Online Convex Optimization. Foundations and Trends in Machine Learning, 4(2):107-194, 2012. URL: http://dx.doi.org/10.1561/2200000018.

Shai Shalev-Shwartz and Yoram Singer. Logarithmic regret algorithms for strongly convex repeated games. Technical report, The Hebrew University, 2007.

Shai Shalev-Shwartz and Tong Zhang. Accelerated Mini-Batch Stochastic Dual Coordinate Ascent. In NIPS, pages 1-17, May 2013. URL: http://arxiv.org/abs/1305.2581.

Shai Shalev-Shwartz and Tong Zhang. Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization. In ICML, pages 64-72, 2014.

Ohad Shamir and Tong Zhang. Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes. In Proceedings of the 30th International Conference on Machine Learning - ICML '13, volume 28, 2013. URL: http://arxiv.org/abs/1212.1824.

Weijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems, pages 2510-2518, 2014.

Di Wang, Michael W. Mahoney, Nishanth Mohan, and Satish Rao. Faster parallel solver for positive linear programs via dynamically-bucketed selective coordinate descent. ArXiv e-prints, abs/1511.06468, November 2015.

Di Wang, Satish Rao, and Michael W. Mahoney. Unified acceleration method for packing and covering problems via diameter reduction. In ICALP, 2016.

Lin Xiao. Dual averaging method for regularized stochastic learning and online optimization. The Journal of Machine Learning Research, 11:2543-2596, 2010.

<book-part-wrapper xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="2.0" xml:lang="en" content-type="research-article">

<collection-meta collection-type="book-series">

<collection-id collection-id-type="doi">10.1145/acmotherconferences</collection-id>

<title-group>

<title>ACM Other Conferences</title>

</title-group>

</collection-meta>

<book-meta>

<book-id book-id-type="acm-id">0000000</book-id>

<book-id book-id-type="doi">10.5555/0000000</book-id>

<book-title-group>

<book-title>Proceedings of the 8th Innovations in Theoretical Computer Science Conference (ITCS 2017)</book-title>

<alt-title alt-title-type="acronym">ITCS 2017</alt-title>

</book-title-group>

</book-meta>

<book-part book-part-type="chapter" xml:lang="en">

<book-part-meta>

<book-part-id book-part-id-type="doi">10.4230/LIPIcs.ITCS.2017.3</book-part-id>

<book-part-id book-part-id-type="article-no">3</book-part-id>

<subj-group subj-group-type="ccs2012"/>

<title-group>

<title>Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent</title>

</title-group>

<contrib-group>

<name>

<surname>Allen-Zhu</surname>

<given-names>Zeyuan</given-names>

</name>

<role>Author</role>

</contrib>

<name>

<surname>Orecchia</surname>

<given-names>Lorenzo</given-names>

</name>

<role>Author</role>

</contrib>

</contrib-group>

<pub-date date-type="publication">

</pub-date>

First-order methods play a central role in large-scale machine learning. Even though many variations exist, each suited to a particular problem, almost all such methods fundamentally rely on two types of algorithmic steps: gradient descent, which yields primal progress, and mirror descent, which yields dual progress.

We observe that the performances of gradient and mirror descent are complementary, so that faster algorithms can be designed by "linearly coupling" the two. We show how to reconstruct Nesterov's accelerated gradient methods using linear coupling, which gives a cleaner interpretation than Nesterov's original proofs. We also discuss the power of linear coupling by extending it to many other settings that Nesterov's methods cannot apply to.

</abstract>

<kwd-group>

<kwd>linear coupling</kwd>

<kwd>gradient descent</kwd>

<kwd>mirror descent</kwd>

<kwd>acceleration</kwd>

</kwd-group>

</book-part-meta>

<back>

<ref-list specific-use="unparsed">

<mixed-citation>Zeyuan Allen-Zhu. Katyusha: Accelerated Variance Reduction for Faster SGD. ArXiv e-prints, abs/1603.05953, March 2016.</mixed-citation>

</ref>

<mixed-citation>Zeyuan Allen-Zhu and Elad Hazan. Variance Reduction for Faster Non-Convex Optimization. In ICML, 2016.</mixed-citation>

</ref>

<mixed-citation>Zeyuan Allen-Zhu, Yin Tat Lee, and Lorenzo Orecchia. Using optimization to obtain a width-independent, parallel, simpler, and faster positive SDP solver. In SODA, 2016.</mixed-citation>

</ref>

<mixed-citation>Zeyuan Allen-Zhu and Lorenzo Orecchia. Nearly-Linear Time Positive LP Solver with Faster Convergence Rate. In STOC, 2015.</mixed-citation>

</ref>

<mixed-citation>Zeyuan Allen-Zhu and Lorenzo Orecchia. Using optimization to break the epsilon barrier: A faster and simpler width-independent algorithm for solving positive linear programs in parallel. In SODA, 2015.</mixed-citation>

</ref>

<mixed-citation>Zeyuan Allen-Zhu, Peter Richtárik, Zheng Qu, and Yang Yuan. Even faster accelerated coordinate descent using non-uniform sampling. In ICML, 2016.</mixed-citation>

</ref>

<mixed-citation>Sanjeev Arora, Elad Hazan, and Satyen Kale. Fast Algorithms for Approximate Semidefinite Programming using the Multiplicative Weights Update Method. In FOCS, pages 339-348. IEEE, 2005. URL: http://dx.doi.org/10.1109/SFCS.2005.35.</mixed-citation>

</ref>

<mixed-citation>Sanjeev Arora, Elad Hazan, and Satyen Kale. The Multiplicative Weights Update Method: a Meta-Algorithm and Applications. Theory of Computing, 8:121-164, 2012. URL: http://dx.doi.org/10.4086/toc.2012.v008a006.</mixed-citation>

</ref>

<mixed-citation>Aharon Ben-Tal and Arkadi Nemirovski. Lectures on Modern Convex Optimization. Society for Industrial and Applied Mathematics, January 2013. URL: http://dx.doi.org/10.1137/1.9780898718829.</mixed-citation>

</ref>

<mixed-citation>Sébastien Bubeck, Yin Tat Lee, and Mohit Singh. A geometric alternative to Nesterov’s accelerated gradient descent. ArXiv e-prints, abs/1506.08187, June 2015. URL: http://arxiv.org/abs/abs/1506.08187.</mixed-citation>

</ref>

<mixed-citation>Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches. The Journal of Machine Learning Research, 13(1):165-202, 2012. URL: http://arxiv.org/abs/1012.1367.</mixed-citation>

</ref>

<mixed-citation>John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Ambuj Tewari. Composite Objective Mirror Descent. In COLT, 2010.</mixed-citation>

</ref>

<mixed-citation>Olivier Fercoq and Peter Richtárik. Accelerated, parallel, and proximal coordinate descent. SIAM Journal on Optimization, 25(4):1997-2023, 2015. First appeared on ArXiv 1312.5799 in 2013.</mixed-citation>

</ref>

<mixed-citation>Yoav Freund and Robert E Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In Computational learning theory, pages 23-37. Springer, 1995.</mixed-citation>

</ref>

<mixed-citation>Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169-192, August 2007. URL: http://dx.doi.org/10.1007/s10994-007-5016-8.</mixed-citation>

</ref>

<mixed-citation>Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, NIPS 2013, pages 315-323, 2013.</mixed-citation>

</ref>

<mixed-citation>Anatoli Juditsky. Convex optimization ii: Algorithms. Lecture notes, November 2013.</mixed-citation>

</ref>

<mixed-citation>Jonathan A. Kelner, Yin Tat Lee, Lorenzo Orecchia, and Aaron Sidford. An Almost-Linear-Time Algorithm for Approximate Max Flow in Undirected Graphs, and its Multicommodity Generalizations. In SODA, April 2014. URL: http://dx.doi.org/10.1137/1.9781611973402.16.</mixed-citation>

</ref>

<mixed-citation>Guanghui Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2):365-397, January 2011. URL: http://dx.doi.org/10.1007/s10107-010-0434-y.</mixed-citation>

</ref>

<mixed-citation>Yin Tat Lee, Satish Rao, and Nikhil Srivastava. A new approach to computing maximum flows using electrical flows. In STOC, page 755, New York, New York, USA, 2013.</mixed-citation>

</ref>

<mixed-citation>Yin Tat Lee and Aaron Sidford. Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. In FOCS, pages 147-156. IEEE, 2013.</mixed-citation>

</ref>

<mixed-citation>Michael W. Mahoney, Satish Rao, Di Wang, and Peng Zhang. Approximating the solution to mixed packing and covering lps in parallel Õ(ε^-3) time. In ICALP, 2016.</mixed-citation>

</ref>

<mixed-citation>H. Brendan McMahan and Matthew Streeter. Adaptive Bound Optimization for Online Convex Optimization. In COLT, 2010. URL: http://arxiv.org/abs/1002.4908.</mixed-citation>

</ref>

<mixed-citation>Arkadi Nemirovsky and David Yudin. Problem complexity and method efficiency in optimization. Nauka Publishers, Moscow (in Russian), 1978. John Wiley, New York (in English) 1983.</mixed-citation>

</ref>

<mixed-citation>Yurii Nesterov. A method of solving a convex programming problem with convergence rate O(1/k²). In Doklady AN SSSR (translated as Soviet Mathematics Doklady), volume 269, pages 543-547, 1983.</mixed-citation>

</ref>

<mixed-citation>Yurii Nesterov. Introductory Lectures on Convex Programming Volume: A Basic course, volume I. Kluwer Academic Publishers, 2004.</mixed-citation>

</ref>

<mixed-citation>Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127-152, December 2005. URL: http://dx.doi.org/10.1007/s10107-004-0552-5.</mixed-citation>

</ref>

<mixed-citation>Yurii Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 120(1):221-259, June 2007. URL: http://dx.doi.org/10.1007/s10107-007-0149-x.</mixed-citation>

</ref>

<mixed-citation>Yurii Nesterov. Accelerating the cubic regularization of newton’s method on convex problems. Mathematical Programming, 112(1):159-181, 2008.</mixed-citation>

</ref>

<mixed-citation>Yurii Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125-161, 2013. URL: http://dx.doi.org/10.1007/s10107-012-0629-5.</mixed-citation>

</ref>

<mixed-citation>Yurii Nesterov. Universal gradient methods for convex optimization problems. Mathematical Programming, May 2014. URL: http://dx.doi.org/10.1007/s10107-014-0790-0.</mixed-citation>

</ref>

<mixed-citation>Brendan O'Donoghue and Emmanuel Candès. Adaptive Restart for Accelerated Gradient Schemes. Foundations of Computational Mathematics, July 2013. URL: http://dx.doi.org/10.1007/s10208-013-9150-3.</mixed-citation>

</ref>

<mixed-citation>Lorenzo Orecchia, Sushant Sachdeva, and Nisheeth K. Vishnoi. Approximating the exponential, the lanczos method and an Õ(m)-time spectral algorithm for balanced separator. In STOC '12. ACM Press, November 2012.</mixed-citation>

</ref>

<mixed-citation>Serge A. Plotkin, David B. Shmoys, and Éva Tardos. Fast Approximation Algorithms for Fractional Packing and Covering Problems. Mathematics of Operations Research, 20(2):257-301, May 1995. URL: http://dx.doi.org/10.1287/moor.20.2.257.</mixed-citation>

</ref>

<mixed-citation>Ankan Saha, S. V. N. Vishwanathan, and Xinhua Zhang. New Approximation Algorithms for Minimum Enclosing Convex Shapes. In SODA, pages 1146-1160, September 2011. URL: http://arxiv.org/abs/0909.1062.</mixed-citation>

</ref>

<mixed-citation>Shai Shalev-Shwartz. Online Learning and Online Convex Optimization. Foundations and Trends in Machine Learning, 4(2):107-194, 2012. URL: http://dx.doi.org/10.1561/2200000018.</mixed-citation>

</ref>

<mixed-citation>Shai Shalev-Shwartz and Yoram Singer. Logarithmic regret algorithms for strongly convex repeated games. Technical report, The Hebrew University, 2007.</mixed-citation>

</ref>

<mixed-citation>Shai Shalev-Shwartz and Tong Zhang. Accelerated Mini-Batch Stochastic Dual Coordinate Ascent. In NIPS, pages 1-17, May 2013. URL: http://arxiv.org/abs/1305.2581.</mixed-citation>

</ref>

<mixed-citation>Shai Shalev-Shwartz and Tong Zhang. Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization. In ICML, pages 64-72, 2014.</mixed-citation>

</ref>

<mixed-citation>Ohad Shamir and Tong Zhang. Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes. In Proceedings of the 30th International Conference on Machine Learning - ICML '13, volume 28, 2013. URL: http://arxiv.org/abs/1212.1824.</mixed-citation>

</ref>

<mixed-citation>Weijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems, pages 2510-2518, 2014.</mixed-citation>

</ref>

<mixed-citation>Di Wang, Michael W. Mahoney, Nishanth Mohan, and Satish Rao. Faster parallel solver for positive linear programs via dynamically-bucketed selective coordinate descent. ArXiv e-prints, abs/1511.06468, November 2015.</mixed-citation>

</ref>

<mixed-citation>Di Wang, Satish Rao, and Michael W. Mahoney. Unified acceleration method for packing and covering problems via diameter reduction. In ICALP, 2016.</mixed-citation>

</ref>

<mixed-citation>Lin Xiao. Dual averaging method for regularized stochastic learning and online optimization. The Journal of Machine Learning Research, 11:2543-2596, 2010.</mixed-citation>

</ref>

</ref-list>

</back>

</book-part>

</book-part-wrapper>