ACM Other Conferences

10.1145/acmotherconferences

0000000

10.5555/0000000

Proceedings of the 12th Innovations in Theoretical Computer Science Conference (ITCS 2021)

ITCS 2021

10.4230/LIPIcs.ITCS.2021.63

10003752.10003809.10003716.10011138.10011140

Theory of computation~Nonconvex optimization

500

Training (Overparametrized) Neural Networks in Near-Linear Time

van den Brand

Jan

KTH Royal Institute of Technology, Stockholm, Sweden janvdb@kth.se Author Peng

Binghui

Columbia University, New York, NY, USA bp2601@columbia.edu Author Song

Zhao

Princeton University and Institute for Advanced Study, NJ, USA zhaos@ias.edu Author Weinstein

Omri

Columbia University, New York, NY, USA omri@cs.columbia.edu Author

04 02 2021

63:1 63:15

The slow convergence rate and pathological curvature issues of first-order gradient methods for training deep neural networks, initiated an ongoing effort for developing faster second-order optimization algorithms beyond SGD, without compromising the generalization error. Despite their remarkable convergence rate (independent of the training batch size n), second-order algorithms incur a daunting slowdown in the cost per iteration (inverting the Hessian matrix of the loss function), which renders them impractical. Very recently, this computational overhead was mitigated by the works of [Zhang et al., 2019; Cai et al., 2019], yielding an O(mn²)-time second-order algorithm for training two-layer overparametrized neural networks of polynomial width m.

We show how to speed up the algorithm of [Cai et al., 2019], achieving an Õ(mn)-time backpropagation algorithm for training (mildly overparametrized) ReLU networks, which is near-linear in the dimension (mn) of the full gradient (Jacobian) matrix. The centerpiece of our algorithm is to reformulate the Gauss-Newton iteration as an 𝓁₂-regression problem, and then use a Fast-JL type dimension reduction to precondition the underlying Gram matrix in time independent of M, allowing to find a sufficiently good approximate solution via first-order conjugate gradient. Our result provides a proof-of-concept that advanced machinery from randomized linear algebra - which led to recent breakthroughs in convex optimization (ERM, LPs, Regression) - can be carried over to the realm of deep learning as well.

Deep learning theory Nonconvex optimization

Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order stochastic optimization for machine learning in linear time. The Journal of Machine Learning Research, 18(1):4148-4187, 2017.

Nir Ailon and Bernard Chazelle. Approximate nearest neighbors and the fast johnson-lindenstrauss transform. In Proceedings of the thirty-eighth annual ACM symposium on Theory of computing (STOC), pages 557-563, 2006.

Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in neural information processing systems (NeurIPS), pages 6155-6166, 2019.

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In ICML, volume 97, pages 242-252. PMLR, 2019. arXiv version: URL: https://arxiv.org/pdf/1811.03962.pdf.

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. On the convergence rate of training recurrent neural networks. In NeurIPS, pages 6673-6685, 2019. arXiv version: URL: https://arxiv.org/pdf/1810.12065.pdf.

Alexandr Andoni, Chengyu Lin, Ying Sheng, Peilin Zhong, and Ruiqi Zhong. Subspace embedding and linear regression with orlicz norm. In ICML, pages 224-233. PMLR, 2018. arXiv versuion: URL: https://arXiv.org/pdf/1806.06430.pdf.

Rohan Anil, Vineet Gupta, Tomer Koren, Kevin Regan, and Yoram Singer. Second order optimization made practical. arXiv preprint, 2020. URL: http://arxiv.org/abs/arXiv:2002.09018.

Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning (ICML), volume 97 of Proceedings of Machine Learning Research, pages 322-332. PMLR, 2019. arXiv version: URL: https://arxiv.org/pdf/1901.08584.pdf.

Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems (NeurIPS), pages 8139-8148, 2019. arXiv version: URL: https://arxiv.org/pdf/1904.11955.pdf.

Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Velingker, and Amir Zandieh. Random fourier features for kernel ridge regression: Approximation bounds and statistical guarantees. In ICML, volume 70, pages 253-262. PMLR, 2017. arXiv version: URL: https://arxiv.org/pdf/1804.09893.pdf.

Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Velingker, and Amir Zandieh. A universal sampling method for reconstructing signals with simple fourier transforms. In STOC. ACM, 2019. arXiv version: https://arxiv.org/pdf/1812.08723.pdf.10.1145/3313276.3316363

Ainesh Bakshi, Nadiia Chepurko, and David P Woodruff. Robust and sample optimal algorithms for psd low-rank approximation. arXiv preprint, 2019. URL: http://arxiv.org/abs/1912.04177.

Ainesh Bakshi, Rajesh Jayaram, and David P Woodruff. Learning two layer rectified neural networks in polynomial time. In COLT, volume 99, pages 195-268. PMLR, 2019. arXiv version: URL: https://arxiv.org/pdf/1811.01885.pdf.

Ainesh Bakshi and David Woodruff. Sublinear time low-rank approximation of distance matrices. In Advances in Neural Information Processing Systems (NeurIPS), pages 3782-3792, 2018. arXiv version: URL: https://arxiv.org/pdf/1809.06986.pdf.

Frank Ban, David P. Woodruff, and Richard Zhang. Regularized weighted low rank approximation. In NeurIPS, pages 4061-4071, 2020. arXiv version: URL: https://arxiv.org/pdf/1911.06958.pdf.

Sue Becker and Yann Le Cun. Improving the convergence of back-propagation learning with second order methods. In Proceedings of the 1988 connectionist models summer school, pages 29-37, 1988.

Alberto Bernacchia, Máté Lengyel, and Guillaume Hennequin. Exact natural gradient in deep linear networks and its application to the nonlinear case. In Advances in Neural Information Processing Systems (NIPS), pages 5941-5950, 2018.

Aleksandar Botev, Hippolyt Ritter, and David Barber. Practical gauss-newton optimisation for deep learning. In International Conference on Machine Learning (ICML), pages 557-565, 2017.

Christos Boutsidis and David P Woodruff. Optimal cur matrix decompositions. In Proceedings of the 46th Annual ACM Symposium on Theory of Computing (STOC), pages 353-362. ACM, 2014. arXiv version: URL: https://arxiv.org/pdf/1405.7910.pdf.

Christos Boutsidis, David P Woodruff, and Peilin Zhong. Optimal principal component analysis in distributed and streaming models. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing (STOC), pages 236-249, 2016.

Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trendsregistered in Machine Learning, 8(3-4):231-357, 2015.

Sébastien Bubeck, Ronen Eldan, Yin Tat Lee, and Dan Mikulincer. Network size and weights size for memorization with two-layers neural networks. arXiv preprint, 2020. URL: http://arxiv.org/abs/2006.02855.

Tianle Cai, Ruiqi Gao, Jikai Hou, Siyu Chen, Dong Wang, Di He, Zhihua Zhang, and Liwei Wang. A gram-gauss-newton method learning overparameterized deep neural networks for regression problems. arXiv preprint, 2019. URL: http://arxiv.org/abs/1905.11675.

Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input sparsity time. In Symposium on Theory of Computing Conference (STOC), pages 81-90. ACM, 2013. arXiv version: URL: https://arxiv.org/pdf/1207.6365.pdf.

Michael B Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time. In STOC, pages 938-942. ACM, 2019. arXiv version: URL: https://arxiv.org/pdf/1810.07896.pdf.

Henry Cohn, Robert Kleinberg, Balazs Szegedy, and Christopher Umans. Group-theoretic algorithms for matrix multiplication. In 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 379-388. IEEE, 2005.

Amit Daniely. Memorizing gaussians with no over-parameterizaion via gradient decent on neural networks. arXiv preprint, 2020. URL: http://arxiv.org/abs/2003.12895.

Huaian Diao, Rajesh Jayaram, Zhao Song, Wen Sun, and David Woodruff. Optimal sketching for kronecker product regression and low rank approximation. In Advances in Neural Information Processing Systems (NeurIPS), pages 4739-4750, 2019. arXiv version: URL: https://arxiv.org/pdf/1909.13384.pdf.

Petros Drineas, Malik Magdon-Ismail, Michael W Mahoney, and David P Woodruff. Fast approximation of matrix coherence and statistical leverage. Journal of Machine Learning Research, 13(Dec):3475-3506, 2012.

Petros Drineas, Michael W Mahoney, and Shan Muthukrishnan. Sampling algorithms for l2 regression and applications. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 1127-1136. Society for Industrial and Applied Mathematics, 2006.

Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In ICLR. OpenReview.net, 2019. arXiv version: URL: https://arxiv.org/pdf/1810.02054.pdf.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research (JMLR), 12(Jul):2121-2159, 2011.

François Le Gall and Florent Urrutia. Improved rectangular matrix multiplication using powers of the coppersmith-winograd tensor. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1029-1046. SIAM, 2018. arXiv version: URL: https://arxiv.org/pdf/1708.05622.pdf.

Thomas George, César Laurent, Xavier Bouthillier, Nicolas Ballas, and Pascal Vincent. Fast approximate natural gradient descent in a kronecker factored eigenbasis. In Advances in Neural Information Processing Systems (NIPS), pages 9550-9560, 2018.

Roger Grosse and James Martens. A kronecker-factored approximate fisher matrix for convolution layers. In International Conference on Machine Learning (ICML), pages 573-582, 2016.

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning (ICML), pages 1842-1850, 2018.

Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In ICML, volume 48, pages 1225-1234. JMLR.org, 2016. arXiv version: URL: https://arxiv.org/pdf/1509.01240.pdf.

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems (NIPS), pages 8571-8580, 2018.

Ziwei Ji and Matus Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. In ICLR. OpenReview.net, 2020. arXiv version: URL: https://arxiv.org/pdf/1909.12292.pdf.

Haotian Jiang, Tarun Kathuria, Yin Tat Lee, Swati Padmanabhan, and Zhao Song. A faster interior point method for semidefinite programming. In Manuscript, 2020.

Haotian Jiang, Yin Tat Lee, Zhao Song, and Sam Chiu-wai Wong. An improved cutting plane method for convex optimization, convex-concave games and its applications. In STOC, pages 944-953. ACM, 2020. arXiv version: URL: https://arxiv.org/pdf/2004.04250.pdf.

Shunhua Jiang, Zhao Song, Omri Weinstein, and Hengjie Zhang. Faster dynamic matrix inverse for faster lps. arXiv preprint, 2020. URL: http://arxiv.org/abs/2004.07470.

Jonathan A Kelner, Lorenzo Orecchia, Aaron Sidford, and Zeyuan Allen Zhu. A simple, combinatorial algorithm for solving sdd systems in nearly-linear time. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing (STOC), pages 911-920. ACM, 2013. arXiv version: URL: https://arxiv.org/pdf/1301.6628.pdf.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. arXiv versuion: URL: https://arxiv.org/pdf/1412.6980.pdf.

Kasper Green Larsen and Jelani Nelson. Optimality of the johnson-lindenstrauss lemma. In IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 633-638. IEEE, 2017. arXiv version: URL: https://arxiv.org/pdf/1609.02094.pdf.

François Le Gall. Powers of tensors and fast matrix multiplication. In Proceedings of the 39th international symposium on symbolic and algebraic computation (ISSAC), pages 296-303. ACM, 2014.

Yin Tat Lee, Zhao Song, and Qiuyi Zhang. Solving empirical risk minimization in the current matrix multiplication time. In COLT. https://arxiv.org/pdf/1905.04447, 2019.

Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In NeurIPS, pages 8168-8177, 2018. arXiv version: URL: https://arxiv.org/pdf/1808.01204.pdf.

Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with ReLU activation. In Advances in neural information processing systems (NIPS), pages 597-607, 2017. arXiv version: URL: https://arxiv.org/pdf/1705.09886.pdf.

Hang Liao, Barak A. Pearlmutter, Vamsi K. Potluru, and David P. Woodruff. Automatic differentiation of sketched regression. In AISTATS, 2020.

Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. arXiv preprint, 2019. URL: http://arxiv.org/abs/1908.03265.

Yichao Lu, Paramveer Dhillon, Dean P Foster, and Lyle Ungar. Faster ridge regression via the subsampled randomized hadamard transform. In Advances in neural information processing systems, pages 369-377, 2013.

James Martens. Deep learning via hessian-free optimization. In ICML, volume 27, pages 735-742, 2010.

James Martens, Jimmy Ba, and Matthew Johnson. Kronecker-factored curvature approximations for recurrent neural networks. In ICLR, 2018.

James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning (ICML), pages 2408-2417, 2015.

Philipp Moritz, Robert Nishihara, and Michael Jordan. A linearly-convergent stochastic l-bfgs algorithm. In Artificial Intelligence and Statistics, pages 249-258, 2016.

Jelani Nelson and Huy L Nguyên. Osnap: Faster numerical linear algebra algorithms via sparser subspace embeddings. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS), pages 117-126. IEEE, 2013. arXiv version: URL: https://arxiv.org/pdf/1211.1002.pdf.

Samet Oymak and Mahdi Soltanolkotabi. Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas in Information Theory, 2019.

Mert Pilanci and Martin J Wainwright. Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence. SIAM Journal on Optimization, 27(1):205-245, 2017.

Eric Price, Zhao Song, and David P. Woodruff. Fast regression with an 𝓁_∞ guarantee. In International Colloquium on Automata, Languages, and Programming (ICALP), volume 80 of LIPIcs, pages 59:1-59:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017. arXiv version: https://arxiv.org/pdf/1705.10723.pdf.10.4230/LIPIcs.ICALP.2017.59

Ilya Razenshteyn, Zhao Song, and David P Woodruff. Weighted low rank approximations with provable guarantees. In Proceedings of the 48th Annual Symposium on the Theory of Computing (STOC), 2016.

Vladimir Rokhlin and Mark Tygert. A fast randomized algorithm for overdetermined linear least-squares regression. Proceedings of the National Academy of Sciences, 105(36):13212-13217, 2008.

Tamás Sarlós. Improved approximation algorithms for large matrices via random projections. In Proceedings of 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2006.

Zhao Song. Matrix Theory : Optimization, Concentration and Algorithms. PhD thesis, The University of Texas at Austin, 2019.

Zhao Song, Ruosong Wang, Lin Yang, Hongyang Zhang, and Peilin Zhong. Efficient symmetric norm regression via linear sketching. In Advances in Neural Information Processing Systems (NeurIPS), pages 828-838, 2019. arXiv version: URL: https://arxiv.org/pdf/1910.01788.pdf.

Zhao Song, David P Woodruff, and Peilin Zhong. Relative error tensor low rank approximation. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2772-2789. SIAM, 2019. arXiv version: URL: https://arxiv.org/pdf/1704.08246.pdf.

Zhao Song and Xin Yang. Quadratic suffices for over-parametrization via matrix chernoff bound. arXiv preprint, 2019. URL: http://arxiv.org/abs/1906.03593.

Daniel A. Spielman and Shang-Hua Teng. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In Proceedings of the Thirty-sixth Annual ACM Symposium on Theory of Computing (STOC), pages 81-90. ACM, 2004.

Joel A Tropp. Improved analysis of the subsampled randomized hadamard transform. Advances in Adaptive Data Analysis, 3(01n02):115-126, 2011.

Pravin M Vaidya. A new algorithm for minimizing convex functions over convex sets. In 30th Annual Symposium on Foundations of Computer Science (FOCS), pages 338-343. IEEE, 1989.

Pravin M Vaidya. Speeding-up linear programming using fast matrix multiplication. In 30th Annual Symposium on Foundations of Computer Science (FOCS), pages 332-337. IEEE, 1989.

Ruosong Wang and David P Woodruff. Tight bounds for 𝓁_p oblivious subspace embeddings. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1825-1843. SIAM, 2019. arXiv version: URL: https://arxiv.org/pdf/1801.04414.pdf.

Virginia Vassilevska Williams. Multiplying matrices faster than coppersmith-winograd. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing (STOC), pages 887-898. ACM, 2012.

David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1-2):1-157, 2014.

David P. Woodruff and Amir Zandieh. Near input sparsity time kernel embeddings via adaptive sampling. In ICML, 2020.

David P Woodruff and Peilin Zhong. Distributed low rank approximation of implicit functions of a matrix. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pages 847-858. IEEE, 2016.

Xiaoxia Wu, Simon S Du, and Rachel Ward. Global convergence of adaptive gradient methods for an over-parameterized neural network. arXiv preprint, 2019. URL: http://arxiv.org/abs/1902.07111.

Yuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems (NIPS), pages 5279-5288, 2017.

Guodong Zhang, James Martens, and Roger B Grosse. Fast convergence of natural gradient descent for over-parameterized neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pages 8080-8091, 2019.

Kai Zhong, Zhao Song, and Inderjit S Dhillon. Learning non-overlapping convolutional neural networks with multiple kernels. arXiv preprint, 2017. URL: http://arxiv.org/abs/1711.03440.

Kai Zhong, Zhao Song, Prateek Jain, Peter L. Bartlett, and Inderjit S. Dhillon. Recovery guarantees for one-hidden-layer neural networks. In ICML, volume 70, pages 4140-4149. PMLR, 2017. arXiv version: URL: https://arxiv.org/pdf/1706.03175.pdf.

<book-part-wrapper xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="2.0" xml:lang="en" content-type="research-article">

<collection-meta collection-type="book-series">

<collection-id collection-id-type="doi">10.1145/acmotherconferences</collection-id>

<title-group>

<title>ACM Other Conferences</title>

</title-group>

</collection-meta>

<book-meta>

<book-id book-id-type="acm-id">0000000</book-id>

<book-id book-id-type="doi">10.5555/0000000</book-id>

<book-title-group>

<book-title>Proceedings of the 12th Innovations in Theoretical Computer Science Conference (ITCS 2021)</book-title>

<alt-title alt-title-type="acronym">ITCS 2021</alt-title>

</book-title-group>

</book-meta>

<book-part book-part-type="chapter" xml:lang="en">

<book-part-meta>

<book-part-id book-part-id-type="doi">10.4230/LIPIcs.ITCS.2021.63</book-part-id>

<book-part-id book-part-id-type="article-no">63</book-part-id>

<subj-group subj-group-type="ccs2012">

<compound-subject>

<compound-subject-part content-type="code">10003752.10003809.10003716.10011138.10011140</compound-subject-part>

<compound-subject-part content-type="text">Theory of computation~Nonconvex optimization</compound-subject-part>

<compound-subject-part content-type="weight">500</compound-subject-part>

</compound-subject>

</subj-group>

<title-group>

<title>Training (Overparametrized) Neural Networks in Near-Linear Time</title>

</title-group>

<contrib-group>

<name>

<surname>van den Brand</surname>

<given-names>Jan</given-names>

</name>

<aff>KTH Royal Institute of Technology, Stockholm, Sweden</aff>

<email>janvdb@kth.se</email>

<role>Author</role>

</contrib>

<name>

<given-names>Binghui</given-names>

</name>

<aff>Columbia University, New York, NY, USA</aff>

<email>bp2601@columbia.edu</email>

<role>Author</role>

</contrib>

<name>

<given-names>Zhao</given-names>

</name>

<aff>Princeton University and Institute for Advanced Study, NJ, USA</aff>

<email>zhaos@ias.edu</email>

<role>Author</role>

</contrib>

<name>

<surname>Weinstein</surname>

<given-names>Omri</given-names>

</name>

<aff>Columbia University, New York, NY, USA</aff>

<email>omri@cs.columbia.edu</email>

<role>Author</role>

</contrib>

</contrib-group>

<pub-date date-type="publication">

</pub-date>

<p>The slow convergence rate and pathological curvature issues of first-order gradient methods for training deep neural networks, initiated an ongoing effort for developing faster second-order optimization algorithms beyond SGD, without compromising the generalization error. Despite their remarkable convergence rate (independent of the training batch size n), second-order algorithms incur a daunting slowdown in the cost per iteration (inverting the Hessian matrix of the loss function), which renders them impractical. Very recently, this computational overhead was mitigated by the works of [Zhang et al., 2019; Cai et al., 2019], yielding an O(mn²)-time second-order algorithm for training two-layer overparametrized neural networks of polynomial width m. </p>

<p>We show how to speed up the algorithm of [Cai et al., 2019], achieving an Õ(mn)-time backpropagation algorithm for training (mildly overparametrized) ReLU networks, which is near-linear in the dimension (mn) of the full gradient (Jacobian) matrix. The centerpiece of our algorithm is to reformulate the Gauss-Newton iteration as an 𝓁₂-regression problem, and then use a Fast-JL type dimension reduction to precondition the underlying Gram matrix in time independent of M, allowing to find a sufficiently good approximate solution via first-order conjugate gradient. Our result provides a proof-of-concept that advanced machinery from randomized linear algebra - which led to recent breakthroughs in convex optimization (ERM, LPs, Regression) - can be carried over to the realm of deep learning as well.</p>

</abstract>

<kwd-group>

<kwd>Deep learning theory</kwd>

<kwd>Nonconvex optimization</kwd>

</kwd-group>

</book-part-meta>

<back>

<ref-list specific-use="unparsed">

<mixed-citation>Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order stochastic optimization for machine learning in linear time. The Journal of Machine Learning Research, 18(1):4148-4187, 2017.</mixed-citation>

</ref>

<mixed-citation>Nir Ailon and Bernard Chazelle. Approximate nearest neighbors and the fast johnson-lindenstrauss transform. In Proceedings of the thirty-eighth annual ACM symposium on Theory of computing (STOC), pages 557-563, 2006.</mixed-citation>

</ref>

<mixed-citation>Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in neural information processing systems (NeurIPS), pages 6155-6166, 2019.</mixed-citation>

</ref>

<mixed-citation>Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In ICML, volume 97, pages 242-252. PMLR, 2019. arXiv version: URL: https://arxiv.org/pdf/1811.03962.pdf.</mixed-citation>

</ref>

<mixed-citation>Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. On the convergence rate of training recurrent neural networks. In NeurIPS, pages 6673-6685, 2019. arXiv version: URL: https://arxiv.org/pdf/1810.12065.pdf.</mixed-citation>

</ref>

<mixed-citation>Alexandr Andoni, Chengyu Lin, Ying Sheng, Peilin Zhong, and Ruiqi Zhong. Subspace embedding and linear regression with orlicz norm. In ICML, pages 224-233. PMLR, 2018. arXiv versuion: URL: https://arXiv.org/pdf/1806.06430.pdf.</mixed-citation>

</ref>

<mixed-citation>Rohan Anil, Vineet Gupta, Tomer Koren, Kevin Regan, and Yoram Singer. Second order optimization made practical. arXiv preprint, 2020. URL: http://arxiv.org/abs/arXiv:2002.09018.</mixed-citation>

</ref>

<mixed-citation>Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning (ICML), volume 97 of Proceedings of Machine Learning Research, pages 322-332. PMLR, 2019. arXiv version: URL: https://arxiv.org/pdf/1901.08584.pdf.</mixed-citation>

</ref>

<mixed-citation>Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems (NeurIPS), pages 8139-8148, 2019. arXiv version: URL: https://arxiv.org/pdf/1904.11955.pdf.</mixed-citation>

</ref>

<mixed-citation>Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Velingker, and Amir Zandieh. Random fourier features for kernel ridge regression: Approximation bounds and statistical guarantees. In ICML, volume 70, pages 253-262. PMLR, 2017. arXiv version: URL: https://arxiv.org/pdf/1804.09893.pdf.</mixed-citation>

</ref>

<mixed-citation>

<pub-id pub-id-type="doi" xlink:href="10.1145/3313276.3316363">10.1145/3313276.3316363</pub-id>

</mixed-citation>

</ref>

<mixed-citation>Ainesh Bakshi, Nadiia Chepurko, and David P Woodruff. Robust and sample optimal algorithms for psd low-rank approximation. arXiv preprint, 2019. URL: http://arxiv.org/abs/1912.04177.</mixed-citation>

</ref>

<mixed-citation>Ainesh Bakshi, Rajesh Jayaram, and David P Woodruff. Learning two layer rectified neural networks in polynomial time. In COLT, volume 99, pages 195-268. PMLR, 2019. arXiv version: URL: https://arxiv.org/pdf/1811.01885.pdf.</mixed-citation>

</ref>

<mixed-citation>Ainesh Bakshi and David Woodruff. Sublinear time low-rank approximation of distance matrices. In Advances in Neural Information Processing Systems (NeurIPS), pages 3782-3792, 2018. arXiv version: URL: https://arxiv.org/pdf/1809.06986.pdf.</mixed-citation>

</ref>

<mixed-citation>Frank Ban, David P. Woodruff, and Richard Zhang. Regularized weighted low rank approximation. In NeurIPS, pages 4061-4071, 2020. arXiv version: URL: https://arxiv.org/pdf/1911.06958.pdf.</mixed-citation>

</ref>

<mixed-citation>Sue Becker and Yann Le Cun. Improving the convergence of back-propagation learning with second order methods. In Proceedings of the 1988 connectionist models summer school, pages 29-37, 1988.</mixed-citation>

</ref>

<mixed-citation>Alberto Bernacchia, Máté Lengyel, and Guillaume Hennequin. Exact natural gradient in deep linear networks and its application to the nonlinear case. In Advances in Neural Information Processing Systems (NIPS), pages 5941-5950, 2018.</mixed-citation>

</ref>

<mixed-citation>Aleksandar Botev, Hippolyt Ritter, and David Barber. Practical gauss-newton optimisation for deep learning. In International Conference on Machine Learning (ICML), pages 557-565, 2017.</mixed-citation>

</ref>

<mixed-citation>Christos Boutsidis and David P Woodruff. Optimal cur matrix decompositions. In Proceedings of the 46th Annual ACM Symposium on Theory of Computing (STOC), pages 353-362. ACM, 2014. arXiv version: URL: https://arxiv.org/pdf/1405.7910.pdf.</mixed-citation>

</ref>

<mixed-citation>Christos Boutsidis, David P Woodruff, and Peilin Zhong. Optimal principal component analysis in distributed and streaming models. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing (STOC), pages 236-249, 2016.</mixed-citation>

</ref>

<mixed-citation>Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trendsregistered in Machine Learning, 8(3-4):231-357, 2015.</mixed-citation>

</ref>

<mixed-citation>Sébastien Bubeck, Ronen Eldan, Yin Tat Lee, and Dan Mikulincer. Network size and weights size for memorization with two-layers neural networks. arXiv preprint, 2020. URL: http://arxiv.org/abs/2006.02855.</mixed-citation>

</ref>

<mixed-citation>Tianle Cai, Ruiqi Gao, Jikai Hou, Siyu Chen, Dong Wang, Di He, Zhihua Zhang, and Liwei Wang. A gram-gauss-newton method learning overparameterized deep neural networks for regression problems. arXiv preprint, 2019. URL: http://arxiv.org/abs/1905.11675.</mixed-citation>

</ref>

<mixed-citation>Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input sparsity time. In Symposium on Theory of Computing Conference (STOC), pages 81-90. ACM, 2013. arXiv version: URL: https://arxiv.org/pdf/1207.6365.pdf.</mixed-citation>

</ref>

<mixed-citation>Michael B Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time. In STOC, pages 938-942. ACM, 2019. arXiv version: URL: https://arxiv.org/pdf/1810.07896.pdf.</mixed-citation>

</ref>

<mixed-citation>Henry Cohn, Robert Kleinberg, Balazs Szegedy, and Christopher Umans. Group-theoretic algorithms for matrix multiplication. In 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 379-388. IEEE, 2005.</mixed-citation>

</ref>

<mixed-citation>Amit Daniely. Memorizing gaussians with no over-parameterizaion via gradient decent on neural networks. arXiv preprint, 2020. URL: http://arxiv.org/abs/2003.12895.</mixed-citation>

</ref>

<mixed-citation>Huaian Diao, Rajesh Jayaram, Zhao Song, Wen Sun, and David Woodruff. Optimal sketching for kronecker product regression and low rank approximation. In Advances in Neural Information Processing Systems (NeurIPS), pages 4739-4750, 2019. arXiv version: URL: https://arxiv.org/pdf/1909.13384.pdf.</mixed-citation>

</ref>

<mixed-citation>Petros Drineas, Malik Magdon-Ismail, Michael W Mahoney, and David P Woodruff. Fast approximation of matrix coherence and statistical leverage. Journal of Machine Learning Research, 13(Dec):3475-3506, 2012.</mixed-citation>

</ref>

<mixed-citation>Petros Drineas, Michael W Mahoney, and Shan Muthukrishnan. Sampling algorithms for l2 regression and applications. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 1127-1136. Society for Industrial and Applied Mathematics, 2006.</mixed-citation>

</ref>

<mixed-citation>Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In ICLR. OpenReview.net, 2019. arXiv version: URL: https://arxiv.org/pdf/1810.02054.pdf.</mixed-citation>

</ref>

<mixed-citation>John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research (JMLR), 12(Jul):2121-2159, 2011.</mixed-citation>

</ref>

<mixed-citation>François Le Gall and Florent Urrutia. Improved rectangular matrix multiplication using powers of the coppersmith-winograd tensor. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1029-1046. SIAM, 2018. arXiv version: URL: https://arxiv.org/pdf/1708.05622.pdf.</mixed-citation>

</ref>

<mixed-citation>Thomas George, César Laurent, Xavier Bouthillier, Nicolas Ballas, and Pascal Vincent. Fast approximate natural gradient descent in a kronecker factored eigenbasis. In Advances in Neural Information Processing Systems (NIPS), pages 9550-9560, 2018.</mixed-citation>

</ref>

<mixed-citation>Roger Grosse and James Martens. A kronecker-factored approximate fisher matrix for convolution layers. In International Conference on Machine Learning (ICML), pages 573-582, 2016.</mixed-citation>

</ref>

<mixed-citation>Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning (ICML), pages 1842-1850, 2018.</mixed-citation>

</ref>

<mixed-citation>Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In ICML, volume 48, pages 1225-1234. JMLR.org, 2016. arXiv version: URL: https://arxiv.org/pdf/1509.01240.pdf.</mixed-citation>

</ref>

<mixed-citation>Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems (NIPS), pages 8571-8580, 2018.</mixed-citation>

</ref>

<mixed-citation>Ziwei Ji and Matus Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. In ICLR. OpenReview.net, 2020. arXiv version: URL: https://arxiv.org/pdf/1909.12292.pdf.</mixed-citation>

</ref>

<mixed-citation>Haotian Jiang, Tarun Kathuria, Yin Tat Lee, Swati Padmanabhan, and Zhao Song. A faster interior point method for semidefinite programming. In Manuscript, 2020.</mixed-citation>

</ref>

<mixed-citation>Haotian Jiang, Yin Tat Lee, Zhao Song, and Sam Chiu-wai Wong. An improved cutting plane method for convex optimization, convex-concave games and its applications. In STOC, pages 944-953. ACM, 2020. arXiv version: URL: https://arxiv.org/pdf/2004.04250.pdf.</mixed-citation>

</ref>

<mixed-citation>Shunhua Jiang, Zhao Song, Omri Weinstein, and Hengjie Zhang. Faster dynamic matrix inverse for faster lps. arXiv preprint, 2020. URL: http://arxiv.org/abs/2004.07470.</mixed-citation>

</ref>

<mixed-citation>Jonathan A Kelner, Lorenzo Orecchia, Aaron Sidford, and Zeyuan Allen Zhu. A simple, combinatorial algorithm for solving sdd systems in nearly-linear time. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing (STOC), pages 911-920. ACM, 2013. arXiv version: URL: https://arxiv.org/pdf/1301.6628.pdf.</mixed-citation>

</ref>

<mixed-citation>Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. arXiv versuion: URL: https://arxiv.org/pdf/1412.6980.pdf.</mixed-citation>

</ref>

<mixed-citation>Kasper Green Larsen and Jelani Nelson. Optimality of the johnson-lindenstrauss lemma. In IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 633-638. IEEE, 2017. arXiv version: URL: https://arxiv.org/pdf/1609.02094.pdf.</mixed-citation>

</ref>

<mixed-citation>François Le Gall. Powers of tensors and fast matrix multiplication. In Proceedings of the 39th international symposium on symbolic and algebraic computation (ISSAC), pages 296-303. ACM, 2014.</mixed-citation>

</ref>

<mixed-citation>Yin Tat Lee, Zhao Song, and Qiuyi Zhang. Solving empirical risk minimization in the current matrix multiplication time. In COLT. https://arxiv.org/pdf/1905.04447, 2019.</mixed-citation>

</ref>

<mixed-citation>Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In NeurIPS, pages 8168-8177, 2018. arXiv version: URL: https://arxiv.org/pdf/1808.01204.pdf.</mixed-citation>

</ref>

<mixed-citation>Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with ReLU activation. In Advances in neural information processing systems (NIPS), pages 597-607, 2017. arXiv version: URL: https://arxiv.org/pdf/1705.09886.pdf.</mixed-citation>

</ref>

<mixed-citation>Hang Liao, Barak A. Pearlmutter, Vamsi K. Potluru, and David P. Woodruff. Automatic differentiation of sketched regression. In AISTATS, 2020.</mixed-citation>

</ref>

<mixed-citation>Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. arXiv preprint, 2019. URL: http://arxiv.org/abs/1908.03265.</mixed-citation>

</ref>

<mixed-citation>Yichao Lu, Paramveer Dhillon, Dean P Foster, and Lyle Ungar. Faster ridge regression via the subsampled randomized hadamard transform. In Advances in neural information processing systems, pages 369-377, 2013.</mixed-citation>

</ref>

<mixed-citation>James Martens. Deep learning via hessian-free optimization. In ICML, volume 27, pages 735-742, 2010.</mixed-citation>

</ref>

<mixed-citation>James Martens, Jimmy Ba, and Matthew Johnson. Kronecker-factored curvature approximations for recurrent neural networks. In ICLR, 2018.</mixed-citation>

</ref>

<mixed-citation>James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning (ICML), pages 2408-2417, 2015.</mixed-citation>

</ref>

<mixed-citation>Philipp Moritz, Robert Nishihara, and Michael Jordan. A linearly-convergent stochastic l-bfgs algorithm. In Artificial Intelligence and Statistics, pages 249-258, 2016.</mixed-citation>

</ref>

<mixed-citation>Jelani Nelson and Huy L Nguyên. Osnap: Faster numerical linear algebra algorithms via sparser subspace embeddings. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS), pages 117-126. IEEE, 2013. arXiv version: URL: https://arxiv.org/pdf/1211.1002.pdf.</mixed-citation>

</ref>

<mixed-citation>Samet Oymak and Mahdi Soltanolkotabi. Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas in Information Theory, 2019.</mixed-citation>

</ref>

<mixed-citation>Mert Pilanci and Martin J Wainwright. Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence. SIAM Journal on Optimization, 27(1):205-245, 2017.</mixed-citation>

</ref>

<mixed-citation>

<pub-id pub-id-type="doi" xlink:href="10.4230/LIPIcs.ICALP.2017.59">10.4230/LIPIcs.ICALP.2017.59</pub-id>

</mixed-citation>

</ref>

<mixed-citation>Ilya Razenshteyn, Zhao Song, and David P Woodruff. Weighted low rank approximations with provable guarantees. In Proceedings of the 48th Annual Symposium on the Theory of Computing (STOC), 2016.</mixed-citation>

</ref>

<mixed-citation>Vladimir Rokhlin and Mark Tygert. A fast randomized algorithm for overdetermined linear least-squares regression. Proceedings of the National Academy of Sciences, 105(36):13212-13217, 2008.</mixed-citation>

</ref>

<mixed-citation>Tamás Sarlós. Improved approximation algorithms for large matrices via random projections. In Proceedings of 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2006.</mixed-citation>

</ref>

<mixed-citation>Zhao Song. Matrix Theory : Optimization, Concentration and Algorithms. PhD thesis, The University of Texas at Austin, 2019.</mixed-citation>

</ref>

<mixed-citation>Zhao Song, Ruosong Wang, Lin Yang, Hongyang Zhang, and Peilin Zhong. Efficient symmetric norm regression via linear sketching. In Advances in Neural Information Processing Systems (NeurIPS), pages 828-838, 2019. arXiv version: URL: https://arxiv.org/pdf/1910.01788.pdf.</mixed-citation>

</ref>

<mixed-citation>Zhao Song, David P Woodruff, and Peilin Zhong. Relative error tensor low rank approximation. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2772-2789. SIAM, 2019. arXiv version: URL: https://arxiv.org/pdf/1704.08246.pdf.</mixed-citation>

</ref>

<mixed-citation>Zhao Song and Xin Yang. Quadratic suffices for over-parametrization via matrix chernoff bound. arXiv preprint, 2019. URL: http://arxiv.org/abs/1906.03593.</mixed-citation>

</ref>

<mixed-citation>Daniel A. Spielman and Shang-Hua Teng. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In Proceedings of the Thirty-sixth Annual ACM Symposium on Theory of Computing (STOC), pages 81-90. ACM, 2004.</mixed-citation>

</ref>

<mixed-citation>Joel A Tropp. Improved analysis of the subsampled randomized hadamard transform. Advances in Adaptive Data Analysis, 3(01n02):115-126, 2011.</mixed-citation>

</ref>

<mixed-citation>Pravin M Vaidya. A new algorithm for minimizing convex functions over convex sets. In 30th Annual Symposium on Foundations of Computer Science (FOCS), pages 338-343. IEEE, 1989.</mixed-citation>

</ref>

<mixed-citation>Pravin M Vaidya. Speeding-up linear programming using fast matrix multiplication. In 30th Annual Symposium on Foundations of Computer Science (FOCS), pages 332-337. IEEE, 1989.</mixed-citation>

</ref>

<mixed-citation>Ruosong Wang and David P Woodruff. Tight bounds for 𝓁_p oblivious subspace embeddings. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1825-1843. SIAM, 2019. arXiv version: URL: https://arxiv.org/pdf/1801.04414.pdf.</mixed-citation>

</ref>

<mixed-citation>Virginia Vassilevska Williams. Multiplying matrices faster than coppersmith-winograd. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing (STOC), pages 887-898. ACM, 2012.</mixed-citation>

</ref>

<mixed-citation>David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1-2):1-157, 2014.</mixed-citation>

</ref>

<mixed-citation>David P. Woodruff and Amir Zandieh. Near input sparsity time kernel embeddings via adaptive sampling. In ICML, 2020.</mixed-citation>

</ref>

<mixed-citation>David P Woodruff and Peilin Zhong. Distributed low rank approximation of implicit functions of a matrix. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pages 847-858. IEEE, 2016.</mixed-citation>

</ref>

<mixed-citation>Xiaoxia Wu, Simon S Du, and Rachel Ward. Global convergence of adaptive gradient methods for an over-parameterized neural network. arXiv preprint, 2019. URL: http://arxiv.org/abs/1902.07111.</mixed-citation>

</ref>

<mixed-citation>Yuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems (NIPS), pages 5279-5288, 2017.</mixed-citation>

</ref>

<mixed-citation>Guodong Zhang, James Martens, and Roger B Grosse. Fast convergence of natural gradient descent for over-parameterized neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pages 8080-8091, 2019.</mixed-citation>

</ref>

<mixed-citation>Kai Zhong, Zhao Song, and Inderjit S Dhillon. Learning non-overlapping convolutional neural networks with multiple kernels. arXiv preprint, 2017. URL: http://arxiv.org/abs/1711.03440.</mixed-citation>

</ref>

<mixed-citation>Kai Zhong, Zhao Song, Prateek Jain, Peter L. Bartlett, and Inderjit S. Dhillon. Recovery guarantees for one-hidden-layer neural networks. In ICML, volume 70, pages 4140-4149. PMLR, 2017. arXiv version: URL: https://arxiv.org/pdf/1706.03175.pdf.</mixed-citation>

</ref>

</ref-list>

</back>

</book-part>

</book-part-wrapper>