Training (Overparametrized) Neural Networks in Near-Linear Time

Authors Jan van den Brand, Binghui Peng, Zhao Song, Omri Weinstein



PDF
Thumbnail PDF

File

LIPIcs.ITCS.2021.63.pdf
  • Filesize: 0.57 MB
  • 15 pages

Document Identifiers

Author Details

Jan van den Brand
  • KTH Royal Institute of Technology, Stockholm, Sweden
Binghui Peng
  • Columbia University, New York, NY, USA
Zhao Song
  • Princeton University and Institute for Advanced Study, NJ, USA
Omri Weinstein
  • Columbia University, New York, NY, USA

Acknowledgements

The author would like to thank David Woodruff for telling us the tensor trick for computing kernel matrices and helping us improve the presentation of the paper.

Cite AsGet BibTex

Jan van den Brand, Binghui Peng, Zhao Song, and Omri Weinstein. Training (Overparametrized) Neural Networks in Near-Linear Time. In 12th Innovations in Theoretical Computer Science Conference (ITCS 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 185, pp. 63:1-63:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.ITCS.2021.63

Abstract

The slow convergence rate and pathological curvature issues of first-order gradient methods for training deep neural networks, initiated an ongoing effort for developing faster second-order optimization algorithms beyond SGD, without compromising the generalization error. Despite their remarkable convergence rate (independent of the training batch size n), second-order algorithms incur a daunting slowdown in the cost per iteration (inverting the Hessian matrix of the loss function), which renders them impractical. Very recently, this computational overhead was mitigated by the works of [Zhang et al., 2019; Cai et al., 2019], yielding an O(mn²)-time second-order algorithm for training two-layer overparametrized neural networks of polynomial width m. We show how to speed up the algorithm of [Cai et al., 2019], achieving an Õ(mn)-time backpropagation algorithm for training (mildly overparametrized) ReLU networks, which is near-linear in the dimension (mn) of the full gradient (Jacobian) matrix. The centerpiece of our algorithm is to reformulate the Gauss-Newton iteration as an 𝓁₂-regression problem, and then use a Fast-JL type dimension reduction to precondition the underlying Gram matrix in time independent of M, allowing to find a sufficiently good approximate solution via first-order conjugate gradient. Our result provides a proof-of-concept that advanced machinery from randomized linear algebra - which led to recent breakthroughs in convex optimization (ERM, LPs, Regression) - can be carried over to the realm of deep learning as well.

Subject Classification

ACM Subject Classification
  • Theory of computation → Nonconvex optimization
Keywords
  • Deep learning theory
  • Nonconvex optimization

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order stochastic optimization for machine learning in linear time. The Journal of Machine Learning Research, 18(1):4148-4187, 2017. Google Scholar
  2. Nir Ailon and Bernard Chazelle. Approximate nearest neighbors and the fast johnson-lindenstrauss transform. In Proceedings of the thirty-eighth annual ACM symposium on Theory of computing (STOC), pages 557-563, 2006. Google Scholar
  3. Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in neural information processing systems (NeurIPS), pages 6155-6166, 2019. Google Scholar
  4. Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In ICML, volume 97, pages 242-252. PMLR, 2019. arXiv version: URL: https://arxiv.org/pdf/1811.03962.pdf.
  5. Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. On the convergence rate of training recurrent neural networks. In NeurIPS, pages 6673-6685, 2019. arXiv version: URL: https://arxiv.org/pdf/1810.12065.pdf.
  6. Alexandr Andoni, Chengyu Lin, Ying Sheng, Peilin Zhong, and Ruiqi Zhong. Subspace embedding and linear regression with orlicz norm. In ICML, pages 224-233. PMLR, 2018. arXiv versuion: URL: https://arXiv.org/pdf/1806.06430.pdf.
  7. Rohan Anil, Vineet Gupta, Tomer Koren, Kevin Regan, and Yoram Singer. Second order optimization made practical. arXiv preprint, 2020. URL: http://arxiv.org/abs/arXiv:2002.09018.
  8. Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning (ICML), volume 97 of Proceedings of Machine Learning Research, pages 322-332. PMLR, 2019. arXiv version: URL: https://arxiv.org/pdf/1901.08584.pdf.
  9. Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems (NeurIPS), pages 8139-8148, 2019. arXiv version: URL: https://arxiv.org/pdf/1904.11955.pdf.
  10. Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Velingker, and Amir Zandieh. Random fourier features for kernel ridge regression: Approximation bounds and statistical guarantees. In ICML, volume 70, pages 253-262. PMLR, 2017. arXiv version: URL: https://arxiv.org/pdf/1804.09893.pdf.
  11. Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Velingker, and Amir Zandieh. A universal sampling method for reconstructing signals with simple fourier transforms. In STOC. ACM, 2019. arXiv version: https://arxiv.org/pdf/1812.08723.pdf. URL: https://doi.org/10.1145/3313276.3316363.
  12. Ainesh Bakshi, Nadiia Chepurko, and David P Woodruff. Robust and sample optimal algorithms for psd low-rank approximation. arXiv preprint, 2019. URL: http://arxiv.org/abs/1912.04177.
  13. Ainesh Bakshi, Rajesh Jayaram, and David P Woodruff. Learning two layer rectified neural networks in polynomial time. In COLT, volume 99, pages 195-268. PMLR, 2019. arXiv version: URL: https://arxiv.org/pdf/1811.01885.pdf.
  14. Ainesh Bakshi and David Woodruff. Sublinear time low-rank approximation of distance matrices. In Advances in Neural Information Processing Systems (NeurIPS), pages 3782-3792, 2018. arXiv version: URL: https://arxiv.org/pdf/1809.06986.pdf.
  15. Frank Ban, David P. Woodruff, and Richard Zhang. Regularized weighted low rank approximation. In NeurIPS, pages 4061-4071, 2020. arXiv version: URL: https://arxiv.org/pdf/1911.06958.pdf.
  16. Sue Becker and Yann Le Cun. Improving the convergence of back-propagation learning with second order methods. In Proceedings of the 1988 connectionist models summer school, pages 29-37, 1988. Google Scholar
  17. Alberto Bernacchia, Máté Lengyel, and Guillaume Hennequin. Exact natural gradient in deep linear networks and its application to the nonlinear case. In Advances in Neural Information Processing Systems (NIPS), pages 5941-5950, 2018. Google Scholar
  18. Aleksandar Botev, Hippolyt Ritter, and David Barber. Practical gauss-newton optimisation for deep learning. In International Conference on Machine Learning (ICML), pages 557-565, 2017. Google Scholar
  19. Christos Boutsidis and David P Woodruff. Optimal cur matrix decompositions. In Proceedings of the 46th Annual ACM Symposium on Theory of Computing (STOC), pages 353-362. ACM, 2014. arXiv version: URL: https://arxiv.org/pdf/1405.7910.pdf.
  20. Christos Boutsidis, David P Woodruff, and Peilin Zhong. Optimal principal component analysis in distributed and streaming models. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing (STOC), pages 236-249, 2016. Google Scholar
  21. Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trendsregistered in Machine Learning, 8(3-4):231-357, 2015. Google Scholar
  22. Sébastien Bubeck, Ronen Eldan, Yin Tat Lee, and Dan Mikulincer. Network size and weights size for memorization with two-layers neural networks. arXiv preprint, 2020. URL: http://arxiv.org/abs/2006.02855.
  23. Tianle Cai, Ruiqi Gao, Jikai Hou, Siyu Chen, Dong Wang, Di He, Zhihua Zhang, and Liwei Wang. A gram-gauss-newton method learning overparameterized deep neural networks for regression problems. arXiv preprint, 2019. URL: http://arxiv.org/abs/1905.11675.
  24. Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input sparsity time. In Symposium on Theory of Computing Conference (STOC), pages 81-90. ACM, 2013. arXiv version: URL: https://arxiv.org/pdf/1207.6365.pdf.
  25. Michael B Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time. In STOC, pages 938-942. ACM, 2019. arXiv version: URL: https://arxiv.org/pdf/1810.07896.pdf.
  26. Henry Cohn, Robert Kleinberg, Balazs Szegedy, and Christopher Umans. Group-theoretic algorithms for matrix multiplication. In 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 379-388. IEEE, 2005. Google Scholar
  27. Amit Daniely. Memorizing gaussians with no over-parameterizaion via gradient decent on neural networks. arXiv preprint, 2020. URL: http://arxiv.org/abs/2003.12895.
  28. Huaian Diao, Rajesh Jayaram, Zhao Song, Wen Sun, and David Woodruff. Optimal sketching for kronecker product regression and low rank approximation. In Advances in Neural Information Processing Systems (NeurIPS), pages 4739-4750, 2019. arXiv version: URL: https://arxiv.org/pdf/1909.13384.pdf.
  29. Petros Drineas, Malik Magdon-Ismail, Michael W Mahoney, and David P Woodruff. Fast approximation of matrix coherence and statistical leverage. Journal of Machine Learning Research, 13(Dec):3475-3506, 2012. Google Scholar
  30. Petros Drineas, Michael W Mahoney, and Shan Muthukrishnan. Sampling algorithms for l2 regression and applications. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 1127-1136. Society for Industrial and Applied Mathematics, 2006. Google Scholar
  31. Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In ICLR. OpenReview.net, 2019. arXiv version: URL: https://arxiv.org/pdf/1810.02054.pdf.
  32. John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research (JMLR), 12(Jul):2121-2159, 2011. Google Scholar
  33. François Le Gall and Florent Urrutia. Improved rectangular matrix multiplication using powers of the coppersmith-winograd tensor. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1029-1046. SIAM, 2018. arXiv version: URL: https://arxiv.org/pdf/1708.05622.pdf.
  34. Thomas George, César Laurent, Xavier Bouthillier, Nicolas Ballas, and Pascal Vincent. Fast approximate natural gradient descent in a kronecker factored eigenbasis. In Advances in Neural Information Processing Systems (NIPS), pages 9550-9560, 2018. Google Scholar
  35. Roger Grosse and James Martens. A kronecker-factored approximate fisher matrix for convolution layers. In International Conference on Machine Learning (ICML), pages 573-582, 2016. Google Scholar
  36. Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning (ICML), pages 1842-1850, 2018. Google Scholar
  37. Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In ICML, volume 48, pages 1225-1234. JMLR.org, 2016. arXiv version: URL: https://arxiv.org/pdf/1509.01240.pdf.
  38. Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems (NIPS), pages 8571-8580, 2018. Google Scholar
  39. Ziwei Ji and Matus Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. In ICLR. OpenReview.net, 2020. arXiv version: URL: https://arxiv.org/pdf/1909.12292.pdf.
  40. Haotian Jiang, Tarun Kathuria, Yin Tat Lee, Swati Padmanabhan, and Zhao Song. A faster interior point method for semidefinite programming. In Manuscript, 2020. Google Scholar
  41. Haotian Jiang, Yin Tat Lee, Zhao Song, and Sam Chiu-wai Wong. An improved cutting plane method for convex optimization, convex-concave games and its applications. In STOC, pages 944-953. ACM, 2020. arXiv version: URL: https://arxiv.org/pdf/2004.04250.pdf.
  42. Shunhua Jiang, Zhao Song, Omri Weinstein, and Hengjie Zhang. Faster dynamic matrix inverse for faster lps. arXiv preprint, 2020. URL: http://arxiv.org/abs/2004.07470.
  43. Jonathan A Kelner, Lorenzo Orecchia, Aaron Sidford, and Zeyuan Allen Zhu. A simple, combinatorial algorithm for solving sdd systems in nearly-linear time. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing (STOC), pages 911-920. ACM, 2013. arXiv version: URL: https://arxiv.org/pdf/1301.6628.pdf.
  44. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. arXiv versuion: URL: https://arxiv.org/pdf/1412.6980.pdf.
  45. Kasper Green Larsen and Jelani Nelson. Optimality of the johnson-lindenstrauss lemma. In IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 633-638. IEEE, 2017. arXiv version: URL: https://arxiv.org/pdf/1609.02094.pdf.
  46. François Le Gall. Powers of tensors and fast matrix multiplication. In Proceedings of the 39th international symposium on symbolic and algebraic computation (ISSAC), pages 296-303. ACM, 2014. Google Scholar
  47. Yin Tat Lee, Zhao Song, and Qiuyi Zhang. Solving empirical risk minimization in the current matrix multiplication time. In COLT. https://arxiv.org/pdf/1905.04447, 2019.
  48. Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In NeurIPS, pages 8168-8177, 2018. arXiv version: URL: https://arxiv.org/pdf/1808.01204.pdf.
  49. Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with ReLU activation. In Advances in neural information processing systems (NIPS), pages 597-607, 2017. arXiv version: URL: https://arxiv.org/pdf/1705.09886.pdf.
  50. Hang Liao, Barak A. Pearlmutter, Vamsi K. Potluru, and David P. Woodruff. Automatic differentiation of sketched regression. In AISTATS, 2020. Google Scholar
  51. Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. arXiv preprint, 2019. URL: http://arxiv.org/abs/1908.03265.
  52. Yichao Lu, Paramveer Dhillon, Dean P Foster, and Lyle Ungar. Faster ridge regression via the subsampled randomized hadamard transform. In Advances in neural information processing systems, pages 369-377, 2013. Google Scholar
  53. James Martens. Deep learning via hessian-free optimization. In ICML, volume 27, pages 735-742, 2010. Google Scholar
  54. James Martens, Jimmy Ba, and Matthew Johnson. Kronecker-factored curvature approximations for recurrent neural networks. In ICLR, 2018. Google Scholar
  55. James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning (ICML), pages 2408-2417, 2015. Google Scholar
  56. Philipp Moritz, Robert Nishihara, and Michael Jordan. A linearly-convergent stochastic l-bfgs algorithm. In Artificial Intelligence and Statistics, pages 249-258, 2016. Google Scholar
  57. Jelani Nelson and Huy L Nguyên. Osnap: Faster numerical linear algebra algorithms via sparser subspace embeddings. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS), pages 117-126. IEEE, 2013. arXiv version: URL: https://arxiv.org/pdf/1211.1002.pdf.
  58. Samet Oymak and Mahdi Soltanolkotabi. Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas in Information Theory, 2019. Google Scholar
  59. Mert Pilanci and Martin J Wainwright. Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence. SIAM Journal on Optimization, 27(1):205-245, 2017. Google Scholar
  60. Eric Price, Zhao Song, and David P. Woodruff. Fast regression with an 𝓁_∞ guarantee. In International Colloquium on Automata, Languages, and Programming (ICALP), volume 80 of LIPIcs, pages 59:1-59:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017. arXiv version: https://arxiv.org/pdf/1705.10723.pdf. URL: https://doi.org/10.4230/LIPIcs.ICALP.2017.59.
  61. Ilya Razenshteyn, Zhao Song, and David P Woodruff. Weighted low rank approximations with provable guarantees. In Proceedings of the 48th Annual Symposium on the Theory of Computing (STOC), 2016. Google Scholar
  62. Vladimir Rokhlin and Mark Tygert. A fast randomized algorithm for overdetermined linear least-squares regression. Proceedings of the National Academy of Sciences, 105(36):13212-13217, 2008. Google Scholar
  63. Tamás Sarlós. Improved approximation algorithms for large matrices via random projections. In Proceedings of 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2006. Google Scholar
  64. Zhao Song. Matrix Theory : Optimization, Concentration and Algorithms. PhD thesis, The University of Texas at Austin, 2019. Google Scholar
  65. Zhao Song, Ruosong Wang, Lin Yang, Hongyang Zhang, and Peilin Zhong. Efficient symmetric norm regression via linear sketching. In Advances in Neural Information Processing Systems (NeurIPS), pages 828-838, 2019. arXiv version: URL: https://arxiv.org/pdf/1910.01788.pdf.
  66. Zhao Song, David P Woodruff, and Peilin Zhong. Relative error tensor low rank approximation. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2772-2789. SIAM, 2019. arXiv version: URL: https://arxiv.org/pdf/1704.08246.pdf.
  67. Zhao Song and Xin Yang. Quadratic suffices for over-parametrization via matrix chernoff bound. arXiv preprint, 2019. URL: http://arxiv.org/abs/1906.03593.
  68. Daniel A. Spielman and Shang-Hua Teng. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In Proceedings of the Thirty-sixth Annual ACM Symposium on Theory of Computing (STOC), pages 81-90. ACM, 2004. Google Scholar
  69. Joel A Tropp. Improved analysis of the subsampled randomized hadamard transform. Advances in Adaptive Data Analysis, 3(01n02):115-126, 2011. Google Scholar
  70. Pravin M Vaidya. A new algorithm for minimizing convex functions over convex sets. In 30th Annual Symposium on Foundations of Computer Science (FOCS), pages 338-343. IEEE, 1989. Google Scholar
  71. Pravin M Vaidya. Speeding-up linear programming using fast matrix multiplication. In 30th Annual Symposium on Foundations of Computer Science (FOCS), pages 332-337. IEEE, 1989. Google Scholar
  72. Ruosong Wang and David P Woodruff. Tight bounds for 𝓁_p oblivious subspace embeddings. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1825-1843. SIAM, 2019. arXiv version: URL: https://arxiv.org/pdf/1801.04414.pdf.
  73. Virginia Vassilevska Williams. Multiplying matrices faster than coppersmith-winograd. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing (STOC), pages 887-898. ACM, 2012. Google Scholar
  74. David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1-2):1-157, 2014. Google Scholar
  75. David P. Woodruff and Amir Zandieh. Near input sparsity time kernel embeddings via adaptive sampling. In ICML, 2020. Google Scholar
  76. David P Woodruff and Peilin Zhong. Distributed low rank approximation of implicit functions of a matrix. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pages 847-858. IEEE, 2016. Google Scholar
  77. Xiaoxia Wu, Simon S Du, and Rachel Ward. Global convergence of adaptive gradient methods for an over-parameterized neural network. arXiv preprint, 2019. URL: http://arxiv.org/abs/1902.07111.
  78. Yuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems (NIPS), pages 5279-5288, 2017. Google Scholar
  79. Guodong Zhang, James Martens, and Roger B Grosse. Fast convergence of natural gradient descent for over-parameterized neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pages 8080-8091, 2019. Google Scholar
  80. Kai Zhong, Zhao Song, and Inderjit S Dhillon. Learning non-overlapping convolutional neural networks with multiple kernels. arXiv preprint, 2017. URL: http://arxiv.org/abs/1711.03440.
  81. Kai Zhong, Zhao Song, Prateek Jain, Peter L. Bartlett, and Inderjit S. Dhillon. Recovery guarantees for one-hidden-layer neural networks. In ICML, volume 70, pages 4140-4149. PMLR, 2017. arXiv version: URL: https://arxiv.org/pdf/1706.03175.pdf.