DROPS

Document

DOI: 10.4230/LIPIcs.ITCS.2024.93

Training Multi-Layer Over-Parametrized Neural Network in Subquadratic Time

Authors: Zhao Song, Lichen Zhang, and Ruizhe Zhang

Published in: LIPIcs, Volume 287, 15th Innovations in Theoretical Computer Science Conference (ITCS 2024)

Abstract

We consider the problem of training a multi-layer over-parametrized neural network to minimize the empirical risk induced by a loss function. In the typical setting of over-parametrization, the network width m is much larger than the data dimension d and the number of training samples n (m = poly(n,d)), which induces a prohibitive large weight matrix W ∈ ℝ^{m× m} per layer. Naively, one has to pay O(m²) time to read the weight matrix and evaluate the neural network function in both forward and backward computation. In this work, we show how to reduce the training cost per iteration. Specifically, we propose a framework that uses m² cost only in the initialization phase and achieves a truly subquadratic cost per iteration in terms of m, i.e., m^{2-Ω(1)} per iteration. Our result has implications beyond standard over-parametrization theory, as it can be viewed as designing an efficient data structure on top of a pre-trained large model to further speed up the fine-tuning process, a core procedure to deploy large language models (LLM).

Cite as

Zhao Song, Lichen Zhang, and Ruizhe Zhang. Training Multi-Layer Over-Parametrized Neural Network in Subquadratic Time. In 15th Innovations in Theoretical Computer Science Conference (ITCS 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 287, pp. 93:1-93:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{song_et_al:LIPIcs.ITCS.2024.93,
  author =	{Song, Zhao and Zhang, Lichen and Zhang, Ruizhe},
  title =	{{Training Multi-Layer Over-Parametrized Neural Network in Subquadratic Time}},
  booktitle =	{15th Innovations in Theoretical Computer Science Conference (ITCS 2024)},
  pages =	{93:1--93:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-309-6},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{287},
  editor =	{Guruswami, Venkatesan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ITCS.2024.93},
  URN =		{urn:nbn:de:0030-drops-196212},
  doi =		{10.4230/LIPIcs.ITCS.2024.93},
  annote =	{Keywords: Deep learning theory, Nonconvex optimization}
}

Document

DOI: 10.4230/OASIcs.EVCS.2023.31

Comparing Bottom-Up with Top-Down Parsing Architectures for the Syntax Definition Formalism from a Disambiguation Standpoint

Authors: Jurgen J. Vinju

Published in: OASIcs, Volume 109, Eelco Visser Commemorative Symposium (EVCS 2023)

Abstract

Context-free general parsing and disambiguation algorithms are threaded throughout the research and engineering career of Eelco Visser. Both our Ph.D. theses featured the study of "disambiguation." Disambiguation is the declarative definition of choices among different parse trees, derived using the same context-free grammar, for the same input sentence. This essay highlights the differences between syntactic disambiguation for context-free general parsing in a top-down architecture and a bottom-up architecture. The differences between top-down and bottom-up are mainly observed as practical aspects of the software architecture and software implementation. Eventually, the concept of data-dependent context-free grammar brings all engineering perspectives of disambiguation back into a conceptual (declarative) framework independent of the parsing architecture. The novelty in this essay is the juxtaposition of three general parsing architectures from a disambiguation point of view: SGLR, SGLL, and DDGLL. It also motivates design decisions in the parsing architectures for SDF{1,2} and Rascal with previously unpublished detail. The essay falls short of a literature review and a tool evaluation since it does not investigate the disambiguation methods of the many other parser generator tools that exist. The fact that only the implementation algorithms are different between the compared parsing architectures, while the syntax definition formalisms have practically the same formal semantics for historical reasons, nicely "isolates the variable" of interest. We hope this essay lives up to the enormous enthusiasm, curiosity, and drive for perfection in syntax definition and parsing that Eelco always radiated. We dearly miss him.

Cite as

Jurgen J. Vinju. Comparing Bottom-Up with Top-Down Parsing Architectures for the Syntax Definition Formalism from a Disambiguation Standpoint. In Eelco Visser Commemorative Symposium (EVCS 2023). Open Access Series in Informatics (OASIcs), Volume 109, pp. 31:1-31:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{vinju:OASIcs.EVCS.2023.31,
  author =	{Vinju, Jurgen J.},
  title =	{{Comparing Bottom-Up with Top-Down Parsing Architectures for the Syntax Definition Formalism from a Disambiguation Standpoint}},
  booktitle =	{Eelco Visser Commemorative Symposium (EVCS 2023)},
  pages =	{31:1--31:16},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-267-9},
  ISSN =	{2190-6807},
  year =	{2023},
  volume =	{109},
  editor =	{L\"{a}mmel, Ralf and Mosses, Peter D. and Steimann, Friedrich},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.EVCS.2023.31},
  URN =		{urn:nbn:de:0030-drops-178018},
  doi =		{10.4230/OASIcs.EVCS.2023.31},
  annote =	{Keywords: parser generation, context-free grammars, GLR, GLL, algorithms, disambiguation}
}

Document

Track A: Algorithms, Complexity and Games

DOI: 10.4230/LIPIcs.ICALP.2022.20

Fully-Dynamic Graph Sparsifiers Against an Adaptive Adversary

Authors: Aaron Bernstein, Jan van den Brand, Maximilian Probst Gutenberg, Danupon Nanongkai, Thatchaphol Saranurak, Aaron Sidford, and He Sun

Published in: LIPIcs, Volume 229, 49th International Colloquium on Automata, Languages, and Programming (ICALP 2022)

Abstract

Designing efficient dynamic graph algorithms against an adaptive adversary is a major goal in the field of dynamic graph algorithms and has witnessed many exciting recent developments in, e.g., dynamic matching (Wajc STOC'20) and decremental shortest paths (Chuzhoy and Khanna STOC'19). Compared to other graph primitives (e.g. spanning trees and matchings), designing such algorithms for graph spanners and (more broadly) graph sparsifiers poses a unique challenge since there is no fast deterministic algorithm known for static computation and the lack of a way to adjust the output slowly (known as "small recourse/replacements"). This paper presents the first non-trivial efficient adaptive algorithms for maintaining many sparsifiers against an adaptive adversary. Specifically, we present algorithms that maintain 1) a polylog(n)-spanner of size Õ(n) in polylog(n) amortized update time, 2) an O(k)-approximate cut sparsifier of size Õ(n) in Õ(n^{1/k}) amortized update time, and 3) a polylog(n)-approximate spectral sparsifier in polylog(n) amortized update time. Our bounds are the first non-trivial ones even when only the recourse is concerned. Our results hold even against a stronger adversary, who can access the random bits previously used by the algorithms and the amortized update time of all algorithms can be made worst-case by paying sub-polynomial factors. Our spanner result resolves an open question by Ahmed et al. (2019) and our results and techniques imply additional improvements over existing results, including (i) answering open questions about decremental single-source shortest paths by Chuzhoy and Khanna (STOC'19) and Gutenberg and Wulff-Nilsen (SODA'20), implying a nearly-quadratic time algorithm for approximating minimum-cost unit-capacity flow and (ii) de-amortizing a result of Abraham et al. (FOCS'16) for dynamic spectral sparsifiers. Our results are based on two novel techniques. The first technique is a generic black-box reduction that allows us to assume that the graph is initially an expander with almost uniform-degree and, more importantly, stays as an almost uniform-degree expander while undergoing only edge deletions. The second technique is called proactive resampling: here we constantly re-sample parts of the input graph so that, independent of an adversary’s computational power, a desired structure of the underlying graph can be always maintained. Despite its simplicity, the analysis of this sampling scheme is far from trivial, because the adversary can potentially create dependencies between the random choices used by the algorithm. We believe these two techniques could be useful for developing other adaptive algorithms.

Cite as

Aaron Bernstein, Jan van den Brand, Maximilian Probst Gutenberg, Danupon Nanongkai, Thatchaphol Saranurak, Aaron Sidford, and He Sun. Fully-Dynamic Graph Sparsifiers Against an Adaptive Adversary. In 49th International Colloquium on Automata, Languages, and Programming (ICALP 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 229, pp. 20:1-20:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{bernstein_et_al:LIPIcs.ICALP.2022.20,
  author =	{Bernstein, Aaron and van den Brand, Jan and Probst Gutenberg, Maximilian and Nanongkai, Danupon and Saranurak, Thatchaphol and Sidford, Aaron and Sun, He},
  title =	{{Fully-Dynamic Graph Sparsifiers Against an Adaptive Adversary}},
  booktitle =	{49th International Colloquium on Automata, Languages, and Programming (ICALP 2022)},
  pages =	{20:1--20:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-235-8},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{229},
  editor =	{Boja\'{n}czyk, Miko{\l}aj and Merelli, Emanuela and Woodruff, David P.},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2022.20},
  URN =		{urn:nbn:de:0030-drops-163611},
  doi =		{10.4230/LIPIcs.ICALP.2022.20},
  annote =	{Keywords: dynamic graph algorithm, adaptive adversary, spanner, sparsifier}
}

@InProceedings{bernstein_et_al:LIPIcs.ICALP.2022.20,
  author =	{Bernstein, Aaron and van den Brand, Jan and Probst Gutenberg, Maximilian and Nanongkai, Danupon and Saranurak, Thatchaphol and Sidford, Aaron and Sun, He},
  title =	{{Fully-Dynamic Graph Sparsifiers Against an Adaptive Adversary}},
  booktitle =	{49th International Colloquium on Automata, Languages, and Programming (ICALP 2022)},
  pages =	{20:1--20:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-235-8},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{229},
  editor =	{Boja\'{n}czyk, Miko{\l}aj and Merelli, Emanuela and Woodruff, David P.},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2022.20},
  URN =		{urn:nbn:de:0030-drops-163611},
  doi =		{10.4230/LIPIcs.ICALP.2022.20},
  annote =	{Keywords: dynamic graph algorithm, adaptive adversary, spanner, sparsifier}
}

Document

DOI: 10.4230/LIPIcs.ITCS.2021.63

Training (Overparametrized) Neural Networks in Near-Linear Time

Authors: Jan van den Brand, Binghui Peng, Zhao Song, and Omri Weinstein

Published in: LIPIcs, Volume 185, 12th Innovations in Theoretical Computer Science Conference (ITCS 2021)

Abstract

The slow convergence rate and pathological curvature issues of first-order gradient methods for training deep neural networks, initiated an ongoing effort for developing faster second-order optimization algorithms beyond SGD, without compromising the generalization error. Despite their remarkable convergence rate (independent of the training batch size n), second-order algorithms incur a daunting slowdown in the cost per iteration (inverting the Hessian matrix of the loss function), which renders them impractical. Very recently, this computational overhead was mitigated by the works of [Zhang et al., 2019; Cai et al., 2019], yielding an O(mn²)-time second-order algorithm for training two-layer overparametrized neural networks of polynomial width m. We show how to speed up the algorithm of [Cai et al., 2019], achieving an Õ(mn)-time backpropagation algorithm for training (mildly overparametrized) ReLU networks, which is near-linear in the dimension (mn) of the full gradient (Jacobian) matrix. The centerpiece of our algorithm is to reformulate the Gauss-Newton iteration as an 𝓁₂-regression problem, and then use a Fast-JL type dimension reduction to precondition the underlying Gram matrix in time independent of M, allowing to find a sufficiently good approximate solution via first-order conjugate gradient. Our result provides a proof-of-concept that advanced machinery from randomized linear algebra - which led to recent breakthroughs in convex optimization (ERM, LPs, Regression) - can be carried over to the realm of deep learning as well.

Cite as

Jan van den Brand, Binghui Peng, Zhao Song, and Omri Weinstein. Training (Overparametrized) Neural Networks in Near-Linear Time. In 12th Innovations in Theoretical Computer Science Conference (ITCS 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 185, pp. 63:1-63:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{vandenbrand_et_al:LIPIcs.ITCS.2021.63,
  author =	{van den Brand, Jan and Peng, Binghui and Song, Zhao and Weinstein, Omri},
  title =	{{Training (Overparametrized) Neural Networks in Near-Linear Time}},
  booktitle =	{12th Innovations in Theoretical Computer Science Conference (ITCS 2021)},
  pages =	{63:1--63:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-177-1},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{185},
  editor =	{Lee, James R.},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ITCS.2021.63},
  URN =		{urn:nbn:de:0030-drops-136025},
  doi =		{10.4230/LIPIcs.ITCS.2021.63},
  annote =	{Keywords: Deep learning theory, Nonconvex optimization}
}

Document

Invited Talk

DOI: 10.4230/LIPIcs.FSTTCS.2020.3

Convex Optimization and Dynamic Data Structure (Invited Talk)

Authors: Yin Tat Lee

Published in: LIPIcs, Volume 182, 40th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2020)

Abstract

In the last three years, there are many breakthroughs in optimization such as nearly quadratic time algorithms for bipartite matching, linear programming algorithms that are as fast as Ax = b. All of these algorithms are based on a careful combination of optimization techniques and dynamic data structures. In this talk, we will explain the framework underlying all the recent breakthroughs. Joint work with Jan van den Brand, Michael B. Cohen, Sally Dong, Haotian Jiang, Tarun Kathuria, Danupon Nanongkai, Swati Padmanabhan, Richard Peng, Thatchaphol Saranurak, Aaron Sidford, Zhao Song, Di Wang, Sam Chiu-wai Wong, Guanghao Ye, Qiuyi Zhang.

Cite as

Yin Tat Lee. Convex Optimization and Dynamic Data Structure (Invited Talk). In 40th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 182, p. 3:1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{lee:LIPIcs.FSTTCS.2020.3,
  author =	{Lee, Yin Tat},
  title =	{{Convex Optimization and Dynamic Data Structure}},
  booktitle =	{40th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2020)},
  pages =	{3:1--3:1},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-174-0},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{182},
  editor =	{Saxena, Nitin and Simon, Sunil},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.FSTTCS.2020.3},
  URN =		{urn:nbn:de:0030-drops-132440},
  doi =		{10.4230/LIPIcs.FSTTCS.2020.3},
  annote =	{Keywords: Convex Optimization, Dynamic Data Structure}
}

5 Search Results for "van den Brand, Jan"

Training Multi-Layer Over-Parametrized Neural Network in Subquadratic Time

Abstract

Cite as

Comparing Bottom-Up with Top-Down Parsing Architectures for the Syntax Definition Formalism from a Disambiguation Standpoint

Abstract

Cite as

Fully-Dynamic Graph Sparsifiers Against an Adaptive Adversary

Abstract

Cite as

Training (Overparametrized) Neural Networks in Near-Linear Time

Abstract

Cite as

Convex Optimization and Dynamic Data Structure (Invited Talk)

Abstract

Cite as

Thanks for your feedback!

Could not send message