Tight Hardness Results for Training Depth-2 ReLU Networks

Goel, Surbhi; Klivans, Adam; Manurangsi, Pasin; Reichman, Daniel

doi:10.4230/LIPIcs.ITCS.2021.22

File

LIPIcs.ITCS.2021.22.pdf

Filesize: 0.54 MB
14 pages

Document Identifiers

DOI: 10.4230/LIPIcs.ITCS.2021.22
URN: urn:nbn:de:0030-drops-135611

Author Details

Surbhi Goel

Microsoft Research, New York, NY, USA

Adam Klivans

The University of Texas at Austin, TX, USA

Pasin Manurangsi

Google Research, Mountain View, CA, USA

Daniel Reichman

Worcester Polytechnic Institute, MA, USA

Acknowledgements

We would like to thank Amit Daniely, Amir Globerson, Meena Jagadeesan and Cameron Musco for interesting discussions.

Cite AsGet BibTex

Surbhi Goel, Adam Klivans, Pasin Manurangsi, and Daniel Reichman. Tight Hardness Results for Training Depth-2 ReLU Networks. In 12th Innovations in Theoretical Computer Science Conference (ITCS 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 185, pp. 22:1-22:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.ITCS.2021.22

Abstract

We prove several hardness results for training depth-2 neural networks with the ReLU activation function; these networks are simply weighted sums (that may include negative coefficients) of ReLUs. Our goal is to output a depth-2 neural network that minimizes the square loss with respect to a given training set. We prove that this problem is NP-hard already for a network with a single ReLU. We also prove NP-hardness for outputting a weighted sum of k ReLUs minimizing the squared error (for k > 1) even in the realizable setting (i.e., when the labels are consistent with an unknown depth-2 ReLU network). We are also able to obtain lower bounds on the running time in terms of the desired additive error ε. To obtain our lower bounds, we use the Gap Exponential Time Hypothesis (Gap-ETH) as well as a new hypothesis regarding the hardness of approximating the well known Densest κ-Subgraph problem in subexponential time (these hypotheses are used separately in proving different lower bounds). For example, we prove that under reasonable hardness assumptions, any proper learning algorithm for finding the best fitting ReLU must run in time exponential in 1/ε². Together with a previous work regarding improperly learning a ReLU [Surbhi Goel et al., 2017], this implies the first separation between proper and improper algorithms for learning a ReLU. We also study the problem of properly learning a depth-2 network of ReLUs with bounded weights giving new (worst-case) upper bounds on the running time needed to learn such networks both in the realizable and agnostic settings. Our upper bounds on the running time essentially matches our lower bounds in terms of the dependency on ε.

Subject Classification

ACM Subject Classification

Theory of computation → Machine learning theory
Theory of computation → Problems, reductions and completeness

Keywords

ReLU
Learning Algorithm
Running Time Lower Bound

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Sarah R. Allen, Ryan O'Donnell, and David Witmer. How to refute a random CSP. In FOCS, pages 689-708, 2015. URL: https://doi.org/10.1109/FOCS.2015.48.
Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in neural information processing systems, pages 6155-6166, 2019.
Benny Applebaum, Boaz Barak, and David Xiao. On basing lower-bounds for learning on worst-case assumptions. In FOCS, pages 211-220, 2008. URL: https://doi.org/10.1109/FOCS.2008.35.
Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Understanding deep neural networks with rectified linear units. In International Conference on Learning Representations, 2018. URL: https://openreview.net/forum?id=B1J_rgWRW.
Francis Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18(1):629-681, 2017.
Ainesh Bakshi, Rajesh Jayaram, and David P Woodruff. Learning two layer rectified neural networks in polynomial time. In Conference on Learning Theory, pages 195-268, 2019.
Aditya Bhaskara, Moses Charikar, Aravindan Vijayaraghavan, Venkatesan Guruswami, and Yuan Zhou. Polynomial integrality gaps for strong SDP relaxations of densest k-subgraph. In SODA, pages 388-405, 2012.
Avrim Blum and Ronald L Rivest. Training a 3-node neural network is NP-complete. In Advances in neural information processing systems, pages 494-501, 1989.
Digvijay Boob, Santanu S Dey, and Guanghui Lan. Complexity of training relu neural network. arXiv preprint, 2018. URL: http://arxiv.org/abs/1809.10787.
Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. arXiv preprint, 2017. URL: http://arxiv.org/abs/1702.07966.
Parinya Chalermsook, Marek Cygan, Guy Kortsarz, Bundit Laekhanukit, Pasin Manurangsi, Danupon Nanongkai, and Luca Trevisan. From Gap-ETH to FPT-inapproximability: Clique, dominating set, and more. In FOCS, pages 743-754, 2017. URL: https://doi.org/10.1109/FOCS.2017.74.
Eden Chlamtác, Pasin Manurangsi, Dana Moshkovitz, and Aravindan Vijayaraghavan. Approximation algorithms for label cover and the log-density threshold. In SODA, pages 900-919, 2017. URL: https://doi.org/10.1137/1.9781611974782.57.
George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303-314, 1989.
Santanu S Dey, Guanyi Wang, and Yao Xie. An approximation algorithm for training one-node relu neural network. arXiv preprint, 2018. URL: http://arxiv.org/abs/1810.03592.
Ilias Diakonikolas, Daniel Kane, and Pasin Manurangsi. Nearly tight bounds for robust proper learning of halfspaces with a margin. In Advances in Neural Information Processing Systems, pages 10473-10484, 2019.
Ilias Diakonikolas, Daniel M. Kane, and Pasin Manurangsi. Nearly tight bounds for robust proper learning of halfspaces with a margin. CoRR, abs/1908.11335, 2019. URL: http://arxiv.org/abs/1908.11335.
Irit Dinur. Mildly exponential reduction from gap 3SAT to polynomial-gap label-cover. Electronic Colloquium on Computational Complexity (ECCC), 23:128, 2016. URL: http://eccc.hpi-web.de/report/2016/128.
Irit Dinur, Prahladh Harsha, and Guy Kindler. Polynomially low error PCPs with polyloglog n queries via modular composition. In STOC, pages 267-276, 2015. URL: https://doi.org/10.1145/2746539.2746630.
Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. arXiv preprint, 2018. URL: http://arxiv.org/abs/1811.03804.
Uriel Feige. A threshold of ln n for approximating set cover. J. ACM, 45(4):634-652, 1998. URL: https://doi.org/10.1145/285055.285059.
Surbhi Goel, Varun Kanade, Adam R. Klivans, and Justin Thaler. Reliably learning the relu in polynomial time. In COLT, pages 1004-1042, 2017. URL: http://proceedings.mlr.press/v65/goel17a.html.
Surbhi Goel, Sushrut Karmalkar, and Adam Klivans. Time/accuracy tradeoffs for learning a relu with respect to gaussian marginals. In Advances in Neural Information Processing Systems, pages 8582-8591, 2019.
Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359-366, 1989.
Stephen Judd. On the complexity of loading shallow neural networks. Journal of Complexity, 4:177-192, 1988.
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational efficiency of training neural networks. In Advances in Neural Information Processing Systems, pages 855-863, 2014.
Carsten Lund and Mihalis Yannakakis. On the hardness of approximating minimization problems. J. ACM, 41(5):960-981, 1994. URL: https://doi.org/10.1145/185675.306789.
Pasin Manurangsi. On approximating projection games. Master’s thesis, Massachusetts Institute of Technology, January 2015.
Pasin Manurangsi and Prasad Raghavendra. A birthday repetition theorem and complexity of approximating dense csps. In 44th International Colloquium on Automata, Languages, and Programming, ICALP 2017, July 10-14, 2017, Warsaw, Poland, pages 78:1-78:15, 2017. URL: https://doi.org/10.4230/LIPIcs.ICALP.2017.78.
Pasin Manurangsi and Daniel Reichman. The computational complexity of training relu(s). CoRR, abs/1810.04207, 2018. URL: http://arxiv.org/abs/1810.04207.
Nimrod Megiddo. On the complexity of polyhedral separability. Discrete & Computational Geometry, 3(4):325-337, 1988.
Xingyuan Pan and Vivek Srikumar. Expressiveness of rectifier networks. In International Conference on Machine Learning, pages 2427-2435, 2016.
Erez Petrank. The hardness of approximation: Gap location. Computational Complexity, 4:133-157, 1994. URL: https://doi.org/10.1007/BF01202286.
Tim Roughgarden. Beyond the worst-case analysis of algorithms, 2020.
Grant Schoenebeck. Linear level lasserre lower bounds for certain k-CSPs. In FOCS, pages 593-602, 2008. URL: https://doi.org/10.1109/FOCS.2008.74.
Rocco A Servedio and Li-Yang Tan. What circuit classes can be learned with non-trivial savings? In 8th Innovations in Theoretical Computer Science Conference (ITCS 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017.
Kirill Simonov, Fedor Fomin, Petr Golovach, and Fahad Panolan. Refined complexity of PCA with outliers. In International Conference on Machine Learning, pages 5818-5826, 2019.
Le Song, Santosh Vempala, John Wilmes, and Bo Xie. On the complexity of learning neural networks. In Advances in Neural Information Processing Systems, pages 5514-5522, 2017.
Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. In John D. Lafferty, Christopher K. I. Williams, John Shawe-Taylor, Richard S. Zemel, and Aron Culotta, editors, Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010, Vancouver, British Columbia, Canada, pages 2199-2207. Curran Associates, Inc., 2010. URL: http://papers.nips.cc/paper/3894-smoothness-low-noise-and-fast-rates.
Madhur Tulsiani. CSP gaps and reductions in the lasserre hierarchy. In STOC, pages 303-312, 2009. URL: https://doi.org/10.1145/1536414.1536457.
Santosh Vempala and John Wilmes. Polynomial convergence of gradient descent for training one-hidden-layer neural networks. In Conference on Learning Theory, pages 3115-3117, 2019.
Van H Vu. On the infeasibility of training neural networks with small mean-squared error. IEEE Transactions on Information Theory, 44(7):2892-2900, 1998.
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint, 2016. URL: http://arxiv.org/abs/1611.03530.

Tight Hardness Results for Training Depth-2 ReLU Networks

Authors Surbhi Goel, Adam Klivans, Pasin Manurangsi, Daniel Reichman

File

Document Identifiers

Author Details

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Tight Hardness Results for Training Depth-2 ReLU Networks

Authors Surbhi Goel, Adam Klivans, Pasin Manurangsi, Daniel Reichman

File

Document Identifiers

Author Details

Funding

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

References

Thanks for your feedback!

Could not send message