Tight Hardness Results for Training Depth-2 ReLU Networks

Authors Surbhi Goel, Adam Klivans, Pasin Manurangsi, Daniel Reichman

Thumbnail PDF


  • Filesize: 0.54 MB
  • 14 pages

Document Identifiers

Author Details

Surbhi Goel
  • Microsoft Research, New York, NY, USA
Adam Klivans
  • The University of Texas at Austin, TX, USA
Pasin Manurangsi
  • Google Research, Mountain View, CA, USA
Daniel Reichman
  • Worcester Polytechnic Institute, MA, USA


We would like to thank Amit Daniely, Amir Globerson, Meena Jagadeesan and Cameron Musco for interesting discussions.

Cite AsGet BibTex

Surbhi Goel, Adam Klivans, Pasin Manurangsi, and Daniel Reichman. Tight Hardness Results for Training Depth-2 ReLU Networks. In 12th Innovations in Theoretical Computer Science Conference (ITCS 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 185, pp. 22:1-22:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


We prove several hardness results for training depth-2 neural networks with the ReLU activation function; these networks are simply weighted sums (that may include negative coefficients) of ReLUs. Our goal is to output a depth-2 neural network that minimizes the square loss with respect to a given training set. We prove that this problem is NP-hard already for a network with a single ReLU. We also prove NP-hardness for outputting a weighted sum of k ReLUs minimizing the squared error (for k > 1) even in the realizable setting (i.e., when the labels are consistent with an unknown depth-2 ReLU network). We are also able to obtain lower bounds on the running time in terms of the desired additive error ε. To obtain our lower bounds, we use the Gap Exponential Time Hypothesis (Gap-ETH) as well as a new hypothesis regarding the hardness of approximating the well known Densest κ-Subgraph problem in subexponential time (these hypotheses are used separately in proving different lower bounds). For example, we prove that under reasonable hardness assumptions, any proper learning algorithm for finding the best fitting ReLU must run in time exponential in 1/ε². Together with a previous work regarding improperly learning a ReLU [Surbhi Goel et al., 2017], this implies the first separation between proper and improper algorithms for learning a ReLU. We also study the problem of properly learning a depth-2 network of ReLUs with bounded weights giving new (worst-case) upper bounds on the running time needed to learn such networks both in the realizable and agnostic settings. Our upper bounds on the running time essentially matches our lower bounds in terms of the dependency on ε.

Subject Classification

ACM Subject Classification
  • Theory of computation → Machine learning theory
  • Theory of computation → Problems, reductions and completeness
  • ReLU
  • Learning Algorithm
  • Running Time Lower Bound


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Sarah R. Allen, Ryan O'Donnell, and David Witmer. How to refute a random CSP. In FOCS, pages 689-708, 2015. URL: https://doi.org/10.1109/FOCS.2015.48.
  2. Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in neural information processing systems, pages 6155-6166, 2019. Google Scholar
  3. Benny Applebaum, Boaz Barak, and David Xiao. On basing lower-bounds for learning on worst-case assumptions. In FOCS, pages 211-220, 2008. URL: https://doi.org/10.1109/FOCS.2008.35.
  4. Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Understanding deep neural networks with rectified linear units. In International Conference on Learning Representations, 2018. URL: https://openreview.net/forum?id=B1J_rgWRW.
  5. Francis Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18(1):629-681, 2017. Google Scholar
  6. Ainesh Bakshi, Rajesh Jayaram, and David P Woodruff. Learning two layer rectified neural networks in polynomial time. In Conference on Learning Theory, pages 195-268, 2019. Google Scholar
  7. Aditya Bhaskara, Moses Charikar, Aravindan Vijayaraghavan, Venkatesan Guruswami, and Yuan Zhou. Polynomial integrality gaps for strong SDP relaxations of densest k-subgraph. In SODA, pages 388-405, 2012. Google Scholar
  8. Avrim Blum and Ronald L Rivest. Training a 3-node neural network is NP-complete. In Advances in neural information processing systems, pages 494-501, 1989. Google Scholar
  9. Digvijay Boob, Santanu S Dey, and Guanghui Lan. Complexity of training relu neural network. arXiv preprint, 2018. URL: http://arxiv.org/abs/1809.10787.
  10. Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. arXiv preprint, 2017. URL: http://arxiv.org/abs/1702.07966.
  11. Parinya Chalermsook, Marek Cygan, Guy Kortsarz, Bundit Laekhanukit, Pasin Manurangsi, Danupon Nanongkai, and Luca Trevisan. From Gap-ETH to FPT-inapproximability: Clique, dominating set, and more. In FOCS, pages 743-754, 2017. URL: https://doi.org/10.1109/FOCS.2017.74.
  12. Eden Chlamtác, Pasin Manurangsi, Dana Moshkovitz, and Aravindan Vijayaraghavan. Approximation algorithms for label cover and the log-density threshold. In SODA, pages 900-919, 2017. URL: https://doi.org/10.1137/1.9781611974782.57.
  13. George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303-314, 1989. Google Scholar
  14. Santanu S Dey, Guanyi Wang, and Yao Xie. An approximation algorithm for training one-node relu neural network. arXiv preprint, 2018. URL: http://arxiv.org/abs/1810.03592.
  15. Ilias Diakonikolas, Daniel Kane, and Pasin Manurangsi. Nearly tight bounds for robust proper learning of halfspaces with a margin. In Advances in Neural Information Processing Systems, pages 10473-10484, 2019. Google Scholar
  16. Ilias Diakonikolas, Daniel M. Kane, and Pasin Manurangsi. Nearly tight bounds for robust proper learning of halfspaces with a margin. CoRR, abs/1908.11335, 2019. URL: http://arxiv.org/abs/1908.11335.
  17. Irit Dinur. Mildly exponential reduction from gap 3SAT to polynomial-gap label-cover. Electronic Colloquium on Computational Complexity (ECCC), 23:128, 2016. URL: http://eccc.hpi-web.de/report/2016/128.
  18. Irit Dinur, Prahladh Harsha, and Guy Kindler. Polynomially low error PCPs with polyloglog n queries via modular composition. In STOC, pages 267-276, 2015. URL: https://doi.org/10.1145/2746539.2746630.
  19. Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. arXiv preprint, 2018. URL: http://arxiv.org/abs/1811.03804.
  20. Uriel Feige. A threshold of ln n for approximating set cover. J. ACM, 45(4):634-652, 1998. URL: https://doi.org/10.1145/285055.285059.
  21. Surbhi Goel, Varun Kanade, Adam R. Klivans, and Justin Thaler. Reliably learning the relu in polynomial time. In COLT, pages 1004-1042, 2017. URL: http://proceedings.mlr.press/v65/goel17a.html.
  22. Surbhi Goel, Sushrut Karmalkar, and Adam Klivans. Time/accuracy tradeoffs for learning a relu with respect to gaussian marginals. In Advances in Neural Information Processing Systems, pages 8582-8591, 2019. Google Scholar
  23. Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359-366, 1989. Google Scholar
  24. Stephen Judd. On the complexity of loading shallow neural networks. Journal of Complexity, 4:177-192, 1988. Google Scholar
  25. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015. Google Scholar
  26. Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational efficiency of training neural networks. In Advances in Neural Information Processing Systems, pages 855-863, 2014. Google Scholar
  27. Carsten Lund and Mihalis Yannakakis. On the hardness of approximating minimization problems. J. ACM, 41(5):960-981, 1994. URL: https://doi.org/10.1145/185675.306789.
  28. Pasin Manurangsi. On approximating projection games. Master’s thesis, Massachusetts Institute of Technology, January 2015. Google Scholar
  29. Pasin Manurangsi and Prasad Raghavendra. A birthday repetition theorem and complexity of approximating dense csps. In 44th International Colloquium on Automata, Languages, and Programming, ICALP 2017, July 10-14, 2017, Warsaw, Poland, pages 78:1-78:15, 2017. URL: https://doi.org/10.4230/LIPIcs.ICALP.2017.78.
  30. Pasin Manurangsi and Daniel Reichman. The computational complexity of training relu(s). CoRR, abs/1810.04207, 2018. URL: http://arxiv.org/abs/1810.04207.
  31. Nimrod Megiddo. On the complexity of polyhedral separability. Discrete & Computational Geometry, 3(4):325-337, 1988. Google Scholar
  32. Xingyuan Pan and Vivek Srikumar. Expressiveness of rectifier networks. In International Conference on Machine Learning, pages 2427-2435, 2016. Google Scholar
  33. Erez Petrank. The hardness of approximation: Gap location. Computational Complexity, 4:133-157, 1994. URL: https://doi.org/10.1007/BF01202286.
  34. Tim Roughgarden. Beyond the worst-case analysis of algorithms, 2020. Google Scholar
  35. Grant Schoenebeck. Linear level lasserre lower bounds for certain k-CSPs. In FOCS, pages 593-602, 2008. URL: https://doi.org/10.1109/FOCS.2008.74.
  36. Rocco A Servedio and Li-Yang Tan. What circuit classes can be learned with non-trivial savings? In 8th Innovations in Theoretical Computer Science Conference (ITCS 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017. Google Scholar
  37. Kirill Simonov, Fedor Fomin, Petr Golovach, and Fahad Panolan. Refined complexity of PCA with outliers. In International Conference on Machine Learning, pages 5818-5826, 2019. Google Scholar
  38. Le Song, Santosh Vempala, John Wilmes, and Bo Xie. On the complexity of learning neural networks. In Advances in Neural Information Processing Systems, pages 5514-5522, 2017. Google Scholar
  39. Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. In John D. Lafferty, Christopher K. I. Williams, John Shawe-Taylor, Richard S. Zemel, and Aron Culotta, editors, Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010, Vancouver, British Columbia, Canada, pages 2199-2207. Curran Associates, Inc., 2010. URL: http://papers.nips.cc/paper/3894-smoothness-low-noise-and-fast-rates.
  40. Madhur Tulsiani. CSP gaps and reductions in the lasserre hierarchy. In STOC, pages 303-312, 2009. URL: https://doi.org/10.1145/1536414.1536457.
  41. Santosh Vempala and John Wilmes. Polynomial convergence of gradient descent for training one-hidden-layer neural networks. In Conference on Learning Theory, pages 3115-3117, 2019. Google Scholar
  42. Van H Vu. On the infeasibility of training neural networks with small mean-squared error. IEEE Transactions on Information Theory, 44(7):2892-2900, 1998. Google Scholar
  43. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint, 2016. URL: http://arxiv.org/abs/1611.03530.