Adaptive Sketches for Robust Regression with Importance Sampling

Authors Sepideh Mahabadi, David P. Woodruff, Samson Zhou



PDF
Thumbnail PDF

File

LIPIcs.APPROX-RANDOM.2022.31.pdf
  • Filesize: 0.9 MB
  • 21 pages

Document Identifiers

Author Details

Sepideh Mahabadi
  • Microsoft Research, Redmond, WA, USA
David P. Woodruff
  • Carnegie Mellon University, Pittsburgh, PA, USA
Samson Zhou
  • Carnegie Mellon University, Pittsburgh, PA, USA

Cite AsGet BibTex

Sepideh Mahabadi, David P. Woodruff, and Samson Zhou. Adaptive Sketches for Robust Regression with Importance Sampling. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 245, pp. 31:1-31:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.APPROX/RANDOM.2022.31

Abstract

We introduce data structures for solving robust regression through stochastic gradient descent (SGD) by sampling gradients with probability proportional to their norm, i.e., importance sampling. Although SGD is widely used for large scale machine learning, it is well-known for possibly experiencing slow convergence rates due to the high variance from uniform sampling. On the other hand, importance sampling can significantly decrease the variance but is usually difficult to implement because computing the sampling probabilities requires additional passes over the data, in which case standard gradient descent (GD) could be used instead. In this paper, we introduce an algorithm that approximately samples T gradients of dimension d from nearly the optimal importance sampling distribution for a robust regression problem over n rows. Thus our algorithm effectively runs T steps of SGD with importance sampling while using sublinear space and just making a single pass over the data. Our techniques also extend to performing importance sampling for second-order optimization.

Subject Classification

ACM Subject Classification
  • Theory of computation → Streaming, sublinear and near linear time algorithms
Keywords
  • Streaming algorithms
  • stochastic optimization
  • importance sampling

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci., 58(1):137-147, 1999. Google Scholar
  2. Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Streaming algorithms via precision sampling. In IEEE 52nd Annual Symposium on Foundations of Computer Science, FOCS, pages 363-372, 2011. Google Scholar
  3. Jaroslaw Blasiok, Vladimir Braverman, Stephen R. Chestnut, Robert Krauthgamer, and Lin F. Yang. Streaming symmetric norms via measure concentration. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC, pages 716-729, 2017. Google Scholar
  4. Michael Bosse, Gabriel Agamennoni, and Igor Gilitschenski. Robust estimation and applications in robotics. Found. Trends Robotics, 4(4):225-269, 2016. Google Scholar
  5. Vladimir Braverman, Petros Drineas, Cameron Musco, Christopher Musco, Jalaj Upadhyay, David P. Woodruff, and Samson Zhou. Near optimal linear algebra in the online and sliding window models. In 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS, pages 517-528, 2020. Google Scholar
  6. Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231-357, 2015. Google Scholar
  7. Moses Charikar, Kevin C. Chen, and Martin Farach-Colton. Finding frequent items in data streams. Theor. Comput. Sci., 312(1):3-15, 2004. Google Scholar
  8. Kenneth L. Clarkson, Ruosong Wang, and David P. Woodruff. Dimensionality reduction for tukey regression. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML, pages 1262-1271, 2019. Google Scholar
  9. Kenneth L. Clarkson and David P. Woodruff. Sketching for M-estimators: A unified approach to robust regression. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 921-939, 2015. Google Scholar
  10. Hadi Daneshmand, Aurélien Lucchi, and Thomas Hofmann. Starting small - learning with adaptive sample sizes. In Proceedings of the 33nd International Conference on Machine Learning, ICML, pages 1463-1471, 2016. Google Scholar
  11. Aaron Defazio, Francis R. Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems, pages 1646-1654, 2014. Google Scholar
  12. Antoine Guitton and William W Symes. Robust and stable velocity analysis using the huber function. In SEG Technical Program Expanded Abstracts 1999, pages 1166-1169. Society of Exploration Geophysicists, 1999. Google Scholar
  13. Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics, pages 492-518. Springer, 1992. Google Scholar
  14. Piotr Indyk and David P. Woodruff. Optimal approximations of the frequency moments of data streams. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing, pages 202-208, 2005. Google Scholar
  15. Rajesh Jayaram and David P. Woodruff. Perfect lp sampling in a data stream. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS, pages 544-555, 2018. Google Scholar
  16. Rajesh Jayaram, David P. Woodruff, and Samson Zhou. Truly perfect samplers for data streams and sliding windows. In PODS: International Conference on Management of Data, pages 29-40, 2022. Google Scholar
  17. Yifei Jiang, Yi Li, Yiming Sun, Jiaxin Wang, and David P. Woodruff. Single pass entrywise-transformed low rank approximation. In Proceedings of the 38th International Conference on Machine Learning, ICML, pages 4982-4991, 2021. Google Scholar
  18. Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26, Proceedings., pages 315-323, 2013. Google Scholar
  19. Tyler B. Johnson and Carlos Guestrin. Training deep models faster with robust, approximate importance sampling. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, NeurIPS, pages 7276-7286, 2018. Google Scholar
  20. Hossein Jowhari, Mert Saglam, and Gábor Tardos. Tight bounds for lp samplers, finding duplicates in streams, and related problems. In Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS, pages 49-58, 2011. Google Scholar
  21. Angelos Katharopoulos and François Fleuret. Not all samples are created equal: Deep learning with importance sampling. In Proceedings of the 35th International Conference on Machine Learning, ICML, pages 2530-2539, 2018. Google Scholar
  22. Sepideh Mahabadi, Ilya P. Razenshteyn, David P. Woodruff, and Samson Zhou. Non-adaptive adaptive sampling on turnstile streams. In Proccedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC, pages 1251-1264, 2020. Google Scholar
  23. Raghu Meka. Cs289ml: Algorithmic machine learning notes, 2017. URL: https://raghumeka.github.io/CS289ML/gdnotes.pdf.
  24. Morteza Monemizadeh and David P. Woodruff. 1-pass relative-error l_p-sampling with applications. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 1143-1160, 2010. Google Scholar
  25. Deanna Needell, Nathan Srebro, and Rachel Ward. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. Math. Program., 155(1-2):549-573, 2016. Google Scholar
  26. Arkadi Nemirovski, Anatoli B. Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574-1609, 2009. Google Scholar
  27. Arkadi Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method efficiency in optimization, 1983. Google Scholar
  28. Noam Nisan. Pseudorandom generators for space-bounded computation. Combinatorica, 12(4):449-461, 1992. Google Scholar
  29. Eric Price, Sandeep Silwal, and Samson Zhou. Hardness and algorithms for robust and sparse optimization. In International Conference on Machine Learning, ICML, pages 17926-17944, 2022. Google Scholar
  30. Xun Qian, Peter Richtárik, Robert M. Gower, Alibek Sailanbayev, Nicolas Loizou, and Egor Shulgin. SGD with arbitrary sampling: General analysis and improved rates. In Proceedings of the 36th International Conference on Machine Learning, ICML, volume 97, pages 5200-5209, 2019. Google Scholar
  31. Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Póczos, and Alexander J. Smola. On variance reduction in stochastic gradient descent and its asynchronous variants. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems, pages 2647-2655, 2015. Google Scholar
  32. Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400-407, 1951. Google Scholar
  33. Nicolas Le Roux, Mark Schmidt, and Francis R. Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems., pages 2672-2680, 2012. Google Scholar
  34. Farnood Salehi, Patrick Thiran, and L. Elisa Celis. Coordinate descent with bandit sampling. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, NeurIPS, pages 9267-9277, 2018. Google Scholar
  35. Sebastian U. Stich, Anant Raj, and Martin Jaggi. Safe adaptive importance sampling. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, pages 4381-4391, 2017. Google Scholar
  36. Mikkel Thorup and Yin Zhang. Tabulation based 4-universal hashing with applications to second moment estimation. In Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 615-624. SIAM, 2004. Google Scholar
  37. Murad Tukan, Xuan Wu, Samson Zhou, Vladimir Braverman, and Dan Feldman. New coresets for projective clustering and applications. In International Conference on Artificial Intelligence and Statistics, AISTATS, 2022. Google Scholar
  38. Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49-67, 2006. Google Scholar
  39. Zhengyou Zhang. Parameter estimation techniques: a tutorial with application to conic fitting. Image Vis. Comput., 15(1):59-76, 1997. Google Scholar
  40. Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In Proceedings of the 32nd International Conference on Machine Learning ICML, pages 1-9, 2015. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail