Adaptive Sketches for Robust Regression with Importance Sampling

Mahabadi, Sepideh; Woodruff, David P.; Zhou, Samson

doi:10.4230/LIPIcs.APPROX/RANDOM.2022.31

Abstract

We introduce data structures for solving robust regression through stochastic gradient descent (SGD) by sampling gradients with probability proportional to their norm, i.e., importance sampling. Although SGD is widely used for large scale machine learning, it is well-known for possibly experiencing slow convergence rates due to the high variance from uniform sampling. On the other hand, importance sampling can significantly decrease the variance but is usually difficult to implement because computing the sampling probabilities requires additional passes over the data, in which case standard gradient descent (GD) could be used instead. In this paper, we introduce an algorithm that approximately samples T gradients of dimension d from nearly the optimal importance sampling distribution for a robust regression problem over n rows. Thus our algorithm effectively runs T steps of SGD with importance sampling while using sublinear space and just making a single pass over the data. Our techniques also extend to performing importance sampling for second-order optimization.

Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci., 58(1):137-147, 1999.
Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Streaming algorithms via precision sampling. In IEEE 52nd Annual Symposium on Foundations of Computer Science, FOCS, pages 363-372, 2011.
Jaroslaw Blasiok, Vladimir Braverman, Stephen R. Chestnut, Robert Krauthgamer, and Lin F. Yang. Streaming symmetric norms via measure concentration. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC, pages 716-729, 2017.
Michael Bosse, Gabriel Agamennoni, and Igor Gilitschenski. Robust estimation and applications in robotics. Found. Trends Robotics, 4(4):225-269, 2016.
Vladimir Braverman, Petros Drineas, Cameron Musco, Christopher Musco, Jalaj Upadhyay, David P. Woodruff, and Samson Zhou. Near optimal linear algebra in the online and sliding window models. In 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS, pages 517-528, 2020.
Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231-357, 2015.
Moses Charikar, Kevin C. Chen, and Martin Farach-Colton. Finding frequent items in data streams. Theor. Comput. Sci., 312(1):3-15, 2004.
Kenneth L. Clarkson, Ruosong Wang, and David P. Woodruff. Dimensionality reduction for tukey regression. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML, pages 1262-1271, 2019.
Kenneth L. Clarkson and David P. Woodruff. Sketching for M-estimators: A unified approach to robust regression. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 921-939, 2015.
Hadi Daneshmand, Aurélien Lucchi, and Thomas Hofmann. Starting small - learning with adaptive sample sizes. In Proceedings of the 33nd International Conference on Machine Learning, ICML, pages 1463-1471, 2016.
Aaron Defazio, Francis R. Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems, pages 1646-1654, 2014.
Antoine Guitton and William W Symes. Robust and stable velocity analysis using the huber function. In SEG Technical Program Expanded Abstracts 1999, pages 1166-1169. Society of Exploration Geophysicists, 1999.
Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics, pages 492-518. Springer, 1992.
Piotr Indyk and David P. Woodruff. Optimal approximations of the frequency moments of data streams. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing, pages 202-208, 2005.
Rajesh Jayaram and David P. Woodruff. Perfect lp sampling in a data stream. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS, pages 544-555, 2018.
Rajesh Jayaram, David P. Woodruff, and Samson Zhou. Truly perfect samplers for data streams and sliding windows. In PODS: International Conference on Management of Data, pages 29-40, 2022.
Yifei Jiang, Yi Li, Yiming Sun, Jiaxin Wang, and David P. Woodruff. Single pass entrywise-transformed low rank approximation. In Proceedings of the 38th International Conference on Machine Learning, ICML, pages 4982-4991, 2021.
Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26, Proceedings., pages 315-323, 2013.
Tyler B. Johnson and Carlos Guestrin. Training deep models faster with robust, approximate importance sampling. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, NeurIPS, pages 7276-7286, 2018.
Hossein Jowhari, Mert Saglam, and Gábor Tardos. Tight bounds for lp samplers, finding duplicates in streams, and related problems. In Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS, pages 49-58, 2011.
Angelos Katharopoulos and François Fleuret. Not all samples are created equal: Deep learning with importance sampling. In Proceedings of the 35th International Conference on Machine Learning, ICML, pages 2530-2539, 2018.
Sepideh Mahabadi, Ilya P. Razenshteyn, David P. Woodruff, and Samson Zhou. Non-adaptive adaptive sampling on turnstile streams. In Proccedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC, pages 1251-1264, 2020.
Raghu Meka. Cs289ml: Algorithmic machine learning notes, 2017. URL: https://raghumeka.github.io/CS289ML/gdnotes.pdf.
Morteza Monemizadeh and David P. Woodruff. 1-pass relative-error l_p-sampling with applications. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 1143-1160, 2010.
Deanna Needell, Nathan Srebro, and Rachel Ward. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. Math. Program., 155(1-2):549-573, 2016.
Arkadi Nemirovski, Anatoli B. Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574-1609, 2009.
Arkadi Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method efficiency in optimization, 1983.
Noam Nisan. Pseudorandom generators for space-bounded computation. Combinatorica, 12(4):449-461, 1992.
Eric Price, Sandeep Silwal, and Samson Zhou. Hardness and algorithms for robust and sparse optimization. In International Conference on Machine Learning, ICML, pages 17926-17944, 2022.
Xun Qian, Peter Richtárik, Robert M. Gower, Alibek Sailanbayev, Nicolas Loizou, and Egor Shulgin. SGD with arbitrary sampling: General analysis and improved rates. In Proceedings of the 36th International Conference on Machine Learning, ICML, volume 97, pages 5200-5209, 2019.
Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Póczos, and Alexander J. Smola. On variance reduction in stochastic gradient descent and its asynchronous variants. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems, pages 2647-2655, 2015.
Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400-407, 1951.
Nicolas Le Roux, Mark Schmidt, and Francis R. Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems., pages 2672-2680, 2012.
Farnood Salehi, Patrick Thiran, and L. Elisa Celis. Coordinate descent with bandit sampling. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, NeurIPS, pages 9267-9277, 2018.
Sebastian U. Stich, Anant Raj, and Martin Jaggi. Safe adaptive importance sampling. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, pages 4381-4391, 2017.
Mikkel Thorup and Yin Zhang. Tabulation based 4-universal hashing with applications to second moment estimation. In Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 615-624. SIAM, 2004.
Murad Tukan, Xuan Wu, Samson Zhou, Vladimir Braverman, and Dan Feldman. New coresets for projective clustering and applications. In International Conference on Artificial Intelligence and Statistics, AISTATS, 2022.
Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49-67, 2006.
Zhengyou Zhang. Parameter estimation techniques: a tutorial with application to conic fitting. Image Vis. Comput., 15(1):59-76, 1997.
Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In Proceedings of the 32nd International Conference on Machine Learning ICML, pages 1-9, 2015.

Adaptive Sketches for Robust Regression with Importance Sampling

Authors Sepideh Mahabadi, David P. Woodruff, Samson Zhou

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Adaptive Sketches for Robust Regression with Importance Sampling

Authors Sepideh Mahabadi, David P. Woodruff, Samson Zhou

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

References