Optimal Sub-Gaussian Mean Estimation in Very High Dimensions
We address the problem of mean estimation in very high dimensions, in the high probability regime parameterized by failure probability δ. For a distribution with covariance Σ, let its "effective dimension" be d_eff = {Tr(Σ)}/{λ_{max}(Σ)}. For the regime where d_eff = ω(log^2 (1/δ)), we show the first algorithm whose sample complexity is optimal to within 1+o(1) factor. The algorithm has a surprisingly simple structure: 1) re-center the samples using a known sub-Gaussian estimator, 2) carefully choose an easy-to-compute positive integer t and then remove the t samples farthest from the origin and 3) return the sample mean of the remaining samples. The core of the analysis relies on a novel vector Bernstein-type tail bound, showing that under general conditions, the sample mean of a bounded high-dimensional distribution is highly concentrated around a spherical shell.
High-dimensional mean estimation
Mathematics of computing~Nonparametric statistics
Mathematics of computing~Multivariate statistics
Theory of computation~Sample complexity and generalization bounds
Theory of computation~Streaming, sublinear and near linear time algorithms
98:1-98:21
Regular Paper
We thank Avi Wigderson for insightful discussions on geometric intuitions for high-dimensional inequalities.
Jasper C.H.
Lee
Jasper C.H. Lee
University of Wisconsin-Madison, WI, USA
Supported in part by the generous funding of a Croucher Fellowship for Postdoctoral Research and by NSF award DMS-2023239. Part of this work was done during Jasper’s visit at the Simons Institute for the Theory of Computing.
Paul
Valiant
Paul Valiant
Purdue University, West Lafayette, IN, USA
Supported in part by NSF award CCF-2127806 and IIS-1562657. Part of this work was done at the Institute for Advanced Study, partially supported by NSF award DMS-1926686, and indirectly supported by NSF award CCF-1900460.
10.4230/LIPIcs.ITCS.2022.98
Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci, 58(1):137-147, 1999.
Olivier Catoni. Challenging the empirical mean and empirical variance: a deviation study. Ann. I. H. Poincaré -PR, 48(4):1148-1185, 2012.
Olivier Catoni and Ilaria Giulini. Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector. arXiv:1802.04308, 2018.
Yeshwanth Cherapanamjeri, Nicolas Flammarion, and Peter L. Bartlett. Fast mean estimation with sub-Gaussian rates. In Proc. COLT '20, pages 786-806, 2019.
Luc Devroye, Matthieu Lerasle, Gabor Lugosi, and Roberto I. Oliveira. Sub-Gaussian mean estimators. Ann. Stat, 44(6):2695-2725, 2016.
Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The total variation distance between high-dimensional Gaussians. arXiv:1810.08693, 2020.
Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high-dimensions without the computational intractability. SIAM J. Comput., 48(2):742-864, 2019.
Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical. In Proc. ICML'17, pages 999-1008, 2017.
Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robustly learning a Gaussian: Getting optimal error, efficiently. In Proc. SODA'18, pages 2683-2702, 2018.
Ilias Diakonikolas and Daniel Kane. Robust high-dimensional statistics. In Tim Roughgarden, editor, Beyond the Worst-Case Analysis of Algorithms, pages 382-402. Cambridge University Press, 2021.
Ilias Diakonikolas, Daniel M. Kane, and Ankit Pensia. Outlier robust mean estimation with subgaussian rates via stability. In Proc. NeuRIPS'20, pages 1830-1840, 2020.
Samuel B. Hopkins. Mean estimation with sub-Gaussian rates in polynomial time. Ann. Stat., 48(2):1193-1213, 2020.
Mark R. Jerrum, Leslie G. Valiant, and Vijay V. Vazirani. Random generation of combinatorial structures from a uniform distribution. Theor. Comput. Sci, 43:169-188, 1986.
Jasper C.H. Lee and Paul Valiant. Optimal sub-Gaussian mean estimation in ℝ. To appear in Proc. FOCS'21.
Zhixian Lei, Kyle Luh, Prayaag Venkat, and Fred Zhang. A fast spectral algorithm for mean estimation with sub-Gaussian rates. In Proc. COLT '20, pages 2598-2612, 2020.
Gábor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed distributions - a survey. Found. Comput. Math., 19(5):1145-1190, 2019.
Gábor Lugosi and Shahar Mendelson. Sub-Gaussian estimators of the mean of a random vector. Ann. Stat., 47(2):783-794, 2019.
Gábor Lugosi and Shahar Mendelson. Robust multivariate mean estimation: the optimality of trimmed mean. Ann. Stat., 49(1):393-410, 2021.
Stanislav Minsker. On some extensions of Bernstein’s inequality for self-adjoint operators. Stat. Probab. Lett., 127:111-119, 2017.
A.S. Nemirovsky and D.B. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley, 1983.
Roberto I. Oliveira and Paulo Orenstein. The sub-Gaussian property of trimmed means estimators. Technical Report, IMPA, 2019.
Joel A. Tropp. An Introduction to Matrix Concentration Inequalities. Foundations and Trends in Machine Learning, 8(1-2):1-230, 2015.
V.V. Yurinskiĭ. Exponential inequalities for sums of random vectors. J. Multivar. Anal., 6(4):473-499, 1976.
Jasper C.H. Lee and Paul Valiant
Creative Commons Attribution 4.0 International license
https://creativecommons.org/licenses/by/4.0/legalcode