Optimal Sub-Gaussian Mean Estimation in Very High Dimensions

Lee, Jasper C.H.; Valiant, Paul

doi:10.4230/LIPIcs.ITCS.2022.98

Abstract

We address the problem of mean estimation in very high dimensions, in the high probability regime parameterized by failure probability δ. For a distribution with covariance Σ, let its "effective dimension" be d_eff = {Tr(Σ)}/{λ_{max}(Σ)}. For the regime where d_eff = ω(log^2 (1/δ)), we show the first algorithm whose sample complexity is optimal to within 1+o(1) factor. The algorithm has a surprisingly simple structure: 1) re-center the samples using a known sub-Gaussian estimator, 2) carefully choose an easy-to-compute positive integer t and then remove the t samples farthest from the origin and 3) return the sample mean of the remaining samples. The core of the analysis relies on a novel vector Bernstein-type tail bound, showing that under general conditions, the sample mean of a bounded high-dimensional distribution is highly concentrated around a spherical shell.

Cite As Get BibTex

Jasper C.H. Lee and Paul Valiant. Optimal Sub-Gaussian Mean Estimation in Very High Dimensions. In 13th Innovations in Theoretical Computer Science Conference (ITCS 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 215, pp. 98:1-98:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022) https://doi.org/10.4230/LIPIcs.ITCS.2022.98

Author Details

Jasper C.H. Lee

University of Wisconsin-Madison, WI, USA

Paul Valiant

Purdue University, West Lafayette, IN, USA

Funding

Lee, Jasper C.H.: Supported in part by the generous funding of a Croucher Fellowship for Postdoctoral Research and by NSF award DMS-2023239. Part of this work was done during Jasper’s visit at the Simons Institute for the Theory of Computing.
Valiant, Paul: Supported in part by NSF award CCF-2127806 and IIS-1562657. Part of this work was done at the Institute for Advanced Study, partially supported by NSF award DMS-1926686, and indirectly supported by NSF award CCF-1900460.

Acknowledgements

We thank Avi Wigderson for insightful discussions on geometric intuitions for high-dimensional inequalities.

References

Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci, 58(1):137-147, 1999.
Olivier Catoni. Challenging the empirical mean and empirical variance: a deviation study. Ann. I. H. Poincaré -PR, 48(4):1148-1185, 2012.
Olivier Catoni and Ilaria Giulini. Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector. arXiv:1802.04308, 2018.
Yeshwanth Cherapanamjeri, Nicolas Flammarion, and Peter L. Bartlett. Fast mean estimation with sub-Gaussian rates. In Proc. COLT '20, pages 786-806, 2019.
Luc Devroye, Matthieu Lerasle, Gabor Lugosi, and Roberto I. Oliveira. Sub-Gaussian mean estimators. Ann. Stat, 44(6):2695-2725, 2016.
Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The total variation distance between high-dimensional Gaussians. arXiv:1810.08693, 2020.
Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high-dimensions without the computational intractability. SIAM J. Comput., 48(2):742-864, 2019.
Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical. In Proc. ICML'17, pages 999-1008, 2017.
Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robustly learning a Gaussian: Getting optimal error, efficiently. In Proc. SODA'18, pages 2683-2702, 2018.
Ilias Diakonikolas and Daniel Kane. Robust high-dimensional statistics. In Tim Roughgarden, editor, Beyond the Worst-Case Analysis of Algorithms, pages 382-402. Cambridge University Press, 2021.
Ilias Diakonikolas, Daniel M. Kane, and Ankit Pensia. Outlier robust mean estimation with subgaussian rates via stability. In Proc. NeuRIPS'20, pages 1830-1840, 2020.
Samuel B. Hopkins. Mean estimation with sub-Gaussian rates in polynomial time. Ann. Stat., 48(2):1193-1213, 2020.
Mark R. Jerrum, Leslie G. Valiant, and Vijay V. Vazirani. Random generation of combinatorial structures from a uniform distribution. Theor. Comput. Sci, 43:169-188, 1986.
Jasper C.H. Lee and Paul Valiant. Optimal sub-Gaussian mean estimation in ℝ. To appear in Proc. FOCS'21.
Zhixian Lei, Kyle Luh, Prayaag Venkat, and Fred Zhang. A fast spectral algorithm for mean estimation with sub-Gaussian rates. In Proc. COLT '20, pages 2598-2612, 2020.
Gábor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed distributions - a survey. Found. Comput. Math., 19(5):1145-1190, 2019.
Gábor Lugosi and Shahar Mendelson. Sub-Gaussian estimators of the mean of a random vector. Ann. Stat., 47(2):783-794, 2019.
Gábor Lugosi and Shahar Mendelson. Robust multivariate mean estimation: the optimality of trimmed mean. Ann. Stat., 49(1):393-410, 2021.
Stanislav Minsker. On some extensions of Bernstein’s inequality for self-adjoint operators. Stat. Probab. Lett., 127:111-119, 2017.
A.S. Nemirovsky and D.B. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley, 1983.
Roberto I. Oliveira and Paulo Orenstein. The sub-Gaussian property of trimmed means estimators. Technical Report, IMPA, 2019.
Joel A. Tropp. An Introduction to Matrix Concentration Inequalities. Foundations and Trends in Machine Learning, 8(1-2):1-230, 2015.
V.V. Yurinskiĭ. Exponential inequalities for sums of random vectors. J. Multivar. Anal., 6(4):473-499, 1976.

Optimal Sub-Gaussian Mean Estimation in Very High Dimensions

Authors Jasper C.H. Lee, Paul Valiant

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message