Optimal Sub-Gaussian Mean Estimation in Very High Dimensions

Authors Jasper C.H. Lee, Paul Valiant



PDF
Thumbnail PDF

File

LIPIcs.ITCS.2022.98.pdf
  • Filesize: 0.68 MB
  • 21 pages

Document Identifiers

Author Details

Jasper C.H. Lee
  • University of Wisconsin-Madison, WI, USA
Paul Valiant
  • Purdue University, West Lafayette, IN, USA

Acknowledgements

We thank Avi Wigderson for insightful discussions on geometric intuitions for high-dimensional inequalities.

Cite AsGet BibTex

Jasper C.H. Lee and Paul Valiant. Optimal Sub-Gaussian Mean Estimation in Very High Dimensions. In 13th Innovations in Theoretical Computer Science Conference (ITCS 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 215, pp. 98:1-98:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.ITCS.2022.98

Abstract

We address the problem of mean estimation in very high dimensions, in the high probability regime parameterized by failure probability δ. For a distribution with covariance Σ, let its "effective dimension" be d_eff = {Tr(Σ)}/{λ_{max}(Σ)}. For the regime where d_eff = ω(log^2 (1/δ)), we show the first algorithm whose sample complexity is optimal to within 1+o(1) factor. The algorithm has a surprisingly simple structure: 1) re-center the samples using a known sub-Gaussian estimator, 2) carefully choose an easy-to-compute positive integer t and then remove the t samples farthest from the origin and 3) return the sample mean of the remaining samples. The core of the analysis relies on a novel vector Bernstein-type tail bound, showing that under general conditions, the sample mean of a bounded high-dimensional distribution is highly concentrated around a spherical shell.

Subject Classification

ACM Subject Classification
  • Mathematics of computing → Nonparametric statistics
  • Mathematics of computing → Multivariate statistics
  • Theory of computation → Sample complexity and generalization bounds
  • Theory of computation → Streaming, sublinear and near linear time algorithms
Keywords
  • High-dimensional mean estimation

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci, 58(1):137-147, 1999. Google Scholar
  2. Olivier Catoni. Challenging the empirical mean and empirical variance: a deviation study. Ann. I. H. Poincaré -PR, 48(4):1148-1185, 2012. Google Scholar
  3. Olivier Catoni and Ilaria Giulini. Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector. arXiv:1802.04308, 2018. Google Scholar
  4. Yeshwanth Cherapanamjeri, Nicolas Flammarion, and Peter L. Bartlett. Fast mean estimation with sub-Gaussian rates. In Proc. COLT '20, pages 786-806, 2019. Google Scholar
  5. Luc Devroye, Matthieu Lerasle, Gabor Lugosi, and Roberto I. Oliveira. Sub-Gaussian mean estimators. Ann. Stat, 44(6):2695-2725, 2016. Google Scholar
  6. Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The total variation distance between high-dimensional Gaussians. arXiv:1810.08693, 2020. Google Scholar
  7. Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high-dimensions without the computational intractability. SIAM J. Comput., 48(2):742-864, 2019. Google Scholar
  8. Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical. In Proc. ICML'17, pages 999-1008, 2017. Google Scholar
  9. Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robustly learning a Gaussian: Getting optimal error, efficiently. In Proc. SODA'18, pages 2683-2702, 2018. Google Scholar
  10. Ilias Diakonikolas and Daniel Kane. Robust high-dimensional statistics. In Tim Roughgarden, editor, Beyond the Worst-Case Analysis of Algorithms, pages 382-402. Cambridge University Press, 2021. Google Scholar
  11. Ilias Diakonikolas, Daniel M. Kane, and Ankit Pensia. Outlier robust mean estimation with subgaussian rates via stability. In Proc. NeuRIPS'20, pages 1830-1840, 2020. Google Scholar
  12. Samuel B. Hopkins. Mean estimation with sub-Gaussian rates in polynomial time. Ann. Stat., 48(2):1193-1213, 2020. Google Scholar
  13. Mark R. Jerrum, Leslie G. Valiant, and Vijay V. Vazirani. Random generation of combinatorial structures from a uniform distribution. Theor. Comput. Sci, 43:169-188, 1986. Google Scholar
  14. Jasper C.H. Lee and Paul Valiant. Optimal sub-Gaussian mean estimation in ℝ. To appear in Proc. FOCS'21. Google Scholar
  15. Zhixian Lei, Kyle Luh, Prayaag Venkat, and Fred Zhang. A fast spectral algorithm for mean estimation with sub-Gaussian rates. In Proc. COLT '20, pages 2598-2612, 2020. Google Scholar
  16. Gábor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed distributions - a survey. Found. Comput. Math., 19(5):1145-1190, 2019. Google Scholar
  17. Gábor Lugosi and Shahar Mendelson. Sub-Gaussian estimators of the mean of a random vector. Ann. Stat., 47(2):783-794, 2019. Google Scholar
  18. Gábor Lugosi and Shahar Mendelson. Robust multivariate mean estimation: the optimality of trimmed mean. Ann. Stat., 49(1):393-410, 2021. Google Scholar
  19. Stanislav Minsker. On some extensions of Bernstein’s inequality for self-adjoint operators. Stat. Probab. Lett., 127:111-119, 2017. Google Scholar
  20. A.S. Nemirovsky and D.B. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley, 1983. Google Scholar
  21. Roberto I. Oliveira and Paulo Orenstein. The sub-Gaussian property of trimmed means estimators. Technical Report, IMPA, 2019. Google Scholar
  22. Joel A. Tropp. An Introduction to Matrix Concentration Inequalities. Foundations and Trends in Machine Learning, 8(1-2):1-230, 2015. Google Scholar
  23. V.V. Yurinskiĭ. Exponential inequalities for sums of random vectors. J. Multivar. Anal., 6(4):473-499, 1976. Google Scholar