Learning Discrete Distributions from Untrusted Batches
We consider the problem of learning a discrete distribution in the presence of an epsilon fraction of malicious data sources. Specifically, we consider the setting where there is some underlying distribution, p, and each data source provides a batch of >= k samples, with the guarantee that at least a (1 - epsilon) fraction of the sources draw their samples from a distribution with total variation distance at most \eta from p. We make no assumptions on the data provided by the remaining epsilon fraction of sources--this data can even be chosen as an adversarial function of the (1 - epsilon) fraction of "good" batches. We provide two algorithms: one with runtime exponential in the support size, n, but polynomial in k, 1/epsilon and 1/eta that takes O((n + k)/epsilon^2) batches and recovers p to error O(eta + epsilon/sqrt(k)). This recovery accuracy is information theoretically optimal, to constant factors, even given an infinite number of data sources. Our second algorithm applies to the eta = 0 setting and also achieves an O(epsilon/sqrt(k)) recover guarantee, though it runs in poly((nk)^k) time. This second algorithm, which approximates a certain tensor via a rank-1 tensor minimizing l_1 distance, is surprising in light of the hardness of many low-rank tensor approximation problems, and may be of independent interest.
robust statistics
information-theoretic optimality
47:1-47:20
Regular Paper
Mingda
Qiao
Mingda Qiao
Gregory
Valiant
Gregory Valiant
10.4230/LIPIcs.ITCS.2018.47
Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. A discriminative framework for clustering via similarity functions. In Symposium on Theory of Computing (STOC), pages 671-680, 2008.
Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D Smith, and Patrick White. Testing that distributions are close. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pages 259-269. IEEE, 2000.
K. Bhatia, P. Jain, and P. Kar. Robust regression via hard thresholding. In Advances in Neural Information Processing Systems (NIPS), pages 721-729, 2015.
Dietrich Braess and Thomas Sauer. Bernstein polynomials and learning theory. Journal of Approximation Theory, 128(2):187-206, 2004.
Siu-On Chan, Ilias Diakonikolas, Gregory Valiant, and Paul Valiant. Optimal algorithms for testing closeness of discrete distributions. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 1193-1203. Society for Industrial and Applied Mathematics, 2014.
Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from untrusted data. In Symposium on Theory of Computing (STOC), pages 47-60, 2017.
Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high dimensions without the computational intractability. In Foundations of Computer Science (FOCS), pages 655-664, 2016.
Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical. arXiv preprint arXiv:1703.00893, 2017.
Alan Frieze and Ravi Kannan. A new approach to the planted clique problem. In LIPIcs-Leibniz International Proceedings in Informatics, volume 2. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2008.
E Gilbert. Codes based on inaccurate source probabilities. IEEE Transactions on Information Theory, 17(3):304-314, 1971.
Nicolas Gillis and Stephen A Vavasis. On the complexity of robust pca and l1-norm low-rank matrix approximation. arXiv preprint arXiv:1509.09236, 2015.
Frank R Hampel, Elvezio M Ronchetti, Peter J Rousseeuw, and Werner A Stahel. Robust statistics: the approach based on influence functions, volume 114. John Wiley &Sons, 2011.
Christopher J Hillar and Lek-Heng Lim. Most tensor problems are np-hard. Journal of the ACM (JACM), 60(6):45, 2013.
Peter J Huber and Elvezio M Ronchetti. Robust statistics, volume 2. John Wiley &Sons, 2009. URL: http://dx.doi.org/10.1002/9780470434697.
http://dx.doi.org/10.1002/9780470434697
Sudeep Kamath, Alon Orlitsky, Dheeraj Pichapati, and Ananda Theertha Suresh. On learning distributions from their samples. In Conference on Learning Theory (COLT), pages 1066-1100, 2015.
Adam R Klivans, Philip M Long, and Rocco A Servedio. Learning halfspaces with malicious noise. Journal of Machine Learning Research, 10(Dec):2715-2740, 2009.
Jakub Konečný, H. Brendan McMahan, Felix X. Yu, Peter Richtarik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. In NIPS Workshop on Private Multi-Party Machine Learning, 2016. URL: https://arxiv.org/abs/1610.05492.
https://arxiv.org/abs/1610.05492
Raphail Krichevsky and Victor Trofimov. The performance of universal encoding. IEEE Transactions on Information Theory, 27(2):199-207, 1981.
Joseph B Kruskal. Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear algebra and its applications, 18(2):95-138, 1977.
Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In Foundations of Computer Science (FOCS), pages 665-674, 2016.
Reut Levi, Dana Ron, and Ronitt Rubinfeld. Testing properties of collections of distributions. Theory of Computing, 9(8):295-347, 2013.
Tengyu Ma, Jonathan Shi, and David Steurer. Polynomial-time tensor decompositions with sum-of-squares. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 438-446. IEEE, 2016.
Brendan McMahan and Daniel Ramage. https://research.google.com/pubs/pub44822.html , 2017.
https://research.google.com/pubs/pub44822.html
H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629, 2016.
Michela Meister and Gregory Valiant. A data prism: Semi-verified learning in the small-alpha regime. arXiv preprint arXiv:1708.02740, 2017.
Blaine Nelson, Battista Biggio, and Pavel Laskov. Understanding the risk factors of learning in adversarial environments. In Proceedings of the 4th ACM workshop on Security and artificial intelligence, pages 87-92. ACM, 2011.
James Newsome, Brad Karp, and Dawn Song. Paragraph: Thwarting signature learning by training maliciously. In International Workshop on Recent Advances in Intrusion Detection, pages 81-105. Springer, 2006.
Alon Orlitsky and Ananda Theertha Suresh. Competitive distribution estimation: Why is good-turing good. In Advances in Neural Information Processing Systems, pages 2143-2151, 2015.
Zhao Song, David P Woodruff, and Peilin Zhong. Low rank approximation with entrywise l1-norm error. Algorithms, 1:2, 2017.
Jacob Steinhardt, Moses Charikar, and Gregory Valiant. Resilience: A criterion for learning in the presence of arbitrary outliers. arXiv preprint arXiv:1703.04940, 2017.
Jacob Steinhardt, Gregory Valiant, and Moses Charikar. Avoiding imposters and delinquents: Adversarial crowdsourcing and peer prediction. In Advances in Neural Information Processing Systems (NIPS), pages 4439-4447, 2016.
Kevin Tian, Weihao Kong, and Gregory Valiant. Learning populations of parameters. In Neural Information Processing Systems (to appear), 2017.
John W Tukey. A survey of sampling from contaminated distributions. Contributions to probability and statistics, 2:448-485, 1960.
Gregory Valiant and Paul Valiant. Estimating the unseen: an n/log (n)-sample estimator for entropy and support size, shown optimal via new clts. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 685-694. ACM, 2011.
Gregory Valiant and Paul Valiant. Instance optimal learning of discrete distributions. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 142-155. ACM, 2016.
Huang Xiao, Battista Biggio, Blaine Nelson, Han Xiao, Claudia Eckert, and Fabio Roli. Support vector machines under adversarial label contamination. Neurocomputing, 160:53-62, 2015.
Creative Commons Attribution 3.0 Unported license
https://creativecommons.org/licenses/by/3.0/legalcode