Learning Discrete Distributions from Untrusted Batches

Qiao, Mingda; Valiant, Gregory

doi:10.4230/LIPIcs.ITCS.2018.47

File

Author Details

Mingda Qiao

Gregory Valiant

Cite As Get BibTex

Mingda Qiao and Gregory Valiant. Learning Discrete Distributions from Untrusted Batches. In 9th Innovations in Theoretical Computer Science Conference (ITCS 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 94, pp. 47:1-47:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018) https://doi.org/10.4230/LIPIcs.ITCS.2018.47

Abstract

We consider the problem of learning a discrete distribution in the presence of an epsilon fraction of malicious data sources. Specifically, we consider the setting where there is some underlying distribution, p, and each data source provides a batch of >= k samples, with the guarantee that at least a (1 - epsilon) fraction of the sources draw their samples from a distribution with total variation distance at most \eta from p. We make no assumptions on the data provided by the remaining epsilon fraction of sources--this data can even be chosen as an adversarial function of the (1 - epsilon) fraction of "good" batches. We provide two algorithms: one with runtime exponential in the support size, n, but polynomial in k, 1/epsilon and 1/eta that takes O((n + k)/epsilon^2) batches and recovers p to error O(eta + epsilon/sqrt(k)). This recovery accuracy is information theoretically optimal, to constant factors, even given an infinite number of data sources. Our second algorithm applies to the eta = 0 setting and also achieves an O(epsilon/sqrt(k)) recover guarantee, though it runs in poly((nk)^k) time. This second algorithm, which approximates a certain tensor via a rank-1 tensor minimizing l_1 distance, is surprising in light of the hardness of many low-rank tensor approximation problems, and may be of independent interest.

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. A discriminative framework for clustering via similarity functions. In Symposium on Theory of Computing (STOC), pages 671-680, 2008.
Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D Smith, and Patrick White. Testing that distributions are close. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pages 259-269. IEEE, 2000.
K. Bhatia, P. Jain, and P. Kar. Robust regression via hard thresholding. In Advances in Neural Information Processing Systems (NIPS), pages 721-729, 2015.
Dietrich Braess and Thomas Sauer. Bernstein polynomials and learning theory. Journal of Approximation Theory, 128(2):187-206, 2004.
Siu-On Chan, Ilias Diakonikolas, Gregory Valiant, and Paul Valiant. Optimal algorithms for testing closeness of discrete distributions. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 1193-1203. Society for Industrial and Applied Mathematics, 2014.
Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from untrusted data. In Symposium on Theory of Computing (STOC), pages 47-60, 2017.
Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high dimensions without the computational intractability. In Foundations of Computer Science (FOCS), pages 655-664, 2016.
Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical. arXiv preprint arXiv:1703.00893, 2017.
Alan Frieze and Ravi Kannan. A new approach to the planted clique problem. In LIPIcs-Leibniz International Proceedings in Informatics, volume 2. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2008.
E Gilbert. Codes based on inaccurate source probabilities. IEEE Transactions on Information Theory, 17(3):304-314, 1971.
Nicolas Gillis and Stephen A Vavasis. On the complexity of robust pca and l1-norm low-rank matrix approximation. arXiv preprint arXiv:1509.09236, 2015.
Frank R Hampel, Elvezio M Ronchetti, Peter J Rousseeuw, and Werner A Stahel. Robust statistics: the approach based on influence functions, volume 114. John Wiley &Sons, 2011.
Christopher J Hillar and Lek-Heng Lim. Most tensor problems are np-hard. Journal of the ACM (JACM), 60(6):45, 2013.
Peter J Huber and Elvezio M Ronchetti. Robust statistics, volume 2. John Wiley &Sons, 2009. URL: http://dx.doi.org/10.1002/9780470434697.
Sudeep Kamath, Alon Orlitsky, Dheeraj Pichapati, and Ananda Theertha Suresh. On learning distributions from their samples. In Conference on Learning Theory (COLT), pages 1066-1100, 2015.
Adam R Klivans, Philip M Long, and Rocco A Servedio. Learning halfspaces with malicious noise. Journal of Machine Learning Research, 10(Dec):2715-2740, 2009.
Jakub Konečný, H. Brendan McMahan, Felix X. Yu, Peter Richtarik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. In NIPS Workshop on Private Multi-Party Machine Learning, 2016. URL: https://arxiv.org/abs/1610.05492.
Raphail Krichevsky and Victor Trofimov. The performance of universal encoding. IEEE Transactions on Information Theory, 27(2):199-207, 1981.
Joseph B Kruskal. Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear algebra and its applications, 18(2):95-138, 1977.
Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In Foundations of Computer Science (FOCS), pages 665-674, 2016.
Reut Levi, Dana Ron, and Ronitt Rubinfeld. Testing properties of collections of distributions. Theory of Computing, 9(8):295-347, 2013.
Tengyu Ma, Jonathan Shi, and David Steurer. Polynomial-time tensor decompositions with sum-of-squares. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 438-446. IEEE, 2016.
Brendan McMahan and Daniel Ramage. https://research.google.com/pubs/pub44822.html , 2017.
H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629, 2016.
Michela Meister and Gregory Valiant. A data prism: Semi-verified learning in the small-alpha regime. arXiv preprint arXiv:1708.02740, 2017.
Blaine Nelson, Battista Biggio, and Pavel Laskov. Understanding the risk factors of learning in adversarial environments. In Proceedings of the 4th ACM workshop on Security and artificial intelligence, pages 87-92. ACM, 2011.
James Newsome, Brad Karp, and Dawn Song. Paragraph: Thwarting signature learning by training maliciously. In International Workshop on Recent Advances in Intrusion Detection, pages 81-105. Springer, 2006.
Alon Orlitsky and Ananda Theertha Suresh. Competitive distribution estimation: Why is good-turing good. In Advances in Neural Information Processing Systems, pages 2143-2151, 2015.
Zhao Song, David P Woodruff, and Peilin Zhong. Low rank approximation with entrywise l1-norm error. Algorithms, 1:2, 2017.
Jacob Steinhardt, Moses Charikar, and Gregory Valiant. Resilience: A criterion for learning in the presence of arbitrary outliers. arXiv preprint arXiv:1703.04940, 2017.
Jacob Steinhardt, Gregory Valiant, and Moses Charikar. Avoiding imposters and delinquents: Adversarial crowdsourcing and peer prediction. In Advances in Neural Information Processing Systems (NIPS), pages 4439-4447, 2016.
Kevin Tian, Weihao Kong, and Gregory Valiant. Learning populations of parameters. In Neural Information Processing Systems (to appear), 2017.
John W Tukey. A survey of sampling from contaminated distributions. Contributions to probability and statistics, 2:448-485, 1960.
Gregory Valiant and Paul Valiant. Estimating the unseen: an n/log (n)-sample estimator for entropy and support size, shown optimal via new clts. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 685-694. ACM, 2011.
Gregory Valiant and Paul Valiant. Instance optimal learning of discrete distributions. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 142-155. ACM, 2016.
Huang Xiao, Battista Biggio, Blaine Nelson, Han Xiao, Claudia Eckert, and Fabio Roli. Support vector machines under adversarial label contamination. Neurocomputing, 160:53-62, 2015.

Learning Discrete Distributions from Untrusted Batches

Authors Mingda Qiao, Gregory Valiant

File

Document Identifiers

Author Details

Cite As Get BibTex

Abstract

Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message