Learning Discrete Distributions from Untrusted Batches

Authors Mingda Qiao, Gregory Valiant

Thumbnail PDF


  • Filesize: 0.61 MB
  • 20 pages

Document Identifiers

Author Details

Mingda Qiao
Gregory Valiant

Cite AsGet BibTex

Mingda Qiao and Gregory Valiant. Learning Discrete Distributions from Untrusted Batches. In 9th Innovations in Theoretical Computer Science Conference (ITCS 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 94, pp. 47:1-47:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


We consider the problem of learning a discrete distribution in the presence of an epsilon fraction of malicious data sources. Specifically, we consider the setting where there is some underlying distribution, p, and each data source provides a batch of >= k samples, with the guarantee that at least a (1 - epsilon) fraction of the sources draw their samples from a distribution with total variation distance at most \eta from p. We make no assumptions on the data provided by the remaining epsilon fraction of sources--this data can even be chosen as an adversarial function of the (1 - epsilon) fraction of "good" batches. We provide two algorithms: one with runtime exponential in the support size, n, but polynomial in k, 1/epsilon and 1/eta that takes O((n + k)/epsilon^2) batches and recovers p to error O(eta + epsilon/sqrt(k)). This recovery accuracy is information theoretically optimal, to constant factors, even given an infinite number of data sources. Our second algorithm applies to the eta = 0 setting and also achieves an O(epsilon/sqrt(k)) recover guarantee, though it runs in poly((nk)^k) time. This second algorithm, which approximates a certain tensor via a rank-1 tensor minimizing l_1 distance, is surprising in light of the hardness of many low-rank tensor approximation problems, and may be of independent interest.
  • robust statistics
  • information-theoretic optimality


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. A discriminative framework for clustering via similarity functions. In Symposium on Theory of Computing (STOC), pages 671-680, 2008. Google Scholar
  2. Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D Smith, and Patrick White. Testing that distributions are close. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pages 259-269. IEEE, 2000. Google Scholar
  3. K. Bhatia, P. Jain, and P. Kar. Robust regression via hard thresholding. In Advances in Neural Information Processing Systems (NIPS), pages 721-729, 2015. Google Scholar
  4. Dietrich Braess and Thomas Sauer. Bernstein polynomials and learning theory. Journal of Approximation Theory, 128(2):187-206, 2004. Google Scholar
  5. Siu-On Chan, Ilias Diakonikolas, Gregory Valiant, and Paul Valiant. Optimal algorithms for testing closeness of discrete distributions. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 1193-1203. Society for Industrial and Applied Mathematics, 2014. Google Scholar
  6. Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from untrusted data. In Symposium on Theory of Computing (STOC), pages 47-60, 2017. Google Scholar
  7. Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high dimensions without the computational intractability. In Foundations of Computer Science (FOCS), pages 655-664, 2016. Google Scholar
  8. Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical. arXiv preprint arXiv:1703.00893, 2017. Google Scholar
  9. Alan Frieze and Ravi Kannan. A new approach to the planted clique problem. In LIPIcs-Leibniz International Proceedings in Informatics, volume 2. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2008. Google Scholar
  10. E Gilbert. Codes based on inaccurate source probabilities. IEEE Transactions on Information Theory, 17(3):304-314, 1971. Google Scholar
  11. Nicolas Gillis and Stephen A Vavasis. On the complexity of robust pca and l1-norm low-rank matrix approximation. arXiv preprint arXiv:1509.09236, 2015. Google Scholar
  12. Frank R Hampel, Elvezio M Ronchetti, Peter J Rousseeuw, and Werner A Stahel. Robust statistics: the approach based on influence functions, volume 114. John Wiley &Sons, 2011. Google Scholar
  13. Christopher J Hillar and Lek-Heng Lim. Most tensor problems are np-hard. Journal of the ACM (JACM), 60(6):45, 2013. Google Scholar
  14. Peter J Huber and Elvezio M Ronchetti. Robust statistics, volume 2. John Wiley &Sons, 2009. URL: http://dx.doi.org/10.1002/9780470434697.
  15. Sudeep Kamath, Alon Orlitsky, Dheeraj Pichapati, and Ananda Theertha Suresh. On learning distributions from their samples. In Conference on Learning Theory (COLT), pages 1066-1100, 2015. Google Scholar
  16. Adam R Klivans, Philip M Long, and Rocco A Servedio. Learning halfspaces with malicious noise. Journal of Machine Learning Research, 10(Dec):2715-2740, 2009. Google Scholar
  17. Jakub Konečný, H. Brendan McMahan, Felix X. Yu, Peter Richtarik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. In NIPS Workshop on Private Multi-Party Machine Learning, 2016. URL: https://arxiv.org/abs/1610.05492.
  18. Raphail Krichevsky and Victor Trofimov. The performance of universal encoding. IEEE Transactions on Information Theory, 27(2):199-207, 1981. Google Scholar
  19. Joseph B Kruskal. Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear algebra and its applications, 18(2):95-138, 1977. Google Scholar
  20. Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In Foundations of Computer Science (FOCS), pages 665-674, 2016. Google Scholar
  21. Reut Levi, Dana Ron, and Ronitt Rubinfeld. Testing properties of collections of distributions. Theory of Computing, 9(8):295-347, 2013. Google Scholar
  22. Tengyu Ma, Jonathan Shi, and David Steurer. Polynomial-time tensor decompositions with sum-of-squares. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 438-446. IEEE, 2016. Google Scholar
  23. Brendan McMahan and Daniel Ramage. https://research.google.com/pubs/pub44822.html , 2017.
  24. H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629, 2016. Google Scholar
  25. Michela Meister and Gregory Valiant. A data prism: Semi-verified learning in the small-alpha regime. arXiv preprint arXiv:1708.02740, 2017. Google Scholar
  26. Blaine Nelson, Battista Biggio, and Pavel Laskov. Understanding the risk factors of learning in adversarial environments. In Proceedings of the 4th ACM workshop on Security and artificial intelligence, pages 87-92. ACM, 2011. Google Scholar
  27. James Newsome, Brad Karp, and Dawn Song. Paragraph: Thwarting signature learning by training maliciously. In International Workshop on Recent Advances in Intrusion Detection, pages 81-105. Springer, 2006. Google Scholar
  28. Alon Orlitsky and Ananda Theertha Suresh. Competitive distribution estimation: Why is good-turing good. In Advances in Neural Information Processing Systems, pages 2143-2151, 2015. Google Scholar
  29. Zhao Song, David P Woodruff, and Peilin Zhong. Low rank approximation with entrywise l1-norm error. Algorithms, 1:2, 2017. Google Scholar
  30. Jacob Steinhardt, Moses Charikar, and Gregory Valiant. Resilience: A criterion for learning in the presence of arbitrary outliers. arXiv preprint arXiv:1703.04940, 2017. Google Scholar
  31. Jacob Steinhardt, Gregory Valiant, and Moses Charikar. Avoiding imposters and delinquents: Adversarial crowdsourcing and peer prediction. In Advances in Neural Information Processing Systems (NIPS), pages 4439-4447, 2016. Google Scholar
  32. Kevin Tian, Weihao Kong, and Gregory Valiant. Learning populations of parameters. In Neural Information Processing Systems (to appear), 2017. Google Scholar
  33. John W Tukey. A survey of sampling from contaminated distributions. Contributions to probability and statistics, 2:448-485, 1960. Google Scholar
  34. Gregory Valiant and Paul Valiant. Estimating the unseen: an n/log (n)-sample estimator for entropy and support size, shown optimal via new clts. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 685-694. ACM, 2011. Google Scholar
  35. Gregory Valiant and Paul Valiant. Instance optimal learning of discrete distributions. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 142-155. ACM, 2016. Google Scholar
  36. Huang Xiao, Battista Biggio, Blaine Nelson, Han Xiao, Claudia Eckert, and Fabio Roli. Support vector machines under adversarial label contamination. Neurocomputing, 160:53-62, 2015. Google Scholar
Questions / Remarks / Feedback

Feedback for Dagstuhl Publishing

Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail