Document Open Access Logo

Distribution Testing with a Confused Collector

Authors Renato Ferreira Pinto Jr., Nathaniel Harms



PDF
Thumbnail PDF

File

LIPIcs.ITCS.2024.47.pdf
  • Filesize: 0.78 MB
  • 14 pages

Document Identifiers

Author Details

Renato Ferreira Pinto Jr.
  • University of Waterloo, Canada
Nathaniel Harms
  • EPFL, Lausanne, Switzerland

Acknowledgements

We thank Eric Blais for helpful discussions and comments on the presentation of this article, and Maryam Aliakbarpour for references on testing with imperfect information. We thank the anonymous reviewers for their comments and references to related work.

Cite AsGet BibTex

Renato Ferreira Pinto Jr. and Nathaniel Harms. Distribution Testing with a Confused Collector. In 15th Innovations in Theoretical Computer Science Conference (ITCS 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 287, pp. 47:1-47:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.ITCS.2024.47

Abstract

We are interested in testing properties of distributions with systematically mislabeled samples. Our goal is to make decisions about unknown probability distributions, using a sample that has been collected by a confused collector, such as a machine-learning classifier that has not learned to distinguish all elements of the domain. The confused collector holds an unknown clustering of the domain and an input distribution μ, and provides two oracles: a sample oracle which produces a sample from μ that has been labeled according to the clustering; and a label-query oracle which returns the label of a query point x according to the clustering. Our first set of results shows that identity, uniformity, and equivalence of distributions can be tested efficiently, under the earth-mover distance, with remarkably weak conditions on the confused collector, even when the unknown clustering is adversarial. This requires defining a variant of the distribution testing task (inspired by the recent testable learning framework of Rubinfeld & Vasilyan), where the algorithm should test a joint property of the distribution and its clustering. As an example, we get efficient testers when the distribution tester is allowed to reject if it detects that the confused collector clustering is "far" from being a decision tree. The second set of results shows that we can sometimes do significantly better when the clustering is random instead of adversarial. For certain one-dimensional random clusterings, we show that uniformity can be tested under the TV distance using Õ((√n)/(ρ^{3/2} ε²)) samples and zero queries, where ρ ∈ (0,1] controls the "resolution" of the clustering. We improve this to O((√n)/(ρ ε²)) when queries are allowed.

Subject Classification

ACM Subject Classification
  • Theory of computation → Streaming, sublinear and near linear time algorithms
  • Theory of computation → Randomness, geometry and discrete structures
  • Theory of computation → Computational geometry
  • Theory of computation → Machine learning theory
  • Theory of computation → Probabilistic computation
  • Mathematics of computing → Probability and statistics
Keywords
  • Distribution testing
  • property testing
  • uniformity testing
  • identity testing
  • earth-mover distance
  • sublinear algorithms

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Jayadev Acharya, Clément Canonne, Cody Freitag, and Himanshu Tyagi. Test without trust: Optimal locally private distribution testing. In Proceedings, International Conference on Artificial Intelligence and Statistics (AISTATS), pages 2067-2076. PMLR, 2019. Google Scholar
  2. Jayadev Acharya, Clément L Canonne, Cody Freitag, Ziteng Sun, and Himanshu Tyagi. Inference under information constraints iii: Local privacy constraints. IEEE Journal on Selected Areas in Information Theory, 2(1):253-267, 2021. Google Scholar
  3. Jayadev Acharya, Clément L Canonne, and Himanshu Tyagi. Inference under information constraints: Lower bounds from chi-square contraction. In Proceedings, Conference on Learning Theory (COLT), pages 3-17. PMLR, 2019. Google Scholar
  4. Jayadev Acharya, Clément L Canonne, and Himanshu Tyagi. Inference under information constraints ii: Communication constraints and shared randomness. IEEE Transactions on Information Theory, 66(12):7856-7877, 2020. Google Scholar
  5. Maryam Aliakbarpour, Themis Gouleakis, John Peebles, Ronitt Rubinfeld, and Anak Yodpinyanee. Towards testing monotonicity of distributions over general posets. In Proceedings, Conference on Learning Theory (COLT), pages 34-82. PMLR, 2019. Google Scholar
  6. Maryam Aliakbarpour, Ravi Kumar, and Ronitt Rubinfeld. Testing mixtures of discrete distributions. In Proceedings, Conference on Learning Theory (COLT), pages 83-114. PMLR, 2019. Google Scholar
  7. Maryam Aliakbarpour and Sandeep Silwal. Testing properties of multiple distributions with few samples. In Proceedings, Innovations in Theoretical Computer Science (ITCS), 2020. Google Scholar
  8. Tugkan Batu, Ravi Kumar, and Ronitt Rubinfeld. Sublinear algorithms for testing monotone and unimodal distributions. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages 381-390, 2004. Google Scholar
  9. Eric Blais, Clément L Canonne, and Tom Gur. Distribution testing lower bounds via reductions from communication complexity. ACM Transactions on Computation Theory, 11(2):1-37, 2019. Google Scholar
  10. Clément Canonne. Topics and techniques in distribution testing: A biased but representative sample. Foundations and Trends in Communications and Information Theory, 19(6):1032-1198, 2022. Google Scholar
  11. Clément L Canonne. Are few bins enough: Testing histogram distributions. In Proceedings, ACM Symposium on Principles of Database Systems (PODS), pages 455-463, 2016. Google Scholar
  12. Clément L Canonne. A survey on distribution testing: Your data is big. but is it blue? Theory of Computing, pages 1-100, 2020. Google Scholar
  13. Clément L Canonne, Ilias Diakonikolas, Themis Gouleakis, and Ronitt Rubinfeld. Testing shape restrictions of discrete distributions. Theory of Computing Systems, 62(1):4-62, 2018. Google Scholar
  14. Clément L Canonne, Ilias Diakonikolas, Daniel Kane, and Sihan Liu. Nearly-tight bounds for testing histogram distributions. Proceedings, Advances in Neural Information Processing Systems (NeurIPS), 35:31599-31611, 2022. Google Scholar
  15. Clément L Canonne and Karl Wimmer. Testing data binnings. In Proceedings of APPROX/RANDOM. Schloss Dagstuhl-Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing, 2020. Google Scholar
  16. Clément L Canonne and Karl Wimmer. Identity testing under label mismatch. In Proceedings, International Symposium on Algorithms and Computation (ISAAC). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2021. Google Scholar
  17. Siu-On Chan, Ilias Diakonikolas, Paul Valiant, and Gregory Valiant. Optimal algorithms for testing closeness of discrete distributions. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 1193-1203. SIAM, 2014. Google Scholar
  18. Constantinos Daskalakis, Themis Gouleakis, Chistos Tzamos, and Manolis Zampetakis. Efficient statistics, in high dimensions, from truncated samples. In Proceedings, IEEE Symposium on Foundations of Computer Science (FOCS), pages 639-649. IEEE, 2018. Google Scholar
  19. Anindya De, Shivam Nadimpalli, and Rocco A Servedio. Testing convex truncation. In Proceedings, ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 4050-4082. SIAM, 2023. Google Scholar
  20. Ilias Diakonikolas, Themis Gouleakis, John Peebles, and Eric Price. Collision-based testers are optimal for uniformity and closeness. Chicago Journal of Theoretical Computer Science, 1:1-21, 2019. Google Scholar
  21. Ilias Diakonikolas and Daniel M Kane. A new approach for testing properties of discrete distributions. In Proceedings, IEEE Symposium on Foundations of Computer Science (FOCS), pages 685-694. IEEE, 2016. Google Scholar
  22. Khanh Do Ba, Huy L Nguyen, Huy N Nguyen, and Ronitt Rubinfeld. Sublinear time algorithms for earth mover’s distance. Theory of Computing Systems, 48:428-442, 2011. Google Scholar
  23. Renato Ferreira Pinto Jr. and Nathaniel Harms. Distribution testing under the parity trace, 2023. arXiv:2304.01374. URL: https://arxiv.org/abs/arXiv:2304.01374.
  24. Renato Ferreira Pinto Jr. and Nathaniel Harms. Distribution testing with a confused collector. arXiv, 2023. URL: https://arxiv.org/abs/2311.1424.
  25. Dimitris Fotakis, Alkis Kalavasis, Vasilis Kontonis, and Christos Tzamos. Efficient algorithms for learning from coarse labels. In Proceedings, Conference on Learning Theory (COLT), pages 2060-2079. PMLR, 2021. Google Scholar
  26. Marco Gaboardi and Ryan Rogers. Local private hypothesis testing: Chi-square tests. In Proceedings, International Conference on Machine Learning (ICML), pages 1626-1635. PMLR, 2018. Google Scholar
  27. Oded Goldreich. The uniform distribution is complete with respect to testing identity to a fixed distribution. In Computational Complexity and Property Testing: On the Interplay Between Randomness and Computation. Springer, 2020. ECCC TR16-015. URL: https://doi.org/10.1007/978-3-030-43662-9_10.
  28. Oded Goldreich and Dana Ron. On testing expansion in bounded-degree graphs. In Studies in Complexity and Cryptography. Miscellanea on the Interplay between Randomness and Computation, pages 68-75. Springer, 2011. URL: https://doi.org/10.1007/978-3-642-22670-0_9.
  29. Oded Goldreich and Dana Ron. Testing distributions of huge objects. In Proceedings, Innovations in Theoretical Computer Science (ITCS). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2022. Google Scholar
  30. Piotr Indyk, Reut Levi, and Ronitt Rubinfeld. Approximating and testing k-histogram distributions in sub-linear time. In Proceedings, ACM Symposium on Principles of Database Systems (PODS), pages 15-22, 2012. Google Scholar
  31. Yin Tat Lee, Aaron Sidford, and Santosh S Vempala. Efficient convex optimization with oracles. In Building Bridges II: Mathematics of László Lovász, pages 317-335. Springer, 2020. Google Scholar
  32. Reut Levi, Dana Ron, and Ronitt Rubinfeld. Testing properties of collections of distributions. Theory of Computing, 9(1):295-347, 2013. Google Scholar
  33. Reut Levi, Dana Ron, and Ronitt Rubinfeld. Testing similar means. SIAM Journal on Discrete Mathematics, 28(4):1699-1724, 2014. Google Scholar
  34. Liam Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Transactions on Information Theory, 54(10):4750-4755, 2008. Google Scholar
  35. Ronitt Rubinfeld and Arsen Vasilyan. Testing distributional assumptions of learning algorithms. In Proceedings, ACM Symposium on Theory of Computing (STOC). ACM, 2023. Google Scholar
  36. Or Sheffet. Locally private hypothesis testing. In Proceedings, International Conference on Machine Learning (ICML), pages 4605-4614. PMLR, 2018. Google Scholar
  37. Gregory Valiant and Paul Valiant. An automatic inequality prover and instance optimal identity testing. SIAM Journal on Computing, 46(1):429-455, 2017. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail