Distribution Testing with a Confused Collector

Ferreira Pinto Jr., Renato; Harms, Nathaniel

doi:10.4230/LIPIcs.ITCS.2024.47

File

Author Details

Renato Ferreira Pinto Jr.

University of Waterloo, Canada

Nathaniel Harms

EPFL, Lausanne, Switzerland

Cite AsGet BibTex

Renato Ferreira Pinto Jr. and Nathaniel Harms. Distribution Testing with a Confused Collector. In 15th Innovations in Theoretical Computer Science Conference (ITCS 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 287, pp. 47:1-47:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.ITCS.2024.47

Abstract

We are interested in testing properties of distributions with systematically mislabeled samples. Our goal is to make decisions about unknown probability distributions, using a sample that has been collected by a confused collector, such as a machine-learning classifier that has not learned to distinguish all elements of the domain. The confused collector holds an unknown clustering of the domain and an input distribution μ, and provides two oracles: a sample oracle which produces a sample from μ that has been labeled according to the clustering; and a label-query oracle which returns the label of a query point x according to the clustering. Our first set of results shows that identity, uniformity, and equivalence of distributions can be tested efficiently, under the earth-mover distance, with remarkably weak conditions on the confused collector, even when the unknown clustering is adversarial. This requires defining a variant of the distribution testing task (inspired by the recent testable learning framework of Rubinfeld & Vasilyan), where the algorithm should test a joint property of the distribution and its clustering. As an example, we get efficient testers when the distribution tester is allowed to reject if it detects that the confused collector clustering is "far" from being a decision tree. The second set of results shows that we can sometimes do significantly better when the clustering is random instead of adversarial. For certain one-dimensional random clusterings, we show that uniformity can be tested under the TV distance using Õ((√n)/(ρ^{3/2} ε²)) samples and zero queries, where ρ ∈ (0,1] controls the "resolution" of the clustering. We improve this to O((√n)/(ρ ε²)) when queries are allowed.

Subject Classification

ACM Subject Classification

Theory of computation → Streaming, sublinear and near linear time algorithms
Theory of computation → Randomness, geometry and discrete structures
Theory of computation → Computational geometry
Theory of computation → Machine learning theory
Theory of computation → Probabilistic computation
Mathematics of computing → Probability and statistics

Keywords

Distribution testing
property testing
uniformity testing
identity testing
earth-mover distance
sublinear algorithms

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Jayadev Acharya, Clément Canonne, Cody Freitag, and Himanshu Tyagi. Test without trust: Optimal locally private distribution testing. In Proceedings, International Conference on Artificial Intelligence and Statistics (AISTATS), pages 2067-2076. PMLR, 2019.
Jayadev Acharya, Clément L Canonne, Cody Freitag, Ziteng Sun, and Himanshu Tyagi. Inference under information constraints iii: Local privacy constraints. IEEE Journal on Selected Areas in Information Theory, 2(1):253-267, 2021.
Jayadev Acharya, Clément L Canonne, and Himanshu Tyagi. Inference under information constraints: Lower bounds from chi-square contraction. In Proceedings, Conference on Learning Theory (COLT), pages 3-17. PMLR, 2019.
Jayadev Acharya, Clément L Canonne, and Himanshu Tyagi. Inference under information constraints ii: Communication constraints and shared randomness. IEEE Transactions on Information Theory, 66(12):7856-7877, 2020.
Maryam Aliakbarpour, Themis Gouleakis, John Peebles, Ronitt Rubinfeld, and Anak Yodpinyanee. Towards testing monotonicity of distributions over general posets. In Proceedings, Conference on Learning Theory (COLT), pages 34-82. PMLR, 2019.
Maryam Aliakbarpour, Ravi Kumar, and Ronitt Rubinfeld. Testing mixtures of discrete distributions. In Proceedings, Conference on Learning Theory (COLT), pages 83-114. PMLR, 2019.
Maryam Aliakbarpour and Sandeep Silwal. Testing properties of multiple distributions with few samples. In Proceedings, Innovations in Theoretical Computer Science (ITCS), 2020.
Tugkan Batu, Ravi Kumar, and Ronitt Rubinfeld. Sublinear algorithms for testing monotone and unimodal distributions. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages 381-390, 2004.
Eric Blais, Clément L Canonne, and Tom Gur. Distribution testing lower bounds via reductions from communication complexity. ACM Transactions on Computation Theory, 11(2):1-37, 2019.
Clément Canonne. Topics and techniques in distribution testing: A biased but representative sample. Foundations and Trends in Communications and Information Theory, 19(6):1032-1198, 2022.
Clément L Canonne. Are few bins enough: Testing histogram distributions. In Proceedings, ACM Symposium on Principles of Database Systems (PODS), pages 455-463, 2016.
Clément L Canonne. A survey on distribution testing: Your data is big. but is it blue? Theory of Computing, pages 1-100, 2020.
Clément L Canonne, Ilias Diakonikolas, Themis Gouleakis, and Ronitt Rubinfeld. Testing shape restrictions of discrete distributions. Theory of Computing Systems, 62(1):4-62, 2018.
Clément L Canonne, Ilias Diakonikolas, Daniel Kane, and Sihan Liu. Nearly-tight bounds for testing histogram distributions. Proceedings, Advances in Neural Information Processing Systems (NeurIPS), 35:31599-31611, 2022.
Clément L Canonne and Karl Wimmer. Testing data binnings. In Proceedings of APPROX/RANDOM. Schloss Dagstuhl-Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing, 2020.
Clément L Canonne and Karl Wimmer. Identity testing under label mismatch. In Proceedings, International Symposium on Algorithms and Computation (ISAAC). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2021.
Siu-On Chan, Ilias Diakonikolas, Paul Valiant, and Gregory Valiant. Optimal algorithms for testing closeness of discrete distributions. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 1193-1203. SIAM, 2014.
Constantinos Daskalakis, Themis Gouleakis, Chistos Tzamos, and Manolis Zampetakis. Efficient statistics, in high dimensions, from truncated samples. In Proceedings, IEEE Symposium on Foundations of Computer Science (FOCS), pages 639-649. IEEE, 2018.
Anindya De, Shivam Nadimpalli, and Rocco A Servedio. Testing convex truncation. In Proceedings, ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 4050-4082. SIAM, 2023.
Ilias Diakonikolas, Themis Gouleakis, John Peebles, and Eric Price. Collision-based testers are optimal for uniformity and closeness. Chicago Journal of Theoretical Computer Science, 1:1-21, 2019.
Ilias Diakonikolas and Daniel M Kane. A new approach for testing properties of discrete distributions. In Proceedings, IEEE Symposium on Foundations of Computer Science (FOCS), pages 685-694. IEEE, 2016.
Khanh Do Ba, Huy L Nguyen, Huy N Nguyen, and Ronitt Rubinfeld. Sublinear time algorithms for earth mover’s distance. Theory of Computing Systems, 48:428-442, 2011.
Renato Ferreira Pinto Jr. and Nathaniel Harms. Distribution testing under the parity trace, 2023. arXiv:2304.01374. URL: https://arxiv.org/abs/arXiv:2304.01374.
Renato Ferreira Pinto Jr. and Nathaniel Harms. Distribution testing with a confused collector. arXiv, 2023. URL: https://arxiv.org/abs/2311.1424.
Dimitris Fotakis, Alkis Kalavasis, Vasilis Kontonis, and Christos Tzamos. Efficient algorithms for learning from coarse labels. In Proceedings, Conference on Learning Theory (COLT), pages 2060-2079. PMLR, 2021.
Marco Gaboardi and Ryan Rogers. Local private hypothesis testing: Chi-square tests. In Proceedings, International Conference on Machine Learning (ICML), pages 1626-1635. PMLR, 2018.
Oded Goldreich. The uniform distribution is complete with respect to testing identity to a fixed distribution. In Computational Complexity and Property Testing: On the Interplay Between Randomness and Computation. Springer, 2020. ECCC TR16-015. URL: https://doi.org/10.1007/978-3-030-43662-9_10.
Oded Goldreich and Dana Ron. On testing expansion in bounded-degree graphs. In Studies in Complexity and Cryptography. Miscellanea on the Interplay between Randomness and Computation, pages 68-75. Springer, 2011. URL: https://doi.org/10.1007/978-3-642-22670-0_9.
Oded Goldreich and Dana Ron. Testing distributions of huge objects. In Proceedings, Innovations in Theoretical Computer Science (ITCS). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2022.
Piotr Indyk, Reut Levi, and Ronitt Rubinfeld. Approximating and testing k-histogram distributions in sub-linear time. In Proceedings, ACM Symposium on Principles of Database Systems (PODS), pages 15-22, 2012.
Yin Tat Lee, Aaron Sidford, and Santosh S Vempala. Efficient convex optimization with oracles. In Building Bridges II: Mathematics of László Lovász, pages 317-335. Springer, 2020.
Reut Levi, Dana Ron, and Ronitt Rubinfeld. Testing properties of collections of distributions. Theory of Computing, 9(1):295-347, 2013.
Reut Levi, Dana Ron, and Ronitt Rubinfeld. Testing similar means. SIAM Journal on Discrete Mathematics, 28(4):1699-1724, 2014.
Liam Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Transactions on Information Theory, 54(10):4750-4755, 2008.
Ronitt Rubinfeld and Arsen Vasilyan. Testing distributional assumptions of learning algorithms. In Proceedings, ACM Symposium on Theory of Computing (STOC). ACM, 2023.
Or Sheffet. Locally private hypothesis testing. In Proceedings, International Conference on Machine Learning (ICML), pages 4605-4614. PMLR, 2018.
Gregory Valiant and Paul Valiant. An automatic inequality prover and instance optimal identity testing. SIAM Journal on Computing, 46(1):429-455, 2017.

Distribution Testing with a Confused Collector

Authors Renato Ferreira Pinto Jr., Nathaniel Harms

File

Document Identifiers

Author Details

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Distribution Testing with a Confused Collector

Authors Renato Ferreira Pinto Jr., Nathaniel Harms

File

Document Identifiers

Author Details

Funding

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

References

Thanks for your feedback!

Could not send message