Testing Data Binnings

Canonne, Clément L.; Wimmer, Karl

doi:10.4230/LIPIcs.APPROX/RANDOM.2020.24

File

Author Details

Clément L. Canonne

IBM Research, Almaden, CA, USA

Karl Wimmer

Duquesne University, Pittsburgh, PA, USA

Cite As Get BibTex

Clément L. Canonne and Karl Wimmer. Testing Data Binnings. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 176, pp. 24:1-24:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020) https://doi.org/10.4230/LIPIcs.APPROX/RANDOM.2020.24

Abstract

Motivated by the question of data quantization and "binning," we revisit the problem of identity testing of discrete probability distributions. Identity testing (a.k.a. one-sample testing), a fundamental and by now well-understood problem in distribution testing, asks, given a reference distribution (model) 𝐪 and samples from an unknown distribution 𝐩, both over [n] = {1,2,… ,n}, whether 𝐩 equals 𝐪, or is significantly different from it. In this paper, we introduce the related question of identity up to binning, where the reference distribution 𝐪 is over k ≪ n elements: the question is then whether there exists a suitable binning of the domain [n] into k intervals such that, once "binned," 𝐩 is equal to 𝐪. We provide nearly tight upper and lower bounds on the sample complexity of this new question, showing both a quantitative and qualitative difference with the vanilla identity testing one, and answering an open question of Canonne [Clément L. Canonne, 2019]. Finally, we discuss several extensions and related research directions.

Subject Classification

ACM Subject Classification

Theory of computation → Machine learning theory
Theory of computation → Streaming, sublinear and near linear time algorithms

Keywords

property testing
distribution testing
identity testing
hypothesis testing

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Sivaraman Balakrishnan and Larry Wasserman. Hypothesis testing for high-dimensional multinomials: a selective review. Ann. Appl. Stat., 12(2):727-749, 2018. URL: https://doi.org/10.1214/18-AOAS1155SF.
Tuğkan Batu, Eldar Fischer, Lance Fortnow, Ravi Kumar, Ronitt Rubinfeld, and Patrick White. Testing random variables for independence and identity. In 42nd IEEE Symposium on Foundations of Computer Science (Las Vegas, NV, 2001), pages 442-451. IEEE Computer Soc., Los Alamitos, CA, 2001.
Tuğkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D. Smith, and Patrick White. Testing that distributions are close. In 41st Annual Symposium on Foundations of Computer Science (Redondo Beach, CA, 2000), pages 259-269. IEEE Comput. Soc. Press, Los Alamitos, CA, 2000. URL: https://doi.org/10.1109/SFCS.2000.892113.
Eric Blais, Clément L. Canonne, and Tom Gur. Distribution testing lower bounds via reductions from communication complexity. ACM Trans. Comput. Theory, 11(2):Art. 6, 37, 2019. URL: https://doi.org/10.1145/3305270.
Clément L. Canonne. A survey on distribution testing: Your data is big. but is it blue? Electronic Colloquium on Computational Complexity (ECCC), 22:63, 2015. To appear as a graduate student survey in Theory of Computing. URL: http://eccc.hpi-web.de/report/2015/063.
Clément L. Canonne. Problem 96: Identity testing up to coarsenings, 2019. Asked at the 3rd Workshop on Local Algorithms (WOLA 2019). URL: https://en.wikipedia.org/w/index.php?title=Plagiarism&oldid=5139350.
Siu-On Chan, Ilias Diakonikolas, Rocco A. Servedio, and Xiaorui Sun. Learning mixtures of structured distributions over discrete domains. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1380-1394. SIAM, Philadelphia, PA, 2012.
David Conlon, Jacob Fox, and Benny Sudakov. Hypergraph Ramsey numbers. J. Amer. Math. Soc., 23(1):247-266, 2010. URL: https://doi.org/10.1090/S0894-0347-09-00645-6.
Luc Devroye and Gábor Lugosi. Combinatorial methods in density estimation. Springer Series in Statistics. Springer-Verlag, New York, 2001. URL: https://doi.org/10.1007/978-1-4613-0125-7.
Ilias Diakonikolas, Themis Gouleakis, John Peebles, and Eric Price. Sample-optimal identity testing with high probability. In 45th International Colloquium on Automata, Languages, and Programming, volume 107 of LIPIcs. Leibniz Int. Proc. Inform., pages Art. No. 41, 14. Schloss Dagstuhl. Leibniz-Zent. Inform., Wadern, 2018.
Ilias Diakonikolas and Daniel M. Kane. A new approach for testing properties of discrete distributions. In 57th Annual IEEE Symposium on Foundations of Computer Science - FOCS 2016, pages 685-694. IEEE Computer Soc., Los Alamitos, CA, 2016.
Ilias Diakonikolas, Daniel M. Kane, and Vladimir Nikishkin. Optimal algorithms and lower bounds for testing closeness of structured distributions. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science - FOCS 2015, pages 1183-1202. IEEE Computer Soc., Los Alamitos, CA, 2015. URL: https://doi.org/10.1109/FOCS.2015.76.
Ilias Diakonikolas, Daniel M. Kane, and Vladimir Nikishkin. Testing identity of structured distributions. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1841-1854. SIAM, Philadelphia, PA, 2015. URL: https://doi.org/10.1137/1.9781611973730.123.
Oded Goldreich. The uniform distribution is complete with respect to testing identity to a fixed distribution. In Computational Complexity and Property Testing, volume 12050 of Lecture Notes in Computer Science, pages 152-172. Springer, 2020.
Oded Goldreich, Shafi Goldwasser, and Dana Ron. Property testing and its connection to learning and approximation. JACM, 45(4):653-750, July 1998.
Oded Goldreich and Dana Ron. On testing expansion in bounded-degree graphs. Technical Report TR00-020, Electronic Colloquium on Computational Complexity (ECCC), 2000.
Andy Nguyen. Solving cyclic longest common subsequence in quadratic time. CoRR, abs/1208.0396, 2012. URL: http://arxiv.org/abs/1208.0396.
Liam Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Trans. Inform. Theory, 54(10):4750-4755, 2008. URL: https://doi.org/10.1109/TIT.2008.928987.
Ronitt Rubinfeld and Madhu Sudan. Robust characterization of polynomials with applications to program testing. SIAM Journal on Computing, 25(2):252-271, 1996.
Kazuhiro Suzuki, Dongvu Tonien, Kaoru Kurosawa, and Koji Toyota. Birthday paradox for multi-collisions. In Information security and cryptology - ICISC 2006, volume 4296 of Lecture Notes in Comput. Sci., pages 29-40. Springer, Berlin, 2006. URL: https://doi.org/10.1007/11927587_5.
Gregory Valiant and Paul Valiant. An automatic inequality prover and instance optimal identity testing. SIAM J. Comput., 46(1):429-455, 2017. URL: https://doi.org/10.1137/151002526.

Testing Data Binnings

Authors Clément L. Canonne , Karl Wimmer

File

Document Identifiers

Author Details

Cite As Get BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Testing Data Binnings

Authors Clément L. Canonne , Karl Wimmer

File

Document Identifiers

Author Details

Cite As Get BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

References

Thanks for your feedback!

Could not send message