Motivated by the question of data quantization and "binning," we revisit the problem of identity testing of discrete probability distributions. Identity testing (a.k.a. one-sample testing), a fundamental and by now well-understood problem in distribution testing, asks, given a reference distribution (model) 𝐪 and samples from an unknown distribution 𝐩, both over [n] = {1,2,… ,n}, whether 𝐩 equals 𝐪, or is significantly different from it. In this paper, we introduce the related question of identity up to binning, where the reference distribution 𝐪 is over k ≪ n elements: the question is then whether there exists a suitable binning of the domain [n] into k intervals such that, once "binned," 𝐩 is equal to 𝐪. We provide nearly tight upper and lower bounds on the sample complexity of this new question, showing both a quantitative and qualitative difference with the vanilla identity testing one, and answering an open question of Canonne [Clément L. Canonne, 2019]. Finally, we discuss several extensions and related research directions.
@InProceedings{canonne_et_al:LIPIcs.APPROX/RANDOM.2020.24, author = {Canonne, Cl\'{e}ment L. and Wimmer, Karl}, title = {{Testing Data Binnings}}, booktitle = {Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2020)}, pages = {24:1--24:13}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-164-1}, ISSN = {1868-8969}, year = {2020}, volume = {176}, editor = {Byrka, Jaros{\l}aw and Meka, Raghu}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.APPROX/RANDOM.2020.24}, URN = {urn:nbn:de:0030-drops-126277}, doi = {10.4230/LIPIcs.APPROX/RANDOM.2020.24}, annote = {Keywords: property testing, distribution testing, identity testing, hypothesis testing} }
Feedback for Dagstuhl Publishing