Testing Data Binnings

Authors Clément L. Canonne , Karl Wimmer



PDF
Thumbnail PDF

File

LIPIcs.APPROX-RANDOM.2020.24.pdf
  • Filesize: 0.6 MB
  • 13 pages

Document Identifiers

Author Details

Clément L. Canonne
  • IBM Research, Almaden, CA, USA
Karl Wimmer
  • Duquesne University, Pittsburgh, PA, USA

Cite As Get BibTex

Clément L. Canonne and Karl Wimmer. Testing Data Binnings. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 176, pp. 24:1-24:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020) https://doi.org/10.4230/LIPIcs.APPROX/RANDOM.2020.24

Abstract

Motivated by the question of data quantization and "binning," we revisit the problem of identity testing of discrete probability distributions. Identity testing (a.k.a. one-sample testing), a fundamental and by now well-understood problem in distribution testing, asks, given a reference distribution (model) 𝐪 and samples from an unknown distribution 𝐩, both over [n] = {1,2,… ,n}, whether 𝐩 equals 𝐪, or is significantly different from it.
In this paper, we introduce the related question of identity up to binning, where the reference distribution 𝐪 is over k ≪ n elements: the question is then whether there exists a suitable binning of the domain [n] into k intervals such that, once "binned," 𝐩 is equal to 𝐪. We provide nearly tight upper and lower bounds on the sample complexity of this new question, showing both a quantitative and qualitative difference with the vanilla identity testing one, and answering an open question of Canonne [Clément L. Canonne, 2019]. Finally, we discuss several extensions and related research directions.

Subject Classification

ACM Subject Classification
  • Theory of computation → Machine learning theory
  • Theory of computation → Streaming, sublinear and near linear time algorithms
Keywords
  • property testing
  • distribution testing
  • identity testing
  • hypothesis testing

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Sivaraman Balakrishnan and Larry Wasserman. Hypothesis testing for high-dimensional multinomials: a selective review. Ann. Appl. Stat., 12(2):727-749, 2018. URL: https://doi.org/10.1214/18-AOAS1155SF.
  2. Tuğkan Batu, Eldar Fischer, Lance Fortnow, Ravi Kumar, Ronitt Rubinfeld, and Patrick White. Testing random variables for independence and identity. In 42nd IEEE Symposium on Foundations of Computer Science (Las Vegas, NV, 2001), pages 442-451. IEEE Computer Soc., Los Alamitos, CA, 2001. Google Scholar
  3. Tuğkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D. Smith, and Patrick White. Testing that distributions are close. In 41st Annual Symposium on Foundations of Computer Science (Redondo Beach, CA, 2000), pages 259-269. IEEE Comput. Soc. Press, Los Alamitos, CA, 2000. URL: https://doi.org/10.1109/SFCS.2000.892113.
  4. Eric Blais, Clément L. Canonne, and Tom Gur. Distribution testing lower bounds via reductions from communication complexity. ACM Trans. Comput. Theory, 11(2):Art. 6, 37, 2019. URL: https://doi.org/10.1145/3305270.
  5. Clément L. Canonne. A survey on distribution testing: Your data is big. but is it blue? Electronic Colloquium on Computational Complexity (ECCC), 22:63, 2015. To appear as a graduate student survey in Theory of Computing. URL: http://eccc.hpi-web.de/report/2015/063.
  6. Clément L. Canonne. Problem 96: Identity testing up to coarsenings, 2019. Asked at the 3rd Workshop on Local Algorithms (WOLA 2019). URL: https://en.wikipedia.org/w/index.php?title=Plagiarism&oldid=5139350.
  7. Siu-On Chan, Ilias Diakonikolas, Rocco A. Servedio, and Xiaorui Sun. Learning mixtures of structured distributions over discrete domains. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1380-1394. SIAM, Philadelphia, PA, 2012. Google Scholar
  8. David Conlon, Jacob Fox, and Benny Sudakov. Hypergraph Ramsey numbers. J. Amer. Math. Soc., 23(1):247-266, 2010. URL: https://doi.org/10.1090/S0894-0347-09-00645-6.
  9. Luc Devroye and Gábor Lugosi. Combinatorial methods in density estimation. Springer Series in Statistics. Springer-Verlag, New York, 2001. URL: https://doi.org/10.1007/978-1-4613-0125-7.
  10. Ilias Diakonikolas, Themis Gouleakis, John Peebles, and Eric Price. Sample-optimal identity testing with high probability. In 45th International Colloquium on Automata, Languages, and Programming, volume 107 of LIPIcs. Leibniz Int. Proc. Inform., pages Art. No. 41, 14. Schloss Dagstuhl. Leibniz-Zent. Inform., Wadern, 2018. Google Scholar
  11. Ilias Diakonikolas and Daniel M. Kane. A new approach for testing properties of discrete distributions. In 57th Annual IEEE Symposium on Foundations of Computer Science - FOCS 2016, pages 685-694. IEEE Computer Soc., Los Alamitos, CA, 2016. Google Scholar
  12. Ilias Diakonikolas, Daniel M. Kane, and Vladimir Nikishkin. Optimal algorithms and lower bounds for testing closeness of structured distributions. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science - FOCS 2015, pages 1183-1202. IEEE Computer Soc., Los Alamitos, CA, 2015. URL: https://doi.org/10.1109/FOCS.2015.76.
  13. Ilias Diakonikolas, Daniel M. Kane, and Vladimir Nikishkin. Testing identity of structured distributions. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1841-1854. SIAM, Philadelphia, PA, 2015. URL: https://doi.org/10.1137/1.9781611973730.123.
  14. Oded Goldreich. The uniform distribution is complete with respect to testing identity to a fixed distribution. In Computational Complexity and Property Testing, volume 12050 of Lecture Notes in Computer Science, pages 152-172. Springer, 2020. Google Scholar
  15. Oded Goldreich, Shafi Goldwasser, and Dana Ron. Property testing and its connection to learning and approximation. JACM, 45(4):653-750, July 1998. Google Scholar
  16. Oded Goldreich and Dana Ron. On testing expansion in bounded-degree graphs. Technical Report TR00-020, Electronic Colloquium on Computational Complexity (ECCC), 2000. Google Scholar
  17. Andy Nguyen. Solving cyclic longest common subsequence in quadratic time. CoRR, abs/1208.0396, 2012. URL: http://arxiv.org/abs/1208.0396.
  18. Liam Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Trans. Inform. Theory, 54(10):4750-4755, 2008. URL: https://doi.org/10.1109/TIT.2008.928987.
  19. Ronitt Rubinfeld and Madhu Sudan. Robust characterization of polynomials with applications to program testing. SIAM Journal on Computing, 25(2):252-271, 1996. Google Scholar
  20. Kazuhiro Suzuki, Dongvu Tonien, Kaoru Kurosawa, and Koji Toyota. Birthday paradox for multi-collisions. In Information security and cryptology - ICISC 2006, volume 4296 of Lecture Notes in Comput. Sci., pages 29-40. Springer, Berlin, 2006. URL: https://doi.org/10.1007/11927587_5.
  21. Gregory Valiant and Paul Valiant. An automatic inequality prover and instance optimal identity testing. SIAM J. Comput., 46(1):429-455, 2017. URL: https://doi.org/10.1137/151002526.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail