Support Size Estimation: The Power of Conditioning

Authors Diptarka Chakraborty, Gunjan Kumar, Kuldeep S. Meel



PDF
Thumbnail PDF

File

LIPIcs.MFCS.2023.33.pdf
  • Filesize: 0.67 MB
  • 13 pages

Document Identifiers

Author Details

Diptarka Chakraborty
  • National University of Singapore, Singapore
Gunjan Kumar
  • National University of Singapore, Singapore
Kuldeep S. Meel
  • National University of Singapore, Singapore

Acknowledgements

The authors would like to thank anonymous reviewers for their useful suggestions and comments on an earlier version of this paper.

Cite AsGet BibTex

Diptarka Chakraborty, Gunjan Kumar, and Kuldeep S. Meel. Support Size Estimation: The Power of Conditioning. In 48th International Symposium on Mathematical Foundations of Computer Science (MFCS 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 272, pp. 33:1-33:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.MFCS.2023.33

Abstract

We consider the problem of estimating the support size of a distribution D. Our investigations are pursued through the lens of distribution testing and seek to understand the power of conditional sampling (denoted as COND), wherein one is allowed to query the given distribution conditioned on an arbitrary subset S. The primary contribution of this work is to introduce a new approach to lower bounds for the COND model that relies on using powerful tools from information theory and communication complexity. Our approach allows us to obtain surprisingly strong lower bounds for the COND model and its extensions. - We bridge the longstanding gap between the upper bound O(log log n + 1/ε²) and the lower bound Ω(√{log log n}) for the COND model by providing a nearly matching lower bound. Surprisingly, we show that even if we get to know the actual probabilities along with COND samples, still Ω(log log n + 1/{ε² log (1/ε)}) queries are necessary. - We obtain the first non-trivial lower bound for the COND equipped with an additional oracle that reveals the actual as well as the conditional probabilities of the samples (to the best of our knowledge, this subsumes all of the models previously studied): in particular, we demonstrate that Ω(log log log n + 1/{ε² log (1/ε)}) queries are necessary.

Subject Classification

ACM Subject Classification
  • Theory of computation → Streaming, sublinear and near linear time algorithms
Keywords
  • Support-size estimation
  • Distribution testing
  • Conditional sampling
  • Lower bound

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Jayadev Acharya, Clément L Canonne, and Gautam Kamath. A chasm between identity and equivalence testing with conditional queries. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2015). Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2015. Google Scholar
  2. Tugkan Batu, Sanjoy Dasgupta, Ravi Kumar, and Ronitt Rubinfeld. The complexity of approximating the entropy. SIAM Journal on Computing, 35(1):132-150, 2005. Google Scholar
  3. Rishiraj Bhattacharyya and Sourav Chakraborty. Property testing of joint distributions using conditional samples. ACM Transactions on Computation Theory (TOCT), 10(4):1-20, 2018. Google Scholar
  4. David Blackwell and James B MacQueen. Ferguson distributions via pólya urn schemes. The annals of statistics, 1(2):353-355, 1973. Google Scholar
  5. Eric Blais, Joshua Brody, and Kevin Matulef. Property testing lower bounds via communication complexity. computational complexity, 21(2):311-358, 2012. Google Scholar
  6. Eric Blais, Clément L Canonne, and Tom Gur. Distribution testing lower bounds via reductions from communication complexity. ACM Transactions on Computation Theory (TOCT), 11(2):1-37, 2019. Google Scholar
  7. Cafer Caferov, Barış Kaya, Ryan O’Donnell, and AC Say. Optimal bounds for estimating entropy with pmf queries. In International Symposium on Mathematical Foundations of Computer Science, pages 187-198. Springer, 2015. Google Scholar
  8. Clément Canonne and Ronitt Rubinfeld. Testing probability distributions underlying aggregated data. In International Colloquium on Automata, Languages, and Programming, pages 283-295. Springer, 2014. Google Scholar
  9. Clément L Canonne. A survey on distribution testing: Your data is big. but is it blue? Theory of Computing, pages 1-100, 2020. Google Scholar
  10. Clément L Canonne, Xi Chen, Gautam Kamath, Amit Levi, and Erik Waingarten. Random restrictions of high dimensional distributions and uniformity testing with subcube conditioning. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 321-336. SIAM, 2021. Google Scholar
  11. Clément L Canonne, Dana Ron, and Rocco A Servedio. Testing probability distributions using conditional samples. SIAM Journal on Computing, 44(3):540-616, 2015. Google Scholar
  12. Amit Chakrabarti and Oded Regev. An optimal lower bound on the communication complexity of gap-hamming-distance. SIAM Journal on Computing, 41(5):1299-1317, 2012. Google Scholar
  13. Sourav Chakraborty, Eldar Fischer, Yonatan Goldhirsh, and Arie Matsliah. On the power of conditional samples in distribution testing. SIAM Journal on Computing, 45(4):1261-1296, 2016. Google Scholar
  14. Sourav Chakraborty and Kuldeep S Meel. On testing of uniform samplers. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33(01), pages 7777-7784, 2019. Google Scholar
  15. Xi Chen, Rajesh Jayaram, Amit Levi, and Erik Waingarten. Learning and testing junta distributions with sub cube conditioning. In Conference on Learning Theory, pages 1060-1113. PMLR, 2021. Google Scholar
  16. Remi Delannoy and Kuldeep S Meel. On almost-uniform generation of sat solutions: The power of 3-wise independent hashing. In Proceedings of the 37th Annual ACM/IEEE Symposium on Logic in Computer Science, 2022. Google Scholar
  17. Moein Falahatgar, Ashkan Jafarpour, Alon Orlitsky, Venkatadheeraj Pichapati, and Ananda Theertha Suresh. Faster algorithms for testing under conditional sampling. In Conference on Learning Theory, pages 607-636. PMLR, 2015. Google Scholar
  18. Moein Falahatgar, Ashkan Jafarpour, Alon Orlitsky, Venkatadheeraj Pichapati, and Ananda Theertha Suresh. Estimating the number of defectives with group testing. In 2016 IEEE International Symposium on Information Theory (ISIT), pages 1376-1380. IEEE, 2016. Google Scholar
  19. Oded Goldreich and Dana Ron. On testing expansion in bounded-degree graphs. In Studies in Complexity and Cryptography. Miscellanea on the Interplay between Randomness and Computation, pages 68-75. Springer, 2011. Google Scholar
  20. Priyanka Golia, Brendan Juba, and Kuldeep S. Meel. Efficient entropy estimation with applications to quantitative information flow. In International Conference on Computer-Aided Verification (CAV), 2022. Google Scholar
  21. Andrew D. Gordon, Thomas A. Henzinger, Aditya V. Nori, and Sriram K. Rajamani. Probabilistic programming. In Future of Software Engineering Proceedings, FOSE 2014, pages 167-181, New York, NY, USA, 2014. Association for Computing Machinery. URL: https://doi.org/10.1145/2593882.2593900.
  22. Sudipto Guha, Andrew McGregor, and Suresh Venkatasubramanian. Sublinear estimation of entropy and information distances. ACM Transactions on Algorithms (TALG), 5(4):1-16, 2009. Google Scholar
  23. Norman L Johnson, Samuel Kotz, and Narayanaswamy Balakrishnan. Continuous univariate distributions, volume 2, volume 289. John wiley & sons, 1995. Google Scholar
  24. Gautam Kamath and Christos Tzamos. Anaconda: A non-adaptive conditional sampling algorithm for distribution testing. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 679-693. SIAM, 2019. Google Scholar
  25. Kuldeep S Meel, Yash Pralhad Pote, and Sourav Chakraborty. On testing of samplers. Advances in Neural Information Processing Systems, 33:5753-5763, 2020. Google Scholar
  26. Shyam Narayanan. On tolerant distribution testing in the conditional sampling model. In Dániel Marx, editor, Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, SODA 2021, Virtual Conference, January 10-13, 2021, pages 357-373. SIAM, 2021. Google Scholar
  27. Krzysztof Onak and Xiaorui Sun. Probability-revealing samples. In International Conference on Artificial Intelligence and Statistics. PMLR, 2018. Google Scholar
  28. Ronitt Rubinfeld and Rocco A Servedio. Testing monotone high-dimensional distributions. Random Structures & Algorithms, 34(1):24-44, 2009. Google Scholar
  29. C. E. Shannon. A mathematical theory of communication. Bell system technical journal, 27, 1948. Google Scholar
  30. Gregory Valiant and Paul Valiant. Estimating the unseen: an n/log (n)-sample estimator for entropy and support size, shown optimal via new clts. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 685-694, 2011. Google Scholar
  31. Yihong Wu and Pengkun Yang. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. The Annals of Statistics, 47(2):857-883, 2019. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail