Testing Distributions of Huge Objects

Testing Distributions of Huge Objects We initiate a study of a new model of property testing that is a hybrid of testing properties of distributions and testing properties of strings. Specifically, the new model refers to testing properties of distributions, but these are distributions over huge objects (i.e., very long strings). Accordingly, the model accounts for the total number of local probes into these objects (resp., queries to the strings) as well as for the distance between objects (resp., strings). Specifically, the distance between distributions is defined as the earth mover’s distance with respect to the relative Hamming distance between strings. We study the query complexity of testing in this new model, focusing on three directions. First, we try to relate the query complexity of testing properties in the new model to the sample complexity of testing these properties in the standard distribution testing model. Second, we consider the complexity of testing properties that arise naturally in the new model (e.g., distributions that capture random variations of fixed strings). Third, we consider the complexity of testing properties that were extensively studied in the standard distribution testing model: Two such cases are uniform distributions and pairs of identical distributions, where we obtain the following results. - Testing whether a distribution over n-bit long strings is uniform on some set of size m can be done with query complexity Õ(m/ε³), where ε > (log₂m)/n is the proximity parameter. - Testing whether two distribution over n-bit long strings that have support size at most m are identical can be done with query complexity Õ(m^{2/3}/ε³). Both upper bounds are quite tight; that is, for ε = Ω(1), the first task requires Ω(m^c) queries for any c < 1 and n = ω(log m), whereas the second task requires Ω(m^{2/3}) queries. Note that the query complexity of the first task is higher than the sample complexity of the corresponding task in the standard distribution testing model, whereas in the case of the second task the bounds almost match. Property Testing Distributions Theory of computation~Streaming, sublinear and near linear time algorithms 78:1-78:19 Regular Paper https://eccc.weizmann.ac.il/report/2021/133/ We are grateful to Avi Wigderson for a discussion that started this research project. Oded Goldreich Oded Goldreich Department of Computer Science, Weizmann Institute of Science, Israel https://orcid.org/0000-0002-4329-135X Partially supported by the Israel Science Foundation (grant No. 1041/18); received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 819702). Dana Ron Dana Ron School of Electrical Engineering, Tel Aviv University, Israel https://orcid.org/0000-0001-6576-7200 Partially supported by the Israel Science Foundation (grant No. 1041/18). 10.4230/LIPIcs.ITCS.2022.78 Tugkan Batu. Testing properties of distributions. PhD thesis, Computer Science department, Cornell University, 2001. Tugkan Batu and Clement L. Canonne. Generalized uniformity testing. In Proceedings of the Fiftieth-Eighth Annual Symposium on Foundations of Computer Science (FOCS), pages 880-889, 2017. Tugkan Batu, Lance Fortnow, Eldar Fischer, Ravi Kumar, Ronitt Rubinfeld, and Patrick White. Testing random variables for independence and identity. In Proceedings of the Forty-Second Annual Symposium on Foundations of Computer Science (FOCS), pages 442-451, 2001. Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D. Smith, and Patrick White. Testing that distributions are close. In Proceedings of the Forty-First Annual Symposium on Foundations of Computer Science (FOCS), pages 259-269, 2000. Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D. Smith, and Patrick White. Testing closeness of discrete distributions. Journal of the ACM, 60(1):4:1-4:25, 2013. This is a long version of [Tugkan Batu et al., 2000]. Clément L. Canonne. A Survey on Distribution Testing: Your Data is Big. But is it Blue? Number 9 in Graduate Surveys. Theory of Computing Library, 2020. URL: https://doi.org/10.4086/toc.gs.2020.009. https://doi.org/10.4086/toc.gs.2020.009 Ilias Diakonikolas, Daniel Kan, and Alistair Stewart. Sharp bounds for generalized uniformity testing. Technical Report TR17-132, Electronic Colloquium on Computational Complexity (ECCC), 2017. Oded Goldreich. Introduction to Property Testing. Cambridge University Press, 2017. Oded Goldreich and Dana Ron. Lower bounds on the complexity of testing grained distributions. Technical Report TR21-129, Electronic Colloquium on Computational Complexity (ECCC), 2021. Oded Goldreich and Dana Ron. Testing distributions of huge objects. Technical Report TR21-133, Electronic Colloquium on Computational Complexity (ECCC), 2021. Sofya Raskhodnikova, Dana Ron, Amir Shpilka, and Adam Smith. Strong lower bonds for approximating distributions support size and the distinct elements problem. SIAM Journal on Computing, 39(3):813-842, 2009. Gregory Valiant and Paul Valiant. Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs. In Proceedings of the Fourty-Third Annual ACM Symposium on the Theory of Computing (STOC), pages 685-694, 2011. Gregory Valiant and Paul Valiant. Estimating the unseen: Improved estimators for entropy and other properties. Journal of the ACM, 64(6), 2017. Paul Valiant. Testing symmetric properties of distributions. SIAM Journal on Computing, 40(6):1927-1968, 2011. Oded Goldreich and Dana Ron Creative Commons Attribution 4.0 International license https://creativecommons.org/licenses/by/4.0/legalcode 2022-01-25