Sample-Optimal Identity Testing with High Probability

Diakonikolas, Ilias; Gouleakis, Themis; Peebles, John; Price, Eric

doi:10.4230/LIPIcs.ICALP.2018.41

File

LIPIcs.ICALP.2018.41.pdf

Filesize: 0.48 MB
14 pages

Document Identifiers

DOI: 10.4230/LIPIcs.ICALP.2018.41
URN: urn:nbn:de:0030-drops-90459

Author Details

Ilias Diakonikolas

USC, Los Angeles, CA, USA

Themis Gouleakis

CSAIL,MIT, Cambridge, MA, USA

John Peebles

CSAIL,MIT, Cambridge, MA, USA

Eric Price

UT Austin, Austin, TX, USA

Cite AsGet BibTex

Ilias Diakonikolas, Themis Gouleakis, John Peebles, and Eric Price. Sample-Optimal Identity Testing with High Probability. In 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 107, pp. 41:1-41:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)
https://doi.org/10.4230/LIPIcs.ICALP.2018.41

Abstract

We study the problem of testing identity against a given distribution with a focus on the high confidence regime. More precisely, given samples from an unknown distribution p over n elements, an explicitly given distribution q, and parameters 0< epsilon, delta < 1, we wish to distinguish, with probability at least 1-delta, whether the distributions are identical versus epsilon-far in total variation distance. Most prior work focused on the case that delta = Omega(1), for which the sample complexity of identity testing is known to be Theta(sqrt{n}/epsilon^2). Given such an algorithm, one can achieve arbitrarily small values of delta via black-box amplification, which multiplies the required number of samples by Theta(log(1/delta)). We show that black-box amplification is suboptimal for any delta = o(1), and give a new identity tester that achieves the optimal sample complexity. Our new upper and lower bounds show that the optimal sample complexity of identity testing is Theta((1/epsilon^2) (sqrt{n log(1/delta)} + log(1/delta))) for any n, epsilon, and delta. For the special case of uniformity testing, where the given distribution is the uniform distribution U_n over the domain, our new tester is surprisingly simple: to test whether p = U_n versus d_{TV} (p, U_n) >= epsilon, we simply threshold d_{TV}({p^}, U_n), where {p^} is the empirical probability distribution. The fact that this simple "plug-in" estimator is sample-optimal is surprising, even in the constant delta case. Indeed, it was believed that such a tester would not attain sublinear sample complexity even for constant values of epsilon and delta. An important contribution of this work lies in the analysis techniques that we introduce in this context. First, we exploit an underlying strong convexity property to bound from below the expectation gap in the completeness and soundness cases. Second, we give a new, fast method for obtaining provably correct empirical estimates of the true worst-case failure probability for a broad class of uniformity testing statistics over all possible input distributions - including all previously studied statistics for this problem. We believe that our novel analysis techniques will be useful for other distribution testing problems as well.

Subject Classification

ACM Subject Classification

Mathematics of computing → Probability and statistics

Keywords

distribution testing
property testing
sample complexity

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

J. Acharya, C. Daskalakis, and G. Kamath. Optimal testing for properties of distributions. In Advances in Neural Information Processing Systems (NIPS), pages 3591-3599, 2015.
S. Balakrishnan and L. A. Wasserman. Hypothesis testing for densities and high-dimensional multinomials: Sharp local minimax rates. CoRR, abs/1706.10003, 2017. URL: http://arxiv.org/abs/1706.10003.
T. Batu, E. Fischer, L. Fortnow, R. Kumar, R. Rubinfeld, and P. White. Testing random variables for independence and identity. In Proc. 42nd IEEE Symposium on Foundations of Computer Science, pages 442-451, 2001.
T. Batu, L. Fortnow, R. Rubinfeld, W. D. Smith, and P. White. Testing that distributions are close. In IEEE Symposium on Foundations of Computer Science, pages 259-269, 2000. URL: https://citeseer.ist.psu.edu/batu00testing.html.
T. Batu, L. Fortnow, R. Rubinfeld, W. D. Smith, and P. White. Testing closeness of discrete distributions. J. ACM, 60(1):4, 2013.
C. L. Canonne. A survey on distribution testing: Your data is big. but is it blue? Electronic Colloquium on Computational Complexity (ECCC), 22:63, 2015.
S. Chan, I. Diakonikolas, P. Valiant, and G. Valiant. Optimal algorithms for testing closeness of discrete distributions. In SODA, pages 1193-1203, 2014.
L. Devroye and G. Lugosi. Combinatorial methods in density estimation. Springer Series in Statistics, Springer, 2001.
I. Diakonikolas, T. Gouleakis, J. Peebles, and E. Price. Collision-based testers are optimal for uniformity and closeness. Electronic Colloquium on Computational Complexity (ECCC), 23:178, 2016.
I. Diakonikolas and D. M. Kane. A new approach for testing properties of discrete distributions. In FOCS, pages 685-694, 2016. Full version available at abs/1601.05557.
I. Diakonikolas, D. M. Kane, and V. Nikishkin. Optimal algorithms and lower bounds for testing closeness of structured distributions. In IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, pages 1183-1202, 2015.
I. Diakonikolas, D. M. Kane, and V. Nikishkin. Testing identity of structured distributions. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, pages 1841-1854, 2015.
O. Goldreich. The uniform distribution is complete with respect to testing identity to a fixed distribution. ECCC, 23, 2016.
O. Goldreich. Commentary on two works related to testing uniformity of distributions, 2017. URL: http://www.wisdom.weizmann.ac.il/~oded/MC/229.html.
O. Goldreich and D. Ron. On testing expansion in bounded-degree graphs. Technical Report TR00-020, Electronic Colloquium on Computational Complexity, 2000.
D. Huang and S. Meyn. Generalized error exponents for small sample universal hypothesis testing. IEEE Trans. Inf. Theor., 59(12):8157-8181, dec 2013.
E. L. Lehmann and J. P. Romano. Testing statistical hypotheses. Springer Texts in Statistics. Springer, 2005.
J. Neyman and E. S. Pearson. On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231(694-706):289-337, 1933. URL: http://dx.doi.org/10.1098/rsta.1933.0009.
L. Paninski. A coincidence-based test for uniformity given very sparsely-sampled discrete data. IEEE Transactions on Information Theory, 54:4750-4755, 2008.
R. Rubinfeld. Taming big probability distributions. XRDS, 19(1):24-28, 2012.
R. Rubinfeld. Taming probability distributions over big domains. Talk given at STOC'14 Workshop on Efficient Distribution Estimation, 2014. Available at http://www.iliasdiakonikolas.org/stoc14-workshop/rubinfeld.pdf.
G. Valiant and P. Valiant. An automatic inequality prover and instance optimal identity testing. In FOCS, 2014.
A. W. van der Vaart and J. A. Wellner. Weak convergence and empirical processes. Springer Series in Statistics. Springer-Verlag, New York, 1996. With applications to statistics.
Y. Wu and P. Yang. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory, 62(6):3702-3720, June 2016.

Sample-Optimal Identity Testing with High Probability

Authors Ilias Diakonikolas, Themis Gouleakis, John Peebles, Eric Price

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Sample-Optimal Identity Testing with High Probability

Authors Ilias Diakonikolas, Themis Gouleakis, John Peebles, Eric Price

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

References

Thanks for your feedback!

Could not send message