Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik GmbH Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik GmbH scholarly article en Diakonikolas, Ilias; Gouleakis, Themis; Peebles, John; Price, Eric License
when quoting this document, please refer to the following
URN: urn:nbn:de:0030-drops-90459

; ; ;

Sample-Optimal Identity Testing with High Probability



We study the problem of testing identity against a given distribution with a focus on the high confidence regime. More precisely, given samples from an unknown distribution p over n elements, an explicitly given distribution q, and parameters 0< epsilon, delta < 1, we wish to distinguish, with probability at least 1-delta, whether the distributions are identical versus epsilon-far in total variation distance. Most prior work focused on the case that delta = Omega(1), for which the sample complexity of identity testing is known to be Theta(sqrt{n}/epsilon^2). Given such an algorithm, one can achieve arbitrarily small values of delta via black-box amplification, which multiplies the required number of samples by Theta(log(1/delta)). We show that black-box amplification is suboptimal for any delta = o(1), and give a new identity tester that achieves the optimal sample complexity. Our new upper and lower bounds show that the optimal sample complexity of identity testing is Theta((1/epsilon^2) (sqrt{n log(1/delta)} + log(1/delta))) for any n, epsilon, and delta. For the special case of uniformity testing, where the given distribution is the uniform distribution U_n over the domain, our new tester is surprisingly simple: to test whether p = U_n versus d_{TV} (p, U_n) >= epsilon, we simply threshold d_{TV}({p^}, U_n), where {p^} is the empirical probability distribution. The fact that this simple "plug-in" estimator is sample-optimal is surprising, even in the constant delta case. Indeed, it was believed that such a tester would not attain sublinear sample complexity even for constant values of epsilon and delta. An important contribution of this work lies in the analysis techniques that we introduce in this context. First, we exploit an underlying strong convexity property to bound from below the expectation gap in the completeness and soundness cases. Second, we give a new, fast method for obtaining provably correct empirical estimates of the true worst-case failure probability for a broad class of uniformity testing statistics over all possible input distributions - including all previously studied statistics for this problem. We believe that our novel analysis techniques will be useful for other distribution testing problems as well.

BibTeX - Entry

  author =	{Ilias Diakonikolas and Themis Gouleakis and John Peebles and Eric Price},
  title =	{{Sample-Optimal Identity Testing with High Probability}},
  booktitle =	{45th International Colloquium on Automata, Languages, and  Programming (ICALP 2018)},
  pages =	{41:1--41:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-076-7},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{107},
  editor =	{Ioannis Chatzigiannakis and Christos Kaklamanis and D{\'a}niel Marx and Donald Sannella},
  publisher =	{Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{},
  URN =		{urn:nbn:de:0030-drops-90459},
  doi =		{10.4230/LIPIcs.ICALP.2018.41},
  annote =	{Keywords: distribution testing, property testing, sample complexity}

Keywords: distribution testing, property testing, sample complexity
Seminar: 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018)
Issue date: 2018
Date of publication: 2018

DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI