Near-Optimal Closeness Testing of Discrete Histogram Distributions

Authors Ilias Diakonikolas, Daniel M. Kane, Vladimir Nikishkin



PDF
Thumbnail PDF

File

LIPIcs.ICALP.2017.8.pdf
  • Filesize: 0.52 MB
  • 15 pages

Document Identifiers

Author Details

Ilias Diakonikolas
Daniel M. Kane
Vladimir Nikishkin

Cite As Get BibTex

Ilias Diakonikolas, Daniel M. Kane, and Vladimir Nikishkin. Near-Optimal Closeness Testing of Discrete Histogram Distributions. In 44th International Colloquium on Automata, Languages, and Programming (ICALP 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 80, pp. 8:1-8:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017) https://doi.org/10.4230/LIPIcs.ICALP.2017.8

Abstract

We investigate the problem of testing the equivalence between two discrete histograms. A k-histogram over [n] is a probability distribution that is piecewise constant over some set of k intervals over [n]. Histograms have been extensively studied in computer science and statistics. Given a set of samples from two k-histogram distributions p, q over [n], we want to distinguish (with high probability) between the cases that p = q and ||p ? q||_1 >= epsilon. The main contribution of this paper is a new algorithm for this testing problem and a nearly matching information-theoretic lower bound.  Specifically, the sample complexity of our algorithm matches our lower bound up to a logarithmic factor, improving on previous work by polynomial factors in the relevant parameters. Our algorithmic approach applies in a more general setting and yields improved sample upper bounds for testing closeness of other structured distributions as well.

Subject Classification

Keywords
  • distribution testing
  • histograms
  • closeness testing

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. J. Acharya, I. Diakonikolas, C. Hegde, J. Li, and L. Schmidt. Fast and Near-Optimal Algorithms for Approximating Distributions by Histograms. In 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2015, pages 249-263, 2015. Google Scholar
  2. J. Acharya, I. Diakonikolas, J. Li, and L. Schmidt. Fast algorithms for segmented regression. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, pages 2878-2886, 2016. Google Scholar
  3. J. Acharya, I. Diakonikolas, J. Li, and L. Schmidt. Sample-optimal density estimation in nearly-linear time. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2017, pages 1278-1289, 2017. Full version available at https://arxiv.org/abs/1506.00671. Google Scholar
  4. R. E. Barlow, D. J. Bartholomew, J. M. Bremner, and H. D. Brunk. Statistical Inference under Order Restrictions. Wiley, New York, 1972. Google Scholar
  5. T. Batu, L. Fortnow, R. Rubinfeld, W. D. Smith, and P. White. Testing that distributions are close. In IEEE Symposium on Foundations of Computer Science, pages 259-269, 2000. URL: https://citeseer.ist.psu.edu/batu00testing.html.
  6. T. Batu, R. Kumar, and R. Rubinfeld. Sublinear algorithms for testing monotone and unimodal distributions. In ACM Symposium on Theory of Computing, pages 381-390, 2004. Google Scholar
  7. C. L. Canonne. A survey on distribution testing: Your data is big. but is it blue? Electronic Colloquium on Computational Complexity (ECCC), 22:63, 2015. Google Scholar
  8. C. L. Canonne. Are few bins enough: Testing histogram distributions. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, pages 455-463, 2016. Google Scholar
  9. C. L. Canonne, I. Diakonikolas, T. Gouleakis, and R. Rubinfeld. Testing shape restrictions of discrete distributions. In 33rd Symposium on Theoretical Aspects of Computer Science, STACS 2016, pages 25:1-25:14, 2016. Google Scholar
  10. S. Chan, I. Diakonikolas, R. Servedio, and X. Sun. Learning mixtures of structured distributions over discrete domains. In SODA, pages 1380-1394, 2013. Google Scholar
  11. S. Chan, I. Diakonikolas, R. Servedio, and X. Sun. Efficient density estimation via piecewise polynomial approximation. In STOC, pages 604-613, 2014. Google Scholar
  12. S. Chan, I. Diakonikolas, R. Servedio, and X. Sun. Near-optimal density estimation in near-linear time using variable-width histograms. In NIPS, pages 1844-1852, 2014. Google Scholar
  13. S. Chan, I. Diakonikolas, P. Valiant, and G. Valiant. Optimal algorithms for testing closeness of discrete distributions. In SODA, pages 1193-1203, 2014. Google Scholar
  14. S. Chaudhuri, R. Motwani, and V. R. Narasayya. Random sampling for histogram construction: How much is enough? In SIGMOD Conference, pages 436-447, 1998. Google Scholar
  15. C. Daskalakis, A. De, G. Kamath, and C. Tzamos. A size-free CLT for poisson multinomials and its applications. In Proceedings of the 48th Annual ACM Symposium on the Theory of Computing, STOC'16, New York, NY, USA, 2016. ACM. Google Scholar
  16. C. Daskalakis, I. Diakonikolas, R. O'Donnell, R. A. Servedio, and L. Tan. Learning Sums of Independent Integer Random Variables. In FOCS, pages 217-226, 2013. Google Scholar
  17. C. Daskalakis, I. Diakonikolas, R. Servedio, G. Valiant, and P. Valiant. Testing k-modal distributions: Optimal algorithms via reductions. In SODA, pages 1833-1852, 2013. Google Scholar
  18. C. Daskalakis, I. Diakonikolas, and R. A. Servedio. Learning k-modal distributions via testing. In SODA, pages 1371-1385, 2012. Google Scholar
  19. C. Daskalakis, I. Diakonikolas, and R. A. Servedio. Learning Poisson Binomial Distributions. In STOC, pages 709-728, 2012. Google Scholar
  20. L. Devroye and G. Lugosi. Combinatorial methods in density estimation. Springer Series in Statistics, Springer, 2001. Google Scholar
  21. L. Devroye and G. Lugosi. Bin width selection in multivariate histograms by the combinatorial method. Test, 13(1):129-145, 2004. Google Scholar
  22. I. Diakonikolas, T. Gouleakis, J. Peebles, and E. Price. Collision-based testers are optimal for uniformity and closeness. Electronic Colloquium on Computational Complexity (ECCC), 23:178, 2016. Google Scholar
  23. I. Diakonikolas, M. Hardt, and L. Schmidt. Differentially private learning of structured discrete distributions. In NIPS, pages 2566-2574, 2015. Google Scholar
  24. I. Diakonikolas and D. M. Kane. A new approach for testing properties of discrete distributions. In FOCS, pages 685-694, 2016. Full version available at abs/1601.05557. Google Scholar
  25. I. Diakonikolas, D. M. Kane, and V. Nikishkin. Optimal algorithms and lower bounds for testing closeness of structured distributions. In IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, pages 1183-1202, 2015. Google Scholar
  26. I. Diakonikolas, D. M. Kane, and V. Nikishkin. Testing identity of structured distributions. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, pages 1841-1854, 2015. Google Scholar
  27. I. Diakonikolas, D. M. Kane, and A. Stewart. Efficient robust proper learning of log-concave distributions. CoRR, abs/1606.03077, 2016. Google Scholar
  28. I. Diakonikolas, D. M. Kane, and A. Stewart. The fourier transform of poisson multinomial distributions and its algorithmic applications. In Proceedings of STOC'16, 2016. Google Scholar
  29. I. Diakonikolas, D. M. Kane, and A. Stewart. Learning multivariate log-concave distributions. CoRR, abs/1605.08188, 2016. Google Scholar
  30. I. Diakonikolas, D. M. Kane, and A. Stewart. Optimal Learning via the Fourier Transform for Sums of Independent Integer Random Variables. In COLT, volume 49, pages 831-849, 2016. Full version available at URL: https://arxiv.org/abs/1505.00662.
  31. I. Diakonikolas, D. M. Kane, and A. Stewart. Properly learning poisson binomial distributions in almost polynomial time. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, pages 850-878, 2016. Full version available at URL: https://arxiv.org/abs/1511.04066.
  32. D. Freedman and P. Diaconis. On the histogram as a density estimator:l2 theory. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 57(4):453-476, 1981. Google Scholar
  33. A. C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Fast, small-space algorithms for approximate histogram maintenance. In STOC, pages 389-398, 2002. Google Scholar
  34. P. Groeneboom and G. Jongbloed. Nonparametric Estimation under Shape Constraints: Estimators, Algorithms and Asymptotics. Cambridge University Press, 2014. Google Scholar
  35. S. Guha, N. Koudas, and K. Shim. Approximation and streaming algorithms for histogram construction problems. ACM Trans. Database Syst., 31(1):396-438, 2006. Google Scholar
  36. P. Indyk, R. Levi, and R. Rubinfeld. Approximating and Testing k-Histogram Distributions in Sub-linear Time. In PODS, pages 15-22, 2012. Google Scholar
  37. H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel. Optimal histograms with quality guarantees. In VLDB, pages 275-286, 1998. Google Scholar
  38. J. Klemela. Multivariate histograms with data-dependent partitions. Statistica Sinica, 19(1):159-176, 2009. Google Scholar
  39. E. L. Lehmann and J. P. Romano. Testing statistical hypotheses. Springer Texts in Statistics. Springer, 2005. Google Scholar
  40. G. Lugosi and A. Nobel. Consistency of data-driven histogram methods for density estimation and classification. Ann. Statist., 24(2):687-706, 04 1996. Google Scholar
  41. J. Neyman and E. S. Pearson. On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231(694-706):289-337, 1933. URL: http://dx.doi.org/10.1098/rsta.1933.0009.
  42. L. Paninski. A coincidence-based test for uniformity given very sparsely-sampled discrete data. IEEE Transactions on Information Theory, 54:4750-4755, 2008. Google Scholar
  43. R. Rubinfeld. Taming big probability distributions. XRDS, 19(1):24-28, 2012. Google Scholar
  44. D. W. Scott. On optimal and data-based histograms. Biometrika, 66(3):605-610, 1979. Google Scholar
  45. D. W. Scott. Multivariate Density Estimation: Theory, Practice and Visualization. Wiley, New York, 1992. Google Scholar
  46. N. Thaper, S. Guha, P. Indyk, and N. Koudas. Dynamic multidimensional histograms. In SIGMOD Conference, pages 428-439, 2002. Google Scholar
  47. G. Valiant and P. Valiant. An automatic inequality prover and instance optimal identity testing. In FOCS, 2014. Google Scholar
  48. R. Willett and R. D. Nowak. Multiscale poisson intensity and density estimation. IEEE Transactions on Information Theory, 53(9):3171-3187, 2007. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail