Hidden Words Statistics for Large Patterns

Authors Svante Janson , Wojciech Szpankowski



PDF
Thumbnail PDF

File

LIPIcs.AofA.2020.17.pdf
  • Filesize: 476 kB
  • 15 pages

Document Identifiers

Author Details

Svante Janson
  • Department of Mathematics, Uppsala University, PO Box 480, SE-751 06 Uppsala, Sweden
Wojciech Szpankowski
  • Center for Science of Information, Department of Computer Science, Purdue University, West Lafayette, IN, USA

Cite AsGet BibTex

Svante Janson and Wojciech Szpankowski. Hidden Words Statistics for Large Patterns. In 31st International Conference on Probabilistic, Combinatorial and Asymptotic Methods for the Analysis of Algorithms (AofA 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 159, pp. 17:1-17:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)
https://doi.org/10.4230/LIPIcs.AofA.2020.17

Abstract

We study here the so called subsequence pattern matching also known as hidden pattern matching in which one searches for a given pattern w of length m as a subsequence in a random text of length n. The quantity of interest is the number of occurrences of w as a subsequence (i.e., occurring in not necessarily consecutive text locations). This problem finds many applications from intrusion detection, to trace reconstruction, to deletion channel, and to DNA-based storage systems. In all of these applications, the pattern w is of variable length. To the best of our knowledge this problem was only tackled for a fixed length m=O(1) [P. Flajolet et al., 2006]. In our main result Theorem 5 we prove that for m=o(n^{1/3}) the number of subsequence occurrences is normally distributed. In addition, in Theorem 6 we show that under some constraints on the structure of w the asymptotic normality can be extended to m=o(√n). For a special pattern w consisting of the same symbol, we indicate that for m=o(n) the distribution of number of subsequences is either asymptotically normal or asymptotically log normal. We conjecture that this dichotomy is true for all patterns. We use Hoeffding’s projection method for U-statistics to prove our findings.

Subject Classification

ACM Subject Classification
  • Mathematics of computing → Probability and statistics
Keywords
  • Hidden pattern matching
  • subsequences
  • probability
  • U-statistics
  • projection method

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. E. Bender and F. Kochman. The distribution of subword counts is usually normal. European J. Combin., 14:265-275, 1993. Google Scholar
  2. J. Bourdon and B. Vallée. Generalized pattern matching statistics. In Mathematics and Computer Science II (Versailles, 2002), Trends. Math., pages 249-265. Birkhäuser, 2002. Google Scholar
  3. S. Diggavi and M. Grossglauser. Information transmission over finite buffer channels. IEEE Trans. Information Theory, 52:1226-1237, 2006. Google Scholar
  4. R. L. Dobrushin. Shannon’s theorem for channels with synchronization errors. Prob. Info. Trans., pages 18-36, 1967. Google Scholar
  5. M. Drmota, K. Viswanathan, and W. Szpankowski. Mutual information for a deletion channel. In IEEE International Symposium on Information Theory, 2012. Google Scholar
  6. P. Flajolet, W. Szpankowski, and B Vallée. Hidden word statistics. J. ACM, 53(1):147-183, 2006. URL: https://doi.org/10.1145/1120582.1120586.
  7. Allan Gut. Probability: A Graduate Course. Springer, New York, 2013. Google Scholar
  8. R. Gwadera, M. Atallah, and W. Szpankowski. Reliable detection of episodes in event sequences. In 3rd IEEE Conf. on Data Mining, pages 67-74. IEEE Computer Soc., 2003. Google Scholar
  9. W. Hoeffding. A class of statistics with asymptotically normal distribution. Ann. Mat. Statistics, 19:293-325, 1984. Google Scholar
  10. N. Holden and R. Lyones. Lower bounds for trace reconstruction, 2018. URL: http://arxiv.org/abs/1808.02336.
  11. P. Jacquet and W. Szpankowski. Analytic Pattern Matching: From DNA to Tiwitter. Cambridge University Press, 2015. Google Scholar
  12. S. Janson, B. Nakamura, and D. Zeilberger. On the asymptotic statistics of the number of occurrences of multiple permutation patterns. J. Comb., 6:117-143, 2015. Google Scholar
  13. A. Kalai, M. Mitzenmacher, and M. Sudan. Tight asymptotic bounds for the deletion channel with small deletion probabilities. In IEEE International Symposium on Information Theory, 2010. Google Scholar
  14. Y. Kanoria and A. Montanari. On the deletion channel with small deletion probability. In IEEE International Symposium on Information Theory, 2010. Google Scholar
  15. A. McGregor, E. Price, and S. Vorotnikova. Trace reconstruction revisisted. In European Symposium on Algorithms, pages 689-700, 2014. Google Scholar
  16. M. Mitzenmacher. A survey of results for deletion channels and related synchronization channels. Probab. Surveys, pages 1-33, 2009. Google Scholar
  17. R. Venkataramanan, S. Tatikonda, and K. Ramchandran. Achievable rates for channels with deletions and insertions. In IEEE International Symposium on Information Theory, 2011. Google Scholar
  18. Y.Peres and A. Zhai. Average-case reconstruction for the deletion channel: subpolynomially many traces suffice. In FOCS. IEEE Computer Society Press, 2017. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail