Document

# Hidden Words Statistics for Large Patterns

## File

LIPIcs.AofA.2020.17.pdf
• Filesize: 476 kB
• 15 pages

## Cite As

Svante Janson and Wojciech Szpankowski. Hidden Words Statistics for Large Patterns. In 31st International Conference on Probabilistic, Combinatorial and Asymptotic Methods for the Analysis of Algorithms (AofA 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 159, pp. 17:1-17:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)
https://doi.org/10.4230/LIPIcs.AofA.2020.17

## Abstract

We study here the so called subsequence pattern matching also known as hidden pattern matching in which one searches for a given pattern w of length m as a subsequence in a random text of length n. The quantity of interest is the number of occurrences of w as a subsequence (i.e., occurring in not necessarily consecutive text locations). This problem finds many applications from intrusion detection, to trace reconstruction, to deletion channel, and to DNA-based storage systems. In all of these applications, the pattern w is of variable length. To the best of our knowledge this problem was only tackled for a fixed length m=O(1) [P. Flajolet et al., 2006]. In our main result Theorem 5 we prove that for m=o(n^{1/3}) the number of subsequence occurrences is normally distributed. In addition, in Theorem 6 we show that under some constraints on the structure of w the asymptotic normality can be extended to m=o(√n). For a special pattern w consisting of the same symbol, we indicate that for m=o(n) the distribution of number of subsequences is either asymptotically normal or asymptotically log normal. We conjecture that this dichotomy is true for all patterns. We use Hoeffding’s projection method for U-statistics to prove our findings.

## Subject Classification

##### ACM Subject Classification
• Mathematics of computing → Probability and statistics
##### Keywords
• Hidden pattern matching
• subsequences
• probability
• U-statistics
• projection method

## Metrics

• Access Statistics
• Total Accesses (updated on a weekly basis)
0

## References

1. E. Bender and F. Kochman. The distribution of subword counts is usually normal. European J. Combin., 14:265-275, 1993.
2. J. Bourdon and B. Vallée. Generalized pattern matching statistics. In Mathematics and Computer Science II (Versailles, 2002), Trends. Math., pages 249-265. Birkhäuser, 2002.
3. S. Diggavi and M. Grossglauser. Information transmission over finite buffer channels. IEEE Trans. Information Theory, 52:1226-1237, 2006.
4. R. L. Dobrushin. Shannon’s theorem for channels with synchronization errors. Prob. Info. Trans., pages 18-36, 1967.
5. M. Drmota, K. Viswanathan, and W. Szpankowski. Mutual information for a deletion channel. In IEEE International Symposium on Information Theory, 2012.
6. P. Flajolet, W. Szpankowski, and B Vallée. Hidden word statistics. J. ACM, 53(1):147-183, 2006. URL: https://doi.org/10.1145/1120582.1120586.
7. Allan Gut. Probability: A Graduate Course. Springer, New York, 2013.
8. R. Gwadera, M. Atallah, and W. Szpankowski. Reliable detection of episodes in event sequences. In 3rd IEEE Conf. on Data Mining, pages 67-74. IEEE Computer Soc., 2003.
9. W. Hoeffding. A class of statistics with asymptotically normal distribution. Ann. Mat. Statistics, 19:293-325, 1984.
10. N. Holden and R. Lyones. Lower bounds for trace reconstruction, 2018. URL: http://arxiv.org/abs/1808.02336.
11. P. Jacquet and W. Szpankowski. Analytic Pattern Matching: From DNA to Tiwitter. Cambridge University Press, 2015.
12. S. Janson, B. Nakamura, and D. Zeilberger. On the asymptotic statistics of the number of occurrences of multiple permutation patterns. J. Comb., 6:117-143, 2015.
13. A. Kalai, M. Mitzenmacher, and M. Sudan. Tight asymptotic bounds for the deletion channel with small deletion probabilities. In IEEE International Symposium on Information Theory, 2010.
14. Y. Kanoria and A. Montanari. On the deletion channel with small deletion probability. In IEEE International Symposium on Information Theory, 2010.
15. A. McGregor, E. Price, and S. Vorotnikova. Trace reconstruction revisisted. In European Symposium on Algorithms, pages 689-700, 2014.
16. M. Mitzenmacher. A survey of results for deletion channels and related synchronization channels. Probab. Surveys, pages 1-33, 2009.
17. R. Venkataramanan, S. Tatikonda, and K. Ramchandran. Achievable rates for channels with deletions and insertions. In IEEE International Symposium on Information Theory, 2011.
18. Y.Peres and A. Zhai. Average-case reconstruction for the deletion channel: subpolynomially many traces suffice. In FOCS. IEEE Computer Society Press, 2017.
X

Feedback for Dagstuhl Publishing