Counting Distinct Patterns in Internal Dictionary Matching

Charalampopoulos, Panagiotis; Kociumaka, Tomasz; Mohamed, Manal; Radoszewski, Jakub; Rytter, Wojciech; Straszyński, Juliusz; Waleń, Tomasz; Zuba, Wiktor

doi:10.4230/LIPIcs.CPM.2020.8

Abstract

We consider the problem of preprocessing a text T of length n and a dictionary 𝒟 in order to be able to efficiently answer queries CountDistinct(i,j), that is, given i and j return the number of patterns from 𝒟 that occur in the fragment T[i..j]. The dictionary is internal in the sense that each pattern in 𝒟 is given as a fragment of T. This way, the dictionary takes space proportional to the number of patterns d=|𝒟| rather than their total length, which could be Θ(n⋅ d). An 𝒪̃(n+d)-size data structure that answers CountDistinct(i,j) queries 𝒪(log n)-approximately in 𝒪̃(1) time was recently proposed in a work that introduced internal dictionary matching [ISAAC 2019]. Here we present an 𝒪̃(n+d)-size data structure that answers CountDistinct(i,j) queries 2-approximately in 𝒪̃(1) time. Using range queries, for any m, we give an 𝒪̃(min(nd/m,n²/m²)+d)-size data structure that answers CountDistinct(i,j) queries exactly in 𝒪̃(m) time. We also consider the special case when the dictionary consists of all square factors of the string. We design an 𝒪(n log² n)-size data structure that allows us to count distinct squares in a text fragment T[i..j] in 𝒪(log n) time.

Alfred V. Aho and Margaret J. Corasick. Efficient string matching: An aid to bibliographic search. Communications of the ACM, 18(6):333-340, 1975. URL: https://doi.org/10.1145/360825.360855.
Amihood Amir, Gad M. Landau, Moshe Lewenstein, and Dina Sokol. Dynamic text and static pattern matching. ACM Transactions on Algorithms, 3(2):19, 2007. URL: https://doi.org/10.1145/1240233.1240242.
Hideo Bannai, Tomohiro I, Shunsuke Inenaga, Yuto Nakashima, Masayuki Takeda, and Kazuya Tsuruta. The "runs" theorem. SIAM Journal on Computing, 46(5):1501-1514, 2017. URL: https://doi.org/10.1137/15M1011032.
Timothy M. Chan and Mihai Pătraşcu. Counting inversions, offline orthogonal range counting, and related problems. In 21st Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, pages 161-173. SIAM, 2010. URL: https://doi.org/10.1137/1.9781611973075.15.
Panagiotis Charalampopoulos, Tomasz Kociumaka, Manal Mohamed, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. Internal dictionary matching. In 30th International Symposium on Algorithms and Computation, ISAAC 2019, volume 149 of LIPIcs, pages 22:1-22:17. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2019. URL: https://doi.org/10.4230/LIPIcs.ISAAC.2019.22.
Maxime Crochemore. An optimal algorithm for computing the repetitions in a word. Information Processing Letters, 12(5):244-250, 1981. URL: https://doi.org/10.1016/0020-0190(81)90024-7.
Maxime Crochemore, Costas S. Iliopoulos, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. Extracting powers and periods in a word from its runs structure. Theoretical Computer Science, 521:29-41, 2014. URL: https://doi.org/10.1016/j.tcs.2013.11.018.
Antoine Deza, Frantisek Franek, and Adrien Thierry. How many double squares can a string contain? Discrete Applied Mathematics, 180:52-69, 2015. URL: https://doi.org/10.1016/j.dam.2014.08.016.
Nathan J. Fine and Herbert S. Wilf. Uniqueness theorems for periodic functions. Proceedings of the American Mathematical Society, 16(1):109-114, 1965. URL: https://doi.org/10.2307/2034009.
Aviezri S. Fraenkel and Jamie Simpson. How many squares can a string contain? Journal of Combinatorial Theory, Series A, 82(1):112-120, 1998. URL: https://doi.org/10.1006/jcta.1997.2843.
Amy Glen and Jamie Simpson. The total run length of a word. Theoretical Computer Science, 501:41-48, 2013. URL: https://doi.org/10.1016/j.tcs.2013.06.004.
Dan Gusfield and Jens Stoye. Linear time algorithms for finding and representing all the tandem repeats in a string. Journal of Computer and System Sciences, 69(4):525-546, 2004. URL: https://doi.org/10.1016/j.jcss.2004.03.004.
Monika Henzinger, Sebastian Krinninger, Danupon Nanongkai, and Thatchaphol Saranurak. Unifying and strengthening hardness for dynamic problems via the online matrix-vector multiplication conjecture. In 47th Annual ACM on Symposium on Theory of Computing, STOC 2015, pages 21-30. ACM, 2015. URL: https://doi.org/10.1145/2746539.2746609.
Haim Kaplan, Natan Rubin, Micha Sharir, and Elad Verbin. Efficient colored orthogonal range counting. SIAM Journal on Computing, 38(3):982-1011, 2008. URL: https://doi.org/10.1137/070684483.
Orgad Keller, Tsvi Kopelowitz, Shir Landau Feibish, and Moshe Lewenstein. Generalized substring compression. Theoretical Computer Science, 525:42-54, 2014. URL: https://doi.org/10.1016/j.tcs.2013.10.010.
Tomasz Kociumaka. Minimal suffix and rotation of a substring in optimal time. In 27th Annual Symposium on Combinatorial Pattern Matching, CPM 2016, volume 54 of LIPIcs, pages 28:1-28:12. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2016. URL: https://doi.org/10.4230/LIPIcs.CPM.2016.28.
Tomasz Kociumaka. Efficient Data Structures for Internal Queries in Texts. PhD thesis, University of Warsaw, 2018. URL: https://mimuw.edu.pl/~kociumaka/files/phd.pdf.
Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. Internal pattern matching queries in a text and applications. In 26th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, pages 532-551. SIAM, 2015. URL: https://doi.org/10.1137/1.9781611973730.36.
Roman M. Kolpakov and Gregory Kucherov. Finding maximal repetitions in a word in linear time. In 40th Annual Symposium on Foundations of Computer Science, FOCS 1999, pages 596-604. IEEE Computer Society, 1999. URL: https://doi.org/10.1109/SFFCS.1999.814634.
Mikhail Rubinchik and Arseny M. Shur. Counting palindromes in substrings. In 24th International Symposium on String Processing and Information Retrieval, SPIRE 2017, volume 10508 of Lecture Notes in Computer Science, pages 290-303. Springer, 2017. URL: https://doi.org/10.1007/978-3-319-67428-5_25.
Milan Ružić. Constructing efficient dictionaries in close to sorting time. In Automata, Languages and Programming, ICALP 2008, Part I, volume 5125 of Lecture Notes in Computer Science, pages 84-95. Springer, 2008. URL: https://doi.org/10.1007/978-3-540-70575-8_8.
Jens Stoye and Dan Gusfield. Simple and flexible detection of contiguous repeats using a suffix tree. Theoretical Computer Science, 270(1-2):843-856, 2002. URL: https://doi.org/10.1016/S0304-3975(01)00121-9.
Mikkel Thorup. Space efficient dynamic stabbing with fast queries. In 35th Annual ACM Symposium on Theory of Computing, STOC 2003, pages 649-658. ACM, 2003. URL: https://doi.org/10.1145/780542.780636.
Dan E. Willard. Log-logarithmic worst-case range queries are possible in space Θ(N). Information Processing Letters, 17(2):81-84, 1983. URL: https://doi.org/10.1016/0020-0190(83)90075-3.

Counting Distinct Patterns in Internal Dictionary Matching

Authors Panagiotis Charalampopoulos , Tomasz Kociumaka , Manal Mohamed , Jakub Radoszewski , Wojciech Rytter , Juliusz Straszyński , Tomasz Waleń , Wiktor Zuba

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Counting Distinct Patterns in Internal Dictionary Matching

Authors Panagiotis Charalampopoulos , Tomasz Kociumaka , Manal Mohamed , Jakub Radoszewski , Wojciech Rytter , Juliusz Straszyński , Tomasz Waleń , Wiktor Zuba

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message