Internal Dictionary Matching

Charalampopoulos, Panagiotis; Kociumaka, Tomasz; Mohamed, Manal; Radoszewski, Jakub; Rytter, Wojciech; Waleń, Tomasz

doi:10.4230/LIPIcs.ISAAC.2019.22

Abstract

We introduce data structures answering queries concerning the occurrences of patterns from a given dictionary D in fragments of a given string T of length n. The dictionary is internal in the sense that each pattern in D is given as a fragment of T. This way, D takes space proportional to the number of patterns d=|D| rather than their total length, which could be Theta(n * d).
In particular, we consider the following types of queries: reporting and counting all occurrences of patterns from D in a fragment T[i..j] (operations Report(i,j) and Count(i,j) below, as well as operation Exists(i,j) that returns true iff Count(i,j)>0) and reporting distinct patterns from D that occur in T[i..j] (operation ReportDistinct(i,j)). We show how to construct, in O((n+d) log^{O(1)} n) time, a data structure that answers each of these queries in time O(log^{O(1)} n+|output|) - see the table below for specific time and space complexities.
Query               | Preprocessing time                       | Space        | Query time
Exists(i,j)         | O(n+d)                                   | O(n)         | O(1)
Report(i,j)         | O(n+d)                                   | O(n+d)       | O(1+|output|)
ReportDistinct(i,j) | O(n log n+d)                             | O(n+d)       | O(log n+|output|)
Count(i,j)          | O({n log n}/{log log n} + d log^{3/2} n) | O(n+d log n) | O({log^2n}/{log log n}) 
The case of counting patterns is much more involved and needs a combination of a locally consistent parsing with orthogonal range searching. Reporting distinct patterns, on the other hand, uses the structure of maximal repetitions in strings. Finally, we provide tight - up to subpolynomial factors - upper and lower bounds for the case of a dynamic dictionary.

Alfred V. Aho and Margaret J. Corasick. Efficient String Matching: An Aid to Bibliographic Search. Communications of the ACM, 18(6):333-340, 1975. URL: https://doi.org/10.1145/360825.360855.
Amihood Amir, Martin Farach, Zvi Galil, Raffaele Giancarlo, and Kunsoo Park. Dynamic Dictionary Matching. Journal of Computer and System Sciences, 49(2):208-222, 1994. URL: https://doi.org/10.1016/S0022-0000(05)80047-9.
Amihood Amir, Martin Farach, Ramana M. Idury, Johannes A. La Poutré, and Alejandro A. Schäffer. Improved Dynamic Dictionary Matching. Information and Computation, 119(2):258-282, 1995. URL: https://doi.org/10.1006/inco.1995.1090.
Amihood Amir, Gad M. Landau, Moshe Lewenstein, and Dina Sokol. Dynamic text and static pattern matching. ACM Transactions on Algorithms, 3(2):19, 2007. URL: https://doi.org/10.1145/1240233.1240242.
Hideo Bannai, Tomohiro I, Shunsuke Inenaga, Yuto Nakashima, Masayuki Takeda, and Kazuya Tsuruta. The "Runs" Theorem. SIAM Journal on Computing, 46(5):1501-1514, 2017. URL: https://doi.org/10.1137/15M1011032.
Hideo Bannai, Shunsuke Inenaga, and Dominik Köppl. Computing All Distinct Squares in Linear Time for Integer Alphabets. In Juha Kärkkäinen, Jakub Radoszewski, and Wojciech Rytter, editors, 28th Annual Symposium on Combinatorial Pattern Matching, CPM 2017, volume 78 of LIPIcs, pages 22:1-22:18. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2017. URL: https://doi.org/10.4230/LIPIcs.CPM.2017.22.
Michael A. Bender and Martin Farach-Colton. The Level Ancestor Problem simplified. Theoretical Computer Science, 321(1):5-12, 2004. URL: https://doi.org/10.1016/j.tcs.2003.05.002.
Michael A. Bender, Martin Farach-Colton, Giridhar Pemmasani, Steven Skiena, and Pavel Sumazin. Lowest common ancestors in trees and directed acyclic graphs. Journal of Algorithms, 57(2):75-94, 2005. URL: https://doi.org/10.1016/j.jalgor.2005.08.001.
Ho-Leung Chan, Wing-Kai Hon, Tak Wah Lam, and Kunihiko Sadakane. Dynamic dictionary matching and compressed suffix trees. In Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2005, pages 13-22. SIAM, 2005. URL: http://dl.acm.org/citation.cfm?id=1070432.1070436.
Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. Dictionary matching and indexing with errors and don't cares. In László Babai, editor, Proceedings of the 36th Annual ACM Symposium on Theory of Computing, STOC 2004, pages 91-100. ACM, 2004. URL: https://doi.org/10.1145/1007352.1007374.
Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. Algorithms on Strings. Cambridge University Press, 2007. URL: https://doi.org/10.1017/cbo9780511546853.
Maxime Crochemore, Costas S. Iliopoulos, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. Extracting powers and periods in a word from its runs structure. Theoretical Computer Science, 521:29-41, 2014. URL: https://doi.org/10.1016/j.tcs.2013.11.018.
Martin Farach-Colton, Paolo Ferragina, and S. Muthukrishnan. On the Sorting-complexity of Suffix Tree Construction. Journal of the ACM, 47(6):987-1011, November 2000. URL: https://doi.org/10.1145/355541.355547.
Michael L. Fredman, János Komlós, and Endre Szemerédi. Storing a Sparse Table with O(1) Worst Case Access Time. Journal of the ACM, 31(3):538-544, 1984. URL: https://doi.org/10.1145/828.1884.
Harold N. Gabow and Robert Endre Tarjan. A Linear-Time Algorithm for a Special Case of Disjoint Set Union. Journal of Computer and System Sciences, 30(2):209-221, 1985. URL: https://doi.org/10.1016/0022-0000(85)90014-5.
Richard Groult, Élise Prieur, and Gwénaël Richomme. Counting distinct palindromes in a word in linear time. Information Processing Letters, 110(20):908-912, 2010. URL: https://doi.org/10.1016/j.ipl.2010.07.018.
Dov Harel and Robert Endre Tarjan. Fast Algorithms for Finding Nearest Common Ancestors. SIAM Journal on Computing, 13(2):338-355, 1984. URL: https://doi.org/10.1137/0213024.
Monika Henzinger, Sebastian Krinninger, Danupon Nanongkai, and Thatchaphol Saranurak. Unifying and Strengthening Hardness for Dynamic Problems via the Online Matrix-Vector Multiplication Conjecture. In Rocco A. Servedio and Ronitt Rubinfeld, editors, Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, pages 21-30. ACM, 2015. URL: https://doi.org/10.1145/2746539.2746609.
Tomohiro I. Longest Common Extensions with Recompression. In Juha Kärkkäinen, Jakub Radoszewski, and Wojciech Rytter, editors, 28th Annual Symposium on Combinatorial Pattern Matching, CPM 2017, volume 78 of LIPIcs, pages 18:1-18:15. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2017. URL: https://doi.org/10.4230/LIPIcs.CPM.2017.18.
Artur Jeż. Faster Fully Compressed Pattern Matching by Recompression. ACM Transactions on Algorithms, 11(3):20:1-20:43, 2015. URL: https://doi.org/10.1145/2631920.
Artur Jeż. Recompression: A Simple and Powerful Technique for Word Equations. Journal of the ACM, 63(1):4:1-4:51, 2016. URL: https://doi.org/10.1145/2743014.
Orgad Keller, Tsvi Kopelowitz, Shir Landau Feibish, and Moshe Lewenstein. Generalized substring compression. Theoretical Computer Science, 525:42-54, 2014. URL: https://doi.org/10.1016/j.tcs.2013.10.010.
Tomasz Kociumaka. Efficient Data Structures for Internal Queries in Texts. PhD thesis, University of Warsaw, 2018. URL: https://mimuw.edu.pl/~kociumaka/files/phd.pdf.
Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. A Linear Time Algorithm for Seeds Computation, 2019. URL: http://arxiv.org/abs/1107.2422.
Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. Internal Pattern Matching Queries in a Text and Applications. In Piotr Indyk, editor, Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, pages 532-551. SIAM, 2015. URL: https://doi.org/10.1137/1.9781611973730.36.
Roman M. Kolpakov and Gregory Kucherov. Finding Maximal Repetitions in a Word in Linear Time. In 40th Annual Symposium on Foundations of Computer Science, FOCS 1999, pages 596-604. IEEE Computer Society, 1999. URL: https://doi.org/10.1109/SFFCS.1999.814634.
J. Ian Munro, Yakov Nekrich, and Jeffrey Scott Vitter. Fast construction of wavelet trees. Theoretical Computer Science, 638:91-97, 2016. URL: https://doi.org/10.1016/j.tcs.2015.11.011.
S. Muthukrishnan. Efficient algorithms for document retrieval problems. In David Eppstein, editor, Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2002, pages 657-666. SIAM, 2002. URL: http://dl.acm.org/citation.cfm?id=545381.545469.
Mihai Pătraşcu. Unifying the Landscape of Cell-Probe Lower Bounds. SIAM Journal on Computing, 40(3):827-847, 2011. URL: https://doi.org/10.1137/09075336X.
Mikhail Rubinchik and Arseny M. Shur. Counting Palindromes in Substrings. In Gabriele Fici, Marinella Sciortino, and Rossano Venturini, editors, String Processing and Information Retrieval - 24th International Symposium, SPIRE 2017, volume 10508 of Lecture Notes in Computer Science, pages 290-303. Springer, 2017. URL: https://doi.org/10.1007/978-3-319-67428-5_25.

Internal Dictionary Matching

Authors Panagiotis Charalampopoulos , Tomasz Kociumaka , Manal Mohamed , Jakub Radoszewski , Wojciech Rytter , Tomasz Waleń

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message

Internal Dictionary Matching

Authors Panagiotis Charalampopoulos , Tomasz Kociumaka , Manal Mohamed , Jakub Radoszewski , Wojciech Rytter , Tomasz Waleń

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Acknowledgements

References

Thanks for your feedback!

Could not send message