Internal Pattern Matching in Small Space and Applications

Authors Gabriel Bathie , Panagiotis Charalampopoulos , Tatiana Starikovskaya



PDF
Thumbnail PDF

File

LIPIcs.CPM.2024.4.pdf
  • Filesize: 0.87 MB
  • 20 pages

Document Identifiers

Author Details

Gabriel Bathie
  • DIENS, École normale supérieure de Paris, PSL Research University, France
  • LaBRI, Université de Bordeaux, France
Panagiotis Charalampopoulos
  • Birkbeck, University of London, UK
Tatiana Starikovskaya
  • DIENS, École normale supérieure de Paris, PSL Research University, France

Acknowledgements

We would like to dedicate this work to our dear friend and colleague Paweł Gawrychowski on the occasion of his 40th birthday. We thank Solon Pissis for helpful suggestions.

Cite AsGet BibTex

Gabriel Bathie, Panagiotis Charalampopoulos, and Tatiana Starikovskaya. Internal Pattern Matching in Small Space and Applications. In 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 296, pp. 4:1-4:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.CPM.2024.4

Abstract

In this work, we consider pattern matching variants in small space, that is, in the read-only setting, where we want to bound the space usage on top of storing the strings. Our main contribution is a space-time trade-off for the Internal Pattern Matching (IPM) problem, where the goal is to construct a data structure over a string S of length n that allows one to answer the following type of queries: Compute the occurrences of a fragment P of S inside another fragment T of S, provided that |T| < 2|P|. For any τ ∈ [1 . . n/log² n], we present a nearly-optimal Õ(n/τ)-size data structure that can be built in Õ(n) time using Õ(n/τ) extra space, and answers IPM queries in O(τ+log n log³ log n) time. IPM queries have been identified as a crucial primitive operation for the analysis of algorithms on strings. In particular, the complexities of several recent algorithms for approximate pattern matching are expressed with regards to the number of calls to a small set of primitive operations that include IPM queries; our data structure allows us to port these results to the small-space setting. We further showcase the applicability of our IPM data structure by using it to obtain space-time trade-offs for the longest common substring and circular pattern matching problems in the asymmetric streaming setting.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
Keywords
  • internal pattern matching
  • longest common substring
  • small-space algorithms

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Paniz Abedin, Arnab Ganguly, Solon P. Pissis, and Sharma V. Thankachan. Efficient data structures for range shortest unique substring queries. Algorithms, 13(11), 2020. URL: https://doi.org/10.3390/a13110276.
  2. Amihood Amir, Mika Amit, Gad M. Landau, and Dina Sokol. Period recovery of strings over the Hamming and edit distances. Theoretical Computer Science, 710:2-18, 2018. Advances in Algorithms & Combinatorics on Strings (Honoring 60th birthday for Prof. Costas S. Iliopoulos). URL: https://doi.org/10.1016/j.tcs.2017.10.026.
  3. Amihood Amir, Itai Boneh, Panagiotis Charalampopoulos, and Eitan Kondratovsky. Repetition detection in a dynamic string. In Proc. of ESA, pages 5:1-5:18, 2019. URL: https://doi.org/10.4230/LIPIcs.ESA.2019.5.
  4. Amihood Amir, Ayelet Butman, Eitan Kondratovsky, Avivit Levy, and Dina Sokol. Multidimensional period recovery. Algorithmica, 84(6):1490-1510, 2022. URL: https://doi.org/10.1007/S00453-022-00926-Y.
  5. Amihood Amir, Panagiotis Charalampopoulos, Solon P. Pissis, and Jakub Radoszewski. Dynamic and internal longest common substring. Algorithmica, 82(12):3707-3743, 2020. URL: https://doi.org/10.1007/S00453-020-00744-0.
  6. Amihood Amir, Gad M. Landau, Moshe Lewenstein, and Dina Sokol. Dynamic text and static pattern matching. ACM Trans. Algor., 3(2):19, 2007. URL: https://doi.org/10.1145/1240233.1240242.
  7. Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Polylogarithmic approximation for edit distance and the asymmetric query complexity. In Proc. of FOCS, pages 377-386, 2010. URL: https://doi.org/10.1109/FOCS.2010.43.
  8. Lorraine A.K. Ayad, Carl Barton, and Solon P. Pissis. A faster and more accurate heuristic for cyclic edit distance computation. Pattern Recognition Letters, 88:81-87, 2017. URL: https://doi.org/10.1016/j.patrec.2017.01.018.
  9. Maxim Babenko, Pawel Gawrychowski, Tomasz Kociumaka, and Tatiana Starikovskaya. Wavelet trees meet suffix trees. In Proc. of SODA, pages 572-591, 2015. URL: https://doi.org/10.1137/1.9781611973730.39.
  10. Golnaz Badkobeh, Panagiotis Charalampopoulos, Dmitry Kosolobov, and Solon P. Pissis. Internal shortest absent word queries in constant time and linear space. Theoretical Computer Science, 922:271-282, 2022. URL: https://doi.org/10.1016/j.tcs.2022.04.029.
  11. Gabriel Bathie, Panagiotis Charalampopoulos, and Tatiana Starikovskaya. Pattern matching with mismatches and wildcards. CoRR, abs/2402.07732, 2024. URL: https://doi.org/10.48550/ARXIV.2402.07732.
  12. Gabriel Bathie, Tomasz Kociumaka, and Tatiana Starikovskaya. Small-space algorithms for the online language distance problem for palindromes and squares. In Proc. of ISAAC, pages 10:1-10:17, 2023. URL: https://doi.org/10.4230/LIPICS.ISAAC.2023.10.
  13. Djamal Belazzougui, Dmitry Kosolobov, Simon J. Puglisi, and Rajeev Raman. Weighted ancestors in suffix trees revisited. In Proc. of CPM, pages 8:1-8:15, 2021. URL: https://doi.org/10.4230/LIPIcs.CPM.2021.8.
  14. Stav Ben-Nun, Shay Golan, Tomasz Kociumaka, and Matan Kraus. Time-space tradeoffs for finding a long common substring. In Proc. of CPM, pages 5:1-5:14, 2020. URL: https://doi.org/10.4230/LIPICS.CPM.2020.5.
  15. Philip Bille, Inge Li Gørtz, Benjamin Sach, and Hjalte Wedel Vildhøj. Time-space trade-offs for longest common extensions. Journal of Discrete Algorithms, 25:42-50, 2014. URL: https://doi.org/10.1016/J.JDA.2013.06.003.
  16. Or Birenzwige, Shay Golan, and Ely Porat. Locally consistent parsing for text indexing in small space. In Proc. of SODA, pages 607-626, 2020. URL: https://doi.org/10.1137/1.9781611975994.37.
  17. Dany Breslauer, Roberto Grossi, and Filippo Mignosi. Simple real-time constant-space string matching. Theoretical Computer Science, 483:2-9, 2013. URL: https://doi.org/10.1016/J.TCS.2012.11.040.
  18. Amit Chakrabarti, Graham Cormode, Ranganath Kondapally, and Andrew McGregor. Information cost tradeoffs for augmented index and streaming language recognition. SIAM J. Comput., 42(1):61-83, 2013. URL: https://doi.org/10.1137/100816481.
  19. Timothy M. Chan, Kasper Green Larsen, and Mihai Puatracscu. Orthogonal range searching on the RAM, revisited. In Proc. of SoCG, pages 1-10, 2011. URL: https://doi.org/10.1145/1998196.1998198.
  20. Panagiotis Charalampopoulos, Tomasz Kociumaka, Manal Mohamed, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszynski, Tomasz Walen, and Wiktor Zuba. Counting distinct patterns in internal dictionary matching. In Proc. of CPM, pages 8:1-8:15, 2020. URL: https://doi.org/10.4230/LIPICS.CPM.2020.8.
  21. Panagiotis Charalampopoulos, Tomasz Kociumaka, Manal Mohamed, Jakub Radoszewski, Wojciech Rytter, and Tomasz Walen. Internal dictionary matching. Algorithmica, 83(7):2142-2169, 2021. URL: https://doi.org/10.1007/S00453-021-00821-Y.
  22. Panagiotis Charalampopoulos, Tomasz Kociumaka, Solon P. Pissis, and Jakub Radoszewski. Faster algorithms for longest common substring. In Proc. of ESA, pages 30:1-30:17, 2021. Full version: https://arxiv.org/abs/2105.03106. URL: https://doi.org/10.4230/LIPICS.ESA.2021.30.
  23. Panagiotis Charalampopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszynski, Tomasz Walen, and Wiktor Zuba. Circular pattern matching with k mismatches. J. Comput. Syst. Sci., 115:73-85, 2021. URL: https://doi.org/10.1016/J.JCSS.2020.07.003.
  24. Panagiotis Charalampopoulos, Tomasz Kociumaka, Jakub Radoszewski, Solon P. Pissis, Wojciech Rytter, Tomasz Walen, and Wiktor Zuba. Approximate circular pattern matching. In Proc. of ESA, pages 35:1-35:19, 2022. URL: https://doi.org/10.4230/LIPICS.ESA.2022.35.
  25. Panagiotis Charalampopoulos, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Tomasz Walen, and Wiktor Zuba. Efficient enumeration of distinct factors using package representations. In Proc. of SPIRE, volume 12303, pages 247-261. Springer, 2020. URL: https://doi.org/10.1007/978-3-030-59212-7_18.
  26. Panagiotis Charalampopoulos, Tomasz Kociumaka, and Philip Wellnitz. Faster approximate pattern matching: A unified approach. In Proc. of FOCS, pages 978-989, 2020. URL: https://doi.org/10.1109/FOCS46700.2020.00095.
  27. Panagiotis Charalampopoulos, Tomasz Kociumaka, and Philip Wellnitz. Faster pattern matching under edit distance: A reduction to dynamic puzzle matching and the seaweed monoid of permutation matrices. In Proc. of FOCS, pages 698-707, 2022. URL: https://doi.org/10.1109/FOCS54457.2022.00072.
  28. Panagiotis Charalampopoulos, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń, and Wiktor Zuba. Approximate circular pattern matching under edit distance. In Proc. of STACS, pages 24:1-24:22, 2024. URL: https://doi.org/10.4230/LIPIcs.STACS.2024.24.
  29. Kuei-Hao Chen, Guan-Shieng Huang, and Richard Chia-Tung Lee. Bit-Parallel Algorithms for Exact Circular String Matching. The Computer Journal, 57(5):731-743, March 2013. URL: https://doi.org/10.1093/comjnl/bxt023.
  30. Raphaël Clifford, Klim Efremenko, Benny Porat, and Ely Porat. A black box for online approximate pattern matching. Inf. Comput., 209(4):731-736, 2011. URL: https://doi.org/10.1016/J.IC.2010.12.007.
  31. Maxime Crochemore, Costas S. Iliopoulos, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszynski, Tomasz Walen, and Wiktor Zuba. Internal quasiperiod queries. In Proc. of SPIRE, pages 60-75, 2020. URL: https://doi.org/10.1007/978-3-030-59212-7_5.
  32. Jiangqi Dai, Qingyu Shi, and Tingqiang Xu. Faster algorithms for internal dictionary queries. CoRR, abs/2312.11873, 2023. URL: https://doi.org/10.48550/ARXIV.2312.11873.
  33. Debarati Das, Tomasz Kociumaka, and Barna Saha. Improved approximation algorithms for Dyck edit distance and RNA folding. In Proc. of ICALP, pages 49:1-49:20, 2022. URL: https://doi.org/10.4230/LIPIcs.ICALP.2022.49.
  34. Rathish Das, Meng He, Eitan Kondratovsky, J. Ian Munro, and Kaiyu Wu. Internal masked prefix sums and its connection to fully internal measurement queries. In Proc. of SPIRE, pages 217-232, 2022. URL: https://doi.org/10.1007/978-3-031-20643-6_16.
  35. Jean-Pierre Duval. Factorizing words over an ordered alphabet. J. Algorithms, 4(4):363-381, 1983. URL: https://doi.org/10.1016/0196-6774(83)90017-2.
  36. Henning Fernau, Florin Manea, Robert Mercaş, and Markus L. Schmid. Pattern matching with variables: Efficient algorithms and complexity results. ACM Trans. Comput. Theory, 12(1), February 2020. URL: https://doi.org/10.1145/3369935.
  37. Nathan J. Fine and Herbert S. Wilf. Uniqueness theorems for periodic functions. Proceedings of the American Mathematical Society, 16:109-114, 1965. Google Scholar
  38. Johannes Fischer, Travis Gagie, Pawel Gawrychowski, and Tomasz Kociumaka. Approximating LZ77 via small-space multiple-pattern matching. In Proc. of ESA, volume 9294, pages 533-544. Springer, 2015. URL: https://doi.org/10.1007/978-3-662-48350-3_45.
  39. Johannes Fischer and Pawel Gawrychowski. Alphabet-dependent string searching with wexponential search trees. In Proc. of CPM, pages 160-171, 2015. URL: https://doi.org/10.1007/978-3-319-19929-0_14.
  40. Kimmo Fredriksson and Szymon Grabowski. Average-optimal string matching. Journal of Discrete Algorithms, 7(4):579-594, 2009. URL: https://doi.org/10.1016/j.jda.2008.09.001.
  41. Pawel Gawrychowski, Tomohiro I, Shunsuke Inenaga, Dominik Köppl, and Florin Manea. Tighter bounds and optimal algorithms for all maximal α-gapped repeats and palindromes - finding all maximal α-gapped repeats and palindromes in optimal worst case time on integer alphabets. Theory Comput. Syst., 62(1):162-191, 2018. URL: https://doi.org/10.1007/S00224-017-9794-5.
  42. Wing-Kai Hon, Tsung-Han Ku, Rahul Shah, and Sharma V. Thankachan. Space-efficient construction algorithm for the circular suffix tree. In Proc. of CPM, pages 142-152, 2013. URL: https://doi.org/10.1007/978-3-642-38905-4_15.
  43. Wing-Kai Hon, Chen-Hua Lu, Rahul Shah, and Sharma V. Thankachan. Succinct indexes for circular patterns. In Proc. of ISAAC, pages 673-682, 2011. URL: https://doi.org/10.1007/978-3-642-25591-5_69.
  44. Costas S. Iliopoulos, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Tomasz Walen, and Wiktor Zuba. Linear-time computation of cyclic roots and cyclic covers of a string. In Proc. of CPM, pages 15:1-15:15, 2023. URL: https://doi.org/10.4230/LIPICS.CPM.2023.15.
  45. Costas S. Iliopoulos, Solon P. Pissis, and M. Sohel Rahman. Searching and indexing circular patterns. In Algorithms for Next-Generation Sequencing Data: Techniques, Approaches, and Applications, pages 77-90. Springer, 2017. URL: https://doi.org/10.1007/978-3-319-59826-0_3.
  46. Marek Karpinski and Yakov Nekrich. Space efficient multi-dimensional range reporting. In Proc. of COCOON, volume 5609, pages 215-224. Springer, 2009. URL: https://doi.org/10.1007/978-3-642-02882-3_22.
  47. Orgad Keller, Tsvi Kopelowitz, Shir Landau Feibish, and Moshe Lewenstein. Generalized substring compression. Theor. Comput. Sci., 525:42-54, 2014. URL: https://doi.org/10.1016/J.TCS.2013.10.010.
  48. Dominik Kempa and Tomasz Kociumaka. Resolution of the Burrows-Wheeler transform conjecture. In Proc. of FOCS, pages 1002-1013, 2020. URL: https://doi.org/10.1109/FOCS46700.2020.00097.
  49. Dominik Kempa and Tomasz Kociumaka. Dynamic suffix array with polylogarithmic queries and updates. In Proc. of STOC, pages 1657-1670, 2022. Full version at http://arxiv.org/abs/1910.10631. URL: https://doi.org/10.1145/3519935.3520061.
  50. Tomasz Kociumaka. Efficient data structures for internal queries in texts. PhD thesis, University of Warsaw, Warsaw, Poland, October 2018. Available at URL: https://depotuw.ceon.pl/handle/item/3614.
  51. Tomasz Kociumaka, Ritu Kundu, Manal Mohamed, and Solon P. Pissis. Longest unbordered factor in quasilinear time. In Proc. of ISAAC, pages 70:1-70:13, 2018. URL: https://doi.org/10.4230/LIPIcs.ISAAC.2018.70.
  52. Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. Optimal data structure for internal pattern matching queries in a text and applications. CoRR, abs/1311.6235, 2013. URL: https://arxiv.org/abs/1311.6235.
  53. Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. Internal pattern matching queries in a text and applications. In Proc. of SODA, pages 532-551, 2015. URL: https://doi.org/10.1137/1.9781611973730.36.
  54. Tomasz Kociumaka, Tatiana Starikovskaya, and Hjalte Wedel Vildhøj. Sublinear space algorithms for the longest common substring problem. In Proc. of ESA, pages 605-617, 2014. URL: https://doi.org/10.1007/978-3-662-44777-2_50.
  55. Roman Kolpakov, Mikhail Podolskiy, Mikhail Posypkin, and Nickolay Khrapov. Searching of gapped repeats and subrepetitions in a word. Journal of Discrete Algorithms, 46-47:1-15, 2017. URL: https://doi.org/10.1016/j.jda.2017.10.004.
  56. Dmitry Kosolobov, Florin Manea, and Dirk Nowotka. Detecting one-variable patterns. In Proc. of SPIRE, pages 254-270, 2017. URL: https://doi.org/10.1007/978-3-319-67428-5_22.
  57. Dmitry Kosolobov and Nikita Sivukhin. Construction of sparse suffix trees and LCE indexes in optimal time and space. In Proc. of CPM, 2024. Google Scholar
  58. Moshe Lewenstein. Orthogonal range searching for text indexing. In Space-Efficient Data Structures, Streams, and Algorithms, pages 267-302, 2013. URL: https://doi.org/10.1007/978-3-642-40273-9_18.
  59. M. Lothaire. Applied Combinatorics on Words. Encyclopedia of Mathematics and its Applications. Cambridge University Press, 2005. Google Scholar
  60. Tung Mai, Anup Rao, Ryan A Rossi, and Saeed Seddighin. Optimal space and time for streaming pattern matching. arXiv preprint arXiv:2107.04660, 2021. Google Scholar
  61. Kazuki Mitani, Takuya Mieno, Kazuhisa Seto, and Takashi Horiyama. Internal longest palindrome queries in optimal time. In Proc. of WALCOM, pages 127-138, 2023. Google Scholar
  62. Milan Ružić. Constructing efficient dictionaries in close to sorting time. In Proc. of ICALP, volume 5125, pages 84-95. Springer, 2008. URL: https://doi.org/10.1007/978-3-540-70575-8_8.
  63. Michael Saks and C. Seshadhri. Space efficient streaming algorithms for the distance to monotonicity and asymmetric edit distance. In Proc. of SODA, pages 1698-1709, 2013. URL: https://doi.org/10.1137/1.9781611973105.122.
  64. Tatiana Starikovskaya and Hjalte Wedel Vildhøj. Time-space trade-offs for the longest common substring problem. In Proc. of CPM, pages 223-234, 2013. URL: https://doi.org/10.1007/978-3-642-38905-4_22.
  65. Robert Susik, Szymon Grabowski, and Sebastian Deorowicz. Fast and simple circular pattern matching. In Man-Machine Interactions 3, pages 537-544, 2014. Google Scholar
  66. Dan E. Willard. Log-logarithmic worst-case range queries are possible in space Θ(N). Inf. Process. Lett., 17(2):81-84, 1983. URL: https://doi.org/10.1016/0020-0190(83)90075-3.