Breaking a Barrier in Constructing Compact Indexes for Parameterized Pattern Matching

Authors Kento Iseri, Tomohiro I , Diptarama Hendrian , Dominik Köppl , Ryo Yoshinaka , Ayumi Shinohara



PDF
Thumbnail PDF

File

LIPIcs.ICALP.2024.89.pdf
  • Filesize: 0.95 MB
  • 19 pages

Document Identifiers

Author Details

Kento Iseri
  • Kyushu Institute of Technology, Japan
Tomohiro I
  • Kyushu Institute of Technology, Japan
Diptarama Hendrian
  • Tokyo Medical and Dental University, Japan
Dominik Köppl
  • University of Yamanashi, Japan
Ryo Yoshinaka
  • Tohoku University, Sendai, Japan
Ayumi Shinohara
  • Tohoku University, Sendai, Japan

Cite AsGet BibTex

Kento Iseri, Tomohiro I, Diptarama Hendrian, Dominik Köppl, Ryo Yoshinaka, and Ayumi Shinohara. Breaking a Barrier in Constructing Compact Indexes for Parameterized Pattern Matching. In 51st International Colloquium on Automata, Languages, and Programming (ICALP 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 297, pp. 89:1-89:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.ICALP.2024.89

Abstract

A parameterized string (p-string) is a string over an alphabet (Σ_s ∪ Σ_p), where Σ_s and Σ_p are disjoint alphabets for static symbols (s-symbols) and for parameter symbols (p-symbols), respectively. Two p-strings x and y are said to parameterized match (p-match) if and only if x can be transformed into y by applying a bijection on Σ_p to every occurrence of p-symbols in x. The indexing problem for p-matching is to preprocess a p-string T of length n so that we can efficiently find the occurrences of substrings of T that p-match with a given pattern. Let σ_s and respectively σ_p be the numbers of distinct s-symbols and p-symbols that appear in T and σ = σ_s + σ_p. Extending the Burrows-Wheeler Transform (BWT) based index for exact string pattern matching, Ganguly et al. [SODA 2017] proposed parameterized BWTs (pBWTs) to design the first compact index for p-matching, and posed an open problem on how to construct the pBWT-based index in compact space, i.e., in O(n lg |Σ_s ∪ Σ_p|) bits of space. Hashimoto et al. [SPIRE 2022] showed how to construct the pBWT for T, under the assumption that Σ_s ∪ Σ_p = [0..O(σ)], in O(n lg σ) bits of space and O(n (σ_p lg n)/(lg lg n)) time in an online manner while reading the symbols of T from right to left. In this paper, we refine Hashimoto et al.’s algorithm to work in O(n lg σ) bits of space and O(n (lg σ_p lg n)/(lg lg n)) time in a more general assumption that Σ_s ∪ Σ_p = [0..n^{O(1)}]. Our result has an immediate application to constructing parameterized suffix arrays in O(n (lg σ_p lg n)/(lg lg n)) time and O(n lg σ) bits of working space. We also show that our data structure can support backward search, a core procedure of BWT-based indexes, at any stage of the online construction, making it the first compact index for p-matching that can be constructed in compact space and even in an online manner.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
Keywords
  • Index for parameterized pattern matching
  • Parameterized Burrows-Wheeler Transform
  • Online construction

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Brenda S. Baker. A theory of parameterized pattern matching: algorithms and applications. In S. Rao Kosaraju, David S. Johnson, and Alok Aggarwal, editors, Proc. 25th Annual ACM Symposium on Theory of Computing (STOC), pages 71-80. ACM, 1993. URL: https://doi.org/10.1145/167088.167115.
  2. Brenda S. Baker. Parameterized pattern matching: Algorithms and applications. Journal of Computer and System Sciences, 52(1):28-42, 1996. URL: https://doi.org/10.1006/jcss.1996.0003.
  3. Brenda S. Baker. Parameterized duplication in strings: Algorithms and an application to software maintenance. SIAM J. Comput., 26(5):1343-1362, 1997. URL: https://doi.org/10.1137/S0097539793246707.
  4. Richard Beal and Donald A. Adjeroh. p-suffix sorting as arithmetic coding. J. Discrete Algorithms, 16:151-169, 2012. URL: https://doi.org/10.1016/j.jda.2012.05.001.
  5. Jason W. Bentley, Daniel Gibney, and Sharma V. Thankachan. On the complexity of BWT-runs minimization via alphabet reordering. In Proc. ESA, volume 173 of LIPIcs, pages 15:1-15:13, 2020. URL: https://doi.org/10.4230/LIPIcs.ESA.2020.15.
  6. Francisco Claude, Gonzalo Navarro, and Alberto Ordóñez Pereira. The wavelet matrix: An efficient wavelet tree for large alphabets. Inf. Syst., 47:15-32, 2015. URL: https://doi.org/10.1016/j.is.2014.06.002.
  7. Richard Cole and Ramesh Hariharan. Faster suffix tree construction with missing suffix links. SIAM J. Comput., 33(1):26-42, 2003. URL: https://doi.org/10.1137/S0097539701424465.
  8. Satoshi Deguchi, Fumihito Higashijima, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda. Parameterized suffix arrays for binary strings. In Proc. Prague Stringology Conference (PSC) 2008, pages 84-94, 2008. URL: http://www.stringology.org/event/2008/p08.html.
  9. Diptarama, Takashi Katsura, Yuhei Otomo, Kazuyuki Narisawa, and Ayumi Shinohara. Position heaps for parameterized strings. In Juha Kärkkäinen, Jakub Radoszewski, and Wojciech Rytter, editors, Proc. 28th Annual Symposium on Combinatorial Pattern Matching (CPM) 2017, volume 78 of LIPIcs, pages 8:1-8:13. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017. URL: https://doi.org/10.4230/LIPIcs.CPM.2017.8.
  10. Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proc. 41st Annual Symposium on Foundations of Computer Science (FOCS) 2000, pages 390-398, 2000. URL: https://doi.org/10.1109/SFCS.2000.892127.
  11. Noriki Fujisato, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Right-to-left online construction of parameterized position heaps. In Jan Holub and Jan Zdárek, editors, Proc. Prague Stringology Conference (PSC) 2018, pages 91-102. Czech Technical University in Prague, Faculty of Information Technology, Department of Theoretical Computer Science, 2018. URL: http://www.stringology.org/event/2018/p09.html.
  12. Noriki Fujisato, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Direct linear time construction of parameterized suffix and LCP arrays for constant alphabets. In Nieves R. Brisaboa and Simon J. Puglisi, editors, Proc. 26th International Symposium on String Processing and Information Retrieval (SPIRE) 2019, volume 11811 of Lecture Notes in Computer Science, pages 382-391. Springer, 2019. URL: https://doi.org/10.1007/978-3-030-32686-9_27.
  13. Noriki Fujisato, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. The parameterized position heap of a trie. In Pinar Heggernes, editor, Proc. 11th International Conference on Algorithms and Complexity (CIAC) 2019, volume 11485 of Lecture Notes in Computer Science, pages 237-248. Springer, 2019. URL: https://doi.org/10.1007/978-3-030-17402-6_20.
  14. Noriki Fujisato, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. The parameterized suffix tray. In Tiziana Calamoneri and Federico Corò, editors, Proc. 12th International Conference on Algorithms and Complexity (CIAC) 2021, volume 12701 of Lecture Notes in Computer Science, pages 258-270. Springer, 2021. URL: https://doi.org/10.1007/978-3-030-75242-2_18.
  15. Travis Gagie, Giovanni Manzini, and Rossano Venturini. An encoding for order-preserving matching. In Proc. 25th Annual European Symposium on Algorithms (ESA) 2017, pages 38:1-38:15, 2017. URL: https://doi.org/10.4230/LIPIcs.ESA.2017.38.
  16. Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan. pBWT: Achieving succinct data structures for parameterized pattern matching and related problems. In Proc. 28th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) 2017, pages 397-407, 2017. URL: https://doi.org/10.1137/1.9781611974782.25.
  17. Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan. Structural pattern matching - succinctly. In Proc. 28th International Symposium on Algorithms and Computation (ISAAC) 2017, pages 35:1-35:13, 2017. URL: https://doi.org/10.4230/LIPIcs.ISAAC.2017.35.
  18. Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan. Fully functional parameterized suffix trees in compact space. In Mikolaj Bojanczyk, Emanuela Merelli, and David P. Woodruff, editors, Proc. 49th International Colloquium on Automata, Languages, and Programming, (ICALP) 2022, volume 229 of LIPIcs, pages 65:1-65:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022. URL: https://doi.org/10.4230/LIPIcs.ICALP.2022.65.
  19. Daiki Hashimoto, Diptarama Hendrian, Dominik Köppl, Ryo Yoshinaka, and Ayumi Shinohara. Computing the parameterized Burrows-Wheeler transform online. In Diego Arroyuelo and Barbara Poblete, editors, Proc. 29th International Symposium on String Processing and Information Retrieval (SPIRE) 2022, volume 13617 of Lecture Notes in Computer Science, pages 70-85. Springer, 2022. URL: https://doi.org/10.1007/978-3-031-20643-6_6.
  20. Tomohiro I, Satoshi Deguchi, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda. Lightweight parameterized suffix array construction. In Proc. 20th International Workshop on Combinatorial Algorithms (IWOCA) 2009, pages 312-323, 2009. URL: https://doi.org/10.1007/978-3-642-10217-2_31.
  21. Tomohiro I and Dominik Köppl. Load-balancing succinct B trees, 2021. arXiv:2104.08751. URL: https://doi.org/10.48550/arXiv.2104.08751.
  22. Robert W. Irving and Lorna Love. The suffix binary search tree and suffix AVL tree. J. Discrete Algorithms, 1(5-6):387-408, 2003. URL: https://doi.org/10.1016/S1570-8667(03)00034-0.
  23. Sung-Hwan Kim and Hwan-Gue Cho. A compact index for cartesian tree matching. In Pawel Gawrychowski and Tatiana Starikovskaya, editors, Proc. 32nd Annual Symposium on Combinatorial Pattern Matching (CPM) 2021, volume 191 of LIPIcs, pages 18:1-18:19. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL: https://doi.org/10.4230/LIPIcs.CPM.2021.18.
  24. Sung-Hwan Kim and Hwan-Gue Cho. Simpler FM-index for parameterized string matching. Inf. Process. Lett., 165:106026, 2021. URL: https://doi.org/10.1016/j.ipl.2020.106026.
  25. S. Rao Kosaraju. Faster algorithms for the construction of parameterized suffix trees (preliminary version). In Proc. 36th Annual Symposium on Foundations of Computer Science (FOCS), pages 631-637. IEEE Computer Society, 1995. URL: https://doi.org/10.1109/SFCS.1995.492664.
  26. Taehyung Lee, Joong Chae Na, and Kunsoo Park. On-line construction of parameterized suffix trees for large alphabets. Inf. Process. Lett., 111(5):201-207, 2011. URL: https://doi.org/10.1016/j.ipl.2010.11.017.
  27. Yoshiaki Matsuoka, Takahiro Aoki, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Generalized pattern matching and periodicity under substring consistent equivalence relations. Theor. Comput. Sci., 656:225-233, 2016. URL: https://doi.org/10.1016/j.tcs.2016.02.017.
  28. Juan Mendivelso, Sharma V. Thankachan, and Yoan J. Pinzón. A brief history of parameterized matching problems. Discret. Appl. Math., 274:103-115, 2020. URL: https://doi.org/10.1016/j.dam.2018.07.017.
  29. J. Ian Munro and Yakov Nekrich. Compressed data structures for dynamic sequences. In Proc. 23rd Annual European Symposium on Algorithms (ESA) 2015, pages 891-902, 2015. URL: https://doi.org/10.1007/978-3-662-48350-3_74.
  30. Shinya Nagashita and Tomohiro I. PalFM-Index: FM-index for palindrome pattern matching. In Laurent Bulteau and Zsuzsanna Lipták, editors, Proc. 34th Annual Symposium on Combinatorial Pattern Matching (CPM) 2023, volume 259 of LIPIcs, pages 23:1-23:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023. URL: https://doi.org/10.4230/LIPIcs.CPM.2023.23.
  31. Katsuhito Nakashima, Noriki Fujisato, Diptarama Hendrian, Yuto Nakashima, Ryo Yoshinaka, Shunsuke Inenaga, Hideo Bannai, Ayumi Shinohara, and Masayuki Takeda. Parameterized DAWGs: Efficient constructions and bidirectional pattern searches. Theor. Comput. Sci., 933:21-42, 2022. URL: https://doi.org/10.1016/j.tcs.2022.09.008.
  32. Gonzalo Navarro. Wavelet trees for all. J. Discrete Algorithms, 25:2-20, 2014. URL: https://doi.org/10.1016/j.jda.2013.07.004.
  33. Eric M. Osterkamp and Dominik Köppl. Extending the parameterized Burrows-Wheeler transform. In Proc. DCC, pages 143-152, March 2024. Google Scholar
  34. Mihai Patrascu and Mikkel Thorup. Dynamic integer sets with optimal rank, select, and predecessor search. In 55th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2014, Philadelphia, PA, USA, October 18-21, 2014, pages 166-175, 2014. URL: https://doi.org/10.1109/FOCS.2014.26.