An Invertible Transform for Efficient String Matching in Labeled Digraphs

Authors Abhinav Nellore , Austin Nguyen , Reid F. Thompson



PDF
Thumbnail PDF

File

LIPIcs.CPM.2021.20.pdf
  • Filesize: 1.55 MB
  • 14 pages

Document Identifiers

Author Details

Abhinav Nellore
  • Oregon Health & Science University, Portland, Oregon 97239, USA
Austin Nguyen
  • Oregon Health & Science University, Portland, Oregon 97239, USA
Reid F. Thompson
  • Oregon Health & Science University, Portland, Oregon 97239, USA
  • VA Portland Healthcare System, Portland, Oregon 97239, USA

Acknowledgements

We thank Rachel Ward, Ben Langmead, Chris Wilks, and the anonymous reviewers for the 32nd Annual Symposium on Combinatorial Pattern Matching, all of whose feedback improved this paper considerably.

Cite As Get BibTex

Abhinav Nellore, Austin Nguyen, and Reid F. Thompson. An Invertible Transform for Efficient String Matching in Labeled Digraphs. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, pp. 20:1-20:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021) https://doi.org/10.4230/LIPIcs.CPM.2021.20

Abstract

Let G = (V, E) be a digraph where each vertex is unlabeled, each edge is labeled by a character in some alphabet Ω, and any two edges with both the same head and the same tail have different labels. The powerset construction gives a transform of G into a weakly connected digraph G' = (V', E') that enables solving the decision problem of whether there is a walk in G matching an arbitrarily long query string q in time linear in |q| and independent of |E| and |V|. We show G is uniquely determined by G' when for every v_𝓁 ∈ V, there is some distinct string s_𝓁 on Ω such that v_𝓁 is the origin of a closed walk in G matching s_𝓁, and no other walk in G matches s_𝓁 unless it starts and ends at v_𝓁. We then exploit this invertibility condition to strategically alter any G so its transform G' enables retrieval of all t terminal vertices of walks in the unaltered G matching q in O(|q| + t log |V|) time. We conclude by proposing two defining properties of a class of transforms that includes the Burrows-Wheeler transform and the transform presented here.

Subject Classification

ACM Subject Classification
  • Mathematics of computing → Graph algorithms
  • Mathematics of computing → Combinatorics on words
Keywords
  • pattern matching
  • string matching
  • Burrows-Wheeler transform
  • labeled graphs

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Hideo Bannai, Travis Gagie, and I Tomohiro. Refining the r-index. Theoretical Computer Science, 812:96-108, 2020. Google Scholar
  2. Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Succinct de bruijn graphs. In International workshop on algorithms in bioinformatics, pages 225-235. Springer, 2012. Google Scholar
  3. Michael Burrows and David J Wheeler. A block-sorting lossless data compression algorithm. Technical report, Systems Research Center, 1994. Google Scholar
  4. Hyewon Choi and Bernd Burgstaller. Non-blocking parallel subset construction on shared-memory multicore architectures. In Proceedings of the Eleventh Australasian Symposium on Parallel and Distributed Computing-Volume 140, pages 13-20, 2013. Google Scholar
  5. Nicola Cotumaccio and Nicola Prezza. On indexing and compressing finite automata. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2585-2599. SIAM, 2021. Google Scholar
  6. Massimo Equi, Veli Mäkinen, and Alexandru I Tomescu. Conditional indexing lower bounds through self-reducibility. arXiv preprint, 2020. URL: http://arxiv.org/abs/2002.00629.
  7. Massimo Equi, Veli Mäkinen, and Alexandru I Tomescu. Graphs cannot be indexed in polynomial time for sub-quadratic time string matching, unless seth fails. In International Conference on Current Trends in Theory and Practice of Informatics, pages 608-622. Springer, 2021. Google Scholar
  8. Tuvi Etzion. An algorithm for generating shift-register cycles. Theoretical computer science, 44:209-224, 1986. Google Scholar
  9. Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and S Muthukrishnan. Compressing and indexing labeled trees, with applications. Journal of the ACM (JACM), 57(1):4, 2009. Google Scholar
  10. Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proceedings 41st Annual Symposium on Foundations of Computer Science, pages 390-398. IEEE, 2000. Google Scholar
  11. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM (JACM), 52(4):552-581, 2005. Google Scholar
  12. Daniel Gabric, Štěpán Holub, and Jeffrey Shallit. Generalized de bruijn words and the state complexity of conjugate sets. In International Conference on Descriptional Complexity of Formal Systems, pages 137-146. Springer, 2019. Google Scholar
  13. Daniel Gabric, Štěpán Holub, and Jeffrey Shallit. Maximal state complexity and generalized de bruijn words. Information and Computation, page 104689, 2021. Google Scholar
  14. Travis Gagie. r-indexing wheeler graphs. arXiv preprint, 2021. URL: http://arxiv.org/abs/2101.12341.
  15. Travis Gagie, Giovanni Manzini, and Jouni Sirén. Wheeler graphs: A framework for bwt-based data structures. Theoretical computer science, 698:67-78, 2017. Google Scholar
  16. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Optimal-time text indexing in bwt-runs bounded space. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1459-1477. SIAM, 2018. Google Scholar
  17. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in bwt-runs bounded space. Journal of the ACM (JACM), 67(1):1-54, 2020. Google Scholar
  18. Erik Garrison, Jouni Sirén, Adam M Novak, Glenn Hickey, Jordan M Eizenga, Eric T Dawson, William Jones, Shilpa Garg, Charles Markello, Michael F Lin, et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature biotechnology, 2018. Google Scholar
  19. Farhad Hemmati and Daniel J Costello. An algebraic construction for q-ary shift register sequences. IEEE Transactions on Computers, 100(12):1192-1195, 1978. Google Scholar
  20. John E Hopcroft, Rajeev Motwani, and Jeffrey D Ullman. Introduction to automata theory, languages, and computation. Acm Sigact News, 32(1):60-65, 2001. Google Scholar
  21. Dominik Kempa. Optimal construction of compressed indexes for highly repetitive texts. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1344-1357. SIAM, 2019. Google Scholar
  22. Alan Kuhnle, Taher Mun, Christina Boucher, Travis Gagie, Ben Langmead, and Giovanni Manzini. Efficient construction of a complete index for pan-genomics read alignment. Journal of Computational Biology, 27(4):500-513, 2020. Google Scholar
  23. Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with bowtie 2. Nature methods, 9(4):357, 2012. Google Scholar
  24. Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L Salzberg. Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome biology, 10(3):R25, 2009. Google Scholar
  25. Abraham Lempel. m-ary closed sequences. Journal of Combinatorial Theory, Series A, 10(3):253-258, 1971. Google Scholar
  26. Heng Li. Aligning sequence reads, clone sequences and assembly contigs with bwa-mem. arXiv preprint, 2013. URL: http://arxiv.org/abs/1303.3997.
  27. Heng Li and Richard Durbin. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics, 25(14):1754-1760, 2009. Google Scholar
  28. Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. An extension of the burrows-wheeler transform. Theoretical Computer Science, 387(3):298-312, 2007. Google Scholar
  29. Gonzalo Navarro and Nicola Prezza. Universal compressed text indexing. Theoretical Computer Science, 762:41-50, 2019. Google Scholar
  30. Adam M Novak, Erik Garrison, and Benedict Paten. A graph extension of the positional burrows-wheeler transform and its applications. In International Workshop on Algorithms in Bioinformatics, pages 246-256. Springer, 2016. Google Scholar
  31. Nicola Prezza. On locating paths in compressed tries. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 744-760. SIAM, 2021. Google Scholar
  32. Nicola Prezza. Subpath queries on compressed graphs: A survey. Algorithms, 14(1):14, 2021. Google Scholar
  33. Michael O Rabin and Dana Scott. Finite automata and their decision problems. IBM journal of research and development, 3(2):114-125, 1959. Google Scholar
  34. Yan Shao, Yanbing Liu, and Jianlong Tan. Accelerating dfa construction by parallelizing subset construction. In International Conference on Trustworthy Computing and Services, pages 16-24. Springer, 2014. Google Scholar
  35. Jouni Sirén. Indexing variation graphs. In 2017 Proceedings of the ninteenth workshop on algorithm engineering and experiments (ALENEX), pages 13-27. SIAM, 2017. Google Scholar
  36. Jouni Sirén, Erik Garrison, Adam M Novak, Benedict Paten, and Richard Durbin. Haplotype-aware graph indexes. Bioinformatics, 36(2):400-407, 2020. Google Scholar
  37. Jouni Sirén, Niko Välimäki, and Veli Mäkinen. Indexing graphs for path queries with applications in genome research. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(2):375-388, 2014. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail