Finding Maximal Exact Matches in Graphs

Authors Nicola Rizzo , Manuel Cáceres , Veli Mäkinen



PDF
Thumbnail PDF

File

LIPIcs.WABI.2023.10.pdf
  • Filesize: 0.79 MB
  • 17 pages

Document Identifiers

Author Details

Nicola Rizzo
  • Department of Computer Science, University of Helsinki, Finland
Manuel Cáceres
  • Department of Computer Science, University of Helsinki, Finland
Veli Mäkinen
  • Department of Computer Science, University of Helsinki, Finland

Acknowledgements

We thank an anonymous reviewer for pointing out an anomaly that lead us to correct the choice of a parameter value in our experiments.

Cite As Get BibTex

Nicola Rizzo, Manuel Cáceres, and Veli Mäkinen. Finding Maximal Exact Matches in Graphs. In 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 273, pp. 10:1-10:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023) https://doi.org/10.4230/LIPIcs.WABI.2023.10

Abstract

We study the problem of finding maximal exact matches (MEMs) between a query string Q and a labeled graph G. MEMs are an important class of seeds, often used in seed-chain-extend type of practical alignment methods because of their strong connections to classical metrics. A principled way to speed up chaining is to limit the number of MEMs by considering only MEMs of length at least κ (κ-MEMs). However, on arbitrary input graphs, the problem of finding MEMs cannot be solved in truly sub-quadratic time under SETH (Equi et al., ICALP 2019) even on acyclic graphs. 
In this paper we show an O(n⋅ L ⋅ d^{L-1} + m + M_{κ,L})-time algorithm finding all κ-MEMs between Q and G spanning exactly L nodes in G, where n is the total length of node labels, d is the maximum degree of a node in G, m = |Q|, and M_{κ,L} is the number of output MEMs. We use this algorithm to develop a κ-MEM finding solution on indexable Elastic Founder Graphs (Equi et al., Algorithmica 2022) running in time O(nH² + m + M_κ), where H is the maximum number of nodes in a block, and M_κ is the total number of κ-MEMs.
Our results generalize to the analysis of multiple query strings (MEMs between G and any of the strings). Additionally, we provide some preliminary experimental results showing that the number of graph MEMs is an order of magnitude smaller than the number of string MEMs of the corresponding concatenated collection.

Subject Classification

ACM Subject Classification
  • Theory of computation → Graph algorithms analysis
  • Theory of computation → Pattern matching
  • Theory of computation → Sorting and searching
  • Applied computing → Genomics
Keywords
  • Sequence to graph alignment
  • bidirectional BWT
  • r-index
  • suffix tree
  • founder graphs

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Mohamed Ibrahim Abouelhoda. A chaining algorithm for mapping cdna sequences to multiple genomic sequences. In Nivio Ziviani and Ricardo A. Baeza-Yates, editors, String Processing and Information Retrieval, 14th International Symposium, SPIRE 2007, Santiago, Chile, October 29-31, 2007, Proceedings, volume 4726 of Lecture Notes in Computer Science, pages 1-13. Springer, 2007. URL: https://doi.org/10.1007/978-3-540-75530-2_1.
  2. Yuma Arakawa, Gonzalo Navarro, and Kunihiko Sadakane. Bi-directional r-indexes. In Hideo Bannai and Jan Holub, editors, 33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022, June 27-29, 2022, Prague, Czech Republic, volume 223 of LIPIcs, pages 11:1-11:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022. URL: https://doi.org/10.4230/LIPIcs.CPM.2022.11.
  3. Arturs Backurs and Piotr Indyk. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). SIAM J. Comput., 47(3):1087-1097, 2018. URL: https://doi.org/10.1137/15M1053128.
  4. Djamal Belazzougui, Fabio Cunial, Juha Kärkkäinen, and Veli Mäkinen. Linear-time string indexing and analysis in small space. ACM Trans. Algorithms, 16(2):17:1-17:54, 2020. URL: https://doi.org/10.1145/3381417.
  5. Karl Bringmann and Marvin Künnemann. Quadratic conditional lower bounds for string problems and dynamic time warping. In Venkatesan Guruswami, editor, IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015, pages 79-97. IEEE Computer Society, 2015. URL: https://doi.org/10.1109/FOCS.2015.15.
  6. M. Burrows and D. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994. Google Scholar
  7. Manuel Caceres. Parameterized algorithms for string matching to DAGs: Funnels and beyond. arXiv preprint, 2022. To appear in the proceedings of CPM 2023. URL: https://arxiv.org/abs/2212.07870.
  8. Ghanshyam Chandra and Chirag Jain. Sequence to graph alignment using gap-sensitive co-linear chaining. In Haixu Tang, editor, Research in Computational Molecular Biology, pages 58-73, Cham, 2023. Springer Nature Switzerland. Google Scholar
  9. David Clark. Compact pat trees. PhD thesis, University of Waterloo, 1997. Google Scholar
  10. The Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Briefings in Bioinformatics, 19(1):118-135, October 2016. URL: https://doi.org/10.1093/bib/bbw089.
  11. Nicola Cotumaccio. Graphs can be succinctly indexed for pattern matching in O(| E | ^ 2 + | V | ^5/2) time. In Ali Bilgin, Michael W. Marcellin, Joan Serra-Sagristà, and James A. Storer, editors, Data Compression Conference, DCC 2022, Snowbird, UT, USA, March 22-25, 2022, pages 272-281. IEEE, 2022. URL: https://doi.org/10.1109/DCC52660.2022.00035.
  12. Nicola Cotumaccio and Nicola Prezza. On indexing and compressing finite automata. In Dániel Marx, editor, Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, SODA 2021, Virtual Conference, January 10-13, 2021, pages 2585-2599. SIAM, 2021. URL: https://doi.org/10.1137/1.9781611976465.153.
  13. Rene De La Briandais. File searching using variable length keys. In Papers Presented at the the March 3-5, 1959, Western Joint Computer Conference, IRE-AIEE-ACM ’59 (Western), pages 295-298, New York, NY, USA, 1959. Association for Computing Machinery. URL: https://doi.org/10.1145/1457838.1457895.
  14. Massimo Equi, Veli Mäkinen, and Alexandru I. Tomescu. Graphs cannot be indexed in polynomial time for sub-quadratic time string matching, unless SETH fails. In Tomás Bures, Riccardo Dondi, Johann Gamper, Giovanna Guerrini, Tomasz Jurdzinski, Claus Pahl, Florian Sikora, and Prudence W. H. Wong, editors, SOFSEM 2021: Theory and Practice of Computer Science - 47th International Conference on Current Trends in Theory and Practice of Computer Science, SOFSEM 2021, Bolzano-Bozen, Italy, January 25-29, 2021, Proceedings, volume 12607 of Lecture Notes in Computer Science, pages 608-622. Springer, 2021. URL: https://doi.org/10.1007/978-3-030-67731-2_44.
  15. Massimo Equi, Veli Mäkinen, Alexandru I Tomescu, and Roberto Grossi. On the complexity of string matching for graphs. ACM Transactions on Algorithms, 19(3):1-25, 2023. Google Scholar
  16. Massimo Equi, Tuukka Norri, Jarno Alanko, Bastien Cazaux, Alexandru I. Tomescu, and Veli Mäkinen. Algorithms and complexity on indexing founder graphs. Algorithmica, 85(6):1586-1623, 2023. URL: https://doi.org/10.1007/s00453-022-01007-w.
  17. Martin Farach. Optimal suffix tree construction with large alphabets. In Proceedings 38th Annual Symposium on Foundations of Computer Science, pages 137-143. IEEE, 1997. Google Scholar
  18. Paolo Ferragina and Roberto Grossi. The String B-tree: A new data structure for string search in external memory and its applications. J. ACM, 46(2):236-280, 1999. URL: https://doi.org/10.1145/301970.301973.
  19. Johannes Fischer and Volker Heun. Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput., 40(2):465-492, 2011. URL: https://doi.org/10.1137/090779759.
  20. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in bwt-runs bounded space. J. ACM, 67(1):2:1-2:54, 2020. URL: https://doi.org/10.1145/3375890.
  21. Adrián Goga, Andrej Baláz, Alessia Petescia, and Travis Gagie. MARIA: Multiple-alignment r-index with aggregation. CoRR, abs/2209.09218, 2022. URL: https://doi.org/10.48550/arXiv.2209.09218.
  22. Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997. URL: https://doi.org/10.1017/cbo9780511574931.
  23. Guy Jacobson. Space-efficient static trees and graphs. In 30th Annual Symposium on Foundations of Computer Science, Research Triangle Park, North Carolina, USA, 30 October - 1 November 1989, pages 549-554. IEEE Computer Society, 1989. URL: https://doi.org/10.1109/SFCS.1989.63533.
  24. Chirag Jain, Daniel Gibney, and Sharma V. Thankachan. Co-linear chaining with overlaps and gap costs. In Itsik Pe'er, editor, Research in Computational Molecular Biology - 26th Annual International Conference, RECOMB 2022, San Diego, CA, USA, May 22-25, 2022, Proceedings, volume 13278 of Lecture Notes in Computer Science, pages 246-262. Springer, 2022. URL: https://doi.org/10.1007/978-3-031-04749-7_15.
  25. Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint, 2013. URL: https://arxiv.org/abs/1303.3997.
  26. Heng Li, Xiaowen Feng, and Chong Chu. The design and construction of reference pangenome graphs with minigraph. Genome Biology, 21:1-19, 2020. Google Scholar
  27. Jun Ma, Manuel Cáceres, Leena Salmela, Veli Mäkinen, and Alexandru I. Tomescu. Chaining for accurate alignment of erroneous long reads to acyclic variation graphs. bioRxiv, 2022. URL: https://doi.org/10.1101/2022.01.07.475257.
  28. Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I. Tomescu. Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing. Cambridge University Press, 2015. URL: https://doi.org/10.1017/CBO9781139940023.
  29. Veli Mäkinen, Bastien Cazaux, Massimo Equi, Tuukka Norri, and Alexandru I. Tomescu. Linear time construction of indexable founder block graphs. In Carl Kingsford and Nadia Pisanti, editors, 20th International Workshop on Algorithms in Bioinformatics, WABI 2020, September 7-9, 2020, Pisa, Italy (Virtual Conference), volume 172 of LIPIcs, pages 7:1-7:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020. URL: https://doi.org/10.4230/LIPIcs.WABI.2020.7.
  30. Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol., 17(3):281-308, 2010. URL: https://doi.org/10.1089/cmb.2009.0169.
  31. Veli Mäkinen and Kristoffer Sahlin. Chaining with overlaps revisited. In Inge Li Gørtz and Oren Weimann, editors, 31st Annual Symposium on Combinatorial Pattern Matching, CPM 2020, June 17-19, 2020, Copenhagen, Denmark, volume 161 of LIPIcs, pages 25:1-25:12. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020. URL: https://doi.org/10.4230/LIPIcs.CPM.2020.25.
  32. Guillaume Marçais, Arthur L. Delcher, Adam M. Phillippy, Rachel Coston, Steven L. Salzberg, and Aleksey V. Zimin. Mummer4: A fast and versatile genome alignment system. PLoS Comput. Biol., 14(1), 2018. URL: https://doi.org/10.1371/journal.pcbi.1005944.
  33. S. Muthukrishnan. Efficient algorithms for document retrieval problems. In David Eppstein, editor, Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, January 6-8, 2002, San Francisco, CA, USA, pages 657-666. ACM/SIAM, 2002. URL: http://dl.acm.org/citation.cfm?id=545381.545469.
  34. Saul B Needleman and Christian D Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443-453, 1970. Google Scholar
  35. Takaaki Nishimoto, Shunsuke Kanda, and Yasuo Tabei. An Optimal-Time RLBWT Construction in BWT-Runs Bounded Space. In Mikołaj Bojańczyk, Emanuela Merelli, and David P. Woodruff, editors, 49th International Colloquium on Automata, Languages, and Programming (ICALP 2022), volume 229 of Leibniz International Proceedings in Informatics (LIPIcs), pages 99:1-99:20, Dagstuhl, Germany, 2022. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. URL: https://doi.org/10.4230/LIPIcs.ICALP.2022.99.
  36. Enno Ohlebusch, Simon Gog, and Adrian Kügel. Computing matching statistics and maximal exact matches on compressed full-text indexes. In Edgar Chávez and Stefano Lonardi, editors, String Processing and Information Retrieval - 17th International Symposium, SPIRE 2010, Los Cabos, Mexico, October 11-13, 2010. Proceedings, volume 6393 of Lecture Notes in Computer Science, pages 347-358. Springer, 2010. URL: https://doi.org/10.1007/978-3-642-16321-0_36.
  37. Mikko Rautiainen and Tobias Marschall. Graphaligner: rapid and versatile sequence-to-graph alignment. Genome biology, 21(1):1-28, 2020. Google Scholar
  38. Nicola Rizzo, Manuel Cáceres, and Veli Mäkinen. Chaining of maximal exact matches in graphs. Manuscript in submission. URL: https://doi.org/10.48550/arXiv.2302.01748.
  39. Nicola Rizzo, Massimo Equi, Tuukka Norri, and Veli Mäkinen. Elastic founder graphs improved and enhanced. CoRR, abs/2303.05336, 2023. Submitted manuscript. URL: https://doi.org/10.48550/arXiv.2303.05336.
  40. Nicola Rizzo and Veli Mäkinen. Indexable elastic founder graphs of minimum height. In Hideo Bannai and Jan Holub, editors, 33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022, June 27-29, 2022, Prague, Czech Republic, volume 223 of LIPIcs, pages 19:1-19:19. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022. URL: https://doi.org/10.4230/LIPIcs.CPM.2022.19.
  41. Nicola Rizzo, Alexandru I. Tomescu, and Alberto Policriti. Solving string problems on graphs using the labeled direct product. Algorithmica, 84(10):3008-3033, 2022. URL: https://doi.org/10.1007/s00453-022-00989-x.
  42. Massimiliano Rossi, Marco Oliva, Paola Bonizzoni, Ben Langmead, Travis Gagie, and Christina Boucher. Finding maximal exact matches using the r-index. J. Comput. Biol., 29(2):188-194, 2022. URL: https://doi.org/10.1089/cmb.2021.0445.
  43. Thomas Schnattinger, Enno Ohlebusch, and Simon Gog. Bidirectional search in a string with wavelet trees and bidirectional matching statistics. Inf. Comput., 213:13-22, 2012. URL: https://doi.org/10.1016/j.ic.2011.03.007.
  44. Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249-260, 1995. URL: https://doi.org/10.1007/BF01206331.
  45. Michaël Vyverman, Bernard De Baets, Veerle Fack, and Peter Dawyndt. essamem: finding maximal exact matches using enhanced sparse suffix arrays. Bioinform., 29(6):802-804, 2013. URL: https://doi.org/10.1093/bioinformatics/btt042.
  46. Robert A Wagner and Michael J Fischer. The string-to-string correction problem. Journal of the ACM, 21(1):168-173, 1974. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail