A Unifying Taxonomy of Pattern Matching in Degenerate Strings and Founder Graphs

Authors Rocco Ascone , Giulia Bernardini , Alessio Conte , Massimo Equi , Esteban Gabory , Roberto Grossi , Nadia Pisanti



PDF
Thumbnail PDF

File

LIPIcs.WABI.2024.14.pdf
  • Filesize: 1.05 MB
  • 21 pages

Document Identifiers

Author Details

Rocco Ascone
  • University of Trieste, Italy
Giulia Bernardini
  • University of Trieste, Italy
Alessio Conte
  • University of Pisa, Italy
Massimo Equi
  • University of Helsinki, Finland
Esteban Gabory
  • CWI, Amsterdam, The Netherlands
Roberto Grossi
  • University of Pisa, Italy
Nadia Pisanti
  • University of Pisa, Italy

Cite AsGet BibTex

Rocco Ascone, Giulia Bernardini, Alessio Conte, Massimo Equi, Esteban Gabory, Roberto Grossi, and Nadia Pisanti. A Unifying Taxonomy of Pattern Matching in Degenerate Strings and Founder Graphs. In 24th International Workshop on Algorithms in Bioinformatics (WABI 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 312, pp. 14:1-14:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.WABI.2024.14

Abstract

Elastic Degenerate (ED) strings and Elastic Founder (EF) graphs are two versions of acyclic components of pangenomes. Both ED strings and EF graphs (which we collectively name variable strings) extend the well-known notion of indeterminate string. Recent work has extensively investigated algorithmic tasks over these structures, and over several other variable strings notions that they generalise. Among such tasks, the basic operation of matching a pattern into a text, which can serve as a toolkit for many pangenomic data analyses using these data structures, deserves special attention. In this paper we: (1) highlight a clear taxonomy within both ED strings and EF graphs ranging through variable strings of all types, from the linear string up to the most general one; (2) investigate the problem PvarT(X,Y) of matching a solid or variable pattern of type X into a variable text of type Y; (3) using as a reference the quadratic conditional lower bounds that are known for PvarT(solid,ED) and PvarT(solid,EF), for all possible types of variable strings X and Y we either prove the quadratic conditional lower bound for PvarT(X,Y), or provide non-trivial, often sub-quadratic, upper bounds, also exploiting the above-mentioned taxonomy.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
  • Theory of computation → Problems, reductions and completeness
  • Applied computing → Molecular sequence analysis
  • Applied computing → Computational genomics
Keywords
  • Pangenomics
  • pattern matching
  • degenerate string
  • founder graph
  • fine-grained complexity

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Karl R. Abrahamson. Generalized string matching. SIAM J. Comput., 16(6):1039-1051, 1987. URL: https://doi.org/10.1137/0216067.
  2. Jarno N. Alanko, Elena Biagi, Simon J. Puglisi, and Jaakko Vuohtoniemi. Subset wavelet trees. In 21st International Symposium on Experimental Algorithms (SEA), volume 265 of LIPIcs, pages 4:1-4:14, 2023. URL: https://doi.org/10.4230/LIPICS.SEA.2023.4.
  3. Mai Alzamel, Lorraine A. K. Ayad, Giulia Bernardini, Roberto Grossi, Costas S. Iliopoulos, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone. Degenerate string comparison and applications. In 18th International Workshop on Algorithms in Bioinformatics (WABI), volume 113 of LIPIcs, pages 21:1-21:14, 2018. URL: https://doi.org/10.4230/LIPICS.WABI.2018.21.
  4. Mai Alzamel, Lorraine A. K. Ayad, Giulia Bernardini, Roberto Grossi, Costas S. Iliopoulos, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone. Comparing degenerate strings. Fundam. Informaticae, 175(1-4):41-58, 2020. URL: https://doi.org/10.3233/FI-2020-1947.
  5. Amihood Amir and Michael Itzhaki. Reconstructing General Matching Graphs. In 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024), volume 296 of Leibniz International Proceedings in Informatics (LIPIcs), pages 2:1-2:15, Dagstuhl, Germany, 2024. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. URL: https://doi.org/10.4230/LIPIcs.CPM.2024.2.
  6. Pavlos Antoniou, Maxime Crochemore, Costas S. Iliopoulos, Inuka Jayasekera, and Gad M. Landau. Conservative string covering of indeterminate strings. In Proceedings of the Prague Stringology Conference, pages 108-115, 2008. URL: http://www.stringology.org/event/2008/p10.html.
  7. Kotaro Aoyama, Yuto Nakashima, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Faster online elastic degenerate string matching. In 29th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 105 of LIPIcs, pages 9:1-9:10. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2018. URL: https://doi.org/10.4230/LIPICS.CPM.2018.9.
  8. Jasmijn A. Baaijens, Paola Bonizzoni, Christina Boucher, Gianluca Della Vedova, Yuri Pirola, Raffaella Rizzi, and Jouni Sirén. Computational graph pangenomics: a tutorial on data structures and their applications. Nat. Comput., 21(1):81-108, 2022. URL: https://doi.org/10.1007/s11047-022-09882-6.
  9. Arturs Backurs and Piotr Indyk. Which regular expression patterns are hard to match? In 57th IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 457-466, 2016. URL: https://doi.org/10.1109/FOCS.2016.56.
  10. Giulia Bernardini, Estéban Gabory, Solon P. Pissis, Leen Stougie, Michelle Sweering, and Wiktor Zuba. Elastic-degenerate string matching with 1 error. In 15th Latin American Symposium on Theoretical Informatics (LATIN), volume 13568 of Lecture Notes in Computer Science, pages 20-37. Springer, 2022. URL: https://doi.org/10.1007/978-3-031-20624-5_2.
  11. Giulia Bernardini, Pawel Gawrychowski, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone. Even faster elastic-degenerate string matching via fast matrix multiplication. In 46th International Colloquium on Automata, Languages, and Programming (ICALP), volume 132 of LIPIcs, pages 21:1-21:15, 2019. URL: https://doi.org/10.4230/LIPICS.ICALP.2019.21.
  12. Giulia Bernardini, Pawel Gawrychowski, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone. Elastic-degenerate string matching via fast matrix multiplication. SIAM J. Comput., 51(3):549-576, 2022. URL: https://doi.org/10.1137/20M1368033.
  13. Giulia Bernardini, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone. Pattern matching on elastic-degenerate text with errors. In 24th International Symposium on String Processing and Information Retrieval (SPIRE), volume 10508 of Lecture Notes in Computer Science, pages 74-90. Springer, 2017. URL: https://doi.org/10.1007/978-3-319-67428-5_7.
  14. Giulia Bernardini, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone. Approximate pattern matching on elastic-degenerate text. Theor. Comput. Sci., 812:109-122, 2020. URL: https://doi.org/10.1016/J.TCS.2019.08.012.
  15. Philip Bille, Inge Li Gørtz, and Tord Stordalen. Rank and select on degenerate strings. In 2024 Data Compression Conference (DCC), pages 283-292, 2024. URL: https://doi.org/10.1109/DCC58796.2024.00036.
  16. Thomas Büchler, Jannik Olbrich, and Enno Ohlebusch. Efficient short read mapping to a pangenome that is represented by a graph of ED strings. Bioinformatics, 39(5):btad320, 2023. URL: https://doi.org/10.1093/bioinformatics/btad320.
  17. Aleksander Cislak, Szymon Grabowski, and Jan Holub. Sopang: online text searching over a pan-genome. Bioinform., 34(24):4290-4292, 2018. URL: https://doi.org/10.1093/BIOINFORMATICS/BTY506.
  18. Richard Cole and Ramesh Hariharan. Tree pattern matching and subset matching in randomized o(n log^3m) time. In Proceedings of the Twenty-Ninth Annual ACM Symposium on the Theory of Computing (STOC), pages 66-75. ACM, 1997. URL: https://doi.org/10.1145/258533.258553.
  19. Richard Cole and Ramesh Hariharan. Verifying candidate matches in sparse and wildcard matching. In Proceedings on 34th Annual ACM Symposium on Theory of Computing (STOC), pages 592-601. ACM, 2002. URL: https://doi.org/10.1145/509907.509992.
  20. Richard Cole and Ramesh Hariharan. Tree pattern matching to subset matching in linear time. SIAM J. Comput., 32(4):1056-1066, 2003. URL: https://doi.org/10.1137/S0097539700382704.
  21. The Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Briefings Bioinformatics, 19(1):118-135, 2018. Google Scholar
  22. Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, and Tomasz Walen. Covering problems for partial words and for indeterminate strings. Theor. Comput. Sci., 698:25-39, 2017. URL: https://doi.org/10.1016/J.TCS.2017.05.026.
  23. Petr Danecek, Adam Auton, Gonçalo R. Abecasis, Cornelis A. Albers, Eric Banks, Mark A. DePristo, Robert E. Handsaker, Gerton Lunter, Gabor T. Marth, Stephen T. Sherry, Gilean McVean, and Richard Durbin. The variant call format and vcftools. Bioinform., 27(15):2156-2158, 2011. URL: https://doi.org/10.1093/BIOINFORMATICS/BTR330.
  24. Jacqueline W. Daykin, Richard Groult, Yannick Guesnet, Thierry Lecroq, Arnaud Lefebvre, Martine Léonard, Laurent Mouchard, Élise Prieur, and Bruce W. Watson. Efficient pattern matching in degenerate strings with the burrows-wheeler transform. Inf. Process. Lett., 147:82-87, 2019. URL: https://doi.org/10.1016/J.IPL.2019.03.003.
  25. Jacqueline W. Daykin and Bruce W. Watson. Indeterminate string factorizations and degenerate text transformations. Math. Comput. Sci., 11(2):209-218, 2017. URL: https://doi.org/10.1007/S11786-016-0285-X.
  26. Daniel Dorey-Robinson, Giuseppe Maccari, and John A. Hammond. Igmat: immunoglobulin sequence multi-species annotation tool for any species including those with incomplete antibody annotation or unusual characteristics. BMC Bioinform., 24(1):491, 2023. URL: https://doi.org/10.1186/S12859-023-05624-2.
  27. E.Garrison, J.Sirén, A.M.Novak, G.Hickey, J.M.Eizenga, E.T.Dawson, W.Jones, S.Garg, C.Markello, M.F.Lin MF, B.Paten B, and R.Durbin. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature Biotechnology, 36(9):875-879, 2018. URL: https://doi.org/10.1038/nbt.4227.
  28. Jordan M. Eizenga, Adam M. Novak, Emily Kobayashi, Flavia Villani, Cecilia Cisar, Simon Heumos, Glenn Hickey, Vincenza Colonna, Benedict Paten, and Erik Garrison. Efficient dynamic variation graphs. Bioinform., 36(21):5139-5144, 2021. URL: https://doi.org/10.1093/bioinformatics/btaa640.
  29. Massimo Equi, Veli Mäkinen, and Alexandru I. Tomescu. Graphs cannot be indexed in polynomial time for sub-quadratic time string matching, unless SETH fails. In 47th International Conference on Current Trends in Theory and Practice of Computer Science, (SOFSEM), volume 12607 of Lecture Notes in Computer Science, pages 608-622. Springer, 2021. URL: https://doi.org/10.1007/978-3-030-67731-2_44.
  30. Massimo Equi, Veli Mäkinen, and Alexandru I. Tomescu. Graphs cannot be indexed in polynomial time for sub-quadratic time string matching, unless SETH fails. Theor. Comput. Sci., 975:114128, 2023. URL: https://doi.org/10.1016/J.TCS.2023.114128.
  31. Massimo Equi, Veli Mäkinen, Alexandru I. Tomescu, and Roberto Grossi. On the complexity of string matching for graphs. ACM Trans. Algorithms, 19(3):21:1-21:25, 2023. URL: https://doi.org/10.1145/3588334.
  32. Massimo Equi, Tuukka Norri, Jarno Alanko, Bastien Cazaux, Alexandru I. Tomescu, and Veli Mäkinen. Algorithms and complexity on indexing elastic founder graphs. In 32nd International Symposium on Algorithms and Computation (ISAAC), volume 212 of LIPIcs, pages 20:1-20:18, 2021. URL: https://doi.org/10.4230/LIPICS.ISAAC.2021.20.
  33. Massimo Equi, Tuukka Norri, Jarno Alanko, Bastien Cazaux, Alexandru I. Tomescu, and Veli Mäkinen. Algorithms and complexity on indexing founder graphs. Algorithmica, 85(6):1586-1623, 2023. URL: https://doi.org/10.1007/S00453-022-01007-W.
  34. Liao et al. A draft human pangenome reference. Nature, 617(7960):312-324, 2023. Google Scholar
  35. Estéban Gabory, Njagi Moses Mwaniki, Nadia Pisanti, Solon P. Pissis, Jakub Radoszewski, Michelle Sweering, and Wiktor Zuba. Comparing elastic-degenerate strings: Algorithms, lower bounds, and applications. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 259 of LIPIcs, pages 11:1-11:20, 2023. URL: https://doi.org/10.4230/LIPICS.CPM.2023.11.
  36. Pawel Gawrychowski, Samah Ghazawi, and Gad M. Landau. On indeterminate strings matching. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM), volume 161 of LIPIcs, pages 14:1-14:14, 2020. URL: https://doi.org/10.4230/LIPICS.CPM.2020.14.
  37. Daniel Gibney. An efficient elastic-degenerate text index? not likely. In Christina Boucher and Sharma V. Thankachan, editors, 27th International Symposium on String Processing and Information Retrieval, volume 12303 of Lecture Notes in Computer Science, pages 76-88. Springer, 2020. URL: https://doi.org/10.1007/978-3-030-59212-7_6.
  38. Daniel Gibney, Gary Hoppenworth, and Sharma V. Thankachan. Simple reductions from formula-sat to pattern matching on labeled graphs and subtree isomorphism. In 4th SIAM Symposium on Simplicity in Algorithms (SOSA), pages 232-242, 2021. URL: https://doi.org/10.1137/1.9781611976496.26.
  39. Roberto Grossi, Costas S. Iliopoulos, Chang Liu, Nadia Pisanti, Solon P. Pissis, Ahmad Retha, Giovanna Rosone, Fatima Vayani, and Luca Versari. On-line pattern matching on similar texts. In 28th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 78 of LIPIcs, pages 9:1-9:14, 2017. URL: https://doi.org/10.4230/LIPICS.CPM.2017.9.
  40. Ming Gu, Martin Farach, and Richard Beigel. An efficient algorithm for dynamic text indexing. In Proceedings of the 5th annual ACM-SIAM symposium on Discrete algorithms (SODA), pages 697-704, 1994. URL: https://dl.acm.org/doi/pdf/10.5555/314464.314675.
  41. Jan Holub, William F. Smyth, and Shu Wang. Fast pattern-matching on indeterminate strings. J. Discrete Algorithms, 6(1):37-50, 2008. URL: https://doi.org/10.1016/J.JDA.2006.10.003.
  42. Costas S. Iliopoulos, Ritu Kundu, and Solon P. Pissis. Efficient pattern matching in elastic-degenerate strings. Information and Computation, 279:104616, 2021. URL: https://doi.org/10.1016/j.ic.2020.104616.
  43. Costas S. Iliopoulos, Laurent Mouchard, and Mohammad Sohel Rahman. A new approach to pattern matching in degenerate DNA/RNA sequences and distributed pattern matching. Math. Comput. Sci., 1(4):557-569, 2008. URL: https://doi.org/10.1007/S11786-007-0029-Z.
  44. Costas S. Iliopoulos and Jakub Radoszewski. Truly subquadratic-time extension queries and periodicity detection in strings with uncertainties. In 27th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 54 of LIPIcs, pages 8:1-8:12, 2016. URL: https://doi.org/10.4230/LIPICS.CPM.2016.8.
  45. Russell Impagliazzo and Ramamohan Paturi. On the complexity of k-sat. J. Comput. Syst. Sci., 62(2):367-375, 2001. URL: https://doi.org/10.1006/JCSS.2000.1727.
  46. IUPAC-IUB Commission on Biochemical Nomenclature. Abbreviations and symbols for nucleic acids, polynucleotides, and their constituents. Biochemistry, 9(20):4022-4027, 1970. URL: https://doi.org/10.1016/0022-2836(71)90319-6.
  47. Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(2):323-350, 1977. URL: https://doi.org/10.1137/0206024.
  48. Felipe A. Louza, Neerja Mhaskar, and W. F. Smyth. A new approach to regular & indeterminate strings. Theor. Comput. Sci., 854:105-115, 2021. URL: https://doi.org/10.1016/J.TCS.2020.12.007.
  49. Veli Mäkinen, Bastien Cazaux, Massimo Equi, Tuukka Norri, and Alexandru I. Tomescu. Linear time construction of indexable founder block graphs. In 20th International Conference on Algorithms in Bioinformatics (WABI), volume 172 of LIPIcs, pages 7:1-7:18, 2020. URL: https://doi.org/10.4230/LIPICS.WABI.2020.7.
  50. Njagi Moses Mwaniki, Erik Garrison, and Nadia Pisanti. Fast exact string to D-texts alignments. In 16th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC), pages 70-79. SCITEPRESS, 2023. URL: https://doi.org/10.5220/0011666900003414.
  51. Njagi Moses Mwaniki and Nadia Pisanti. Optimal sequence alignment to ED-strings. In 18th International Symposium Bioinformatics Research and Applications (ISBRA), volume 13760 of Lecture Notes in Computer Science, pages 204-216. Springer, 2022. URL: https://doi.org/10.1007/978-3-031-23198-8_19.
  52. Benedict Paten, Adam M. Novak, Jordan M. Eizenga, and Erik Garrison. Genome graphs and the evolution of genome inference. Genome Res, 27(5):665-676, 2017. Google Scholar
  53. Nadia Pisanti, Henry Soldano, and Mathilde Carpentier. Incremental inference of relational motifs with a degenerate alphabet. In 16th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 3537 of Lecture Notes in Computer Science, pages 229-240. Springer, 2005. URL: https://doi.org/10.1007/11496656_20.
  54. Nadia Pisanti, Henry Soldano, Mathilde Carpentier, and Joël Pothier. A relational extension of the notion of motifs: Application to the common 3d protein substructures searching problem. Journal of Compututational Biology, 16(12):1635-1660, 2009. URL: https://doi.org/10.1089/CMB.2008.0019.
  55. Solon P. Pissis and Ahmad Retha. Dictionary matching in elastic-degenerate texts with applications in searching VCF files on-line. In 17th International Symposium on Experimental Algorithms (SEA), volume 103 of LIPIcs, pages 16:1-16:14, 2018. URL: https://doi.org/10.4230/LIPICS.SEA.2018.16.
  56. Petr Procházka, Ondrej Cvacho, Lubos Krcál, and Jan Holub. Backward pattern matching on elastic-degenerate strings. SN Comput. Sci., 4(5):442, 2023. URL: https://doi.org/10.1007/S42979-023-01760-X.
  57. Goran Rakocevic, Vladimir Semenyuk, Wan-Ping Lee, James Spencer, John Browning, Ivan J. Johnson, Vladan Arsenijevic, Jelena Nadj, Kaushik Ghose, Maria C. Suciu, Sun-Gou Ji, Gülfem Demir, Lizao Li, Berke Ç. Toptaş, Alexey Dolgoborodov, Björn Pollex, Iosif Spulber, Irina Glotova, Péter Kómár, Andrew L. Stachyra, Yilong Li, Milos Popovic, Morten Källberg, Amit Jain, and Deniz Kural. Fast and accurate genomic analyses using genome graphs. Nature Genetics, 51:354-362, 2019. Google Scholar
  58. Nicola Rizzo, Massimo Equi, Tuukka Norri, and Veli Mäkinen. Elastic founder graphs improved and enhanced. Theor. Comput. Sci., 982:114269, 2024. URL: https://doi.org/10.1016/J.TCS.2023.114269.
  59. Nicola Rizzo and Veli Mäkinen. Indexable elastic founder graphs of minimum height. In 33rd Annual Symposium on Combinatorial Pattern Matching (CPM), volume 223 of LIPIcs, pages 19:1-19:19, 2022. URL: https://doi.org/10.4230/LIPICS.CPM.2022.19.
  60. Nicola Rizzo and Veli Mäkinen. Linear time construction of indexable elastic founder graphs. In 33rd International Workshop on Combinatorial Algorithms (IWOCA), volume 13270 of Lecture Notes in Computer Science, pages 480-493. Springer, 2022. URL: https://doi.org/10.1007/978-3-031-06678-8_35.
  61. Marie-France Sagot, Alain Viari, and Henry Soldano. Multiple sequence comparison: A peptide matching approach. In 6th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 937 of Lecture Notes in Computer Science, pages 366-385. Springer, 1995. URL: https://doi.org/10.1007/3-540-60044-2_55.
  62. Marie-France Sagot, Alain Viari, and Henry Soldano. Multiple sequence comparison - A peptide matching approach. Theor. Comput. Sci., 180(1-2):115-137, 1997. URL: https://doi.org/10.1016/S0304-3975(96)00137-5.
  63. Ariel Shiftan and Ely Porat. Set intersection and sequence matching with mismatch counting. Theor. Comput. Sci., 638:3-10, 2016. URL: https://doi.org/10.1016/J.TCS.2016.01.003.
  64. Jouni Sirén, Erik Garrison, Adam M. Novak, Benedict Paten, and Richard Durbin. Haplotype-aware graph indexes. In 18th International Conference on Algorithms in Bioinformatics (WABI), volume 113 of LIPIcs, pages 4:1-4:13, 2018. URL: https://doi.org/10.4230/LIPICS.WABI.2018.4.
  65. Henry Soldano, Alain Viari, and Marc Champesme. Searching for flexible repeated patterns using a non-transitive similarity relation. Pattern Recognit. Lett., 16(3):233-246, 1995. URL: https://doi.org/10.1016/0167-8655(94)00095-K.
  66. Chris Thachuk. Indexing hypertext. J. Discrete Algorithms, 18:113-122, 2013. URL: https://doi.org/10.1016/J.JDA.2012.10.001.
  67. Ryan Williams. A new algorithm for optimal 2-constraint satisfaction and its implications. Theor. Comput. Sci., 348(2-3):357-365, 2005. URL: https://doi.org/10.1016/J.TCS.2005.09.023.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail