Parameterized Algorithms for String Matching to DAGs: Funnels and Beyond

Author Manuel Cáceres



PDF
Thumbnail PDF

File

LIPIcs.CPM.2023.7.pdf
  • Filesize: 0.78 MB
  • 19 pages

Document Identifiers

Author Details

Manuel Cáceres
  • Department of Computer Science, University of Helsinki, Finland

Acknowledgements

I am very grateful to Alexandru I. Tomescu for initial discussions on funnel algorithms, to Veli Mäkinen for discussions on applying KMP on DAGs, to Massimo Equi and Nicola Rizzo for the useful discussions, and to the anonymous reviewers for their useful comments.

Cite As Get BibTex

Manuel Cáceres. Parameterized Algorithms for String Matching to DAGs: Funnels and Beyond. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 7:1-7:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023) https://doi.org/10.4230/LIPIcs.CPM.2023.7

Abstract

The problem of String Matching to Labeled Graphs (SMLG) asks to find all the paths in a labeled graph G = (V, E) whose spellings match that of an input string S ∈ Σ^m. SMLG can be solved in quadratic O(m|E|) time [Amir et al., JALG 2000], which was proven to be optimal by a recent lower bound conditioned on SETH [Equi et al., ICALP 2019]. The lower bound states that no strongly subquadratic time algorithm exists, even if restricted to directed acyclic graphs (DAGs).
In this work we present the first parameterized algorithms for SMLG on DAGs. Our parameters capture the topological structure of G. All our results are derived from a generalization of the Knuth-Morris-Pratt algorithm [Park and Kim, CPM 1995] optimized to work in time proportional to the number of prefix-incomparable matches.
To obtain the parameterization in the topological structure of G, we first study a special class of DAGs called funnels [Millani et al., JCO 2020] and generalize them to k-funnels and the class ST_k. We present several novel characterizations and algorithmic contributions on both funnels and their generalizations.

Subject Classification

ACM Subject Classification
  • Theory of computation → Parameterized complexity and exact algorithms
  • Theory of computation → Pattern matching
  • Mathematics of computing → Graph algorithms
Keywords
  • string matching
  • parameterized algorithms
  • FPT inside P
  • string algorithms
  • graph algorithms
  • directed acyclic graphs
  • labeled graphs
  • funnels

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Amir Abboud, Virginia Vassilevska Williams, and Joshua Wang. Approximation and fixed parameter subquadratic algorithms for radius and diameter in sparse graphs. In Proceedings of the 27th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2016), pages 377-391. SIAM, 2016. Google Scholar
  2. Alfred V Aho and John E Hopcroft. The design and analysis of computer algorithms. Pearson Education India, 1974. Google Scholar
  3. Alfred V Aho, John E Hopcroft, and Jeffrey D Ullman. On finding lowest common ancestors in trees. SIAM Journal on Computing, 5(1):115-132, 1976. Google Scholar
  4. Tatsuya Akutsu. A linear time pattern matching algorithm between a string and a tree. In Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching (CPM 1993), pages 1-10. Springer, 1993. Google Scholar
  5. Jarno Alanko, Giovanna D'Agostino, Alberto Policriti, and Nicola Prezza. Regular languages meet prefix sorting. In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2020), pages 911-930. SIAM, 2020. Google Scholar
  6. Jarno Alanko, Giovanna D'Agostino, Alberto Policriti, and Nicola Prezza. Wheeler languages. Information and Computation, 281:104820, 2021. Google Scholar
  7. Stephen Alstrup, Cyril Gavoille, Haim Kaplan, and Theis Rauhe. Nearest common ancestors: A survey and a new algorithm for a distributed environment. Theory of Computing Systems, 37(3):441-456, 2004. Google Scholar
  8. Amihood Amir, Moshe Lewenstein, and Noa Lewenstein. Pattern matching in hypertext. Journal of Algorithms, 35(1):82-99, 2000. Google Scholar
  9. Arne Andersson, Torben Hagerup, Stefan Nilsson, and Rajeev Raman. Sorting in linear time? Journal of Computer and System Sciences, 57(1):74-93, 1998. Google Scholar
  10. Renzo Angles, Marcelo Arenas, Pablo Barceló, Aidan Hogan, Juan Reutter, and Domagoj Vrgoč. Foundations of modern query languages for graph databases. ACM Computing Surveys, 50(5):1-40, 2017. Google Scholar
  11. Renzo Angles and Claudio Gutierrez. Survey of graph database models. ACM Computing Surveys, 40(1):1-39, 2008. Google Scholar
  12. Arturs Backurs and Piotr Indyk. Which regular expression patterns are hard to match? In Proceedings of the 57th IEEE Annual Symposium on Foundations of Computer Science (FOCS 2016), pages 457-466. IEEE, 2016. Google Scholar
  13. Arturs Backurs and Piotr Indyk. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). SIAM Journal on Computing, 47(3):1087-1097, 2018. URL: https://doi.org/10.1137/15M1053128.
  14. Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al. Modern information retrieval, volume 463. ACM press New York, 1999. Google Scholar
  15. Ricardo Baeza-Yates and Alejandro Salinger. Fast intersection algorithms for sorted sequences. In Algorithms and Applications, pages 45-61. Springer, 2010. Google Scholar
  16. Pablo Barceló Baeza. Querying graph databases. In Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS 2013), pages 175-188, 2013. Google Scholar
  17. Michael A Bender and Martin Farach-Colton. The LCA problem revisited. In Proceedings of the 4th Latin American Symposium on Theoretical Informatics (LATIN 2000), pages 88-94. Springer, 2000. Google Scholar
  18. Michael A Bender, Martin Farach-Colton, Giridhar Pemmasani, Steven Skiena, and Pavel Sumazin. Lowest common ancestors in trees and directed acyclic graphs. Journal of Algorithms, 57(2):75-94, 2005. Google Scholar
  19. Jon Louis Bentley and Andrew Chi-Chih Yao. An almost optimal algorithm for unbounded searching. Information Processing Letters, 5(SLAC-PUB-1679), 1976. Google Scholar
  20. Omer Berkman and Uzi Vishkin. Recursive star-tree parallel data structure. SIAM Journal on Computing, 22(2):221-242, 1993. Google Scholar
  21. Giulia Bernardini, Pawel Gawrychowski, Nadia Pisanti, Solon P Pissis, and Giovanna Rosone. Even faster elastic-degenerate string matching via fast matrix multiplication. In Proceedings of the 46th International Colloquium on Automata, Languages, and Programming (ICALP 2019), volume 132, pages 1-15. Schloss Dagstuhl-Leibniz Center for Informatics, 2019. Google Scholar
  22. Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Succinct de Bruijn graphs. In Proceedings of the 12th International Workshop on Algorithms in Bioinformatics (WABI 2012), pages 225-235. Springer, 2012. Google Scholar
  23. Manuel Cáceres, Massimo Cairo, Brendan Mumey, Romeo Rizzi, and Alexandru I Tomescu. A linear-time parameterized algorithm for computing the width of a DAG. In Proceedings of the 47th International Workshop on Graph-Theoretic Concepts in Computer Science (WG 2021), pages 257-269. Springer, 2021. Google Scholar
  24. Manuel Caceres, Massimo Cairo, Brendan Mumey, Romeo Rizzi, and Alexandru I Tomescu. Minimum path cover in parameterized linear time. arXiv preprint arXiv:2211.09659, 2022. Google Scholar
  25. Manuel Cáceres, Massimo Cairo, Brendan Mumey, Romeo Rizzi, and Alexandru I Tomescu. Sparsifying, shrinking and splicing for minimum path cover in parameterized linear time. In Proceedings of the 33rd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2022), pages 359-376. SIAM, 2022. Google Scholar
  26. Manuel Caceres, Brendan Mumey, Edin Husic, Romeo Rizzi, Massimo Cairo, Kristoffer Sahlin, and Alexandru I Ioan Tomescu. Safety in multi-assembly via paths appearing in all path covers of a DAG. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 19(6):3673-3684, 2021. Google Scholar
  27. Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Briefings in bioinformatics, 19(1):118-135, 2018. Google Scholar
  28. Jeff Conklin. Hypertext: An introduction and survey. computer, 20(09):17-41, 1987. Google Scholar
  29. Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction to algorithms. MIT press, 2022. Google Scholar
  30. Nicola Cotumaccio. Graphs can be succinctly indexed for pattern matching in O(|E|²+|V|^5/2) time. In Proceedings of the 32nd Data Compression Conference (DCC 2022), pages 272-281. IEEE, 2022. Google Scholar
  31. Nicola Cotumaccio and Nicola Prezza. On indexing and compressing finite automata. In Proceedings of the 32nd ACM-SIAM Symposium on Discrete Algorithms (SODA 2021), pages 2585-2599. SIAM, 2021. Google Scholar
  32. Marek Cygan, Fedor V Fomin, Łukasz Kowalik, Daniel Lokshtanov, Dániel Marx, Marcin Pilipczuk, Michał Pilipczuk, and Saket Saurabh. Parameterized algorithms, volume 5. Springer, 2015. Google Scholar
  33. Moshe Dubiner, Zvi Galil, and Edith Magen. Faster tree pattern matching. Journal of the ACM, 41(2):205-213, 1994. Google Scholar
  34. Massimo Equi, Roberto Grossi, Veli Mäkinen, Alexandru Tomescu, et al. On the complexity of string matching for graphs. In Proceedings of the 46th International Colloquium on Automata, Languages, and Programming (ICALP 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019. Google Scholar
  35. Massimo Equi, Veli Mäkinen, and Alexandru I Tomescu. Graphs cannot be indexed in polynomial time for sub-quadratic time string matching, unless SETH fails. In Proceedings of the 47th International Conference on Current Trends in Theory and Practice of Informatics (SOFSEM 2021), pages 608-622. Springer, 2021. Google Scholar
  36. Massimo Equi, Veli Mäkinen, Alexandru I Tomescu, and Roberto Grossi. On the complexity of string matching for graphs. ACM Transactions on Algorithms, 2023. Google Scholar
  37. Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and Senthilmurugan Muthukrishnan. Structuring labeled trees for optimal succinctness, and beyond. In Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2005), pages 184-193. IEEE, 2005. Google Scholar
  38. Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science (FOCS 2000), pages 390-398. IEEE, 2000. Google Scholar
  39. Johannes Fischer and Volker Heun. Theoretical and practical improvements on the RMQ-problem, with applications to LCA and LCE. In Proceedings of the 17th Annual Symposium on Combinatorial Pattern Matching (CPM 2006), pages 36-48. Springer, 2006. Google Scholar
  40. Fedor V Fomin, Daniel Lokshtanov, Saket Saurabh, Michał Pilipczuk, and Marcin Wrochna. Fully polynomial-time parameterized computations for graphs and matrices of low treewidth. ACM Transactions on Algorithms, 14(3):1-45, 2018. Google Scholar
  41. Harold N Gabow and Robert Endre Tarjan. A linear-time algorithm for a special case of disjoint set union. Journal of Computer and System Sciences, 30(2):209-221, 1985. Google Scholar
  42. Travis Gagie, Giovanni Manzini, and Jouni Sirén. Wheeler graphs: A framework for BWT-based data structures. Theoretical Computer Science, 698:67-78, 2017. Google Scholar
  43. Archontia C Giannopoulou, George B Mertzios, and Rolf Niedermeier. Polynomial fixed-parameter algorithms: A case study for longest path on interval graphs. Theoretical Computer Science, 689:67-95, 2017. Google Scholar
  44. Daniel Gibney and Sharma V. Thankachan. On the hardness and inapproximability of recognizing Wheeler graphs. In Proceedings of the 27th Annual European Symposium on Algorithms (ESA 2019), volume 144, pages 51:1-51:16, 2019. Google Scholar
  45. Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings of the 32nd Annual ACM Symposium on Theory of Computing (STOC 2000), pages 397-406, 2000. Google Scholar
  46. Ming Gu, Martin Farach, and Richard Beigel. An efficient algorithm for dynamic text indexing. In Proceedings of the 5th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 1994), pages 697-704, 1994. Google Scholar
  47. Yijie Han. Deterministic sorting in O (n log log n) time and linear space. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC 2002), pages 602-608, 2002. Google Scholar
  48. Yijie Han and Xiaojun Shen. Conservative algorithms for parallel and sequential integer sorting. In Proceedgins of the 1st International Computing and Combinatorics Conference (COCOON 1995), pages 324-333. Springer, 1995. Google Scholar
  49. Tzvika Hartman, Avinatan Hassidim, Haim Kaplan, Danny Raz, and Michal Segalov. How to split a flow? In 2012 Proceedings IEEE INFOCOM, pages 828-836. IEEE, 2012. Google Scholar
  50. Benjamin Grant Jackson. Parallel methods for short read assembly. PhD thesis, Iowa State University, 2009. Google Scholar
  51. Guy Jacobson. Space-efficient static trees and graphs. In Proceedings of the 30th Annual Symposium on Foundations of Computer Science (FOCS 1989), pages 549-554. IEEE Computer Society, 1989. Google Scholar
  52. Chirag Jain, Haowen Zhang, Yu Gao, and Srinivas Aluru. On the complexity of sequence-to-graph alignment. Journal of Computational Biology, 27(4):640-654, 2020. Google Scholar
  53. Arthur B Kahn. Topological sorting of large networks. Communications of the ACM, 5(11):558-562, 1962. Google Scholar
  54. Richard M Karp. Reducibility among combinatorial problems. In Complexity of Computer Computations, pages 85-103. Springer, 1972. Google Scholar
  55. John D Kececioglu and Eugene W Myers. Combinatorial algorithms for DNA sequence assembly. Algorithmica, 13(1):7-51, 1995. Google Scholar
  56. Shahbaz Khan, Milla Kortelainen, Manuel Cáceres, Lucia Williams, and Alexandru I Tomescu. Improving RNA assembly via safety and completeness in flow decompositions. Journal of Computational Biology, 2022. Google Scholar
  57. Shahbaz Khan, Milla Kortelainen, Manuel Cáceres, Lucia Williams, and Alexandru I Tomescu. Safety and completeness in flow decompositions for RNA assembly. In Proceedings of the 26th International Conference on Research in Computational Molecular Biology (RECOMB 2022), pages 177-192. Springer, 2022. Google Scholar
  58. Carl Kingsford, Michael C Schatz, and Mihai Pop. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics, 11(1):1-11, 2010. Google Scholar
  59. David Kirkpatrick and Stefan Reisch. Upper bounds for sorting integers on random access machines. Theoretical Computer Science, 28(3):263-276, 1983. Google Scholar
  60. Donald E Knuth, James H Morris, Jr, and Vaughan R Pratt. Fast pattern matching in strings. SIAM Journal on Computing, 6(2):323-350, 1977. Google Scholar
  61. Tomohiro Koana, Viatcheslav Korenwein, André Nichterlein, Rolf Niedermeier, and Philipp Zschoche. Data reduction for maximum matching on real-world graphs: Theory and experiments. Journal of Experimental Algorithmics, 26:1-30, 2021. Google Scholar
  62. Jonas Lehmann. The computational complexity of worst case flows in unreliable flow networks. B.S. thesis, Institute for Theoretical Computer Science, University of Lübeck, 2017. Google Scholar
  63. Carsten Lund and Mihalis Yannakakis. The approximation of maximum subgraph problems. In Proceedings of the 20th International Colloquium on Automata, Languages, and Programming (ICALP 1993), pages 40-51. Springer, 1993. Google Scholar
  64. Jun Ma, Manuel Cáceres, Leena Salmela, Veli Mäkinen, and Alexandru I Tomescu. Chaining for accurate alignment of erroneous long reads to acyclic variation graphs. bioRxiv, 2022. Google Scholar
  65. Thomas Magnanti, R Ahuja, and J Orlin. Network flows: theory, algorithms, and applications. PrenticeHall, Upper Saddle River, NJ, 1993. Google Scholar
  66. Veli Mäkinen, Alexandru I Tomescu, Anna Kuosmanen, Topi Paavilainen, Travis Gagie, and Rayan Chikhi. Sparse dynamic programming on DAGs with small width. ACM Transactions on Algorithms, 15(2):1-21, 2019. Google Scholar
  67. Udi Manber and Sun Wu. Approximate string matching with arbitrary costs for text and hypertext. In Advances In Structural And Syntactic Pattern Recognition, pages 22-33. World Scientific, 1992. Google Scholar
  68. Edward M McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, 23(2):262-272, 1976. Google Scholar
  69. Paul Medvedev, Konstantinos Georgiou, Gene Myers, and Michael Brudno. Computability of models for sequence assembly. In Proceedings of the 7th International Workshop on Algorithms in Bioinformatics (WABI 2007), pages 289-301. Springer, 2007. Google Scholar
  70. Marcelo Garlet Millani, Hendrik Molter, Rolf Niedermeier, and Manuel Sorge. Efficient algorithms for measuring the funnel-likeness of DAGs. Journal of Combinatorial Optimization, 39(1):216-245, 2020. Google Scholar
  71. J Ian Munro and Venkatesh Raman. Succinct representation of balanced parentheses and static trees. SIAM Journal on Computing, 31(3):762-776, 2001. Google Scholar
  72. Gonzalo Navarro. Improved approximate pattern matching on hypertext. Theoretical Computer Science, 237(1-2):455-463, 2000. Google Scholar
  73. Jakob Nielsen. Hypertext and hypermedia. Academic Press Professional, Inc., 1990. Google Scholar
  74. Kunsoo Park and Dong Kyue Kim. String matching in hypertext. In Proceeding of the 6th Annual Symposium on Combinatorial Pattern Matching (CPM 1995), pages 318-329. Springer, 1995. Google Scholar
  75. Jorge Pérez, Marcelo Arenas, and Claudio Gutierrez. Semantics and complexity of SPARQL. ACM Transactions on Database Systems, 34(3):1-45, 2009. Google Scholar
  76. Maurice Pollack. The maximum capacity through a network. Operations Research, 8(5):733-736, 1960. Google Scholar
  77. Mikko Rautiainen and Tobias Marschall. Aligning sequences to general graphs in O(V+ mE) time. bioRxiv, page 216127, 2017. Google Scholar
  78. Nicola Rizzo, Alexandru I Tomescu, and Alberto Policriti. Solving string problems on graphs using the labeled direct product. Algorithmica, pages 1-26, 2022. Google Scholar
  79. Baruch Schieber and Uzi Vishkin. On finding lowest common ancestors: Simplification and parallelization. SIAM Journal on Computing, 17(6):1253-1262, 1988. Google Scholar
  80. Markus Schulze. A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single-winner election method. Social choice and Welfare, 36(2):267-303, 2011. Google Scholar
  81. Robert Sedgewick and Kevin Wayne. Algorithms (4th edn). Addison-Wesley, 2011. Google Scholar
  82. Nachum Shacham. Multicast routing of hierarchical data. In [Conference Record] SUPERCOMM/ICC'92 Discovering a New World of Communications, pages 1217-1221. IEEE, 1992. Google Scholar
  83. Jouni Sirén, Niko Välimäki, and Veli Mäkinen. Indexing graphs for path queries with applications in genome research. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(2):375-388, 2014. Google Scholar
  84. Robert E Tarjan. Edge-disjoint spanning trees and depth-first search. Acta Informatica, 6(2):171-185, 1976. Google Scholar
  85. Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249-260, 1995. Google Scholar
  86. Ehsan Ullah, Kyongbum Lee, and Soha Hassoun. An algorithm for identifying dominant-edge metabolic pathways. In 2009 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). Digest of Technical Papers, pages 144-150. IEEE, 2009. Google Scholar
  87. Peter van Emde Boas. Preserving order in a forest in less than logarithmic time. In Proceedings of the 16th Annual Symposium on Foundations of Computer Science (FOCS 1975), pages 75-84. IEEE, 1975. Google Scholar
  88. Benedicte Vatinlen, Fabrice Chauvet, Philippe Chrétienne, and Philippe Mahey. Simple bounds and greedy algorithms for decomposing a flow into a minimal set of paths. European Journal of Operational Research, 185(3):1390-1401, 2008. Google Scholar
  89. Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory (SWAT 1973), pages 1-11. IEEE, 1973. Google Scholar
  90. Dan E Willard. Log-logarithmic worst-case range queries are possible in space Θ(N). Information Processing Letters, 17(2):81-84, 1983. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail