Longest Common Substring with Gaps and Related Problems

Authors Aranya Banerjee , Daniel Gibney , Sharma V. Thankachan



PDF
Thumbnail PDF

File

LIPIcs.ESA.2024.16.pdf
  • Filesize: 1 MB
  • 18 pages

Document Identifiers

Author Details

Aranya Banerjee
  • Georgia Institute of Technology, Atlanta, GA, USA
Daniel Gibney
  • University of Texas at Dallas, Richardson, TX, USA
Sharma V. Thankachan
  • North Carolina State University, Raleigh, NC, USA

Cite As Get BibTex

Aranya Banerjee, Daniel Gibney, and Sharma V. Thankachan. Longest Common Substring with Gaps and Related Problems. In 32nd Annual European Symposium on Algorithms (ESA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 308, pp. 16:1-16:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024) https://doi.org/10.4230/LIPIcs.ESA.2024.16

Abstract

The longest common substring (also known as longest common factor) and longest common subsequence problems are two well-studied classical string problems. The former is solvable in optimal 𝒪(n) time for two strings of length m and n with m ≤ n, and the latter is solvable in 𝒪(nm) time, which is conditionally optimal under the Strong Exponential Time Hypothesis. In this work, we study the problem of longest common factor with gaps, that is, finding a set of at most k matching substrings obeying precedence conditions with maximum total length. For k = 1, this is equivalent to the longest common factor problem, and for k = m, this is equivalent to the longest common subsequence problem. Our work demonstrates that, for constant k, this problem can be solved in strongly subquadratic time, i.e., nm^{1 - Θ(1)}. Motivated by co-linear chaining applications in Computational Biology, we further demonstrate that the longest common factor with gaps results can be extended to the case where the matches are restricted to maximal exact matches (MEMs). To further demonstrate the applicability of our techniques, we show that a similar approach can be used for a restricted version of the episode matching problem where one seeks an ordered set of at most k matches whose concatenation equals a query pattern P and the length of the substring of T containing the matches is minimized. These solutions all run in strongly subquadratic time for constant k.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
Keywords
  • Pattern Matching
  • Longest Common Subsequence
  • Episode Matching

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Amir Abboud, Thomas Dueholm Hansen, Virginia Vassilevska Williams, and Ryan Williams. Simulating branching programs with edit distance and friends: or: a polylog shaved is a lower bound made. In Daniel Wichs and Yishay Mansour, editors, Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 18-21, 2016, pages 375-388. ACM, 2016. URL: https://doi.org/10.1145/2897518.2897653.
  2. Amir Abboud, Richard Ryan Williams, and Huacheng Yu. More applications of the polynomial method to algorithm design. In Piotr Indyk, editor, Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, San Diego, CA, USA, January 4-6, 2015, pages 218-230. SIAM, 2015. URL: https://doi.org/10.1137/1.9781611973730.17.
  3. Mohamed I Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. Coconut: an efficient system for the comparison and analysis of genomes. BMC bioinformatics, 9(1):476, 2008. URL: https://doi.org/10.1186/1471-2105-9-476.
  4. Mohamed Ibrahim Abouelhoda and Enno Ohlebusch. Multiple genome alignment: Chaining algorithms revisited. In Ricardo A. Baeza-Yates, Edgar Chávez, and Maxime Crochemore, editors, Combinatorial Pattern Matching, 14th Annual Symposium, CPM 2003, Morelia, Michocán, Mexico, June 25-27, 2003, Proceedings, volume 2676 of Lecture Notes in Computer Science, pages 1-16. Springer, 2003. URL: https://doi.org/10.1007/3-540-44888-8_1.
  5. Mohamed Ibrahim Abouelhoda and Enno Ohlebusch. Chaining algorithms for multiple genome comparison. J. Discrete Algorithms, 3(2-4):321-341, 2005. URL: https://doi.org/10.1016/j.jda.2004.08.011.
  6. Srinivas Aluru, Alberto Apostolico, and Sharma V. Thankachan. Efficient alignment free sequence comparison with bounded mismatches. In Teresa M. Przytycka, editor, Research in Computational Molecular Biology - 19th Annual International Conference, RECOMB 2015, Warsaw, Poland, April 12-15, 2015, Proceedings, volume 9029 of Lecture Notes in Computer Science, pages 1-12. Springer, 2015. URL: https://doi.org/10.1007/978-3-319-16706-0_1.
  7. Amihood Amir, Panagiotis Charalampopoulos, Solon P. Pissis, and Jakub Radoszewski. Dynamic and internal longest common substring. Algorithmica, 82(12):3707-3743, 2020. URL: https://doi.org/10.1007/s00453-020-00744-0.
  8. Alberto Apostolico and Mikhail J. Atallah. Compact recognizers of episode sequences. Inf. Comput., 174(2):180-192, 2002. URL: https://doi.org/10.1006/INCO.2002.3143.
  9. Kyriakos Axiotis and Christos Tzamos. Capacitated dynamic programming: Faster knapsack and graph algorithms. In Christel Baier, Ioannis Chatzigiannakis, Paola Flocchini, and Stefano Leonardi, editors, 46th International Colloquium on Automata, Languages, and Programming, ICALP 2019, July 9-12, 2019, Patras, Greece, volume 132 of LIPIcs, pages 19:1-19:13. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019. URL: https://doi.org/10.4230/LIPIcs.ICALP.2019.19.
  10. Maxim A. Babenko and Tatiana Starikovskaya. Computing the longest common substring with one mismatch. Probl. Inf. Transm., 47(1):28-33, 2011. URL: https://doi.org/10.1134/S0032946011010030.
  11. Arturs Backurs, Piotr Indyk, and Ludwig Schmidt. Better approximations for tree sparsity in nearly-linear time. In Philip N. Klein, editor, Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2017, Barcelona, Spain, Hotel Porta Fira, January 16-19, pages 2215-2229. SIAM, 2017. URL: https://doi.org/10.1137/1.9781611974782.145.
  12. Johannes Bader, Simon Gog, and Matthias Petri. Practical variable length gap pattern matching. In Andrew V. Goldberg and Alexander S. Kulikov, editors, Experimental Algorithms - 15th International Symposium, SEA 2016, St. Petersburg, Russia, June 5-8, 2016, Proceedings, volume 9685 of Lecture Notes in Computer Science, pages 1-16. Springer, 2016. URL: https://doi.org/10.1007/978-3-319-38851-9_1.
  13. Philip Bille, Inge Li Gørtz, Shay Mozes, Teresa Anna Steiner, and Oren Weimann. The fine-grained complexity of episode matching. In Hideo Bannai and Jan Holub, editors, 33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022, June 27-29, 2022, Prague, Czech Republic, volume 223 of LIPIcs, pages 4:1-4:12. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022. URL: https://doi.org/10.4230/LIPICS.CPM.2022.4.
  14. Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and David Kofoed Wind. String matching with variable length gaps. Theor. Comput. Sci., 443:25-34, 2012. URL: https://doi.org/10.1016/J.TCS.2012.03.029.
  15. Philip Bille and Mikkel Thorup. Regular expression matching with multi-strings and intervals. In Moses Charikar, editor, Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17-19, 2010, pages 1297-1308. SIAM, 2010. URL: https://doi.org/10.1137/1.9781611973075.104.
  16. Luc Boasson, Patrick Cégielski, Irène Guessarian, and Yuri V. Matiyasevich. Window-accumulated subsequence matching problem is linear. Ann. Pure Appl. Log., 113(1-3):59-80, 2001. URL: https://doi.org/10.1016/S0168-0072(01)00051-3.
  17. Nick Bray, Inna Dubchak, and Lior Pachter. Avid: A global alignment program. Genome research, 13(1):97-102, 2003. URL: https://doi.org/10.1101/gr.789803.
  18. Karl Bringmann and Marvin Künnemann. Quadratic conditional lower bounds for string problems and dynamic time warping. In Venkatesan Guruswami, editor, IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015, pages 79-97. IEEE Computer Society, 2015. URL: https://doi.org/10.1109/FOCS.2015.15.
  19. Karl Bringmann and Marvin Künnemann. Multivariate fine-grained complexity of longest common subsequence. In Artur Czumaj, editor, Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018, pages 1216-1235. SIAM, 2018. URL: https://doi.org/10.1137/1.9781611975031.79.
  20. Stefan Burkhardt and Juha Kärkkäinen. Fast lightweight suffix array construction and checking. In Ricardo A. Baeza-Yates, Edgar Chávez, and Maxime Crochemore, editors, Combinatorial Pattern Matching, 14th Annual Symposium, CPM 2003, Morelia, Michocán, Mexico, June 25-27, 2003, Proceedings, volume 2676 of Lecture Notes in Computer Science, pages 55-69. Springer, 2003. URL: https://doi.org/10.1007/3-540-44888-8_5.
  21. Michael Burrows, D J Wheeler D I G I T A L, Robert W. Taylor, David J. Wheeler, and David Wheeler. A block-sorting lossless data compression algorithm. In , 1994. URL: https://api.semanticscholar.org/CorpusID:2167441.
  22. Manuel Cáceres, Simon J. Puglisi, and Bella Zhukova. Fast indexes for gapped pattern matching. In Alexander Chatzigeorgiou, Riccardo Dondi, Herodotos Herodotou, Christos A. Kapoutsis, Yannis Manolopoulos, George A. Papadopoulos, and Florian Sikora, editors, SOFSEM 2020: Theory and Practice of Computer Science - 46th International Conference on Current Trends in Theory and Practice of Informatics, SOFSEM 2020, Limassol, Cyprus, January 20-24, 2020, Proceedings, volume 12011 of Lecture Notes in Computer Science, pages 493-504. Springer, 2020. URL: https://doi.org/10.1007/978-3-030-38919-2_40.
  23. Patrick Cégielski, Irène Guessarian, and Yuri V. Matiyasevich. Multiple serial episodes matching. Inf. Process. Lett., 98(6):211-218, 2006. URL: https://doi.org/10.1016/J.IPL.2006.02.008.
  24. Mark J Chaisson and Glenn Tesler. Mapping single molecule sequencing reads using basic local alignment with successive refinement (blasr): application and theory. BMC bioinformatics, 13(1):238, 2012. URL: https://doi.org/10.1186/1471-2105-13-238.
  25. Panagiotis Charalampopoulos, Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, and Tomasz Walen. Linear-time algorithm for long LCF with k mismatches. In Gonzalo Navarro, David Sankoff, and Binhai Zhu, editors, Annual Symposium on Combinatorial Pattern Matching, CPM 2018, July 2-4, 2018 - Qingdao, China, volume 105 of LIPIcs, pages 23:1-23:16. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2018. URL: https://doi.org/10.4230/LIPIcs.CPM.2018.23.
  26. Panagiotis Charalampopoulos, Tomasz Kociumaka, Solon P. Pissis, and Jakub Radoszewski. Faster algorithms for longest common substring. In Petra Mutzel, Rasmus Pagh, and Grzegorz Herman, editors, 29th Annual European Symposium on Algorithms, ESA 2021, September 6-8, 2021, Lisbon, Portugal (Virtual Conference), volume 204 of LIPIcs, pages 30:1-30:17. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL: https://doi.org/10.4230/LIPICS.ESA.2021.30.
  27. Charles J. Colbourn and Alan C. H. Ling. Quorums from difference covers. Inf. Process. Lett., 75(1-2):9-12, 2000. URL: https://doi.org/10.1016/S0020-0190(00)00080-6.
  28. Maxime Crochemore, Costas S. Iliopoulos, Christos Makris, Wojciech Rytter, Athanasios K. Tsakalidis, and T. Tsichlas. Approximate string matching with gaps. Nord. J. Comput., 9(1):54-65, 2002. Google Scholar
  29. Gautam Das, Rudolf Fleischer, Leszek Gasieniec, Dimitrios Gunopulos, and Juha Kärkkäinen. Episode matching. In Alberto Apostolico and Jotun Hein, editors, Combinatorial Pattern Matching, 8th Annual Symposium, CPM 97, Aarhus, Denmark, June 30 - July 2, 1997, Proceedings, volume 1264 of Lecture Notes in Computer Science, pages 12-27. Springer, 1997. URL: https://doi.org/10.1007/3-540-63220-4_46.
  30. Mark de Berg, Otfried Cheong, Marc J. van Kreveld, and Mark H. Overmars. Computational geometry: algorithms and applications, 3rd Edition. Springer, 2008. URL: https://www.worldcat.org/oclc/227584184.
  31. Martin Farach. Optimal suffix tree construction with large alphabets. In 38th Annual Symposium on Foundations of Computer Science, FOCS '97, Miami Beach, Florida, USA, October 19-22, 1997, pages 137-143. IEEE Computer Society, 1997. URL: https://doi.org/10.1109/SFCS.1997.646102.
  32. Tomás Flouri, Emanuele Giaquinta, Kassian Kobert, and Esko Ukkonen. Longest common substrings with k mismatches. Inf. Process. Lett., 115(6-8):643-647, 2015. URL: https://doi.org/10.1016/J.IPL.2015.03.006.
  33. Pawel Gawrychowski, Tomasz Kociumaka, Wojciech Rytter, and Tomasz Walen. Faster longest common extension queries in strings over general alphabets. In Roberto Grossi and Moshe Lewenstein, editors, 27th Annual Symposium on Combinatorial Pattern Matching, CPM 2016, June 27-29, 2016, Tel Aviv, Israel, volume 54 of LIPIcs, pages 5:1-5:13. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2016. URL: https://doi.org/10.4230/LIPIcs.CPM.2016.5.
  34. Dov Harel and Robert Endre Tarjan. Fast algorithms for finding nearest common ancestors. SIAM J. Comput., 13(2):338-355, 1984. URL: https://doi.org/10.1137/0213024.
  35. Daniel S Hirschberg. Algorithms for the longest common subsequence problem. Journal of the ACM (JACM), 24(4):664-675, 1977. Google Scholar
  36. James W Hunt and Thomas G Szymanski. A fast algorithm for computing longest common subsequences. Communications of the ACM, 20(5):350-353, 1977. Google Scholar
  37. Chirag Jain, Daniel Gibney, and Sharma V. Thankachan. Co-linear chaining with overlaps and gap costs. In Itsik Pe'er, editor, Research in Computational Molecular Biology - 26th Annual International Conference, RECOMB 2022, San Diego, CA, USA, May 22-25, 2022, Proceedings, volume 13278 of Lecture Notes in Computer Science, pages 246-262. Springer, 2022. URL: https://doi.org/10.1007/978-3-031-04749-7_15.
  38. Chirag Jain, Arang Rhie, Nancy F Hansen, Sergey Koren, and Adam M Phillippy. Long-read mapping to repetitive reference sequences using winnowmap2. Nature Methods, pages 1-6, 2022. URL: https://doi.org/10.1038/s41592-022-01457-8.
  39. Tomasz Kociumaka, Jakub Radoszewski, and Tatiana Starikovskaya. Publisher correction: Longest common substring with approximately k mismatches. Algorithmica, 85(10):3323, 2023. URL: https://doi.org/10.1007/S00453-023-01119-X.
  40. Stefan Kurtz et al. Versatile and open software for comparing large genomes. Genome biology, 5(2):R12, 2004. URL: https://doi.org/10.1186/gb-2004-5-2-r12.
  41. Moshe Lewenstein. Indexing with gaps. In Roberto Grossi, Fabrizio Sebastiani, and Fabrizio Silvestri, editors, String Processing and Information Retrieval, 18th International Symposium, SPIRE 2011, Pisa, Italy, October 17-21, 2011. Proceedings, volume 7024 of Lecture Notes in Computer Science, pages 135-143. Springer, 2011. URL: https://doi.org/10.1007/978-3-642-24583-1_14.
  42. Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094-3100, 2018. URL: https://doi.org/10.1093/bioinformatics/bty191.
  43. Tiantian Li, Haitao Jiang, Xuefeng Cui, Haodi Feng, and Daming Zhu. Longest order-consistent and number-limited common substrings. Available at SSRN 4724959, 2024. Google Scholar
  44. Tiantian Li, Daming Zhu, Haitao Jiang, Haodi Feng, and Xuefeng Cui. Longest k-tuple common sub-strings. In Donald A. Adjeroh, Qi Long, Xinghua Mindy Shi, Fei Guo, Xiaohua Hu, Srinivas Aluru, Giri Narasimhan, Jianxin Wang, Mingon Kang, Ananda Mondal, and Jin Liu, editors, IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022, Las Vegas, NV, USA, December 6-8, 2022, pages 63-66. IEEE, 2022. URL: https://doi.org/10.1109/BIBM55620.2022.9995199.
  45. Mamoru Maekawa. A square root N algorithm for mutual exclusion in decentralized systems. ACM Trans. Comput. Syst., 3(2):145-159, 1985. Google Scholar
  46. Veli Mäkinen, Gonzalo Navarro, and Esko Ukkonen. Transposition invariant string matching. J. Algorithms, 56(2):124-153, 2005. URL: https://doi.org/10.1016/J.JALGOR.2004.07.008.
  47. Veli Mäkinen and Kristoffer Sahlin. Chaining with overlaps revisited. In 31st Annual Symposium on Combinatorial Pattern Matching, CPM 2020, June 17-19, 2020, Copenhagen, Denmark, volume 161, pages 25:1-25:12. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020. URL: https://doi.org/10.4230/LIPIcs.CPM.2020.25.
  48. Guillaume Marçais, Arthur L Delcher, et al. Mummer4: A fast and versatile genome alignment system. PLoS computational biology, 14(1):e1005944, 2018. URL: https://doi.org/10.1371/journal.pcbi.1005944.
  49. Michele Morgante, Alberto Policriti, Nicola Vitacolonna, and Andrea Zuccolo. Structured motifs search. J. Comput. Biol., 12(8):1065-1082, 2005. URL: https://doi.org/10.1089/CMB.2005.12.1065.
  50. Eugene W. Myers. An O(ND) difference algorithm and its variations. Algorithmica, 1(2):251-266, 1986. URL: https://doi.org/10.1007/BF01840446.
  51. Gene Myers and Webb Miller. Chaining multiple-alignment fragments in sub-quadratic time. In Kenneth L. Clarkson, editor, Proceedings of the Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, 22-24 January 1995. San Francisco, California, USA, pages 38-47. ACM/SIAM, 1995. URL: http://dl.acm.org/citation.cfm?id=313651.313661.
  52. Christian Otto, Steve Hoffmann, Jan Gorodkin, and Peter F Stadler. Fast local fragment chaining using sum-of-pair gap costs. Algorithms for Molecular Biology, 6(1):4, 2011. URL: https://doi.org/10.1186/1748-7188-6-4.
  53. M. Sohel Rahman, Costas S. Iliopoulos, Inbok Lee, Manal Mohamed, and William F. Smyth. Finding patterns with variable length gaps or don't cares. In Danny Z. Chen and D. T. Lee, editors, Computing and Combinatorics, 12th Annual International Conference, COCOON 2006, Taipei, Taiwan, August 15-18, 2006, Proceedings, volume 4112 of Lecture Notes in Computer Science, pages 146-155. Springer, 2006. URL: https://doi.org/10.1007/11809678_17.
  54. Jingwen Ren and Mark JP Chaisson. lra: A long read aligner for sequences and contigs. PLOS Computational Biology, 17(6):e1009078, 2021. URL: https://doi.org/10.1371/journal.pcbi.1009078.
  55. Kristoffer Sahlin and Veli Mäkinen. Accurate spliced alignment of long RNA sequencing reads. Bioinformatics, 37(24):4643-4651, 2021. URL: https://doi.org/10.1093/bioinformatics/btab540.
  56. Fritz J Sedlazeck et al. Accurate detection of complex structural variations using single-molecule sequencing. Nature methods, 15(6):461-468, 2018. URL: https://doi.org/10.1038/s41592-018-0001-7.
  57. Tetsuo Shibuya and Igor Kurochkin. Match chaining algorithms for cDNA mapping. In Algorithms in Bioinformatics, Third International Workshop, WABI 2003, Budapest, Hungary, September 15-20, 2003, Proceedings, pages 462-475, 2003. URL: https://doi.org/10.1007/978-3-540-39763-2_33.
  58. Daniel Dominic Sleator and Robert Endre Tarjan. A data structure for dynamic trees. J. Comput. Syst. Sci., 26(3):362-391, 1983. URL: https://doi.org/10.1016/0022-0000(83)90006-5.
  59. Sharma V. Thankachan, Chaitanya Aluru, Sriram P. Chockalingam, and Srinivas Aluru. Algorithmic framework for approximate matching under bounded edits with applications to sequence analysis. In Benjamin J. Raphael, editor, Research in Computational Molecular Biology - 22nd Annual International Conference, RECOMB 2018, Paris, France, April 21-24, 2018, Proceedings, volume 10812 of Lecture Notes in Computer Science, pages 211-224. Springer, 2018. URL: https://doi.org/10.1007/978-3-319-89929-9_14.
  60. Sharma V. Thankachan, Alberto Apostolico, and Srinivas Aluru. A provably efficient algorithm for the k-mismatch average common substring problem. J. Comput. Biol., 23(6):472-482, 2016. URL: https://doi.org/10.1089/cmb.2015.0235.
  61. Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, pages 1-11. IEEE Computer Society, 1973. URL: https://doi.org/10.1109/SWAT.1973.13.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail