Faster Algorithms for Longest Common Substring

Charalampopoulos, Panagiotis; Kociumaka, Tomasz; Pissis, Solon P.; Radoszewski, Jakub

doi:10.4230/LIPIcs.ESA.2021.30

Abstract

In the classic longest common substring (LCS) problem, we are given two strings S and T, each of length at most n, over an alphabet of size σ, and we are asked to find a longest string occurring as a fragment of both S and T. Weiner, in his seminal paper that introduced the suffix tree, presented an 𝒪(n log σ)-time algorithm for this problem [SWAT 1973]. For polynomially-bounded integer alphabets, the linear-time construction of suffix trees by Farach yielded an 𝒪(n)-time algorithm for the LCS problem [FOCS 1997]. However, for small alphabets, this is not necessarily optimal for the LCS problem in the word RAM model of computation, in which the strings can be stored in 𝒪(n log σ/log n) space and read in 𝒪(n log σ/log n) time. We show that, in this model, we can compute an LCS in time 𝒪(n log σ / √{log n}), which is sublinear in n if σ = 2^{o(√{log n})} (in particular, if σ = 𝒪(1)), using optimal space 𝒪(n log σ/log n).
We then lift our ideas to the problem of computing a k-mismatch LCS, which has received considerable attention in recent years. In this problem, the aim is to compute a longest substring of S that occurs in T with at most k mismatches. Flouri et al. showed how to compute a 1-mismatch LCS in 𝒪(n log n) time [IPL 2015]. Thankachan et al. extended this result to computing a k-mismatch LCS in 𝒪(n log^k n) time for k = 𝒪(1) [J. Comput. Biol. 2016]. We show an 𝒪(n log^{k-1/2} n)-time algorithm, for any constant integer k > 0 and irrespective of the alphabet size, using 𝒪(n) space as the previous approaches. We thus notably break through the well-known n log^k n barrier, which stems from a recursive heavy-path decomposition technique that was first introduced in the seminal paper of Cole et al. [STOC 2004] for string indexing with k errors.

Amir Abboud, Richard Ryan Williams, and Huacheng Yu. More applications of the polynomial method to algorithm design. In Piotr Indyk, editor, 26th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, pages 218-230. SIAM, 2015. URL: https://doi.org/10.1137/1.9781611973730.17.
Amihood Amir and Itai Boneh. Locally maximal common factors as a tool for efficient dynamic string algorithms. In Gonzalo Navarro, David Sankoff, and Binhai Zhu, editors, Annual Symposium on Combinatorial Pattern Matching, CPM 2018, July 2-4, 2018 - Qingdao, China, volume 105 of LIPIcs, pages 11:1-11:13. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2018. URL: https://doi.org/10.4230/LIPIcs.CPM.2018.11.
Amihood Amir, Panagiotis Charalampopoulos, Solon P. Pissis, and Jakub Radoszewski. Dynamic and internal longest common substring. Algorithmica, 82(12):3707-3743, 2020. URL: https://doi.org/10.1007/s00453-020-00744-0.
Lorraine A. K. Ayad, Carl Barton, Panagiotis Charalampopoulos, Costas S. Iliopoulos, and Solon P. Pissis. Longest common prefixes with k-errors and applications. In Travis Gagie, Alistair Moffat, Gonzalo Navarro, and Ernesto Cuadros-Vargas, editors, String Processing and Information Retrieval - 25th International Symposium, SPIRE 2018, Lima, Peru, October 9-11, 2018, Proceedings, volume 11147 of Lecture Notes in Computer Science, pages 27-41. Springer, 2018. URL: https://doi.org/10.1007/978-3-030-00479-8_3.
Maxim A. Babenko, Paweł Gawrychowski, Tomasz Kociumaka, and Tatiana Starikovskaya. Wavelet trees meet suffix trees. In Piotr Indyk, editor, Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, San Diego, CA, USA, January 4-6, 2015, pages 572-591. SIAM, 2015. URL: https://doi.org/10.1137/1.9781611973730.39.
Maxim A. Babenko and Tatiana Starikovskaya. Computing the longest common substring with one mismatch. Problems of Information Transmission, 47(1):28-33, 2011. URL: https://doi.org/10.1134/S0032946011010030.
Ricardo A. Baeza-Yates. Improved string searching. Software: Practice and Experience, 19(3):257-271, 1989. URL: https://doi.org/10.1002/spe.4380190305.
Hideo Bannai, Tomohiro I, Shunsuke Inenaga, Yuto Nakashima, Masayuki Takeda, and Kazuya Tsuruta. The "runs" theorem. SIAM Journal on Computing, 46(5):1501-1514, 2017. URL: https://doi.org/10.1137/15M1011032.
Djamal Belazzougui. Worst case efficient single and multiple string matching in the RAM model. In Costas S. Iliopoulos and William F. Smyth, editors, Combinatorial Algorithms - 21st International Workshop, IWOCA 2010, London, UK, July 26-28, 2010, Revised Selected Papers, volume 6460 of Lecture Notes in Computer Science, pages 90-102. Springer, 2010. URL: https://doi.org/10.1007/978-3-642-19222-7_10.
Djamal Belazzougui. Improved space-time tradeoffs for approximate full-text indexing with one edit error. Algorithmica, 72(3):791-817, 2015. URL: https://doi.org/10.1007/s00453-014-9873-9.
Oren Ben-Kiki, Philip Bille, Dany Breslauer, Leszek Gasieniec, Roberto Grossi, and Oren Weimann. Optimal packed string matching. In Supratik Chakraborty and Amit Kumar, editors, IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science, FSTTCS 2011, December 12-14, 2011, Mumbai, India, volume 13 of LIPIcs, pages 423-432. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2011. URL: https://doi.org/10.4230/LIPIcs.FSTTCS.2011.423.
Oren Ben-Kiki, Philip Bille, Dany Breslauer, Leszek Gasieniec, Roberto Grossi, and Oren Weimann. Towards optimal packed string matching. Theoretical Computer Science, 525:111-129, 2014. URL: https://doi.org/10.1016/j.tcs.2013.06.013.
Stav Ben-Nun, Shay Golan, Tomasz Kociumaka, and Matan Kraus. Time-space tradeoffs for finding a long common substring. In Inge Li Gørtz and Oren Weimann, editors, 31st Annual Symposium on Combinatorial Pattern Matching, CPM 2020, June 17-19, 2020, Copenhagen, Denmark, volume 161 of LIPIcs, pages 5:1-5:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020. URL: https://doi.org/10.4230/LIPIcs.CPM.2020.5.
Philip Bille. Fast searching in packed strings. In Gregory Kucherov and Esko Ukkonen, editors, Combinatorial Pattern Matching, 20th Annual Symposium, CPM 2009, Lille, France, June 22-24, 2009, Proceedings, volume 5577 of Lecture Notes in Computer Science, pages 116-126. Springer, 2009. URL: https://doi.org/10.1007/978-3-642-02441-2_11.
Philip Bille. Fast searching in packed strings. Journal of Discrete Algorithms, 9(1):49-56, 2011. URL: https://doi.org/10.1016/j.jda.2010.09.003.
Philip Bille, Inge Li Gørtz, and Frederik Rye Skjoldjensen. Deterministic indexing for packed strings. In Juha Kärkkäinen, Jakub Radoszewski, and Wojciech Rytter, editors, 28th Annual Symposium on Combinatorial Pattern Matching, CPM 2017, July 4-6, 2017, Warsaw, Poland, volume 78 of LIPIcs, pages 6:1-6:11. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017. URL: https://doi.org/10.4230/LIPIcs.CPM.2017.6.
Dany Breslauer, Leszek Gasieniec, and Roberto Grossi. Constant-time word-size string matching. In Juha Kärkkäinen and Jens Stoye, editors, Combinatorial Pattern Matching - 23rd Annual Symposium, CPM 2012, Helsinki, Finland, July 3-5, 2012. Proceedings, volume 7354 of Lecture Notes in Computer Science, pages 83-96. Springer, 2012. URL: https://doi.org/10.1007/978-3-642-31265-6_7.
Karl Bringmann, Philip Wellnitz, and Marvin Künnemann. Few matches or almost periodicity: Faster pattern matching with mismatches in compressed texts. In Timothy M. Chan, editor, Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pages 1126-1145. SIAM, 2019. URL: https://doi.org/10.1137/1.9781611975482.69.
Gerth Stølting Brodal and Christian N. S. Pedersen. Finding maximal quasiperiodicities in strings. In Raffaele Giancarlo and David Sankoff, editors, Combinatorial Pattern Matching, 11th Annual Symposium, CPM 2000, Montreal, Canada, June 21-23, 2000, Proceedings, volume 1848 of Lecture Notes in Computer Science, pages 397-411. Springer, 2000. URL: https://doi.org/10.1007/3-540-45123-4_33.
Mark R. Brown and Robert Endre Tarjan. A fast merging algorithm. Journal of the ACM, 26(2):211-226, 1979. URL: https://doi.org/10.1145/322123.322127.
Domenico Cantone and Simone Faro. Pattern matching with swaps for short patterns in linear time. In Mogens Nielsen, Antonín Kucera, Peter Bro Miltersen, Catuscia Palamidessi, Petr Tuma, and Frank D. Valencia, editors, SOFSEM 2009: Theory and Practice of Computer Science, 35th Conference on Current Trends in Theory and Practice of Computer Science, Spindleruv Mlýn, Czech Republic, January 24-30, 2009. Proceedings, volume 5404 of Lecture Notes in Computer Science, pages 255-266. Springer, 2009. URL: https://doi.org/10.1007/978-3-540-95891-8_25.
Ho-Leung Chan, Tak Wah Lam, Wing-Kin Sung, Siu-Lung Tam, and Swee-Seong Wong. Compressed indexes for approximate string matching. Algorithmica, 58(2):263-281, 2010. URL: https://doi.org/10.1007/s00453-008-9263-2.
Ho-Leung Chan, Tak Wah Lam, Wing-Kin Sung, Siu-Lung Tam, and Swee-Seong Wong. A linear size index for approximate pattern matching. Journal of Discrete Algorithms, 9(4):358-364, 2011. URL: https://doi.org/10.1016/j.jda.2011.04.004.
Panagiotis Charalampopoulos, Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. Linear-time algorithm for long LCF with k mismatches. In Gonzalo Navarro, David Sankoff, and Binhai Zhu, editors, Annual Symposium on Combinatorial Pattern Matching, CPM 2018, July 2-4, 2018 - Qingdao, China, volume 105 of LIPIcs, pages 23:1-23:16. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2018. URL: https://doi.org/10.4230/LIPIcs.CPM.2018.23.
Panagiotis Charalampopoulos, Paweł Gawrychowski, and Karol Pokorski. Dynamic longest common substring in polylogarithmic time. In Artur Czumaj, Anuj Dawar, and Emanuela Merelli, editors, 47th International Colloquium on Automata, Languages, and Programming, ICALP 2020, July 8-11, 2020, Saarbrücken, Germany (Virtual Conference), volume 168 of LIPIcs, pages 27:1-27:19. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020. URL: https://doi.org/10.4230/LIPIcs.ICALP.2020.27.
Panagiotis Charalampopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, and Wiktor Zuba. Circular pattern matching with k mismatches. Journal of Computer and System Sciences, 115:73-85, 2021. URL: https://doi.org/10.1016/j.jcss.2020.07.003.
Archie L. Cobbs. Fast approximate matching using suffix trees. In Zvi Galil and Esko Ukkonen, editors, Combinatorial Pattern Matching, 6th Annual Symposium, CPM 95, Espoo, Finland, July 5-7, 1995, Proceedings, volume 937 of Lecture Notes in Computer Science, pages 41-54. Springer, 1995. URL: https://doi.org/10.1007/3-540-60044-2_33.
Vincent Cohen-Addad, Laurent Feuilloley, and Tatiana Starikovskaya. Lower bounds for text indexing with mismatches and differences. In Timothy M. Chan, editor, Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pages 1146-1164. SIAM, 2019. URL: https://doi.org/10.1137/1.9781611975482.70.
Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. Dictionary matching and indexing with errors and don't cares. In László Babai, editor, Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, June 13-16, 2004, pages 91-100. ACM, 2004. URL: https://doi.org/10.1145/1007352.1007374.
Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. Algorithms on strings. Cambridge University Press, 2007.
Maxime Crochemore, Costas S. Iliopoulos, Manal Mohamed, and Marie-France Sagot. Longest repeats with a block of k don't cares. Theoretical Computer Science, 362(1-3):248-254, 2006. URL: https://doi.org/10.1016/j.tcs.2006.06.029.
Jonas Ellert and Johannes Fischer. Linear time runs over general ordered alphabets. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, July 12-16, 2021, Glasgow, Scotland (Virtual Conference), volume 198 of LIPIcs, pages 63:1-63:16. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL: https://doi.org/10.4230/LIPIcs.ICALP.2021.63.
Martin Farach. Optimal suffix tree construction with large alphabets. In 38th Annual Symposium on Foundations of Computer Science, FOCS '97, Miami Beach, Florida, USA, October 19-22, 1997, pages 137-143. IEEE Computer Society, 1997. URL: https://doi.org/10.1109/SFCS.1997.646102.
Nathan J. Fine and Herbert S. Wilf. Uniqueness theorems for periodic functions. Proceedings of the American Mathematical Society, 16(1):109-114, 1965. URL: http://www.jstor.org/stable/2034009.
Tomás Flouri, Emanuele Giaquinta, Kassian Kobert, and Esko Ukkonen. Longest common substrings with k mismatches. Information Processing Letters, 115(6-8):643-647, 2015. URL: https://doi.org/10.1016/j.ipl.2015.03.006.
Kimmo Fredriksson. Faster string matching with super-alphabets. In Alberto H. F. Laender and Arlindo L. Oliveira, editors, String Processing and Information Retrieval, 9th International Symposium, SPIRE 2002, Lisbon, Portugal, September 11-13, 2002, Proceedings, volume 2476 of Lecture Notes in Computer Science, pages 44-57. Springer, 2002. URL: https://doi.org/10.1007/3-540-45735-6_5.
Kimmo Fredriksson. Shift-or string matching with super-alphabets. Information Processing Letters, 87(4):201-204, 2003. URL: https://doi.org/10.1016/S0020-0190(03)00296-5.
François Le Gall and Saeed Seddighin. Quantum meets fine-grained complexity: Sublinear time quantum algorithms for string problems. CoRR, abs/2010.12122, 2020. URL: http://arxiv.org/abs/2010.12122.
Emanuele Giaquinta, Szymon Grabowski, and Kimmo Fredriksson. Approximate pattern matching with k-mismatches in packed text. Information Processing Letters, 113(19-21):693-697, 2013. URL: https://doi.org/10.1016/j.ipl.2013.07.002.
Szymon Grabowski. A note on the longest common substring with k-mismatches problem. Information Processing Letters, 115(6-8):640-642, 2015. URL: https://doi.org/10.1016/j.ipl.2015.03.003.
Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, January 12-14, 2003, Baltimore, Maryland, USA, pages 841-850. ACM/SIAM, 2003. URL: http://dl.acm.org/citation.cfm?id=644108.644250.
Yijie Han. Deterministic sorting in O(n log log n) time and linear space. Journal of Algorithms, 50(1):96-105, 2004. URL: https://doi.org/10.1016/j.jalgor.2003.09.001.
Lucas Chi Kwong Hui. Color set size problem with application to string matching. In Alberto Apostolico, Maxime Crochemore, Zvi Galil, and Udi Manber, editors, Combinatorial Pattern Matching, Third Annual Symposium, CPM 92, Tucson, Arizona, USA, April 29 - May 1, 1992, Proceedings, volume 644 of Lecture Notes in Computer Science, pages 230-243. Springer, 1992. URL: https://doi.org/10.1007/3-540-56024-6_19.
Trinh N. D. Huynh, Wing-Kai Hon, Tak Wah Lam, and Wing-Kin Sung. Approximate string matching using compressed suffix arrays. Theoretical Computer Science, 352(1-3):240-249, 2006. URL: https://doi.org/10.1016/j.tcs.2005.11.022.
Russell Impagliazzo and Ramamohan Paturi. On the complexity of k-SAT. Journal of Computer and System Sciences, 62(2):367-375, 2001. URL: https://doi.org/10.1006/jcss.2000.1727.
Russell Impagliazzo, Ramamohan Paturi, and Francis Zane. Which problems have strongly exponential complexity? Journal of Computer and System Sciences, 63(4):512-530, 2001. URL: https://doi.org/10.1006/jcss.2001.1774.
Dominik Kempa and Tomasz Kociumaka. String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure. In 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, pages 756-767. ACM, 2019. URL: https://doi.org/10.1145/3313276.3316368.
Shmuel Tomi Klein and Miri Ben-Nissan. Accelerating Boyer Moore searches on binary texts. In Jan Holub and Jan Zdárek, editors, Implementation and Application of Automata, 12th International Conference, CIAA 2007, Prague, Czech Republic, July 16-18, 2007, Revised Selected Papers, volume 4783 of Lecture Notes in Computer Science, pages 130-143. Springer, 2007. URL: https://doi.org/10.1007/978-3-540-76336-9_14.
Tomasz Kociumaka, Jakub Radoszewski, and Tatiana Starikovskaya. Longest common substring with approximately k mismatches. Algorithmica, 81(6):2633-2652, 2019. URL: https://doi.org/10.1007/s00453-019-00548-x.
Tomasz Kociumaka, Tatiana Starikovskaya, and Hjalte Wedel Vildhøj. Sublinear space algorithms for the longest common substring problem. In Algorithms - ESA 2014 - 22th Annual European Symposium, Wroclaw, Poland, September 8-10, 2014. Proceedings, pages 605-617, 2014. URL: https://doi.org/10.1007/978-3-662-44777-2_50.
Roman M. Kolpakov and Gregory Kucherov. Finding maximal repetitions in a word in linear time. In 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA, pages 596-604. IEEE Computer Society, 1999. URL: https://doi.org/10.1109/SFFCS.1999.814634.
Gad M. Landau and Uzi Vishkin. Fast string matching with k differences. Journal of Computer and System Sciences, 37(1):63-78, 1988. URL: https://doi.org/10.1016/0022-0000(88)90045-1.
J. Ian Munro, Gonzalo Navarro, and Yakov Nekrich. Text indexing and searching in sublinear time. In Inge Li Gørtz and Oren Weimann, editors, 31st Annual Symposium on Combinatorial Pattern Matching, CPM 2020, June 17-19, 2020, Copenhagen, Denmark, volume 161 of LIPIcs, pages 24:1-24:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020. URL: https://doi.org/10.4230/LIPIcs.CPM.2020.24.
J. Ian Munro, Yakov Nekrich, and Jeffrey Scott Vitter. Fast construction of wavelet trees. Theoretical Computer Science, 638:91-97, 2016. URL: https://doi.org/10.1016/j.tcs.2015.11.011.
Gonzalo Navarro and Mathieu Raffinot. A bit-parallel approach to suffix automata: Fast extended string matching. In Martin Farach-Colton, editor, Combinatorial Pattern Matching, 9th Annual Symposium, CPM 98, Piscataway, New Jersey, USA, July 20-22, 1998, Proceedings, volume 1448 of Lecture Notes in Computer Science, pages 14-33. Springer, 1998. URL: https://doi.org/10.1007/BFb0030778.
Tatiana Starikovskaya and Hjalte Wedel Vildhøj. Time-space trade-offs for the longest common substring problem. In Johannes Fischer and Peter Sanders, editors, Combinatorial Pattern Matching, 24th Annual Symposium, CPM 2013, Bad Herrenalb, Germany, June 17-19, 2013. Proceedings, volume 7922 of Lecture Notes in Computer Science, pages 223-234. Springer, 2013. URL: https://doi.org/10.1007/978-3-642-38905-4_22.
Jorma Tarhio and Hannu Peltola. String matching in the DNA alphabet. Software: Practice and Experience, 27(7):851-861, 1997.
Sharma V. Thankachan, Chaitanya Aluru, Sriram P. Chockalingam, and Srinivas Aluru. Algorithmic framework for approximate matching under bounded edits with applications to sequence analysis. In Benjamin J. Raphael, editor, Research in Computational Molecular Biology - 22nd Annual International Conference, RECOMB 2018, Paris, France, April 21-24, 2018, Proceedings, volume 10812 of Lecture Notes in Computer Science, pages 211-224. Springer, 2018. URL: https://doi.org/10.1007/978-3-319-89929-9_14.
Sharma V. Thankachan, Alberto Apostolico, and Srinivas Aluru. A provably efficient algorithm for the k-mismatch average common substring problem. Journal of Computational Biology, 23(6):472-482, 2016. URL: https://doi.org/10.1089/cmb.2015.0235.
Dekel Tsur. Fast index for approximate string matching. Journal of Discrete Algorithms, 8(4):339-345, 2010. URL: https://doi.org/10.1016/j.jda.2010.08.002.
Esko Ukkonen. Approximate string-matching over suffix trees. In Alberto Apostolico, Maxime Crochemore, Zvi Galil, and Udi Manber, editors, Combinatorial Pattern Matching, 4th Annual Symposium, CPM 93, Padova, Italy, June 2-4, 1993, Proceedings, volume 684 of Lecture Notes in Computer Science, pages 228-242. Springer, 1993. URL: https://doi.org/10.1007/BFb0029808.
Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, pages 1-11. IEEE Computer Society, 1973. URL: https://doi.org/10.1109/SWAT.1973.13.

Faster Algorithms for Longest Common Substring

Authors Panagiotis Charalampopoulos , Tomasz Kociumaka , Solon P. Pissis , Jakub Radoszewski

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Faster Algorithms for Longest Common Substring

Authors Panagiotis Charalampopoulos , Tomasz Kociumaka , Solon P. Pissis , Jakub Radoszewski

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message