Time-Space Tradeoffs for Finding a Long Common Substring

Ben-Nun, Stav; Golan, Shay; Kociumaka, Tomasz; Kraus, Matan

doi:10.4230/LIPIcs.CPM.2020.5

Abstract

We consider the problem of finding, given two documents of total length n, a longest string occurring as a substring of both documents. This problem, known as the Longest Common Substring (LCS) problem, has a classic 𝒪(n)-time solution dating back to the discovery of suffix trees (Weiner, 1973) and their efficient construction for integer alphabets (Farach-Colton, 1997). However, these solutions require Θ(n) space, which is prohibitive in many applications. To address this issue, Starikovskaya and Vildhøj (CPM 2013) showed that for n^{2/3} ≤ s ≤ n, the LCS problem can be solved in 𝒪(s) space and 𝒪̃(n²/s) time. Kociumaka et al. (ESA 2014) generalized this tradeoff to 1 ≤ s ≤ n, thus providing a smooth time-space tradeoff from constant to linear space. In this paper, we obtain a significant speed-up for instances where the length L of the sought LCS is large. For 1 ≤ s ≤ n, we show that the LCS problem can be solved in 𝒪(s) space and 𝒪̃(n²/(L⋅s) +n) time. The result is based on techniques originating from the LCS with Mismatches problem (Flouri et al., 2015; Charalampopoulos et al., CPM 2018), on space-efficient locally consistent parsing (Birenzwige et al., SODA 2020), and on the structure of maximal repetitions (runs) in the input documents.

Amir Abboud, Richard Ryan Williams, and Huacheng Yu. More applications of the polynomial method to algorithm design. In Piotr Indyk, editor, 26th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, pages 218-230. SIAM, 2015. URL: https://doi.org/10.1137/1.9781611973730.17.
Mai Alzamel, Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszynski, Tomasz Walen, and Wiktor Zuba. Quasi-linear-time algorithm for longest common circular factor. In 30th Annual Symposium on Combinatorial Pattern Matching, CPM 2019, June 18-20, 2019, Pisa, Italy, pages 25:1-25:14, 2019. URL: https://doi.org/10.4230/LIPIcs.CPM.2019.25.
Amihood Amir, Panagiotis Charalampopoulos, Costas S. Iliopoulos, Solon P. Pissis, and Jakub Radoszewski. Longest common factor after one edit operation. In Gabriele Fici, Marinella Sciortino, and Rossano Venturini, editors, 24th International Symposium on String Processing and Information Retrieval, SPIRE 2017, volume 10508 of LNCS, pages 14-26. Springer, 2017. URL: https://doi.org/10.1007/978-3-319-67428-5_2.
Amihood Amir, Panagiotis Charalampopoulos, Solon P. Pissis, and Jakub Radoszewski. Longest common substring made fully dynamic. In 27th Annual European Symposium on Algorithms, ESA 2019, pages 6:1-6:17, 2019. URL: https://doi.org/10.4230/LIPIcs.ESA.2019.6.
Lorraine A. K. Ayad, Carl Barton, Panagiotis Charalampopoulos, Costas S. Iliopoulos, and Solon P. Pissis. Longest common prefixes with k-errors and applications. In Travis Gagie, Alistair Moffat, Gonzalo Navarro, and Ernesto Cuadros-Vargas, editors, 25th International Symposium on String Processing and Information Retrieval, SPIRE 2018, volume 11147 of LNCS, pages 27-41. Springer, 2018. URL: https://doi.org/10.1007/978-3-030-00479-8_3.
Maxim A. Babenko and Tatiana Starikovskaya. Computing the longest common substring with one mismatch. Problems of Information Transmission, 47(1):28-33, 2011. URL: https://doi.org/10.1134/S0032946011010030.
Hideo Bannai, Tomohiro I, Shunsuke Inenaga, Yuto Nakashima, Masayuki Takeda, and Kazuya Tsuruta. The "runs" theorem. SIAM J. Comput., 46(5):1501-1514, 2017. URL: https://doi.org/10.1137/15M1011032.
Philip Bille, Johannes Fischer, Inge Li Gørtz, Tsvi Kopelowitz, Benjamin Sach, and Hjalte Wedel Vildhøj. Sparse text indexing in small space. ACM Transactions on Algorithms, 12(3):39:1-39:19, 2016. URL: https://doi.org/10.1145/2836166.
Or Birenzwige, Shay Golan, and Ely Porat. Locally consistent parsing for text indexing in small space. In 31st Annual ACM-SIAM Symposium on Discrete Algorithms, pages 607-626. Society for Industrial and Applied Mathematics, 2020. URL: https://doi.org/10.1137/1.9781611975994.37.
Rene De La Briandais. File searching using variable length keys. In Western Joint Computer Conference, pages 295-298. ACM Press, 1959. URL: https://doi.org/10.1145/1457838.1457895.
Panagiotis Charalampopoulos, Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. Linear-time algorithm for long LCF with k mismatches. In Gonzalo Navarro, David Sankoff, and Binhai Zhu, editors, 29th Annual Symposium on Combinatorial Pattern Matching, CPM 2018, volume 105 of LIPIcs, pages 23:1-23:16. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2018. URL: https://doi.org/10.4230/LIPIcs.CPM.2018.23.
Panagiotis Charalampopoulos, Paweł Gawrychowski, and Karol Pokorski. Dynamic longest common substring in polylogarithmic time. In 45th International Colloquium on Automata, Languages, and Programming, ICALP 2020, LIPIcs. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020.
Maxime Crochemore, Costas S. Iliopoulos, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, and Tomasz Walen. The maximal number of cubic runs in a word. J. Comput. Syst. Sci., 78(6):1828-1836, 2012. URL: https://doi.org/10.1016/j.jcss.2011.12.005.
Maxime Crochemore, Costas S. Iliopoulos, Manal Mohamed, and Marie-France Sagot. Longest repeats with a block of k don't cares. Theoretical Computer Science, 362(1-3):248-254, 2006. URL: https://doi.org/10.1016/j.tcs.2006.06.029.
Maxime Crochemore and Dominique Perrin. Two-way string matching. Journal of the ACM, 38(3):651-675, 1991. URL: https://doi.org/10.1145/116825.116845.
Jean-Pierre Duval. Factorizing words over an ordered alphabet. Journal of Algorithms, 4(4):363-381, 1983. URL: https://doi.org/10.1016/0196-6774(83)90017-2.
Martin Farach-Colton. Optimal suffix tree construction with large alphabets. In 38th Annual Symposium on Foundations of Computer Science, FOCS 1997, pages 137-143, 1997. URL: https://doi.org/10.1109/SFCS.1997.646102.
Nathan J. Fine and Herbert S. Wilf. Uniqueness theorems for periodic functions. Proceedings of the American Mathematical Society, 16(1):109-114, 1965. URL: https://doi.org/10.1090/S0002-9939-1965-0174934-9.
Johannes Fischer, Travis Gagie, Paweł Gawrychowski, and Tomasz Kociumaka. Approximating LZ77 via small-space multiple-pattern matching. In Nikhil Bansal and Irene Finocchi, editors, 23rd Annual European Symposium on Algorithms, ESA 2015, volume 9294 of LNCS, pages 533-544. Springer, 2015. URL: https://doi.org/10.1007/978-3-662-48350-3_45.
Tomás Flouri, Emanuele Giaquinta, Kassian Kobert, and Esko Ukkonen. Longest common substrings with k mismatches. Information Processing Letters, 115(6-8):643-647, 2015. URL: https://doi.org/10.1016/j.ipl.2015.03.006.
Edward Fredkin. Trie memory. Communications of the ACM, 3(9):490-499, 1960. URL: https://doi.org/10.1145/367390.367400.
Mitsuru Funakoshi, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Faster queries for longest substring palindrome after block edit. In 30th Annual Symposium on Combinatorial Pattern Matching, CPM 2019, pages 27:1-27:13, 2019. URL: https://doi.org/10.4230/LIPIcs.CPM.2019.27.
Zvi Galil and Joel I. Seiferas. Time-space-optimal string matching. Journal of Computer and System Sciences, 26(3):280-294, 1983. URL: https://doi.org/10.1016/0022-0000(83)90002-8.
Pawel Gawrychowski and Tomasz Kociumaka. Sparse suffix tree construction in optimal time and space. In Philip N. Klein, editor, 28th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2017, pages 425-439. SIAM, 2017. URL: https://doi.org/10.1137/1.9781611974782.27.
Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing, 35(2):378-407, 2005. URL: https://doi.org/10.1137/S0097539702402354.
Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997. URL: https://doi.org/10.1017/cbo9780511574931.
Lucas Chi Kwong Hui. Color set size problem with applications to string matching. In Alberto Apostolico, Maxime Crochemore, Zvi Galil, and Udi Manber, editors, 3rd Annual Symposium on Combinatorial Pattern Matching, CPM 1992, volume 644 of LNCS, pages 230-243. Springer, 1992. URL: https://doi.org/10.1007/3-540-56024-6_19.
Tomohiro I, Juha Kärkkäinen, and Dominik Kempa. Faster sparse suffix sorting. In Ernst W. Mayr and Natacha Portier, editors, 31st International Symposium on Theoretical Aspects of Computer Science, STACS 2014, volume 25 of LIPIcs, pages 386-396. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2014. URL: https://doi.org/10.4230/LIPIcs.STACS.2014.386.
Juha Kärkkäinen and Esko Ukkonen. Sparse suffix trees. In Jin-yi Cai and C. K. Wong, editors, 2nd Annual International Conference on Computing and Combinatorics, COCOON 1996, volume 1090 of LNCS, pages 219-230. Springer, 1996. URL: https://doi.org/10.1007/3-540-61332-3_155.
Tomasz Kociumaka, Jakub Radoszewski, and Tatiana Starikovskaya. Longest common substring with approximately k mismatches. Algorithmica, 81(6):2633-2652, 2019. URL: https://doi.org/10.1007/s00453-019-00548-x.
Tomasz Kociumaka, Tatiana Starikovskaya, and Hjalte Wedel Vildhøj. Sublinear space algorithms for the longest common substring problem. In Andreas S. Schulz and Dorothea Wagner, editors, 22th Annual European Symposium on Algorithms, ESA 2014, volume 8737 of LNCS, pages 605-617. Springer, 2014. URL: https://doi.org/10.1007/978-3-662-44777-2_50.
Roman M. Kolpakov and Gregory Kucherov. Finding maximal repetitions in a word in linear time. In 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA, pages 596-604, 1999. URL: https://doi.org/10.1109/SFFCS.1999.814634.
Mamoru Maekawa. A √n algorithm for mutual exclusion in decentralized systems. ACM Transactions on Computer Systems, 3(2):145-159, 1985. URL: https://doi.org/10.1145/214438.214445.
Donald R. Morrison. PATRICIA - practical algorithm to retrieve information coded in alphanumeric. Journal of the ACM, 15(4):514-534, 1968. URL: https://doi.org/10.1145/321479.321481.
Gonzalo Navarro and Veli Mäkinen. Compressed full-text indexes. ACM Computing Surveys, 39(1):2, 2007. URL: https://doi.org/10.1145/1216370.1216372.
Tatiana Starikovskaya and Hjalte Wedel Vildhøj. Time-space trade-offs for the longest common substring problem. In Johannes Fischer and Peter Sanders, editors, 24th Annual Symposium on Combinatorial Pattern Matching,CPM 2013, volume 7922 of LNCS, pages 223-234. Springer, 2013. URL: https://doi.org/10.1007/978-3-642-38905-4_22.
Sharma V. Thankachan, Chaitanya Aluru, Sriram P. Chockalingam, and Srinivas Aluru. Algorithmic framework for approximate matching under bounded edits with applications to sequence analysis. In Benjamin J. Raphael, editor, 22nd Annual International Conference on Research in Computational Molecular Biology, RECOMB 2018, volume 10812 of LNCS, pages 211-224. Springer, 2018. URL: https://doi.org/10.1007/978-3-319-89929-9_14.
Sharma V. Thankachan, Alberto Apostolico, and Srinivas Aluru. A provably efficient algorithm for the k-mismatch average common substring problem. Journal of Computational Biology, 23(6):472-482, 2016. URL: https://doi.org/10.1089/cmb.2015.0235.
Derrien Thomas, Estellé Jordi, Marco Sola Santiago, Knowles David G., Raineri Emanuele, Guigó Roderic, and Ribeca Paolo. Fast computation and applications of genome mappability. PLOS ONE, 7:1-16, January 2012. URL: https://doi.org/10.1371/journal.pone.0030377.
Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, SWAT 1973, pages 1-11, Washington, DC, USA, 1973. IEEE Computer Society. URL: https://doi.org/10.1109/SWAT.1973.13.

Time-Space Tradeoffs for Finding a Long Common Substring

Authors Stav Ben-Nun, Shay Golan , Tomasz Kociumaka , Matan Kraus

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Time-Space Tradeoffs for Finding a Long Common Substring

Authors Stav Ben-Nun, Shay Golan , Tomasz Kociumaka , Matan Kraus

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message