eng
Schloss Dagstuhl β Leibniz-Zentrum fΓΌr Informatik
Leibniz International Proceedings in Informatics
1868-8969
2021-08-31
30:1
30:17
10.4230/LIPIcs.ESA.2021.30
article
Faster Algorithms for Longest Common Substring
Charalampopoulos, Panagiotis
1
https://orcid.org/0000-0002-6024-1557
Kociumaka, Tomasz
2
https://orcid.org/0000-0002-2477-1702
Pissis, Solon P.
3
4
https://orcid.org/0000-0002-1445-1932
Radoszewski, Jakub
5
6
https://orcid.org/0000-0002-0067-6401
The Interdisciplinary Center Herzliya, Israel
University of California, Berkeley, CA, USA
CWI, Amsterdam, The Netherlands
Vrije Universiteit, Amsterdam, The Netherlands
Institute of Informatics, University of Warsaw, Poland
Samsung R&D, Warsaw, Poland
In the classic longest common substring (LCS) problem, we are given two strings S and T, each of length at most n, over an alphabet of size Ο, and we are asked to find a longest string occurring as a fragment of both S and T. Weiner, in his seminal paper that introduced the suffix tree, presented an πͺ(n log Ο)-time algorithm for this problem [SWAT 1973]. For polynomially-bounded integer alphabets, the linear-time construction of suffix trees by Farach yielded an πͺ(n)-time algorithm for the LCS problem [FOCS 1997]. However, for small alphabets, this is not necessarily optimal for the LCS problem in the word RAM model of computation, in which the strings can be stored in πͺ(n log Ο/log n) space and read in πͺ(n log Ο/log n) time. We show that, in this model, we can compute an LCS in time πͺ(n log Ο / β{log n}), which is sublinear in n if Ο = 2^{o(β{log n})} (in particular, if Ο = πͺ(1)), using optimal space πͺ(n log Ο/log n).
We then lift our ideas to the problem of computing a k-mismatch LCS, which has received considerable attention in recent years. In this problem, the aim is to compute a longest substring of S that occurs in T with at most k mismatches. Flouri et al. showed how to compute a 1-mismatch LCS in πͺ(n log n) time [IPL 2015]. Thankachan et al. extended this result to computing a k-mismatch LCS in πͺ(n log^k n) time for k = πͺ(1) [J. Comput. Biol. 2016]. We show an πͺ(n log^{k-1/2} n)-time algorithm, for any constant integer k > 0 and irrespective of the alphabet size, using πͺ(n) space as the previous approaches. We thus notably break through the well-known n log^k n barrier, which stems from a recursive heavy-path decomposition technique that was first introduced in the seminal paper of Cole et al. [STOC 2004] for string indexing with k errors.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol204-esa2021/LIPIcs.ESA.2021.30/LIPIcs.ESA.2021.30.pdf
longest common substring
k mismatches
wavelet tree