Schloss Dagstuhl - Leibniz-Zentrum fΓΌr Informatik GmbH Schloss Dagstuhl - Leibniz-Zentrum fΓΌr Informatik GmbH scholarly article en Charalampopoulos, Panagiotis; Kociumaka, Tomasz; Pissis, Solon P.; Radoszewski, Jakub https://www.dagstuhl.de/lipics License: Creative Commons Attribution 4.0 license (CC BY 4.0)
when quoting this document, please refer to the following
DOI:
URN: urn:nbn:de:0030-drops-146114
URL:

; ; ;

Faster Algorithms for Longest Common Substring

pdf-format:


Abstract

In the classic longest common substring (LCS) problem, we are given two strings S and T, each of length at most n, over an alphabet of size Οƒ, and we are asked to find a longest string occurring as a fragment of both S and T. Weiner, in his seminal paper that introduced the suffix tree, presented an π’ͺ(n log Οƒ)-time algorithm for this problem [SWAT 1973]. For polynomially-bounded integer alphabets, the linear-time construction of suffix trees by Farach yielded an π’ͺ(n)-time algorithm for the LCS problem [FOCS 1997]. However, for small alphabets, this is not necessarily optimal for the LCS problem in the word RAM model of computation, in which the strings can be stored in π’ͺ(n log Οƒ/log n) space and read in π’ͺ(n log Οƒ/log n) time. We show that, in this model, we can compute an LCS in time π’ͺ(n log Οƒ / √{log n}), which is sublinear in n if Οƒ = 2^{o(√{log n})} (in particular, if Οƒ = π’ͺ(1)), using optimal space π’ͺ(n log Οƒ/log n).
We then lift our ideas to the problem of computing a k-mismatch LCS, which has received considerable attention in recent years. In this problem, the aim is to compute a longest substring of S that occurs in T with at most k mismatches. Flouri et al. showed how to compute a 1-mismatch LCS in π’ͺ(n log n) time [IPL 2015]. Thankachan et al. extended this result to computing a k-mismatch LCS in π’ͺ(n log^k n) time for k = π’ͺ(1) [J. Comput. Biol. 2016]. We show an π’ͺ(n log^{k-1/2} n)-time algorithm, for any constant integer k > 0 and irrespective of the alphabet size, using π’ͺ(n) space as the previous approaches. We thus notably break through the well-known n log^k n barrier, which stems from a recursive heavy-path decomposition technique that was first introduced in the seminal paper of Cole et al. [STOC 2004] for string indexing with k errors.

BibTeX - Entry

@InProceedings{charalampopoulos_et_al:LIPIcs.ESA.2021.30,
  author =	{Charalampopoulos, Panagiotis and Kociumaka, Tomasz and Pissis, Solon P. and Radoszewski, Jakub},
  title =	{{Faster Algorithms for Longest Common Substring}},
  booktitle =	{29th Annual European Symposium on Algorithms (ESA 2021)},
  pages =	{30:1--30:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-204-4},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{204},
  editor =	{Mutzel, Petra and Pagh, Rasmus and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/opus/volltexte/2021/14611},
  URN =		{urn:nbn:de:0030-drops-146114},
  doi =		{10.4230/LIPIcs.ESA.2021.30},
  annote =	{Keywords: longest common substring, k mismatches, wavelet tree}
}

Keywords: longest common substring, k mismatches, wavelet tree
Seminar: 29th Annual European Symposium on Algorithms (ESA 2021)
Issue date: 2021
Date of publication: 31.08.2021


DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI