Longest Common Substring with Approximately k Mismatches

In the longest common substring problem, we are given two strings of length n and must find a substring of maximal length that occurs in both strings. It is well known that the problem can be solved in linear time, but the solution is not robust and can vary greatly when the input strings are changed even by one character. To circumvent this, Leimeister and Morgenstern introduced the problem of the longest common substring with k mismatches. Lately, this problem has received a lot of attention in the literature. In this paper, we first show a conditional lower bound based on the SETH hypothesis implying that there is little hope to improve existing solutions. We then introduce a new but closely related problem of the longest common substring with approximately k mismatches and use locality-sensitive hashing to show that it admits a solution with strongly subquadratic running time. We also apply these results to obtain a strongly subquadratic-time 2-approximation algorithm for the longest common substring with k mismatches problem and show conditional hardness of improving its approximation ratio.


Introduction
Understanding how similar two strings are and what they share in common is a central task in stringology. The significance of this task is witnessed by the 50,000+ citations of the paper introducing BLAST [3], a heuristic algorithmic tool for comparing biological sequences. This task can be formalised in many different ways, from the longest common substring problem to the edit distance problem. The longest common substring problem can be solved in optimal linear time and space, while the best known algorithms for the edit distance problem require n 2−o(1) time, which makes the longest common substring problem an attractive choice for many practical applications. On the other hand, the longest common substring problem is not robust and its solution can vary greatly when the input strings are changed even by one character. To overcome this issue, recently a new problem has been introduced called the longest common substring with k mismatches. In this paper, we continue this line of research.

Related work
Let us start with a precise statement of the longest common substring problem.
Problem 1 (LCS). Given two strings T 1 , T 2 of length n, find a maximum-length substring of T 1 that occurs in T 2 .
The suffix tree of T 1 and T 2 , a data structure containing all suffixes of T 1 and T 2 , allows to solve this problem in linear time and space [36,17,21], which is optimal as any algorithm needs Ω(n) time to read and Ω(n) space to store the strings. However, if we only account for "additional" space, the space the algorithm uses apart from the space required to store the input, then the suffix tree-based solution is not optimal and has been improved in a series of publications [5,26,32].
The major disadvantage of the longest common substring problem is that its solution is not robust. Consider, for example, two pairs of strings: a 2m+1 , a 2m b and a m ba m , a 2m b. The longest common substring of the first pair of strings is almost twice as long as the longest common substring of the second pair of strings, although we changed only one character. This makes the longest common substring unsuitable to be used as a measure of similarity of two strings: Intuitively, changing one character must not change the measure of similarity much. To overcome this issue, it is natural to allow the substring to occur in T 1 and T 2 not exactly but with a small number of mismatches.
Problem 2 (LCS with k Mismatches). Given two strings T 1 , T 2 of length n and an integer k, find a maximumlength substring of T 1 that occurs in T 2 with at most k mismatches.
The problem can be solved in quadratic time and space by a dynamic-programming algorithm, but more efficient solutions have also been shown. The longest common substring with one mismatch problem was first considered in [6], where an O(n 2 )-time and O(n)-space solution was given. This result was further improved by Flouri et al. [14], who showed an O(n log n)-time and O(n)-space solution.
For a general value of k, the problem was first considered by Leimeister and Morgenstern [29], who suggested a greedy heuristic algorithm. Flouri et al. [14] showed that LCS with k Mismatches admits a quadratic-time algorithm which takes constant (additional) space. Grabowski [16] presented two outputdependent algorithms with running times O(n((k + 1)(ℓ 0 + 1)) k ) and O(n 2 k/ℓ k ), where ℓ 0 is the length of the longest common substring of T 1 and T 2 and ℓ k is the length of the longest common substring with k mismatches of T 1 and T 2 . Thankachan et al. [35] gave an O(n)-space, O(n log k n)-time solution for k = O(1). Very recently, Charalampopoulos et al. [10] extended the underlying techniques and developed an O(n)-time algorithm for the case of ℓ k = Ω(log 2k+2 n). Finally, Abboud et al. [1] applied the polynomial method to develop a k 1.5 n 2 /2 Ω( √ (log n)/k) -time randomised solution to the problem. In fact, their algorithm was developed for a more general problem of computing the longest common substring with k edits, but it can be adapted to LCS with k Mismatches as well. The problem of computing the longest common substring with k edits was also considered in [34], where an O(n log k n)-time solution was given for constant k.

Our contribution
Our contribution is as follows. In Section 2, we show that existence of a strongly subquadratic-time algorithm for LCS with k Mismatches on strings over binary alphabet for k = Ω(log n) refutes the Strong Exponential Time Hypothesis (SETH) of Impagliazzo, Paturi, and Zane [23,24]; see also [11,Chapter 14]: Hypothesis (SETH). For every δ > 0, there exists an integer q such that SAT on q-CNF formulas with m clauses and n variables cannot be solved in m O(1) 2 (1−δ)n time.
This conditional lower bound implies that there is little hope to improve existing solutions to LCS with k Mismatches. To this end, we introduce a new problem, inspired by the work of Andoni and Indyk [4].
Problem 3 (LCS with Approximately k Mismatches). Two strings T 1 , T 2 of length n, an integer k, and a constant ε > 0 are given. If ℓ k is the length of the longest common substring with k mismatches of T 1 and T 2 , return a substring of T 1 of length at least ℓ k that occurs in T 2 with at most (1 + ε) · k mismatches.
Let d H (S 1 , S 2 ) denote the Hamming distance between equal-length strings S 1 and S 2 , that is, the number of mismatches between them. Then we are to find the substrings S 1 and S 2 of T 1 and T 2 , respectively, of length at least ℓ k such that d H (S 1 , S 2 ) ≤ (1 + ε) · k.
Although the problem statement is not standard, it makes perfect sense from the practical point of view. It is also more robust than the LCS with k Mismatches problem, as for most applications it is not important whether a returned substring occurs in T 1 and T 2 with, for example, 10 or 12 mismatches. The result is also important from the theoretical point of view as it improves our understanding of the big picture of string comparison. In their work, Andoni and Indyk used the technique of locality-sensitive hashing to develop a space-efficient randomised index for a variant of the approximate pattern matching problem. We extend their work with new ideas in the construction and the analysis to develop a randomised subquadratic-time solution to Problem 3. This result is presented in Section 3.
In Section 4, we consider approximation algorithms for the length of the LCS with k Mismatches. By applying previous techniques, we show a strongly subquadratic-time 2-approximation algorithm and show that no strongly subquadratic-time (2 − ε)-approximation algorithm exists for any ε > 0 unless SETH fails.
Finally, in Section 5 we show a strongly subcubic-time solution for LCS with k Mismatches for all k by reducing it (for arbitrary alphabet size) to Binary Jumbled Indexing. Namely, we show that LCS with k Mismatches for all k = 1, . . . , n can be solved in O(n 2.859 ) expected time or in O(n 2.864 ) deterministic time, improving upon naive computation performed for every k separately.

LCS with k Mismatches is SETH-hard
Recall that the Hamming distance of two strings U and V of the same length, denoted as d H (U, V ), is simply the number of mismatches. Our proof is based on conditional hardness of the following problem.
(b) If the set A does not contain two orthogonal vectors, then all the solutions for the LCS with k Mismatches problem for T 1 and T 2 have length smaller than ℓ ′ = (7q + 14)d.
Proof. (a) Assume that U i and U j are a pair of orthogonal vectors. T 1 contains a substring H q µ(U i )H q and T 2 contains a substring H q τ (U j )H q . Both substrings have length ℓ and, by Observation 1, their Hamming distance is exactly k = d.
(b) Assume to the contrary that there are indices a and b for which the substrings S 1 = T 1 [a, a + ℓ ′ − 1] and S 2 = T 2 [b, b + ℓ ′ − 1] have at most k mismatches. First, let us note that 7 | a − b. Indeed, otherwise S 1 would contain at least ⌊(ℓ ′ − 3)/7⌋ = (q + 2)k − 1 ≥ k + 1 substrings of the form 1000 which, by Observation 2, would not be aligned with substrings 1000 in S 2 . Hence, they would account for more than k mismatches between S 1 and S 2 .
Let us call all the substrings of T 1 and T 2 that come from the 3-character prefixes of µ(0), µ(1), τ (0), τ (1), and γ the core substrings, with core substrings that come from γ being gadget core substrings. We have already established that the core substrings of S 1 and S 2 are aligned. Moreover, S 1 and S 2 contain at least ⌊(ℓ ′ − 2)/7⌋ = (q + 2)k − 1 core substrings each. Amongst every (q + 2)k − 1 consecutive core substrings in S 1 , some k consecutive must come from µ(U i ) for some index i; a symmetric property holds for S 2 and τ (U j ). Moreover, as only the gadget core substrings in S 1 and S 2 can match exactly, at most k core substrings that are contained in S 1 and S 2 can be non-gadget. Hence, S 1 and S 2 contain exactly k non-gadget core substrings each. If they were not aligned, they would have produced more than k mismatches in total with the gadget core substrings.

LCS with Approximately k Mismatches
In this section, we prove the following theorem.

Overview of the proof
The classic solution to the longest common substring problem is based on two observations. The first observation is that the longest common substring of T 1 and T 2 is in fact the longest common prefix of some suffix of T 1 and some suffix of T 2 . The second observation is that the maximal length of the longest common prefix of a fixed suffix S of T 1 and suffixes of T 2 is reached by one of the two suffixes of T 2 that are closest to S in the lexicographic order. This suggests the following algorithm: First, build a suffix tree of T 1 and T 2 , which contains all suffixes of T 1 and T 2 ordered lexicographically. Second, compute the longest common prefix of each suffix of T 1 and the two suffixes of T 2 closest to S in the lexicographic order, one from the left and one from the right. The problem of computing the longest common prefix has been extensively studied in the literature and a number of very efficient deterministic and randomised solutions exist [7,8,12,22,19]; for example, one can use a Lowest Common Ancestor (LCA) data structure, which can be constructed in linear time and space and answers longest common prefix queries in O(1) time [12,19].
Our solution to the longest common substring with approximately k mismatches problem is somewhat similar. Instead of the lexicographic order, we will consider Θ(n 1/(1+ε) ) different orderings on the suffixes of T 1 and T 2 . To define these orderings, we will use the locality-sensitive hashing technique, which was initially introduced for the needs of computational geometry [18] and later adapted for substrings with Hamming distance [4]. In more detail, we will choose Θ(n 1/(1+ε) ) hash functions, where each function can be considered as a projection of a string of length n onto a random subset of its positions. By choosing the size of the subset appropriately, we will be able to guarantee that the hash function is locality-sensitive: For any two strings at the Hamming distance at most k, the values of the hash functions on them will be equal with reasonably high probability, while the values of the hash functions on any pair of strings at the Hamming distance bigger than (1 + ε) · k will be equal with low probability. For each hash function, we will sort the suffixes of T 1 and T 2 by the lexicographic order on their hash values. As a corollary of the locality-sensitive property, if two suffixes of T 1 and T 2 have a long common prefix with at most k mismatches, they are likely to be close to each other in at least one of the orderings.
However, we will not be able to compute the longest common prefix with (1 + ε)k mismatches for all candidate pairs of suffixes exactly (the best data structure, based on the kangaroo method [28,15], has query time Θ((1 + ε)k) which is Θ(n) in the worst case). We will use this method for only one pair of suffixes chosen at random from a carefully preselected set of candidate pairs. For other candidate pairs, we will use LCPk queries. In an LCPk query, we are given two suffixes S 1 , S 2 of T 1 and T 2 , respectively, and must output any integer ℓ such that LCP k (S 1 , S 2 ) ≤ ℓ ≤ LCP (1+ε)k (S 1 , S 2 ), where LCP k and LCP (1+ε)k denote the longest common prefix with at most k and at most (1 + ε)k mismatches, respectively. In Section 3.2, we show the following lemma based on the sketching techniques by Kushilevitz et al. [27]: For given k and ε, after O(n log 3 n)-time and O(n log 2 n)-space preprocessing of strings T 1 , T 2 , any LCPk query can be answered in O(log 2 n) time. With probability at least 1 − 1/n 3 , the preprocessing produces a data structure that correctly answers all LCPk queries.
The key idea is to compute sketches for all power-of-two length substrings of T 1 and T 2 . The sketches will have logarithmic length (so that we will be able to compare them very fast) and the Hamming distance between them will be roughly proportional to the Hamming distance between the original substrings. Once the sketches are computed, we use binary search to answer LCPk queries in polylogarithmic time.

Proof of Lemma 2
During the preprocessing stage, we compute sketches [27] of all substrings of the strings T 1 and T 2 of lengths ℓ = 1, 2, 4, . . . , 2 ⌊log n⌋ , which can be defined in the following way. Without loss of generality, assume that the alphabet is Σ = {0, 1, . . . , p − 1}, where p is a prime number. For a fixed ℓ, choose λ = ⌈3 ln n/γ 2 ⌉ vectors r i ℓ of length ℓ, where γ is a constant to be defined later, such that the values r i ℓ [j] across i = 1, 2, . . . , λ and j = 1, 2, . . . , ℓ are independent and identically distributed so that for every a ∈ Σ: For a string X of length ℓ, we define the sketch sk(X) to be a vector of length λ, where sk(X)[i] = r i ℓ · X (mod p). For each i = 1, 2, . . . , λ, we compute the inner product of r i ℓ with all length-ℓ substrings of T 1 and T 2 in O(n log n) time by running the Fast Fourier Transform (FFT) algorithm in the field Z p [13]. As a result, we obtain the sketches of each length-ℓ substring of T 1 and T 2 . We repeat this step for all specified values of ℓ. One instance of the FFT algorithm takes O(n log n) time, and we run an instance for each i = 1, 2, . . . , λ and for each ℓ = 1, 2, 4, . . . , 2 ⌊log n⌋ , which takes O(n log 3 n) time in total. The sketches occupy O(n log 2 n) space. Each string S can be decomposed uniquely as Lemma 3 (see [27]). Let S 1 , S 2 be strings of the same length. For each i = 1, . . . , λ: Proof. We use a different interpretation of r i ℓ that defines the same distribution. We start with the zero vector and sample positions with probability We set ∆ = δ1+δ2 2 · λ and γ = δ2−δ1 2 . Observe that ) k is an increasing function of k bounded from above by e 1/2 . Consequently, if ε is a constant, then γ is a constant as well.
Lemma 4. For all strings S 1 and S 2 of the same length, the following claims hold with probability at least 1 − n −6 : Proof. Let χ i be an indicator random variable that is equal to one if and only if sk The claim follows immediately from Lemma 3 and the following Chernoff-Hoeffding bounds [20, Theorem 1]. For λ independently and identically distributed binary variables χ 1 , χ 2 , . . . , χ λ , we have , so we obtain that the error probability is at most e −2λγ 2 ≤ n −6 . If d H (S 1 , S 2 ) ≤ k, Lemma 3 asserts that µ ≤ δ 1 . By the first of the above inequalities, we have that d H (sk(S 1 ), sk(S 2 )) ≤ ∆ with probability at least 1 − n −6 . Hence, if d H (sk(S 1 ), sk(S 2 )) > ∆, then d H (S 1 , S 2 ) > k with the same probability.
Suppose we wish to answer an LCPk query on two suffixes S 1 , S 2 . It suffices to find the longest prefixes of S 1 , S 2 such that the Hamming distance between their sketches is at most ∆. As mentioned above, these prefixes can be represented uniquely as a concatenation of strings of power-of-two lengths ℓ 1 > ℓ 2 > . . . > ℓ g . To compute ℓ 1 , we initialise it with the biggest power of two not exceeding n and compute the Hamming distance between the sketches of the corresponding substrings. If it does not exceed ∆, we have found ℓ 1 ; otherwise, we divide ℓ 1 by two and continue. Suppose that we already know ℓ 1 , ℓ 2 , . . . , ℓ i and the sketches sk , respectively. Consequently, the query procedure takes O(log 2 n) time. It errs on at least one query with probability at most n −3 (Lemma 4 is only applied for pairs of same-length substrings of T 1 and T 2 , so we estimate error probability by the union bound). This completes the proof of Lemma 2.

Proof of Theorem 2
We start by preprocessing T 1 and T 2 as described in Lemma 2. In the main phase of the algorithm, we construct a family H of hash functions based on four parameters m, s, t, w ∈ Z to be specified later.
Let Π be the set of all projections of strings of length n onto a single position, i.e. the value π i (S) of the i-th projection on a string S is simply its i-th character S[i]. More generally, for a string S of length n and a function h = (π a1 , . . . , π aq ) ∈ Π q , we define h(S) as S[a p1 ]S[a p2 ] · · · S[a pq ], where p is a permutation such that a p1 ≤ · · · ≤ a pq . If |S| < n, we define h(S) := h(S · $ n−|S| ), where $ / ∈ Σ is a special gap-filling character.
Each hash function h ∈ H is going to be a uniformly random element of Π mt ; however, the individual hash functions are not chosen independently in order to ensure faster running time for the algorithm. Nevertheless, H will be composed of s independent subfamilies H i , each of size w t . To construct H i , we choose w functions u i,1 , . . . , u i,w ∈ Π m independently and uniformly at random. Each hash function h ∈ H i is defined as an unordered t-tuple of distinct functions u i,r . Formally, Consider the set of all suffixes S 1 , S 2 , . . . , S 2n of T 1 and T 2 . For each h ∈ H, we define an ordering ≺ h of the suffixes S 1 , . . . , S 2n according to the lexicographic order of the values h(S j ) of the hash function and, in case of ties, according to the lengths |S j |. To construct it, we build a compact trie 1 on strings h(S 1 ), h(S 2 ), . . . , h(S 2n ). Let us defer the proof of the theorem until we complete the description of the algorithm and derive Theorem 2. We preprocess functions u i,r and build a trie on h(S 1 ), . . . , h(S 2n ) for each h ∈ H i . We then augment the trie with an LCA data structure, which can be done in linear time and space [12,19]. The latter can be used to find in constant time the longest common prefix of any two strings h(S j ) and h(S j ′ ).
Consider a function h ∈ H and a positive integer ℓ ≤ n. We define h [ℓ] so that In other words, if h is a projection onto positions from a multiset P , then h We define the family of collisions C H ℓ as a set of triples (S, S ′ , h) such that S and S ′ are suffixes of T 1 and T 2 , respectively, both of length at least ℓ, and h ∈ H is such that the suffixes collide on h [ℓ] , that is, . Note that the families of collisions are nested: C H 0 ⊇ · · · ⊇ C H ℓ ⊇ C H ℓ+1 ⊇ · · · ⊇ C H n . For a fixed function h, we define the ℓ-neighbourhood of S as the set of suffixes S ′ of T 2 such that (S, S ′ , h) ∈ C H ℓ . We observe that the ℓ-neighbourhood of S forms a contiguous range in the sequence of suffixes of T 2 ordered according to ≺ h , and this range can be identified in O(log n) time using binary search and LCA queries on the trie constructed for h. Consequently, an O(n|H|)-space representation of C H ℓ , with one range for every ℓ-neighbourhood of each suffix S, can be constructed in O(n|H| log n) time.
In the algorithm, we find the largest ℓ such that |C H ℓ | ≥ 2n|H|; using a binary search, this takes O(n|H| log 2 n) time. For each (S, S ′ , h) ∈ C H ℓ+1 , we compute the longest common prefix with approximately k mismatches LCPk(S, S ′ ) (Lemma 2). Additionally, we pick a single element (S,S ′ ,h) ∈ C H ℓ uniformly at random and compute the longest common prefix with at most (1 + ε)k mismatches LCP (1+ε)k (S,S ′ ) naively in O(n) time. The longest of the retrieved prefixes is returned as an answer.
See Algorithm 1 for pseudocode. We will now proceed to the analysis of complexity and correctness of the algorithm. for r = 1, 2, . . . , w do

Complexity and correctness
To ensure the complexity bounds and correctness of the algorithm, we must carefully choose the parameters s, t, w, and m. Let p 1 = 1 − k/n, p 2 = 1 − (1 + ε) · k/n, and ρ = log p 1 / log p 2 . The intuition behind these values is that if S and S ′ are two strings of length n and d H (S, S ′ ) ≤ k, then p 1 is a lower bound for the probability of S[i] = S ′ [i] for a uniformly random position i. On the other hand, p 2 is an upper bound for the same probability if d H (S, S ′ ) ≥ (1 + ε) · k. Based on these values, we define t = log n , m = 1 t log p2 1 n , w = t 2 + ⌈p −m 1 ⌉, and s = Θ(t!).

Complexity
To show the complexity of the algorithm, we will start with a simple observation and a more involved fact. Similarly, Moreover, p 1 > p 2 yields log p 1 > log p 2 and therefore ρ = log p1 log p2 < 1. Consequently, w = O(log n) + 2 O( √ log n) = n o(1) , which concludes the proof.

Correctness
First, let us focus on two suffixes which yield the longest common substring with exactly k mismatches. Lemma 7. Let S and S ′ be suffixes of T 1 and T 2 , respectively, that maximise LCP k (S, S ′ ), i.e., such that LCP k (S, S ′ ) = ℓ k . For each i ∈ {1, . . . , s}, with probability Ω(1/t!) there exists h ∈ H i such that Hence, , where the latter is true because w ≥ t 2 and w ≥ 2. Consequently, µ = Ω(1/t!).
As a corollary, we can choose a constant in the number of steps s = Θ(t!) so that (S, S ′ , h) ∈ C H ℓ k for some h ∈ H holds with probability at least 3 4 . If additionally ℓ k > ℓ, then (S, S ′ , h) ∈ C H ℓ+1 , so LCPk(S, S ′ ) will be called and with high probability will return a substring of length ≥ ℓ k . Otherwise, |C H ℓ k | ≥ 2n|H| and we claim that a uniformly random (S,S ′ ,h) ∈ C H ℓ satisfies LCP (1+ε)k (S,S ′ ) ≥ ℓ ≥ ℓ k with probability at least 1 2 . To prove this, we first introduce a family B H of bad collisions: triples (S, S ′ , h) which belong to C H ℓ for some ℓ > LCP (1+ε)k (S, S ′ ), and bound its expected size. Proof. Let us bound the probability that (S, S ′ , h) ∈ B H for fixed suffixes S and S ′ (of T 1 and T 2 , respectively) and fixed h = (u i,r1 , . . . , u i,rt ). Equivalently, we shall bound Pr[(S, S ′ , h) ∈ C H ℓ ] for ℓ = LCP (1+ε)k (S, S ′ ) + 1. If |S| < ℓ or |S ′ | < ℓ, the probability is 0 by the definition of C H ℓ . Otherwise, we observe that d H (S[1, ℓ], S ′ [1, ℓ]) > (1 + ε)k and that h can be considered (due to its marginal distribution) as a projection onto mt uniformly random positions. Therefore, where the last inequality follows from the definition of m, which yields mt ≥ log p2 1 n . In total, we have n 2 |H| possible triples (S, S ′ , h) so by linearity of expectation, we conclude that the expected number of bad collisions is at most 1 n n 2 |H| = n|H|. Corollary 4. Let (S, S ′ , h) be a uniformly random element of C H ℓ , where ℓ is a random variable which always satisfies |C H ℓ | ≥ 2n|H|. We have Pr[(S, S ′ , h) ∈ B H ] ≤ 1 2 . Proof. More formally, we shall prove that Pr[(S, S ′ , h) ∈ B H | (S, S ′ , h) ∈ C H ℓ ] ≤ 1 2 holds for a uniformly random triple (S, S ′ , h). Indeed: Below, we combine the previous results to prove that with constant probability Algorithm 1 correctly solves the Approximate LCS with k Mismatches problem. Note that we can reduce the error probability to an arbitrarily small constant δ > 0: it suffices to repeat the algorithm a constant number of times and among the resulting pairs, choose the longest substrings successfully verified to be at Hamming distance at most (1 + ε)k; verification can be implemented naively in O(n) time.
Corollary 5. With non-zero constant probability, Algorithm 1 succeeds -it reports a substring of T 1 and a substring of T 2 at Hamming distance at most (1 + ε)k, both of length at least ℓ k , where ℓ k is the length of the longest common substring with k mismatches.
Proof. We will prove that the algorithm succeeds conditioned on the following events: • the preprocessing of Lemma 2 succeeds, • the preprocessing of Theorem 3 succeeds for each function u i,r , • C H ℓ k contains (S, S ′ , h) such that LCP k (S, S ′ ) = ℓ k (see Lemma 7), • the randomly chosen (S,S ′ ,h) ∈ C H ℓ does not belong to B H (see Corollary 4). This assumption holds with probability Ω(1), because probability of the complementary event can be bounded as follows using the union bound applied on the top of Lemma 2, Theorem 3, Lemma 7, and Corollary 4: Successful preprocessing of functions u i,r guarantees that the value ℓ and the families C H ℓ and C H ℓ+1 have been computed correctly. If ℓ k > ℓ, then C H ℓ+1 contains (S, S ′ , h) such that LCP k (S, S ′ ) = ℓ k . The correctness of LCPk queries asserts that LCPk(S, S ′ ) ≥ ℓ k , so the algorithm considers prefixes of S and S ′ of length at least ℓ k as candidates for the resulting substrings. If ℓ k ≤ ℓ, on the other hand, then the randomly chosen (S,S ′ ,h) ∈ C H ℓ satisfies LCP (1+ε)k (S,S ′ ) ≥ ℓ ≥ ℓ k , so the algorithm considers prefixes ofS andS ′ of length at least ℓ ≥ ℓ k . In either case, a pair of substrings of length at least ℓ k and at Hamming distance at most (1 + ε)k is among the considered candidates. The resulting substrings also satisfy these conditions, because we return the longest candidates and the correctness of LCPk queries asserts that no substrings at distance more than (1 + ε)k are considered.

Proof of Theorem 3
Recall that each h ∈ H is a t-tuple of functions u i,r , i.e. h = (u i,r1 , u i,r2 , . . . , u i,rt ), where 1 ≤ i ≤ s and 1 ≤ r 1 < r 2 < · · · < r t ≤ w. We will show a preprocessing of functions u i,r after which we will be able to compute the longest common prefix of any two strings u i,r (S j ), u i,r (S j ′ ) in O(1) time. As a result, we will be able to compute the longest common prefix of h(S j ), h(S j ′ ) in O(t) time. It also follows that we will be able to compare any two strings h(S j ), h(S j ′ ) in O(t) time as the order ≺ h is defined by the character following the longest common prefix (or by the lengths |S j | and |S j ′ | if h(S j ) = h(S j ′ )). Therefore, we can sort strings It remains to explain how we preprocess individual functions u i,r . For each function, it suffices to build a trie on strings u i,r (S 1 ), u i,r (S 2 ), . . . , u i,r (S 2n ) and to augment it with an LCA data structure [12,19]. We will consider two different methods for constructing the trie with time dependent on m. No matter what the value of m is, one of these methods will have O(n 4/3 log 4/3 n) running time. Let u i,r be a projection onto a multiset P of positions 1 ≤ a 1 ≤ a 2 ≤ · · · ≤ a m ≤ n and denote T = T 1 $ n T 2 $ n . Proof. Without loss of generality assume that √ m is integer. Let us partition P into subsets B 1 , . . . , B √ m , where Now u i,r can be represented as a √ m-tuple of projections b 1 , b 2 , . . . , b √ m onto the subsets B 1 , B 2 , . . . , B √ m , respectively. We will build the trie by layers to avoid space overhead. Suppose that we have built the trie for a function (b 1 , b 2 , . . . , b ℓ−1 ) and we want to extend it to the trie for (b 1 , b 2 , . . . , b ℓ−1 , b ℓ ).
Let p be a prime of value Ω(n 5 ). With error probability inverse polynomial in n, we can find such p in O(log O(1) n) time; see [33,2]. We choose a uniformly random r ∈ F p and create a vector χ of length n. We initialise χ as a zero vector and for each position a ℓ,q ∈ B ℓ , we increase χ[a ℓ,q ] by r q . We then run the FFT algorithm for χ and T in the field Z p [13]. The output of the FFT algorithm contains the inner products of χ and all suffixes S 1 , S 2 , . . . , S 2n . The inner product of χ and a suffix S j is the Karp-Rabin fingerprint [25] ϕ ℓ,j of b ℓ (S j ), where If the fingerprints of b ℓ (S j ) and b ℓ (S j ′ ) are equal, then b ℓ (S j ) and b ℓ (S j ′ ) are equal with probability at least 1 − 1/n 4 , and otherwise they differ (for a proof, see e.g. [30]).
For a fixed leaf of the trie for (b 1 , b 2 , . . . , b ℓ−1 ), we first sort all the suffixes that end in it by fingerprints ϕ ℓ,j . Second, we lexicographically sort the strings b ℓ (S j ) with distinct fingerprints. For this, we need to be able to compare b ℓ (S j ) and b ℓ (S j ′ ) and to find the first character where they differ. We compare b ℓ (S j ) and b ℓ (S j ′ ) character-by-character in O( √ m) time. We then append the leaf of the trie for (b 1 , b 2 , . . . , b ℓ−1 ) with a trie on strings b ℓ (S j ) that can be built by imitating its depth-first traverse. By the union bound, the error probability is at most 1 n 4 · n 2 √ m ≤ 1 n . We now analyse the complexity of the algorithm. For each of the √ m layers, the FFT algorithm takes O(n log n) time. The sort by fingerprints takes O(n log n) time per layer, or O( √ mn log n) time in total. We finally need to estimate the total number of character-by-character comparisons in all the layers. We claim that it can be upper bounded by O(n log n).
The reason for that is as follows: if we consider the resulting trie for u i,r (S 1 ), . . . , u i,r (S 2n ), it has size O(n). Imagine that the layers cut this trie into a number of smaller tries. The total size of these tries is still O(n), and we build each of these tries using character-by-character comparisons. For a trie of size x, we need O(x log x) comparisons, which in total is O(n log n). Therefore, the character-by-character comparisons take O( √ mn log n) time in total.
The second method builds the trie using the algorithm described in the first paragraph of this section: we only need to give a method for computing the longest common prefix of u i,r (S j ) and u i,r (S j ′ ) (or, equivalently, the first position where u i,r (S j ) and u i,r (S j ′ ) differ). The following lemma shows that this query can be answered in O(n log n/m) time, which gives O(n 2 log 2 n/m) time complexity of the trie construction.
Lemma 10 (see [4]). After O(n)-time and space preprocessing the first position where two strings u i,r (S j ) and u i,r (S j ′ ) differ can be found in O(n log n/m) time correctly with error probability at most 1/n 3 .
Proof. For m = O(log n) the conclusion is trivial. Assume otherwise. We start by building the suffix tree for the string T which takes O(n) time and space [36,17]. Furthermore, we augment the suffix tree with an LCA data structure in O(n) time [12,19].
Let ℓ = ⌈3n ln n/m⌉. We can find the first ℓ positions q 1 < q 2 < · · · < q ℓ where S j and S j ′ differ in O(ℓ) = O(n log n/m) time using the kangaroo method [28,15]. We set q r = ∞ if a given position does not exist. The idea of the kangaroo method is as follows. We can find q 1 by one query to the LCA data structure in O(1) time. After removing the first q 1 positions of S j and S j ′ , we obtain suffixes S j+q1 , S j ′ +q1 and find q 2 by another query to the LCA data structure, and so on. If at least one of the positions q 1 , q 2 , . . . , q ℓ belongs to P , then we return the first such position as an answer, and otherwise we say that u i,r (S j ) = u i,r (S j ′ ). The multiset P can be stored as an array of multiplicities so that testing if an element belongs to it can be done in constant time.
Let us show that if p is the first position where u i,r (S j ) and u i,r (S j ′ ) differ, then p belongs to {q 1 , q 2 , . . . , q ℓ } with high probability. Because q 1 < q 2 < · · · < q ℓ are the first ℓ positions where S j and S j ′ differ, it suffices to show that at least one of these positions belongs to P . We rely on the fact that positions of P are independent and uniformly random elements of [1, n]. Consequently, we have Pr[q 1 , . . . , q ℓ / ∈ P ] = (1 − ℓ/n) m ≤ (1 − 3 ln n/m) m ≤ 1 e 3 ln n = 1/n 3 . By Lemmas 9 and 10, the trie on strings u i,r (S 1 ), . . . , u i,r (S 2n ) can be built in O(min{ √ m, n log n/m} · n log n) = O(n 4/3 log 4/3 n) time and O(n) space correctly with high probability which implies Theorem 3 as explained in the beginning of this section.

Approximate LCS with k Mismatches
In this section, we consider an approximate variant of the LCS with k Mismatches problem, defined as follows.
Problem 5 (Approximate LCS with k Mismatches). Two strings T 1 , T 2 of length n, an integer k, and a constant z > 1 are given. If ℓ k is the length of the longest common substring with k mismatches of T 1 and T 2 , return a substring of T 1 of length at least ℓ k /z that occurs in T 2 with at most k mismatches.  Proof. (a) The algorithm of Theorem 2 for ε = 1 computes a pair of substrings of length at least ℓ k of T 1 and T 2 that have Hamming distance at most 2k. Either the first halves or the second halves of the strings have Hamming distance at most k.
(b) We use the gap that exists in Lemma 1 for q > 1. Assume that there is such an algorithm for some ε and δ. We will run it for strings T 1 and T 2 from that lemma. Let q = ⌈ 3 ε ⌉ − 2; then ℓ/ℓ ′ ≥ 2 − ε. If the Orthogonal Vectors problem has a solution, by Lemma 1(a), the algorithm produces a longest common substring of length at least ℓ/(2 − ε) ≥ ℓ ′ . Otherwise, by Lemma 1(b), its result has length smaller than ℓ ′ . This concludes that the conjectured approximation algorithm can be used to solve the Orthogonal Vectors problem.

LCS with k Mismatches for all k
The following problem has received a considerable attention in the recent years; see [9] and the references therein.
Problem 6 (Binary Jumbled Indexing). Construct a data structure over a binary string S of length n that, given positive integers ℓ and q, can compute if there is a substring of S of length ℓ containing exactly q ones.
A simple combinatorial argument shows that it suffices to compute the minimal and maximal number of ones in a substring of S of length ℓ, as for every intermediate number of ones a substring of S of this length exists as well. As a result, the Binary Jumbled Indexing problem can be solved in linear space and with constant-time queries. It turns out that the index can also be constructed in strongly subquadratic time. Proof. Note that, equivalently, we can compute, for all ℓ = 1, . . . , n, the minimal Hamming distance between substrings of length ℓ in T 1 and T 2 .
Let M be an n × n Boolean matrix such that M [i, j] = 0 if and only if T 1 [i] = T 2 [j]. We construct 2n − 1 binary strings corresponding to the diagonals of M : the string number p, for p ∈ {−n, . . . , n}, corresponds to the diagonal M [i, j] : j − i = p. For each of the strings, we construct the jumbled index using Lemma 11.
Each diagonal corresponds to one of the possible alignments of T 1 and T 2 . In the jumbled index we compute, in particular, for each value of ℓ what is the minimal number of 1s (which correspond to mismatches between the corresponding positions in T 1 and T 2 ) in a string of length ℓ. To compute the global minimum for a given ℓ, we only need to take the minimum across all the jumbled indexes.
By Lemma 11, all the jumbled indexes can be constructed in O(n 2.859 ) expected time or in O(n 2.864 ) time deterministically.