Maximal Common Subsequence Algorithms

A common subsequence of two strings is maximal, if inserting any character into the subsequence can no longer yield a common subsequence of the two strings. The present article proposes a (sub)linearithmic-time, linear-space algorithm for finding a maximal common subsequence of two strings and also proposes a linear-time algorithm for determining if a common subsequence of two strings is maximal. 2012 ACM Subject Classification Theory of computation → Pattern matching


Introduction
A subsequence of a string of characters is obtained from the string by deleting any number of not necessarily contiguous characters at any position.A common subsequence of two strings can be though of as a pattern common to the strings.A common subsequence is maximal, if inserting any character into the subsequence can no longer yield a common subsequence.Hence, any common subsequence can be found as a subsequence of some maximal common subsequence.The present article considers the problem of finding a maximal common subsequence of two strings both of length O(n) over an alphabet set of O(n) characters for some positive integer n and also considers the problem of determining if a given common subsequence of the two strings is maximal.
A longest one of maximal common subsequences is called a longest common subsequence (an LCS).It is well known that the dynamic programming algorithm of Wagner and Fisher [10] finds an LCS of two O(n)-length strings in O(n 2 ) time and O(n 2 ) space.Moreover, the divide-and-conquer version developed by Hirschberg [6] reduces the required space to O(n) without increasing the asymptotic execution time.On the other hand, Abboud et al. [1] revealed that, for any positive constant , there exist no O(n 2− )-time algorithms for computing the LCS length, unless the strong exponential time hypothesis (SETH) [7,8] is false.This immediately implies that, under assumption of SETH, neither an LCS can be found nor whether a common subsequence is an LCS can be determined in O(n 2− ) time.Problems of finding a conditional LCS have also been considered.The constrained LCS (CLCS) problem [9,3] (also called the SEQ-IC-LCS problem [2]) and the restricted LCS (RLCS) problem [5] (also called the SEQ-EC-LCS problem [2]) are such problems.Given a common subsequence P as essentially "relevant" (resp."irrelevant") to relationship between the two strings, the CLCS (RLCS) problem consists of finding an LCS that has (resp.does not have) P as a subsequence and was shown to be solvable in O(n 3 ) time [3] (resp.[5,2]).From definition, the CLCS found is maximal.In contrast, the RLCS found is not necessarily maximal and, unless maximal, the RLCS might not be very informative in 1:2

Maximal Common Subsequence Algorithms
certain applications because it is just obtained from some common subsequence, which has the "irrelevant" P perfectly as a subsequence, only by deleting a single character.
The reason why it takes at least an almost quadratic time to find an LCS or a conditional LCS as a pattern common to the two strings is due to condition that the pattern to be found should have a maximum length.Possibly for an analogous reason, the best asymptotic running time known for finding a shortest maximal common subsequence of two strings remains cubic [4].The present article shows that, ignoring such conditions with respect to the length of a maximal common subsequence to be found, we can find a maximal common subsequence much faster, by proposing an O(n log log n)-time, O(n)-space algorithm.This algorithm can also be used to find a constrained maximal common subsequence, which hence has P as a subsequence, in the same asymptotic time and space, where P is an arbitrary common subsequence given as a "relevant" pattern.It is also shown that we can determine whether any given common subsequence, such as an RLCS, is maximal further faster, by proposing an O(n)-time algorithm.
This article is organized as follows.Section 2 defines notations and terminology used in this article.Section 3 proposes an O(n log log n)-time, O(n)-space algorithm that finds a maximal common subsequence of two strings of length O(n).Section 4 modifies the above algorithm so as to output a maximal common subsequence having a given common subsequence as a subsequence in the same asymptotic time and space.Section 5 proposes an O(n)-time algorithm that determines if a given common subsequence is maximal.Section 6 concludes this article.

Preliminaries
For any sequences S and S , let S • S denote the concatenation of S followed by S .Let ε denote the empty sequence.For any sequence S, let |S| denote the length of S. For any index i with 1 A subsequence of S is a sequence obtained from S by deleting elements at any position, i.e., For any sequences S and S , we say that S contains S , if S is a subsequence of S. For any indices i and i with 0 ≤ i ≤ i ≤ |S|, let S(i , i] denote the contiguous subsequence of S consisting of all elements at position between i + 1 and i, i.e., S[i |Σ| } be an alphabet set of |Σ| characters, which are totally ordered.A string is a sequence of characters over Σ.For any strings X and Y , a common subsequence of X and Y is a subsequence of X that is also a subsequence of Y .We say that X and Y are disjoint, if they have no non-empty common subsequences.Let a common subsequence W of X and Y be maximal, if inserting any character into W can no longer yield a common subsequence of X and Y .

Algorithm for finding a maximal common subsequence
This section proposes an O(n log log n)-time algorithm that outputs, for any strings X and Y of length O(n) with |Σ| = O(n) given as input, a maximal common subsequence of X and Y .
For technical reasons, we assume without loss of generality that

Y. Sakai 1:3
We also assume that the array I (resp.J ) of arrays I c (resp.J c ) for all characters c in Σ is available, where I c (resp.J c ) is an appropriate data structure supporting queries of the following index, indicating the nearest occurrence of a specific character c in X (resp.Y ) from a specific position i (resp.j).Definition 1.For any character c in Σ and any index i with 0 ≤ i ≤ |X|, let I ≺ c (i) (resp.I c (i)) denote the least (greatest) index such that c does not appear in X(I ≺ c (i), i] (resp.X(i, I c (i)]).Define index J ≺ c (j) (resp.J c (j)) analogously with respect to Y .
In what follows, we adopt as data structure I c (resp.J c ) the y-fast trie [11] maintaining all indices i (resp.j) with X[i] = c (resp.Y [j] = c), because array I (resp.J ) is constructible in O(n log log n) time and O(n) space and supports O(log log n)-time queries of any index introduced above.However, in implementation for practical use, if n is not very large, then due to hidden constant factors in big-O notation, adopting as I c (resp.J c ) the array consisting of the same indices as the y-fast trie in ascending order, supporting O(log n)-time queries based on a binary search of the array, might be more suitable.Furthermore, if |Σ| is a small constant, then we can adopt as I c (resp.J c ) the table of indices I ≺ c (i) and I c (i) (resp.J ≺ c (i) and J c (j)) for all indices i (resp.j), which supports O(1)-time queries.We design the proposed algorithm based on the following property of a common subsequence W , which is naturally derived from the fact that W is not maximal if and only if inserting some character between some prefix and the remaining suffix of W still yields a common subsequence of X and Y .The proposed algorithm solves the problem using string variable W , which is initially set to c 1 • c |Σ| and is eventually updated to a maximal common subsequence of X and Y .For any index k with 0 ≤ k ≤ |W |, let X k and Y k be the substrings in Lemma 2. The algorithm updates W by iteratively replacing it by where k is the least index such that X k and Y k are not disjoint and c is a certain character appearing both in X k and Y k , until X k and Y k become disjoint for all indices k with 0 ≤ k ≤ |W |.Note that the resulting string W is a maximal common subsequence of X and Y due to Lemma 2. The algorithm adopts as c the character that appears both in the shortest possible suffix of X k or Y k and the entire string of the other.As shown later, this choice is crucial to executing the algorithm in O(n log log n) time.
In order to execute the above, the algorithm maintains a sequence variable, Ŵ = (i 1 , j 1 ) and Y (j k , j k+1 ] are respectively certain prefixes of X k and Y k such that they are disjoint if and only if X k and Y k are disjoint.The character to be inserted at position between W (0, k] and W (k, |W |] is searched for by iteratively updating Ŵ by replacing (i k+1 , j k+1 ) by (i k+1 − 1, j k+1 − 1).If i k+1 becomes i k or j k+1 becomes j k , then, since X k and Y k are disjoint, the algorithm updates Ŵ by replacing Ŵ [k + 1] by (i , j ) and then updates k to k + 1, where i (resp.j ) is the index such that X(0, i ] (resp.Y (0, j ]) is the shortest prefix of X (resp.
. Otherwise, since Y [j k+1 ] appears in X(i k , i k+1 ], the algorithm updates W and Ŵ in a symmetric manner with respect to Y [j k+1 ]. A pseudocode of the proposed algorithm is given as Algorithm findMCS in Algorithm 1, where we assume that, by an O(n log log n)-time preprocessing, arrays I and J are available as data structures supporting O(log log n)-time queries of any of indices I ≺ c (i), I c (i), J ≺ c (j), and J c (j).In this pseudocode, variables i , j , i, and j are respectively used to represent indices i k , j k , i k+1 , and j k+1 , where (i k , j k ) = Ŵ [k] and (i k+1 , j k+1 ) = Ŵ [k + 1].A concrete example of how this algorithm works is presented in Figure 1.
As mentioned earlier, the following condition holds at any execution of line 7 of this algorithm and also at the last execution of line 4. Definition 3.For any string W , any index pair sequence Ŵ = (i 1 , j 1 )

X(i
The following simple lemma plays a key role in estimating execution time of the algorithm.This lemma claims, for example, that the situation where any solid line other than the leftmost and rightmost ones in Figure 1 shares at least one of endpoints with a unique dotted line is inevitable.Note that i = i k and j = j k due to Lemma 4. Since i = i or j = j , Ŵ [k + 1] is obtained from (i k+1 , j k+1 ) by executing either line 15 or line 18 executing line 8 iteratively min(i k+1 − i k , j k+1 − j k ) times.This implies that execution time of the algorithm is O( k+1] (j ) + 1 = h k+1 hold.Therefore, from Lemma 5, i k+1 + 1 = g k+1 or j k+1 + 1 = h k+1 holds and hence we have that min(i The algorithm uses variables W , Ŵ , k, i , j , i, and j, together with data structures I and J , which all require O(n) space.

Algorithm for finding a constrained maximal common subsequence
This section modifies Algorithm findMCS so as to output, for any common subsequence P of X and Y given as an additional input string, a maximal common subsequence of X and Y that contains P in O(n log log n) time and O(n) space, where we assume the same condition of X and Y as in Section 3 and also assume that P Ŵ ← ε; for each index k from |P | − 1 down to 1, 5: while do the same as lines 4 through 19 of Algorithm findMCS.
instead of (1, 1) . Since lines 4 through 19 of the original algorithm delete no characters from W , the modified algorithm eventually outputs a maximal common subsequence of X and Y that contains P in O(n log log n) time after initialization of (W, Ŵ , 1) to (P, P , 1).For any index k with 1 ≤ k ≤ |P |, let i k+1 (resp.j k+1 ) be the greatest index such that Then, Definition 3 immediately suggests that P can be set to (1, 1) . Thus, we have Algorithm findCMCS presented in Algorithm 2 as an O(n log log n)-time algorithm for finding a maximal common subsequence of X and Y containing P .for each index i from i k + 1 to i k+1 , where i 0 = 0, 27: i X[i] ← i; 28: if j X[i] > j k , then 29: output "not maximal" and halt; 30: for each index j from j k + 1 to j k+1 , where j 0 = 0, 31: where j 0 = 0.The algorithm uses index variable i c (resp.j c ) for any character c in Σ.Let I (resp.J) denote the array consisting of variables i c (resp.j c ) for all characters c in Σ.For any index k with −1 ≤ k ≤ |W |, let C I (k) (resp.C J (k)) denote the condition that, for any character c in Σ, X(i c , i k+1 ] (resp.Y (j c , j k+1 ]) is the longest suffix of X(0, i k+1 ] (resp.Y (0, j k+1 ]) in which c does not appear.
After computing indices i k , j k , i k , and j k for all indices k with 0 ≤ k ≤ |W | by lines 1 through 21 of the algorithm, lines 22 through 24 initialize variables i c and j c so that C I (−1) and C J (−1) hold.Then, for any index k from 0 to |W |, lines 25 through 33 check if X k and Y k are disjoint as follows.Since either k = 0 or X k−1 and Y k−1 are disjoint, X k and Y k are disjoint.Therefore, it suffices to check if X k and Y k are disjoint and check if X k and Y k are disjoint.Lines 27 through 29 update I so as to satisfy C I (k) by iteratively executing line 27 and also check if X k and Y k are disjoint by iteratively executing line 28 using array J satisfying C J (k − 1).If X k and Y k are not disjoint, then, since j X[i] > j k holds for some index i with i k + 1 ≤ i ≤ i k+1 due to C J (k − 1), line 29 outputs message "not maximal" and terminates the algorithm; otherwise, line 29 is never executed also due to C J (k − 1) and hence lines 30 through 33 are executed.Lines 30 through 33 update array J so as to satisfy C J (k) and check if X k and Y k are disjoint using array I satisfying C I (k) in a similar manner.Thus, the algorithm works correctly.
It is easy to verify that the algorithm runs in O(n) time.

Conclusion
The present article proposed an O(n log log n)-time, O(n)-space algorithm that finds a maximal common subsequence of two O(n)-length strings over an alphabet set of O(n) characters, which are totally ordered, where n is an arbitrary positive integer and a common subsequence is maximal, if inserting any character into it can no longer yields a common subsequence.It is also shown that, without increasing asymptotic time and space complexities, this algorithm can be used to find a constrained maximal common subsequence, which contains a common subsequence given arbitrarily as a "relevant" pattern, after an appropriate initialization of some variables.Furthermore, an O(n)-time algorithm that determines if a given common subsequence is maximal was also proposed.There remain some questions to be solved, which are related to the problems considered in the present article.Our algorithms run much faster than those proposed so far (and also all possible algorithms under SETH) for the LCS-related problems corresponding to ours.One reason for this difference is that any common subsequence is certainly a subsequence C P M 2 0 1 8

1:10
Maximal Common Subsequence Algorithms of some maximal common subsequence but is not necessarily a subsequence of any LCS.This fact naturally poses a question whether we can find a restricted maximal common subsequence, which does not contain a common subsequence given as an "irrelevant" pattern, in O(n log log n) time and O(n) space, because some restricted non-maximal common subsequences are not necessarily subsequences of any restricted maximal common subsequence.The gap between asymptotic execution time of the proposed algorithms for finding a maximal common subsequence and for determining if a common subsequence given is maximal immediately poses another natural question whether we can find a maximal common subsequence in O(n) time.

Lemma 2 .
For any common subsequence W of X and Y , W is maximal if and only if X k and Y k are disjoint for any index k with 0 ≤ k ≤ |W |, where X k (resp.Y k ) is the remaining substring obtained from X (resp.Y ) by deleting both the shortest prefix containing W (0, k] and the shortest suffix containing W (k, |W |].Proof.The lemma follows from the fact that, for any index k with 0 ≤ k ≤ |W | and any character c in Σ, W (0, k] • c • W (k, |W |] is a common subsequence of X and Y if and only if c appears in both X k and Y k .

Figure 1 A
Figure 1 A maximal common subsequence W = ˆdcebfag$ of X = ˆdccefebcccfbbfbhagbh$ and Y = ˆddacegagaabefdacggiai$ with c1 = ˆand c |Σ| = $, which is output by Algorithm findMCS.Lines 5 through 18 of the algorithm are executed fifteen times and for each number t with 1 ≤ t ≤ 15, the tth most inner pair of arrows (one solid and the other dotted, which are of the same length) indicates which index pairs (i, j) are considered by line 7 throughout the tth iteration of lines 5 through 18, where the dotted arrow is chosen so as to show that the sum of the length of all dotted arrows is at most 2(|X| + |Y |).Each dashed line between X[i] and Y [j] indicates that Ŵ is replaced by Ŵ (0, k] • (i − 1, j − 1) • Ŵ (k, | Ŵ |] by either line 15 or line 18.Each solid line between X[i ] and Y [j ], other than the leftmost one, indicates that Ŵ [k + 1] is set to index pair (i , j ) by line 11.

Lemma 5 .Theorem 6 .
At least one of I W [k+1] (i ) = i k+1 or J W [k+1] (j ) = j k+1 holds at any execution of line 11 in Algorithm findMCS, where i k+1 and j k+1 are the indices in Definition 3. Proof.Since X(i , i k+1 ] and Y (j , j k+1 ] are disjoint due to Lemma 4, W [k + 1] does not appear in at least one of X(i , i k+1 ] or Y (j , j k+1 ].For any strings X and Y of length O(n) with |Σ| = O(n), Algorithm findMCS outputs a maximal common subsequence of X and Y in O(n log log n) time and O(n) space.Proof.Since C(W, Ŵ , |W |) holds at the last execution of line 4 of the algorithm due to Lemma 4, it follows from Lemma 2 that W output by the algorithm is a maximal common subsequence of X and Y .Execution time of the algorithm is estimated as follows.Let V be the eventual string W output by line 19.For any index k with 0 ≤ k ≤ |V |, let g k (resp.h k ) denote the least index such that X(0, g k ] (resp.Y (0, h k ]) contains V (0, k].Let k be an arbitrary index with 1 ≤ k ≤ |V | and consider W and Ŵ just before execution of line 11.Let i k , j k , i k+1 , and j k+1 be the indices in Definition 3. Let (i , j ) = Ŵ [k] and let (i, j) = Ŵ [k + 1].

Theorem 7 .
For any strings X and Y of length O(n) with |Σ| = O(n) and any common subsequence P of X and Y , Algorithm findCMCS outputs a maximal common subsequence of X and Y containing P in O(n log log n) time and O(n) space.Proof.It is easy to verify by induction that, for any index k with 1 ≤ k ≤ |P | − 1, X(i, |X|] (resp.Y (j, |Y |]) at execution of line 9 of the algorithm are the shortest suffix of X (resp.Y ) that contains P (k, |P |].Therefore, W , Ŵ , and k just after execution of line 12 satisfy C(W, Ŵ , k).Since lines 1 through 12 are executed in O(n) time, the theorem can be proven in a way similar to the proof of Theorem 6. 5 Algorithm for determining if a common subsequence is maximal This section proposes an O(n)-time algorithm that determines, for any strings X and Y of length O(n) with |Σ| = O(n) and any common subsequence W of X and Y given as input, whether W is maximal or not.The proposed algorithm is based on Lemma 2. Using an array of |Σ| bits, each being used to indicate if a distinct character in Σ appears in Y k , we can determine if X k and Y k are disjoint in O(|X k | + |Y k |) time for any index k with 0 ≤ k ≤ |W |, where X k and Y k are the substrings of X and Y in Lemma 2, respectively.However, this naive approach provides only an O(n 2 )-time algorithm, because both |X k | and |Y k | can be Θ(n) for all indices k and |W | can also be Θ(n).In order to reduce this execution time to O(n), the algorithm exploits the fact that if X k−1 and Y k−1 are disjoint, then the prefix X k of X k overlapping X k−1 and the prefix Y k of Y k overlapping Y k−1 are also disjoint; otherwise, W is not maximal due to Lemma 2, where X −1 = X(0, 0] and Y −1 = Y (0, 0].From this fact, if X k−1 and Y k−1 are C P M 2 0 1 8

1:8 Maximal Common Subsequence Algorithms Algorithm 3: Algorithm determineIfMCS
, then whether X k and Y k are disjoint can be determined only by checking if X k and Y k are disjoint as well as checking if X k and Y k are disjoint, where X k (resp.Y k ) are the remaining suffix of X k (resp.Y k ) after deleting prefix X k (resp.Y k ).Note however that, as long as using the array of |Σ| bits, it still takes O(|X k | + |Y k |) time to determine if X k and Y k are disjoint.The algorithm reduces this execution time to O(|X k | + |Y k |)by using, instead of the bit array, a pair of arrays of |Σ| indices.Each index in one (resp.theother) of the arrays in the pair is used to represent the last position at which a distinct character in Σ appears in the prefix of Y (resp.X) having Y k (resp.X k ) as a suffix.This index array allows the algorithm to determine if any character in X k (resp.Y k ) appears in Y k (resp.X suffix is the concatenation of the prefix of Y (resp.X) having Y k−1 (resp.X k−1 ) as a suffix followed by Y k−1 (resp.X k ), for each k, this index array can be updated appropriately inO(|Y k−1 |) (resp.O(|X k |)) time.We show that Algorithm determineIfMCS presented in Figure3works as the proposed algorithm.For any strings X and Y of length O(n) with |Σ| = O(n) and any common subsequence W of X and Y , Algorithm determineIfMCS outputs message "not maximal", if W is not a maximal common subsequence of X and Y , or outputs message "maximal", otherwise, in O(n) time.Proof.For any index k with 0 ≤ k ≤ |W |, let X k and Y k be the strings in Lemma 2 and let indices i k , i k+1 , j k , and j k+1 be such that X disjointk ) in O(1) time.Furthermore, since the prefix of Y (resp.X) having Y k (resp.X k ) as a