Abstract 1 Introduction 2 Preliminaries 3 Algorithm 4 Concluding remarks References

Linear-Space LCS Enumeration for Two Strings

Yoshifumi Sakai ORCID Graduate School of Agricultural Science, Tohoku University, Sendai, Japan
Abstract

Suppose we want to seek the longest common subsequences (LCSs) of two strings as informative patterns that explain the relationship between the strings. The dynamic programming algorithm gives us a table from which all LCSs can be extracted by traceback. However, the need for quadratic space to hold this table can be an obstacle when dealing with long strings. A question that naturally arises in this situation would be whether it is possible to exhaustively search for all LCSs one by one in a time-efficient manner using only a space linear in the LCS length, where we treat read-only memory for storing the strings as excluded from the space consumed. As a part of the answer to this question, we propose an O(L)-space algorithm that outputs all distinct LCSs of the strings one by one each in O(n2logL) time, where the strings are both of length n and L is the LCS length of the strings.

Keywords and phrases:
algorithms, longest common subsequence, enumeration
Funding:
Yoshifumi Sakai: This work was supported by JSPS KAKENHI Grant Number JP23K10975.
Copyright and License:
[Uncaptioned image] © Yoshifumi Sakai; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Pattern matching
Editors:
Paola Bonizzoni and Veli Mäkinen

1 Introduction

Comparing two strings to find common patterns that would be most informative of their relationships is a fundamental task in parsing them. The longest common subsequences (LCSs) are regarded as such common patterns, and the problem of efficiently finding one of them has long been studied [2, 3, 5, 6, 7, 8, 9, 10, 11, 13]. Here, a subsequence of a string is the sequence obtained from the string by deleting any number of elements at any position that is not necessarily contiguous. Hence, an LCS of two strings is a common subsequence obtained by deleting the least possible number of elements from the strings. For example, if X=𝚊𝚌𝚍𝚍𝚊𝚍𝚊𝚌𝚋𝚌𝚋 and Y=𝚌𝚊𝚌𝚌𝚋𝚊𝚊𝚍𝚌𝚊𝚍 are the target strings, then 𝚌𝚊𝚌𝚌𝚋 is one of the seven distinct LCSs of X and Y. The six other LCSs are 𝚌𝚊𝚌𝚋𝚌, 𝚊𝚌𝚌𝚋𝚌, 𝚊𝚌𝚊𝚊𝚌, 𝚊𝚌𝚊𝚍𝚌, 𝚊𝚌𝚊𝚍𝚊, and 𝚊𝚌𝚍𝚊𝚍.

Given a pair of strings X and Y both of length n and a string Z of length less than n, it is easy to determine whether Z is a common subsequence of X and Y in O(n) time. In comparison, it is hard to determine whether Z is an LCS of X and Y. The reason is that we need to know the LCS length in advance. It was revealed [1, 4] that unless the strong exponential time hypothesis (SETH) does not hold, no algorithm can determine the LCS length of X and Y in O(n2ε) time for any positive constant ε. Therefore, we cannot find any LCS in O(n2ε) time under SETH.

On the other hand, as is well known, finding an arbitrary LCS of X and Y is possible in O(n2) time and O(n2) space by the dynamic programming (DP) algorithm [13]. The space of size O(n2) consumed to store the LCS lengths for all pairs of a prefix of X and a prefix of Y can be treated as constituting a particular directed acyclic graph (DAG) such that any path from the source to the sink represents an LCS of X and Y. This DAG is also useful in applications other than seeking a single arbitrary LCS because each LCS is represented by at least one of the paths from the source to the sink. If one considers only the problem of finding a single arbitrary LCS without intending a data structure representing all LCSs, Hirschberg [7] showed that O(n) space is sufficient to solve the problem in the same O(n2) time as the DP algorithm. Excluding the read-only memory storing X and Y, his algorithm with a slight modification performs only in O(L(X,Y)) space, where L(X,Y) denotes the LCS length of X and Y.

The advantage of O(L(X,Y))-space algorithms for finding an arbitrary LCS is that they perform without reserving O(n2) space, the size of which becomes pronounced when n is large. Thus, a natural question that would come to mind is whether it is possible to design O(L(X,Y))-space algorithms that are as accessible to all LCSs as the DP algorithm in addition to the above advantage. As a part of the answer to this question, this article shows that an O(L(X,Y))-space algorithm is capable of enumerating all distinct LCSs with a delay time not significantly inferior to the running time of Hirschberg’s O(L(X,Y))-space algorithm [7] for finding a single arbitrary LCS. More precisely, the algorithm proposed in this article outputs each LCS in O(n2logL(X,Y)) time after the previous LCS is output, if any, which is inferior by a logarithmic factor of L(X,Y). Enumeration is done by introducing a DAG, called the all-LCS graph, with each path from the source to the sink corresponding to a distinct LCS and vice versa. Since the size of the all-LCS graph is O(n2), which exceeds O(L(X,Y)), we design the proposed algorithm to somehow perform a depth-first search without explicitly constructing it.

2 Preliminaries

For any sequences S and S, let SS denote the concatenation of S followed by S. For any sequence S, let |S| denote the length of S, i.e., the number of elements in S. For any index i with 1i|S|, let S[i] denote the ith element of S, so that S=S[1]S[2]S[|S|]. A subsequence of S is obtained from S by deleting zero or more elements at any positions not necessarily contiguous, i.e., the sequence S[i1]S[i2]S[i] for some length with 0|S| and some indices i1,i2,,i with 1i1<i2<<i|S|. For any indices i and i with 1ii+1|S|+1, let S[i:i] denote the contiguous subsequence S[i]S[i+1]S[i] of S, where S[i:i] with i=i+1 represents the empty subsequence. If i=1 (resp. i=|S|), then S[i:i] is called a prefix (resp. suffix) of S.

A string is a sequence of characters over an alphabet set. For any strings X and Y, a common subsequence of X and Y is a subsequence of X that is a subsequence of Y. Let L(X,Y) denote the maximum of |Z| over all common subsequences Z of X and Y. Any common subsequence Z of X and Y with |Z|=L(X,Y) is called a longest common subsequence (an LCS) of X and Y. The following lemma gives a typical recursive expression for the LCS length.

Lemma 1 (e.g., [13]).

For any strings X and Y, if at least one of X and Y is empty, then L(X,Y)=0; otherwise, if x=y, then L(X,Y)=L(X,Y)+1; otherwise, L(X,Y)=max(L(X,Y),L(X,Y)), where X=Xx and Y=Yy. The same also holds for the case where X=xX and Y=yY.

Let any algorithm that takes certain information, specified later, of an arbitrary pair of non-empty strings X and Y as input and uses only O(L(X,Y)) space to output all distinct LCSs of X and Y, each represented by a specific form, one by one be called a linear-space LCS enumeration algorithm. We assume that the information of X and Y given to the algorithm consists of |X|, |Y|, and O(1)-time access to check whether X[i]=Y[j] or not for any pair of indices i and j with 1i|X| and 1j|Y| with no space consumption. The worst-case time between outputting one LCS and the next LCS is called the delay time of the algorithm. This article aims to propose a linear-space LCS enumeration algorithm with O(|X||Y|logL(X,Y)) delay time.

Let any index pair (i,j) such that 1i|X|, 1j|Y|, and X[i]=Y[j] be called a match. For convenience, we sometimes treat X[0] and Y[0] (resp. X[|X|+1] and Y[|Y|+1]) as virtual elements of X and Y, respectively, both of which are identical to a virtual character, say (resp. $), that appears in neither X nor Y, so that (0,0) (resp. (|X|+1,|Y|+1)) can be treated as a virtual match. We use src and snk to denote (0,0) and (|X|+1,|Y|+1), respectively. For any match w, let iw and jw denote the indices such that (iw,jw)=w. Furthermore, let the diagonal coordinate of w be jwiw, by treating any match as a grid point on a two-dimensional plane. For any matches w and w, let w<w (resp. ww) mean that both iw<iw and jw<jw (resp. iwiw and jwjw). For any non-virtual match w, we call the character that is identical to both X[iw] and Y[jw] the character corresponding to w. Similarly, for any sequence W of non-virtual matches, we call the concatenation of the characters corresponding to W[k] for all indices k from 1 to |W| in this order the string corresponding to W.

Let us call any sequence W of L(X,Y) non-virtual matches with W[1]<W[2]<<W[L(X,Y)] an LCS-match sequence, because the string corresponding to W is an LCS of X and Y. For convenience, we sometimes treat src(=(0,0)) and snk(=(|X|+1,|Y|+1)) as virtual elements W[0] and W[L(X,Y)+1] of any LCS-match sequence W, respectively. Although any LCS Z of X and Y has at least one LCS-match sequence W with Z as the corresponding string, two or more such LCS-match sequences W may exist. Hence, the problem of enumerating LCSs is not equivalent to the problem of enumerating LCS-match sequences. As the representative of LCS-match sequences W having the same corresponding strings, we consider the one such that for any index k with 1kL(X,Y), X[1:iW[k]] and Y[1:jW[k]] are the shortest prefixes of X and Y with the string corresponding to W[1:k] as a subsequence, respectively. We call any such LCS-match sequence front-leaning. Since any LCS Z of X and Y has exactly one front-leaning LCS-match sequence W with Z as the corresponding string, the problem of enumerating LCSs is equivalent to the problem of enumerating front-leaning LCS-match sequences. Based on this observarion, we design the proposed linear-space LCS enumeration algorithm to output all distinct front-leaning LCS-match sequences one by one instead of outputting the corresponding LCSs.

3 Algorithm

This section proposes a linear-space LCS enumeration algorithm that outputs all distinct front-leaning LCS-match sequences of X and Y each in O(|X||Y|logL(X,Y)) time.

Our approach to designing the algorithm considers a particular directed acyclic graph (DAG), which we call the all-LCS graph, such that each path from vertex src(=(0,0)) to vertex snk(=(|X|+1,|Y|+1)) represents a distinct front-leaning LCS-match sequence and vice versa, where each vertex in the graph is a different particular match. Relying on this DAG, we can enumerate LCSs by enumerating all paths from src to snk in a straightforward manner using depth-first search (DFS). The all-LCS graph we introduce is of size O(|X||Y|), which exceeds the size of space allowed for any linear-space LCS enumeration algorithm. Our proposed algorithm overcomes this situation by simulating DFS on the all-LCS graph without explicitly constructing it. We define the all-LCS graph in Section 3.1 and propose a linear-space LCS enumeration algorithm by presenting how to simulate the DFS in Section 3.2.

3.1 All-LCS graph

Conceptually speaking, the all-LCS graph is the union of the path from src to snk that passes through vertices W[0],W[1],,W[L(X,Y)+1] in this order over all front-leaning LCS-match sequences W, where we recall that W[0]=src and W[L(X,Y)+1]=snk. The correctness of the definition can be verified because for any front-leaning LCS-match sequences that share an element, exchanging their prefixes ending at the element also yields front-leaning LCS-match sequences. However, since we need to use the all-LCS graph without knowing what front-leaning LCS-match sequences exist, the current definition is not specific enough for our situation. To address this issue, we redefine the all-LCS graph as inductively defined.

For simplicity, for any match w, let L(w) and L+(w) denote L(X[1:iw],Y[1:jw]) and L(X[iw:|X|],Y[jw:|Y|]), respectively. The key idea to redefine the all-LCS graph is to focus only on particular matches introduced below.

Definition 2 (valid match).

Let any match w such that L(w)+L+(w)=L(X,Y)+1 be called L(w)-valid, or simply valid.

Definition 3 (follower).

For any valid match u, let any (L(u)+1)-valid match v that is the only match w with u<wv be called a u-follower. Let Fu denote the sequence of all u-followers such that iFu[1]>iFu[2]>>iFu[|Fu|] (and hence jFu[1]<jFu[2]<<jFu[|Fu|]). Adopting the order of elements in Fu, let the fth u-follower be Fu[f] and let the next u-follower of Fu[f] be Fu[f+1], unless f=|Fu|.

Definition 2 is conceived from the fact that for any LCS-match sequence W and any index k with 0kL(X,Y)+1, both L(W[k])=k and L+(W[k])=L(X,Y)+1k. On the other hand, Definition 3 comes from the observation that for any front-leaning LCS-match sequence W and any index k with 0kL(X,Y), W[k] is k-valid, W[k+1] is (k+1)-valid, and W[k+1] is the only match w such that W[k]<wW[k+1]. Sequence Fu introduced in Definition 3 makes sense because for any distinct u-followers v and v, if at least iviv or jvjv, then both iv>iv and jv<jv; otherwise, both iv<iv and jv>jv, due to condition that v (resp. v) is the only match w with u<wv (resp. u<wv). Our inductive definition of the all-LCS graph is as follows (see Figure 1).

Definition 4 (all-LCS graph).

The all-LCS graph is defined inductively to be the DAG such that src(=(0,0)) is a vertex and for any vertex u, all u-followers are vertices and any u-follower v is connected with u by an edge from u to v.

Refer to caption
Figure 1: The all-LCS graph for X=𝚊𝚌𝚍𝚍𝚊𝚍𝚊𝚌𝚋𝚌𝚋 and Y=𝚌𝚊𝚌𝚌𝚋𝚊𝚊𝚍𝚌𝚊𝚍, having seven LCSs, 𝚌𝚊𝚌𝚌𝚋, 𝚌𝚊𝚌𝚋𝚌, 𝚊𝚌𝚌𝚋𝚌, 𝚊𝚌𝚊𝚊𝚌, 𝚊𝚌𝚊𝚍𝚌, 𝚊𝚌𝚊𝚍𝚊, and 𝚊𝚌𝚍𝚊𝚍, where vertices and edges of the graph are indicated by solid bullets and lines, respectively, matches other than the vertices are indicated by open bullets, and all k-valid matches for each index k other than 0 and L(X,Y)+1(=6) are indicated by a dotted polygonal line connecting them from the lowermost-leftmost one to the uppermost-rightmost one in ascending order of the diagonal coordinate.

The following lemma claims that Definition 4 defines the all-LCS graph well.

Lemma 5.

The all-LCS graph has a path from vertex src to vertex snk that passes through vertices src,w1,w2,,w,snk in this order if and only if w1w2w is a front-leaning LCS-match sequence.

Proof.

Let W be an arbitrary front-leaning LCS-match sequence, so that for any index k with 1k|W|+1, W[k] is a W[k1]-follower. Since src is a vertex of the all-LCS graph, induction proves that for any index k with 1k|W|+1, the all-LCS graph has an edge from W[k1] to W[k], completing the “if” part of the lemma.

Consider the path of the all-LCS graph in the lemma. Since src is 0-valid, W[k] is a W[k1] follower for any index k with 1k+1, and snk is (L(X,Y)+1)-valid, we have that =L(X,Y). Induction proves that for any index k with 1k, X[1:iwk] and Y[1:jwk] are the shortest prefixes of X and Y with the string corresponding to w1w2wk as a subsequence, respectively, completing the proof of the “only if” part.

The size of the all-LCS graph is O(|X||Y|) as stated in the following lemma.

Lemma 6.

The number of edges in the all-LCS graph is at most 2|X||Y|+1.

Proof.

The number of edges in the all-LCS graph is equal to the sum of |Fu| over all vertices u in the graph, where Fu is the sequence of all u-followers introduced in Definition 3. For any vertex u and any index f with 1f|Fu|/2, it follows from iFu[1]>iFu[2]>>iFu[|Fu|]>iu (resp. ju<jFu[1]<jFu[2]<<jFu[|Fu|]) that iu+f<iu+((|Fu|+1)f)iFu[f] (resp. ju+fjFu[f]). On the other hand, since Fu[f] is a u-follower, Fu[f] is the only match w with u<wFu[f]. Thus, (iu+f,ju+f) is not a match. This implies that each (iu+f,ju+f) is a distinct non-match index pair with 1iu+f|X| and 1ju+f|Y| because (iu,ju) is a match and (iu+1,ju+1),(iu+1,ju+2),,(iu+f1,ju+f1) include no match. In addition, each vertex u having an odd number of u-followers other than src is a distinct match with 1iu|X| and 1ju|Y|. Therefore, the lemma holds because there exist |X||Y| index pairs (i,j) with 1i|X| and 1j|Y|.

 Remark 7.

If an O(|X||Y|) space is available, then as an implementation of the all-LCS graph, it is not difficult to design an O(|X||Y|)-time algorithm that constructs the two-dimensional array G such that each element G[i,j] with 0i|X|+1 and 0j|Y|+1 is the sequence of all (i,j)-followers in ascending order of the diagonal coordinate, if (i,j) is a vertex of the all-LCS graph, or the empty sequence, otherwise. Once G is available, we can enumerate LCSs with O(L(X,Y)) delay time by finding each distinct path in the all-LCS graph from src to snk in O(L(X,Y)) time by performing DFS.

3.2 Proposed linear-space LCS enumeration algorithm

As mentioned earlier, the linear-space LCS enumeration algorithm we propose simulates DFS on the all-LCS graph without explicitly constructing it, because the algorithm can use only O(L(X,Y)) space while the size of the all-LCS graph is O(|X||Y|).

3.2.1 Algorithm overview

Refer to caption
Figure 2: Valid matches with the third (resp. fourth) output front-leaning LCS-match sequences W (resp. W), indicated by dashed (resp. solid) polygonal line connecting from src=(0,0) to snk=(12,12), for the same X and Y as Figure 1, where the branching match of W (resp. W) is indicated by a dashed (resp. solid) line connecting from W[κ1] (resp. W[κ1]) for the branching position κ of W (resp. W) to it.
Algorithm 1 enumLCS.

The proposed algorithm outputs each distinct front-leaning LCS-match sequence W in lexicographical order of the sequence f1f2fL(X,Y), where W[k] is the fkth W[k1]-follower (see Definition 3). Due to this output order setting, the following index and match immediately provide the subsequent lemma, based on which we design the proposed algorithm.

Definition 8 (branching position and match).

For any front-leaning LCS-match W, let the greatest index k with 1kL(X,Y) such that W[k] is not the last W[k1]-follower be called the branching position of W, if any. In addition, let the next W[κ]-follower of W[κ] be called the branching match of W, if the branching position κ of W exists.

Lemma 9.

For any front-leaning LCS-match sequence W, W is the last output one if and only if W has no branching position. If W is the first output one, then for any index k with 1kL(X,Y), W[k] is the first W[k1]-follower. Otherwise, W[1:κ1]=W[1:κ1], W[κ] is the branching match of W, and for any index k with κ+1kL(X,Y), W[k] is the first W[k1]-follower, where W is the last output front-leaning LCS-match sequence before W and κ is the branching position of W.

Example 10.

Consider the third and fourth output front-leaning LCS-match sequences as W and W, respectively, for the same X and Y as Figure 1 (see Figure 2). Since W is (1,2)(2,3)(8,4)(9,5)(10,9) with FW[0]=F(0,0)=(2,1)(1,2), FW[1]=F(1,2)=(2,3), FW[2]=F(2,3)=(8,4)(5,6)(3,8), FW[3]=F(8,4)=(9,5), FW[4]=F(9,5)=(10,9), and FW[5]=F(10,9)=(12,12), the branching position of W is 3 and the branching match of W is (5,6). Thus, it follows from Lemma 9 that W[1:2]=W[1:2]=(1,2)(2,3) and W[3]=(5,6). In addition, W[4]=(7,7) due to FW[3]=F(5,6)=(7,7)(6,8), and W[5]=(8,9) due to FW[4]=F(7,7)=(8,9). Hence, W is determined as (1,2)(2,3)(5,6)(7,7)(8,9).

The proposed algorithm denoted enumLCS obtains each distinct front-leaning LCS-match sequence W using variables 𝖶 and κ. These variables are to maintain W and the existing branching position of W, respectively. After initializing κ to 0, for each front-leaning LCS-match sequence W according to our output order setting, the algorithm executes procedure updateW to set 𝖶[κ+1:L(X,Y)] to W[κ+1:L(X,Y)], outputs 𝖶 as W, and executes procedure findBranch to set κ and W[κ] to the branching position and match of W, respectively, if existing. If the branching position of W does not exist, then κ is set to a dummy index, indicating that W is the last output one. Due to Lemma 9, induction proves that the algorithm outputs all distinct front-leaning LCS-match sequences of X and Y one after another. A pseudo-code of enumLCS is given as Algorithm 1.

3.2.2 Space-efficient search for locally valid matches

Before presenting the details of procedures updateW and findBranch, we generalize the validity of matches to the local case and develop an algorithm that generates the stream of all the “locally” k-valid matches in ascending order of the diagonal coordinate. The reason is that both updateW and findBranch are designed to simulate this algorithm.

For any valid matches u and v with u<v and any match w with u<w<v, let Lu(w) denote L(X[iu+1:iw],Y[ju+1:jw]) and let Lv+(w) denote L(X[iw:iv1],Y[jw:jv1]), so that L(w)=L(0,0)(w) and L+(w)=L(|X|+1,|Y|+1)+(w). We define locally valid matches as follows.

Definition 11 (locally valid match).

For any valid matches u and v with u<v, let any match with u<w<v and Lu(w)+Lv+(w)=L(v)L(u) be called (u,v)-locally (L(u)+Lu(w))-valid, or simply (u,v)-locally valid.

Note that for any valid matches u, v, and w with u<w<v, w is not necessarily (u,v)-locally valid. For example, u=(5,6), v=(12,12), and w=(6,11) for the same X and Y as Figure 2 are valid matches with u<w<v but w is not (u,v)-locally valid. In contrast, any (u,v)-locally k-valid match w is k-valid, because L(w)=k and L+(w)=L(X,Y)+1k due to L(w)L(u)+Lu(w)=k, L+(w)Lv+(w)+L+(v)=L(X,Y)+1k, and L(w)+L+(w)L(X,Y)+1.

The number of (u,v)-locally k-valid matches is O((iviu)+(jvju)), which may exceed O(L(v)L(u)). However, the stream of all (u,v)-locally k-valid matches in ascending order of the diagonal coordinate can be generated by executing the O(L(v)L(u))-space algorithm genStream presented as Algorithm 2.

Algorithm 2 genStream(u,v,l,m,k).
Table 1: Table of 𝗃, 𝖨, 𝗂, 𝖩, and the diagonal coordinate 𝗃𝗂 of (𝗂,𝗃) just before the rth execution of line 2 of genStream(u,v,l,m,k) implemented as Algorithm 2, excluding the last execution to terminate the algorithm, for the same X and Y as Figure 2, u=(5,6) with l=3, v=(12,12) with m=6, and k=5, where the last column indicates the (u,v)-locally k-valid matches (𝗂,𝗃) output just after the rth execution of line 3.
r 𝗃 𝖨 𝗂 𝖩 𝗃𝗂 output matches
1 7 7 11 empty 4
2 8 6 11 empty 3
3 9 68 11 empty 2
4 9 68 10 9 1 (10,9)
5 9 68 9 9 0
6 9 68 8 9 1 (8,9)
7 10 67 8 9 2
8 10 67 7 107 3 (7,10)
9 11 67 7 107 4
Lemma 12.

For any indices l, k, and m with l<k<m and any pair of an l-valid match u and an m-valid match v with u<v, genStream(u,v,l,m,k) outputs all (u,v)-locally k-valid matches one by one in ascending order of the diagonal coordinate in O((iviu)(jvju)) time and O(ml) space.

Proof.

For any index i with iujiv and any index j with jujjv, let Lu(i,j) and Lv+(i,j) denote L(X[iu+1:i],Y[ju+1:j]) and L(X[i:iv1],Y[j:jv1]), respectively, so that for any match w with u<w<v, Lu(w)=Lu(iw,jw) and Lv+(w)=Lv(iw,jw).

To find all (u,v)-locally k-valid matches, genStream maintains two index variables 𝗃 and 𝗂 with ju+1𝗃jv and iu𝗂iv1. After initializing 𝗃 and 𝗂 to ju+1 and iv1, respectively, the algorithm repeatedly updates 𝗃 and 𝗂 by either increasing 𝗃 by one or decreasing 𝗂 by one and outputs (𝗂,𝗃) as a distinct (u,v)-locally k-valid match, if Lu(𝗂,𝗃)=kl, Lv+(𝗂,𝗃)=mk, and X[𝗂]=Y[𝗃], until 𝗃=jv or 𝗂=iu. Hence, the (u,v)-locally k-valid matches output by the algorithm are in ascending order of the diagonal coordinate (see Table 1).

At each iteration of the algorithm, at least Lu(𝗂1,𝗃)<kl or Lv+(𝗂,𝗃+1)<mk. This is because otherwise there exist matches w and w+ with u<w<w+<v, Lu(w)kl, and Lv+(w+)mk, which implies that L(X,Y)L(w)+L+(w+)(L(u)+Lu(w))+(Lv+(w+)+L+(v))(l+(kl))+((mk)+(L(X,Y)+1m))=L(X,Y)+1, a contradiction. Therefore, if Lu(𝗂1,𝗃)<kl, then any match (i,𝗃) with iu+1i𝗂1 is not (u,v)-locally k-valid due to Lu(i,𝗃)Lu(𝗂1,𝗃); otherwise, any match (𝗂,j) with 𝗃+1jjv1 is not (u,v)-locally k-valid due to Lv+(𝗂,j)Lv+(𝗂,𝗃+1). Based on this fact, in each iteration of the algorithm, if Lu(𝗂1,𝗃)<kl, then 𝗃 is increased by one; otherwise, 𝗂 is decreased by one.

The algorithm accomplishes the above by maintaining a sequence variable 𝖨 of at most ml indices so that |𝖨|=Lu(iv1,𝗃) and for any index p with 1p|𝖨|, 𝖨[p] is the least index i with iu+1iiv1 such that Lu(i,𝗃)=p. Analogously, a sequence variable 𝖩 of at most ml indices is maintained so that |𝖩|=Lv+(𝗂,ju+1) and for any index p with 1p|𝖩|, 𝖩[p] is the greatest index j with ju+1jjv1 such that Lv+(𝗂,j)=p. The algorithm executes procedure incJ (lines 9 through 13 of Algorithm 2) to increase 𝗃 by one and update 𝖨 in a straightforward way based on Lemma 1 in O(iviu) time and executes procedure decI (lines 14 through 18) to decrease 𝗂 by one and update 𝖩 in O(jvju) time symmetrically.

It follows from Lu(𝗂,𝗃)+Lv+(𝗂,𝗃)ml that both Lu(𝗂,𝗃)=kl and Lv+(𝗂,𝗃)=mk hold if and only if all of |𝖨|kl, |𝖩|mk, 𝗂𝖨[kl], and 𝗃𝖩[mk] hold. On the other hand, Lu(𝗂1,𝗃)<kl if and only if |𝖨|<kl or 𝗂𝖨[kl]. Thus, all distinct (u,v)-locally k-valid matches are output by the algorithm successfully in ascending order of the diagonal coordinate. Since incJ is executed at most jvju times and decI is executed at most iviu times, the algorithm runs in O((iviu)(jvju)) time and O(ml) space.

3.2.3 Ideas for speeding up the naive method

In what follows, for simplicity, we use κ to represent the index maintained by enumLCS, when 𝖶 is output as W by line 6 of Algorithm 1, where W is an arbitrary front-leaning LCS-match sequence. Hence, if W is the first output one, then κ=0; otherwise, κ is the branching position of the last output front-leaning LCS-match sequence before W. When mentioning genStream(u,v,l,m,k), it is often the case that l and m are obvious from u and v, respectively. In such cases, for the sake of readability, l and m are omitted from the specification, and we denote it by genStream(u,v,k).

A naive O(L(X,Y))-space procedure completes the task of updateW and findBranch simultaneously in O(L(X,Y)|X||Y|) time as follows. For each index k from 1 to L(X,Y) in this order, by simulating genStream(W[k1],snk,k) in O(|X||Y|) time, we can obtain the existing next W[k1]-follower of W[k] as a candidate of the branching match of W, if kκ, or otherwise, we can obtain both the first and existing second W[k1]-followers of W[k1] as W[k] and a candidate of the branching match, respectively. The existing last found candidate is the branching match of W, based on which the branching position κ of W can also be determined.

We reduce the execution time of the naive procedure to O(|X||Y|logL(X,Y)) by considering an approximation W of W that can be obtained faster than W and transformed into W easily. The definition of W is as follows.

Definition 13 (approximation W of W).

For any front-leaning LCS-match sequence W, let W denote the sequence such that W[1:κ]=W[1:κ] and for any index k with κ+1kL(X,Y), W[k] is the (W[κ],snk)-locally k-valid match with the least diagonal coordinate.

Example 14.

If W is the forth output front-leaning LCS-match sequence for the same X and Y as Figure 2, which is (1,2)(2,3)(5,6)(7,7)(8,9) with κ=3 (see Example 10), then W is (1,2)(2,3)(5,6)(7,7)(10,9). Note that W[5] is not identical to W[5] because the ((5,6),(12,12))-locally 5-valid match with the least diagonal coordinate is not (8,9) but (10,9) (see Table 1).

The difference between W and W that allows us to obtain W faster than W is that each element W[k] of W with κ+1kL(X,Y) is defined dependent on W[k1] while the corresponding element W[k] of W is defined independent of W[k1]. That is, unlike the case of W, where each element W[k] of W[κ+1:L(X,Y)] must be determined in ascending order of k from κ+1 to L(X,Y), each element W[k] of W[κ+1:L(X,Y)] can be determined in any order. Procedure updateW obtains W[κ+1:L(X,Y)] by recursively decomposing the problem of determining W[l+1:m1] into the problems of determining W[l+1:k1] and determining W[k+1:m1], which is done by simulating genStream(W[l],W[m],k) until the first match is output to determine W[k]. The following lemma guarantees that our approach works by claiming that W[l]<W[l+1]<<W[m] because this implies that W[k] is (W[l],W[m])-locally k-valid.

Lemma 15.

For any front-leaning LCS-match sequence W, W is an LCS-match sequence.

Proof.

Since W[1]<W[2]<<W[κ+1], it suffices to show that for any index k with κ+2kL(X,Y), W[k1]<W[k]. Since W[k1] is (W[κ],snk)-locally (k1)-valid, there exists a (W[κ],snk)-locally k-valid match w with W[k1]<w. It follows from jW[k]iW[k]jwiw and neither W[k]<w nor w<W[k] that iwiW[k]. This implies that iW[k1]<iW[k] because iW[k1]<iw. Symmetrically, there exists a (W[κ],snk)-locally (k1)-valid match w with w<W[k], which satisfies that jW[k1]jw<jW[k]. Thus, W[k1]<W[k].

The conversion from W to W is possible in O(|X|iW[κ]) time by determining W[k] using W[k1] and W[k] for each index k from κ+1 to L(X,Y) in this order based on the following Lemma.

Lemma 16.

For any front-leaning LCS-match sequence W and any index k with κ+1kL(X,Y), W[k] is the match (i,jW[k]) with the least index i that is greater than iW[k1].

Proof.

It follows from Lemma 15 that W[κ]=W[κ]<W[κ+1]. Due to definition of W[k], there exists no k-valid match w with W[κ]<w<(|X|+1,|Y|+1) such that jw<jW[k]. Therefore, if W[k1]<W[k], then the match (i,jW[k]) in the lemma is W[k], i.e., the first W[k1]-follower. In addition, iiW[k] because W[k] is a match. This implies from W[k]<W[k+1] due to Lemma 15 that if (i,jW[k])=W[k], then W[k]<W[k+1]. Thus, induction proves the lemma.

Table 2: Table of sequences 𝖩k maintained by genStreamk for the same X, Y, and W as Figure 2, where each underline indicates the prefix of 𝖩1 that represents 𝖩k with iW[k1]+1𝗂iW[k]1.
𝗂 𝖩5 𝖩4 𝖩3 𝖩2 𝖩1
1 1110842
2 111084 1110841
3 11108 11108 11108¯2
4 11108 11108 11108¯2
5 11107 11107 111072
6 118 1184 1184 118¯42
7 107 1074 1074 10742
8 9 9 954 954 954
9 9 9 95 95 95
10 9 9 94 94 94
11 empty empty 5 5 5

The reason for splitting the naive procedure into updateW and findBranch is that once W is available, we can efficiently determine the existing branching match of it as follows. As long as searching only for the branching match, iterative updates of (𝗂,𝗃) in execution of genStream(W[k1],snk,k) can start from (iW[k]1,jW[k]+1) rather than from (|X|,jW[k1]+1), if 𝖨 and 𝖩 are appropriately maintained. Let genStreamk denote execution of the modified genStream as above and let 𝖩k denote 𝖩 maintained in genStreamk. A key observation for improving time efficiency is that 𝖩1 can be used as 𝖩k for any index k because 𝖩k is a prefix of 𝖩1 (see Table 2). This immediately implies that adopting 𝖩1 as 𝖩 maintained in genStreamk instead of 𝖩k and doing genStreamk for each k from L(X,Y) to 1 in descending order allow genStreamk to initialize 𝖩 not from scratch but staring from the resulting 𝖩 after doing genStreamk+1, unless k=L(X,Y). Another crucial observation is that only the first element 𝖨[1] of 𝖨 is sufficient to be maintained in each genStreamk and update of 𝖨[1] can be done by scanning X[iW[k1]+1:iW[k]1] instead of X[iW[k1]+1:|X|]. Based on these observations, we design findBranch as a modification of genStream(src,snk,1).

3.2.4 Procedures updateW and findBranch

Based on Lemmas 15 and 16, updateW obtains W[κ+1:L(X,Y)] and transform it to W[κ+1:L(X,Y)] for each front-leaning LCS-match sequence W. Algorithm 3 presents a pseudo-code of updateW, in which recursive decomposition of the problem of determining W[l+1:m1] is implemented as a sequential process. The following lemma assures that this procedure works well.

Algorithm 3 updateW.
Lemma 17.

For any front-leaning LCS-match sequence W, if W[κ] is given as 𝖶[κ], then updateW determines W[κ+1:L(X,Y)] as 𝖶[κ+1:L(X,Y)] in O(|X||Y|logL(X,Y)) time and O(L(X,Y)) space.

Proof.

Lines 1 through 8 of updateW implemented as Algorithm 3 determines W[k], i.e., the (W[κ],snk)-locally k-valid match having the least diagonal coordinate, as 𝖶[k] for each index k with κ+1kL(X,Y). This can be verified because it follows from Lemma 15 that W[k] is the (W[l],W[m])-locally k-valid match with the least diagonal coordinate, implying that W[k] can be obtained as the first match output by genStream(W[l],W[m],k) due to Lemma 12. Execution of lines 1 through 8 completes in O((|X|iW[κ])(|Y|jW[κ])log(L(X,Y)κ)) time because the number of iterations of lines 2 through 8 is O(log(L(X,Y)κ)) and it follows from Lemma 12 that each iteration executes in O((|X|iW[κ])(|Y|jW[κ])) time. Once W[κ+1:L(X,Y)] is available as 𝖶[κ+1:L(X,Y)], for each index k from κ+1 to L(X,Y) in this order, lines 10 through 12 determine W[k] as 𝖶[k] from W[k1] and W[k] in O(iW[k]iW[k1]) time based on Lemma 16.

Algorithm 4 findBranch.

As mentioned in Section 3.2.3, we adopt a modification of genStream(src,snk,1) as findBranch to determine the existing branching match of W in O(|X||Y|) time. A pseudo-code of findBranch is presented as Algorithm 4. Note that incJ in findBranch (lines 12 through 15) maintains only the first element 𝖨[1] of 𝖨 maintained by genStream(W[k1],snk,k) in a straightforward way. In contrast, decI in findBranch (lines 16 through 20) is the same as decI in genStream(src,snk,1), which implies that 𝖩 maintained by genStream(W[k1],snk,k) is a prefix of 𝖩 maintained by findBranch. The following lemma claims that findBranch works successfully.

Lemma 18.

For any front-leaning LCS-match sequence W given as 𝖶, findBranch determines the branching position and match of W as κ and 𝖶[κ], respectively, if existing, or sets κ to a dummy index, otherwise, in O(|X||Y|) time and O(L(X,Y)) space.

Proof.

After initialization of variables 𝗂, 𝖨, 𝖩, and κ by line 1 of findBranch implemented as Algorithm 4, for each index k from L(X,Y) to 1 in this order, lines 3 through 11 search for the next W[k1]-follower of W[k]. In this process for k, 𝗂 is initialized to iW[k]1 by line 3 and 𝗃 is initialized to jw[k]+1 by line 4. Since 𝖨[1] and 𝖩 are also updated appropriately by lines 3 and 4, respectively, lines 5 through 11 simulate execution of genStream(W[k1],snk,k) until the (W[k1],snk)-locally k-valid match (𝗂,𝗃) with 𝗂<iW[k] and 𝗃>jW[k] having the least diagonal coordinate is found. If found, then line 7 terminates the execution of findBranch after setting κ to k as the branching position of W and setting 𝖶[k] to (𝖨[1],𝗃), instead of (𝗂,𝗃), as the branching match of W due to an argument similar to the proof of Lemma 16.

For any index k with 1kL(X,Y), the process for k implemented by lines 2 though 11 executes incJ O(|Y|jW[k]) times each in O(iW[k]iW[k1]) time. On the other hand, for each index 𝗂 with 0𝗂|X|, decI is executed at most once in O(|Y|) time throughout the execution of findBranch. Thus, findBranch runs in O(|X||Y|) time and O(L(X,Y)) space.

It immediately follows from Lemmas 17 and 18 that the following theorem holds.

Theorem 19.

Algorithm enumLCS with procedures updateW and findBranch works as a linear-space LCS enumeration algorithm that outputs all distinct front-leaning LCS-match sequences of X and Y each in O(|X||Y|logL(X,Y)) time.

4 Concluding remarks

This article proposed an algorithm that takes strings X and Y as input and uses O(L(X,Y)) space, excluding the space for storing X and Y, to enumerate all distinct LCSs of X and Y each in O(|X||Y|logL(X,Y)) time. The all-LCS graph, of size O(|X||Y|), introduced in Section 3.1 allows us to determine all distinct LCSs each in O(L(X,Y)) time. Given this, whether we can remove the logarithmic factor of L(X,Y) from the delay time achieved by the proposed algorithm is a natural question arising from a space-delay tradeoff perspective. The author recently claimed in [12] that adopting a variant of the linear-space LCS-finding algorithm of Hirschberg [7] as an alternative of our O(|X||Y|logL(X,Y))-time procedure updateW can resolve this question in the affirmative.

References

  • [1] Amir Abboud, Arturs Backurs, and Virginia Vassilevska Williams. Tight hardness results for lcs and other sequence similarity measures. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 59–78. IEEE, 2015. doi:10.1109/FOCS.2015.14.
  • [2] Alberto Apostolico. Improving the worst-case performance of the hunt-szymanski strategy for the longest common subsequence of two strings. Information Processing Letters, 23(2):63–69, 1986. doi:10.1016/0020-0190(86)90044-X.
  • [3] Alberto Apostolico and Concettina Guerra. The longest common subsequence problem revisited. Algorithmica, 2:315–336, 1987. doi:10.1007/BF01840365.
  • [4] Karl Bringmann and Marvin Künnemann. Quadratic conditional lower bounds for string problems and dynamic time warping. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 79–97. IEEE, 2015. doi:10.1109/FOCS.2015.15.
  • [5] Yo-lun Chin, Chung-kwong Poon, et al. A fast algorithm for computing longest common subsequences of small alphabet size. University of Hong Kong, Department of Computer Science, 1990.
  • [6] Jun-Yi Guo and Frank K Hwang. An almost-linear time and linear space algorithm for the longest common subsequence problem. Information processing letters, 94(3):131–135, 2005. doi:10.1016/j.ipl.2005.01.002.
  • [7] Daniel S. Hirschberg. A linear space algorithm for computing maximal common subsequences. Communications of the ACM, 18(6):341–343, 1975. doi:10.1145/360825.360861.
  • [8] James W Hunt and Thomas G Szymanski. A fast algorithm for computing longest common subsequences. Communications of the ACM, 20(5):350–353, 1977. doi:10.1145/359581.359603.
  • [9] Costas S Iliopoulos and M Sohel Rahman. A new efficient algorithm for computing the longest common subsequence. Theory of Computing Systems, 45(2):355–371, 2009. doi:10.1007/s00224-008-9101-6.
  • [10] William J Masek and Michael S Paterson. A faster algorithm computing string edit distances. Journal of Computer and System sciences, 20(1):18–31, 1980. doi:10.1016/0022-0000(80)90002-1.
  • [11] Narao Nakatsu, Yahiko Kambayashi, and Shuzo Yajima. A longest common subsequence algorithm suitable for similar text strings. Acta Informatica, 18:171–179, 1982. doi:10.1007/BF00264437.
  • [12] Yoshifumi Sakai. Linear-space LCS enumeration with quadratic-time delay for two strings. arXiv preprint arXiv:2504.05742, 2025. doi:10.48550/arXiv.2504.05742.
  • [13] Robert A Wagner and Michael J Fischer. The string-to-string correction problem. Journal of the ACM (JACM), 21(1):168–173, 1974. doi:10.1145/321796.321811.