String Inference from Longest-Common-Preﬁx Array

. The suﬃx array, perhaps the most important data structure in modern string processing, is often augmented with the longest common preﬁx (LCP) array which stores the lengths of the LCPs for lexicographically adjacent suﬃxes of a string. Together the two arrays are roughly equivalent to the suﬃx tree with the LCP array representing the tree shape. In order to better understand the combinatorics of LCP arrays, we consider the problem of inferring a string from an LCP array, i.e., determining whether a given array of integers is a valid LCP array, and if it is, reconstructing some string or all strings with that LCP array. There are recent studies of inferring a string from a suﬃx tree shape but using signiﬁcantly more information (in the form of suﬃx links) than is available in the LCP array. We provide two main results. (1) We describe two algorithms for inferring strings from an LCP array when we allow a generalized form of LCP array deﬁned for a multiset of cyclic strings: a linear time algorithm for binary alphabet and a general algorithm with polynomial time complexity for a constant alphabet size. (2) We prove that determining whether a given integer array is a valid LCP array is NP-complete when we require more restricted forms of LCP array deﬁned for a single cyclic or non-cyclic string or a multiset of non-cyclic strings. The result holds whether or not the alphabet is restricted to be binary. In combination, the two results show that the generalized form of LCP array for a multiset of cyclic strings is fundamentally diﬀerent from the other more restricted forms


Introduction
For a string X of n symbols, the suffix array (SA) [22] contains pointers to the suffixes of X, sorted in lexicographical order.The suffix array is often augmented with a second array -the longest common prefix (LCP) array -storing the length of the longest common prefix between lexicographically adjacent suffixes; i.e., LCP[i] is the length of the LCP of suffixes X[SA[i]..n) and X[SA[i−1]..n).The two arrays are closely connected to the suffix tree [31] -the compacted trie of all the string's suffixes: the entries of SA correspond to the leaves of the suffix tree, and the LCP array entries tell the string depths of the lowest common ancestors of adjacent leaves, defining the shape of the tree (see Fig. 2 in the appendix).For decades these data structures have been central to string processing; see [4] for a history and an overview, and [1,3,15,29,25] for further details on myriad applications.
Given both the suffix and the LCP array, the corresponding string is unique up to renaming of the characters and is easy reconstruct: zeros in the LCP array tell where the first character changes in the lexicographical list of the suffixes, and the suffix array tells how to permute those first characters to obtain the string.Given just the suffix array, we can easily reconstruct a corresponding string where all characters are different, and it is not difficult to characterize strings with a given suffix array [5,27,21].In essence, the suffix array determines a set of positions in the LCP array that must be zero.Specifically, for any i let j and k be integers such that SA[j] = SA[i − 1] + 1 and SA[k] = SA[i] + 1.Then, if k < j, we must have LCP[i] = 0.For any other position, we can freely and independently decide whether the value is zero or not, and as described above, the zero positions together with the suffix array determine the string.
In this paper, we consider the problem of similarly reconstructing strings from an LCP array without the suffix array.As mentioned above, the LCP array determines the shape of the suffix tree, i.e., the suffix tree without edge or leaf labels.Notice that the LCP array specifies the label lengths for internal edges but not for leaf edges, which would allow trivial inference of the suffix array.String inference from the suffix tree shape has recently been considered by three different sets of authors [19,6,30].However, all of them assume that the suffix tree is augmented with significant additional information, namely suffix links (see Fig. 2), which makes the task much easier.Indeed, our new algorithms essentially reconstruct suffix links from the LCP array.According to Cazaux and Rivals [6], the case without suffix links was considered but not solved in [26].We are also aware that others have considered it but without success [2].
To fully define the problem, we have to specify what kind of strings we are trying to infer.Often suffix trees and suffix arrays are defined for terminated strings that are assumed to end with a special symbol $ that is different from and lexicographically smaller than any other symbol.The alternative is an open-ended string where no assumption is made on the last symbol.For suffix and LCP arrays the only change from omitting the terminator symbol is dropping the first element (which is always zero in the LCP array), but the suffix tree can change considerably because some suffixes can be prefixes of other suffixes and thus are not represented by a leaf (see Fig. 3).Inferring open-ended strings from a suffix tree (with suffix links) is studied by Starikovskaya and Vildhøj [30], who show that any string can be appended by additional characters without changing the suffix tree shape (thus the term open-ended).However, such an extension can change the suffix and LCP arrays a great deal (see Fig. 4), i.e., with the arrays a string is never truly open-ended but has at least an implicit terminator.
To get rid of even an implicit terminator, we consider a third type of strings, cyclic strings, where we use rotations in place of suffixes (see Figs. 5-7).For a terminated string, replacing suffixes with rotations causes no changes to the suffix/rotation array or the LCP array.Thus any integer array that is a valid LCP array for a terminated string is always a valid LCP array for a cyclic string too, but the opposite is not true.For example, the LCP array for the cyclic string aababa is (2, 1, 3, 0, 2), which is not a valid LCP array for any non-cyclic string.In this sense, the cyclic string case is strictly more general.An even more striking example is a non-primitive string, such as abab, that has two or more identical rotations.For reasons explained below, instead of rotations we use cyclic suffixes which are infinite repetitions of rotations.Thus the LCP array for the cyclic string abab is (ω, 0, ω), where ω denotes the positions of two adjacent identical cyclic suffixes.
Finally, we may have a joint suffix array for a collection of strings, where we have all suffixes of all strings in lexicographical order, and the corresponding LCP array.In the terminated version, each string is terminated with a distinct terminator symbol.If we have an LCP array for a collection of open-ended strings, adding the terminator symbols simply prepends one zero for each terminator.The LCP array for a collection of terminated strings is identical to the LCP array of the concatenation of the strings.Thus the generalization from single strings to string sets does not add to the set of valid LCP arrays for terminated strings, but it does for cyclic strings.For example the LCP array for a string set {aa, b} is (ω, 0), which is not a valid LCP array for any single string.For multiple cyclic strings, it is important to use cyclic suffixes instead of rotations because the result can be different (e.g., the set {ab, aba}).Now we are ready to formally define the problem of String Inference from LCP Array (SILA).In the decision version, we are given an array of integers (and possibly ω's) and asked if the array is a valid LCP array of some string.If the answer is yes, the reporting version may also output some such string, and possibly a characterization of all such strings.Different variants are identified by a prefix: S for a string set; T, O, or C for terminated, open-ended or cyclic; and B for a binary alphabet (where terminators are not counted).For example, BCSSILA stands for Binary Cyclic String Set Inference from LCP Array.As discussed above, and summarized in the following result (with a proof in the appendix), the non-cyclic variants are essentially equivalent, but the cyclic variants are more general.
Our Contribution.Our first result is a linear time algorithm for BCSSILA.For a valid LCP array the algorithm outputs a string, which is the Burrows-Wheeler transform (BWT) of the solution string set.This relies on a generalization of the BWT for multisets of cyclic strings developed in [23,20].There can be more than one multiset of strings with the same BWT but the class of such string collections is simple and well characterized in [20].The algorithm also outputs a set of substring swaps such that applying any combination of the swaps on the BWT produces another BWT of a solution, and any BWT of a solution can be produced by such a combination of swaps.Thus we have a complete characterization of all solutions.The number of swaps can be linear and thus the number of distinct solutions can be exponential.We also present an algorithm for CSSILA, i.e., without a restriction on the alphabet size, that has a polynomial time complexity for any constant alphabet size.
Our second result is a proof, by a reduction from 3SAT, that (the decision version of) BCSILA, and thus CSILA, is NP complete.Therefore, even though the BCSSILA algorithm produces a characterization of all solutions, it is NP hard to determine whether one of the solutions is a single string.Furthermore, we modify the reduction to prove that BTSILA is NP complete too.By Proposition 1, this shows that all variants of SILA mentioned above except (B)CSSILA are NP complete.Since CSSILA is in P for constant alphabet sizes, this leaves the complexity of CSSILA for larger alphabets as an open problem.Related Work.String inference from partial information is a classic problem in string processing, dating back some 40 years to the work of Simon [28], where reconstructing a string from a set of its subsequences is considered.Since then, string inference from a variety of data structures has received a considerable amount attention, with authors considering border arrays [12,11,10], parameterized border arrays [18], the Lyndon factorization [24], suffix arrays [5,21], KMP failure tables [11,13], prefix tables [7], cover arrays [9], and directed acyclic word graphs [5].The motivation for studying most string inference problems is to gain a deeper understanding of the combinatorics of the data structures involved, in order to design more efficient algorithms for their construction and use.
A (somewhat tangentially) related result to ours is due to He et al. [16], who prove that it is NP hard to infer a string from the longest-previous-factor (LPF) array.It is well known that LPF is a permutation of LCP [8] but otherwise it is a quite different data structure.For example, it is in no way concerned with lexicographical ordering.Like our NP-hardness proof, He et al.'s reduction is from 3-SAT, but the details of each reduction appear to be very different.Moreover, their construction requires an unbounded alphabet while our construction works for a binary alphabet and thus for any alphabet.
To the best of our knowledge, all of the previous string inference problems aim at obtaining a single non-cyclic string from some data structure, and we are the first to consider the generalizations to cyclic strings and to string sets, and as our results show, this makes a crucial difference.As explained in the next section, the generalizations arise naturally from the generalized BWT introduced in [23], which also played a central role in another recent result on the combinatorics of LCP arrays [20].

Basic notions
Let v be a string of length n and let v be obtained from v by sorting its characters.The standard permutation [14,17] of v is the mapping In other words, Ψ v corresponds to the stable sorting of the characters.Let C = {c i } s i=1 be the disjoint cycle decomposition of Ψ v .We define the inverse Burrows-Wheeler transform IBWT as the mapping from v into a multiset of cyclic strings W = { {w i } } s i=1 such that for any i ∈ [1.
Example 1.For v = bbaabaaa, we have IBWT(v) = { {aab, aab, ab} } as illustrated in the following table (showing v and Ψ v ) and figure (showing the cycles of Ψ v as a graph).The character subscripts are provided to make it easier to ensure stability.
The elements of W are primitive cyclic strings.Cyclic means that all rotations of a string are considered equal.For example, aab, aba and baa are all equal.A string is primitive if it is not a concatenation of multiple copies of the same string.For example, aab is primitive but aabaab is not.For any alphabet Σ, the mapping IBWT is a bijection between the set Σ * of all (non-cyclic) strings and the multisets of primitive cyclic strings over Σ [23].
The set of positions of W is defined as the set of integer pairs pos(W ) := i, p : i ∈ [1..s], p ∈ [0..|w i |) .For a position i, p ∈ pos(W ) we define a cyclic suffix W i,p as the infinite string that starts at i, p , i.e., W i,p = w i [p]w i [p + 1 mod |w i |]w i [p + 2 mod |w i |], . . . .The multiset of all cyclic suffixes of W is defined as suf(W ) := { {W i,p : i, p ∈ pos(W )} }.We say that a string x occurs at position i, p in W if x is a prefix of the suffix W i,p .
The (cyclic) suffix array of a multiset of strings W is defined as an array SA W [j] = i j , p j , where i j , p j ∈ pos(W ) for all j ∈ [0..n) and W i j−1 ,p j−1 ≤ W i j ,p j for all j ∈ [1..n).The Burrows-Wheeler transform (BWT) is a mapping from W into the string v defined as v[j] = w i [p − 1 mod |w i |], where i, p = SA W [j], i.e., v[j] is the character preceding the beginning of the suffix W SA W [j] .The BWT is the inverse of IBWT [23,20].
The longest-common-prefix array for 0 < j < n, where lcp(x, y) is the length of the longest common prefix between the strings x and y.

Intervals.
Many algorithms on suffix arrays and LCP arrays are based on iterating over a specific types of array intervals.Next, we define these intervals and establish their key properties.For proofs and further details, we refer to [1,25].Let v ∈ {a, b} n and W = IBW T (v).Let SA = SA W be the suffix array and LCP = LCP W the LCP array of W .Note that from now on, we will assume a binary alphabet.
In other words, in the suffix array the x-interval SA[i..j) consists of all suffixes of W with x as a prefix.Thus the size j − i of the interval is the number of occurrences of x in W , which we will denote by n x .
Lemma 2. Every nonempty x-interval is an ℓ-interval for some (unique) ℓ ≥ |x|.Every ℓinterval is an x-interval for some string x of length ℓ.
Thus the ℓ-intervals represent the set of all distinct x-intervals.This and the fact that the total number of ℓ-intervals is O(n) are the basis of many efficient algorithms for suffix arrays, see e.g., [1,25].

Algorithm for BCSSILA
We are now ready to describe the algorithm for string inference from an LCP array.Given an LCP array LCP[1..n), our goal is to construct a string v ∈ {a, b} n such that LCP = LCP IBWT(v) .At first, we assume that such a string v exists, and consider later what happens if the input is not a valid LCP array.
Let RMQ LCP [i..j) denote the range minimum query over the LCP array that returns the position of the minimum element in LCP[i..j), i.e., RMQ LCP [i..j) = arg min k∈[i..j) LCP[k].The LCP array is preprocessed in linear time so that any RMQ can be answered in constant time (see for instance [25]).Then any x-interval can be split into two subintervals as shown in the following result.This approach makes it easy to recursively enumerate all ℓ-intervals.We will also keep track of ax-and bx-intervals together with any x-interval, even if we do not know x precisely.From the intervals we can determine the numbers of occurrences, n ax and n bx , which are useful in the inference of v: Lemma 4. Let [i..j) be the x-interval.Then v[i..j) contains exactly n ax a's and n bx b's.
In particular, when either n ax or n bx drops to zero, we have fully determined v[i..j) for the x-interval [i..j).In such a case, the LCP array intervals have to satisfy the following property.Lemma 5. Let [i y ..j y ) be the y-interval for y ∈ {x, ax, bx}.If n ax = j ax − i ax = 0, then LCP[i bx + 1..j bx ) = 1 + LCP[i x + 1..j x ), where 1 + A, for an array A, denotes adding one to all elements of A. Symmetrically, if n bx = 0, then LCP[i ax + 1..j ax ) = 1 + LCP[i x + 1..j x ).
Algorithm 1: Infer BWT from an LCP array Input: an array LCP [1..n) of integers and ω's Output: a string v ∈ {a, b} n such that LCP IBWT(v) = LCP together with a set S of swap intervals, or false if there is no such string v 1 S := ∅; 2 preprocess LCP for RMQs; The main procedure is given in Algorithm 1.The main work is done in the recursive procedure InferInterval given in Algorithm 2. The procedure gets as input the x-, ax-and bx-intervals for some (unknown) string x, splits the x-interval into xya-and xyb-subintervals based on Lemma 3, and tries to split ax-and bx-intervals similarly.If all subintervals are nonempty, the algorithm processes the two subinterval triples recursively (lines 28 and 29).
When trying to split the ax-interval, the result may be, for example, that the axya-interval is empty.In this case, we do not need to recurse on the xya-interval since the corresponding part of v must be all b's.The algorithm recognizes the emptiness of axya-or axyb-interval by the fact that m ax > m x + 1, but the problem is to decide which is the empty one.In most cases, this can be determined by comparing the sizes of the different subintervals or even the actual LCP-intervals (see Lemma 5).
There is one case, where the algorithm is unable to determine the empty subintervals, which is when Then, either the axya-and bxyb-intervals are empty or the axyb-and bxya-intervals are empty, but there is no way of deciding between the two cases.It turns out that both are valid choices.The algorithm sets v according to one choice (line 8) but records the alternative choice by adding the interval to the set S. In such a case, the string xy is called a swap core and the xy-interval (equal to the x-interval) is called a swap interval.
For each swap interval [i..j), the algorithm sets v[i..k) = aa . . .a and v[k..j) = bb . . .b, where k = (i + j)/2, but swapping the two halves would be an equally good choice.Therefore, if the output of the algorithm contains s swap intervals, it represents a set of 2 s distinct strings.The following lemma shows that the swaps indeed do not affect the LCP array (with the proof in the appendix).Lemma 6.Let v ∈ {a, b} n , W = IBWT(v), SA = SA W and LCP = LCP W .Let x be a string that occurs in W and satisfies: (1) Input: (nonempty) x-, ax-and bx-intervals Output: Set v[ix..jx) and add the swap intervals within Theorem 1. Algorithm 1 computes in linear time a representation of the set of all strings v ∈ {a, b} * such that LCP IBWT(v) is the input array, or returns false if no such string exists.
Proof.Since the algorithm verifies its result (lines 9 and 10), it will return false if the input is not a valid LCP array.Given a valid LCP array, Algorithm 2 sets all elements of v since it recurses on any subinterval that it doesn't set.All the choices made by the algorithm are forced by the lemmas in this and the previous section.The swap intervals record all alternatives in the cases where the content of v could not be fully determined, and all of those alternatives have the same LCP array by Lemma 6.It is also easy to see that the algorithm runs in linear time.⊓ ⊔

Coupling Constrained Eulerian Cycle
We will now set out to prove the NP-completeness of the single string inference problems BCSILA and BTSILA.The proofs are done by a reduction from 3-SAT via an intermediate problem called Coupling Constrained Eulerian Cycle (CCEC) described in this section.Consider a directed graph G of degree two, i.e., every vertex in G has exactly two incoming and two outgoing edges.If G is connected, it is Eulerian.An Eulerian cycle can pass through each vertex in two possible ways, which we call the straight state and the crossing state of the vertex as illustrated here: We consider each vertex to be a switch that can be flipped between these two states.The combination of vertex states is called the graph state.For a given graph state, the paths in the graph form, in general, a collection of cycles.The Eulerian cycle problem can then be stated as finding a graph state such that there is only a single cycle; we call such a graph state Eulerian.
In the Coupling Constrained Eulerian Cycle (CCEC) problem, we are given a graph as described above, an initial graph state, and a partitioning of the set of vertices.If we flip a vertex state, we must simultaneously flip the states of all the vertices in the same partition, i.e., the vertices in a partition are coupled.A graph state that is achievable from the initial state by a set of such partition flips is called a feasible state.The CCEC problem is to determine if there exists a feasible graph state that is Eulerian.
Proof.The proof is by reduction from 3-SAT.To obtain a CCEC graph from a 3-CNF formula, a gadget of five vertices is constructed from each clause and these gadgets are connected by a cycle.In each gadget, three of the vertices are labeled by the literals of the corresponding clause; the other two are called free vertices.See Fig. 1 for an illustration.
Each labeled vertex is in a straight state if the labeling literal is false and in a crossing state if the literal is true; their initial state corresponds to some arbitrary truth assignment to the variables.For each variable x i , there is a vertex partition consisting of all vertices labeled by x i or ¬x i , so that flipping this partition corresponds to changing the truth value of x i .Each free vertex forms a singleton partition and has an arbitrary initial state.Thus a graph state is feasible iff the labeled vertex states correspond to some truth assignment.
If a clause is false for a given truth assignment, the labeled vertices in the corresponding gadget are all in a straight state.This separates a part of the gadget from the main cycle and thus the graph state is not Eulerian.If a clause is true, at least one of the labeled vertices in the gadget is in a crossing state.Then we can always choose the state of the free vertices so that the full gadget is connected to the main cycle.Thus there exists a feasible Eulerian graph state iff there exists a truth assignment to the variables that satisfies all clauses.
⊓ ⊔ For purposes that will become clear later, we modify the above construction by adding some extra components to the graph without changing the validity of the reduction.Specifically, for each variable x i in the 3-CNF formula we add the following gadget to the main cycle: The vertices in the gadget are treated similarly to the other vertices in the graph: they belong to the partition with the other vertices labeled by x i or ¬x i , and the initial state is determined by the truth value of the labeling literal.It is easy to see that the gadget will be fully connected to the main cycle whether x i is true or false.Thus the extra gadgets have no effect on the existence of an Eulerian cycle.Finally, we insert to the main cycle a single vertex labelled y with a self loop and forming a singleton partition.

BCSILA to CCEC
The next step is to establish a connection between the BCSILA and CCEC problems by showing a reduction from BCSILA to CCEC.Although the direction of the reduction is opposite to what we want, this construction plays a key role in the analysis of the main construction described in the next section.Given a BCSILA instance (an integer array), we use Algorithm 1 to produce a representation of a set V of strings.The problem is then to decide if there exists v ∈ V such that IBWT(v) is a single (cyclic) string.We will write V as a string with brackets marking the swaps.For example, V = b[ab][ab]a = {bababa, babbaa, bbaaba, bbabaa}.In Example 1, we saw that the inverse BWT of a string v ∈ V can be represented as a graph G v where the vertices are labeled by positions in v and there is an edge between vertices i and j if, for some character c ∈ {a, b} and some integer k, v[i] = c is the kth occurrence of c in v and v[j] = c is the kth occurrence of c in v.Such an edge (i, j) is labeled by c k .Note that ∀v ∈ V , v is the same; we will denote it by V .We form a generalized graph G V as a union of the graphs G v , v ∈ V (see Fig. 11 for an example).
Consider a k (the kth a) in V , say at position i.If a k is outside any swap region in V , say at position j, there is a single edge (i, j) in G V labeled by a k .If a k is within a swap region in V , it has two possible positions in the strings v ∈ V , say j and j ′ .That same pair of positions are also the possible positions of some b, Then g v has two edges, (i, j) and (i, j ′ ), labeled with a k and two edges, (i ′ , j) and (i ′ , j ′ ), labeled with b k ′ .The positions/vertices j and j ′ are called a swap pair.
To obtain a CCEC graph G V , we make two modifications to G V .First, we merge each swap pair into a single vertex.Each merged vertex now has two incoming and two outgoing edges and all other vertices have one incoming and one outgoing edge.Second, we remove all vertices with degree one by concatenating their incoming and outgoing edges (see Fig. 11).
The initial state of the vertices in G V is set so that the cycles in G V correspond to the cycles in G v for some v ∈ V .Two vertices in G V belong to the same partition if their labels belong to the same swap interval in V .Then we have a one-to-one correspondence between swaps in V and partition flips in G V .If this CCEC instance has a solution, the Eulerian cycle spells a single string realizing the input LCP array.If the CCEC instance has no solution, the original BCSILA problem has no solution either.

BCSILA is NP-Complete
We are now ready to show that BCSILA is NP-complete using the reduction chain 3-SAT → CCEC → BCSILA.The first step was described in Section 4, and we will next describe the second.The latter reduction is not a general reduction from an arbitrary CCEC instance but works only for a CCEC instance obtained by the first reduction (including the extra gadgets).
The above BCSILA to CCEC reduction transforms each pair of swapped positions into a vertex and each swap interval into a vertex partition.Our construction creates a BCSILA instance such that the resulting BWT has the necessary swaps to produce the CCEC instance vertices and partitions.However, the BWT also has some unwanted swaps producing spurious vertices, but we will show that these spurious vertices do not invalidate the reduction.
Starting from a CCEC instance, we construct a set of cyclic strings and obtain the BCSILA instance as the LCP array of that string set.The construction associates two strings to each vertex and the cyclic strings are formed by concatenating the vertex strings according to the cycles in the graph in its initial state.The two passes of the cycles through a vertex must use different strings but it does not matter which pass uses which string.
Let n be the number of vertices in the CCEC graph and let m be the number of vertex partitions.We number the vertices from 1 to n and the partitions from 1 to m.The biggest partition number is assigned to the partition with the vertex y, the second biggest to the partition corresponding to the variable x 1 , the third biggest to variable x 2 , and so on.The three biggest vertex numbers are assigned to the vertices labeled x 1 in the extra gadget for the variable x 1 , the next three biggest to the extra gadget vertices labeled x 2 and so on.Within each extra gadget, the biggest number is assigned to the middle one of the three vertices.The strings associated with a vertex are ba k ba m+2h and bba k bba m+2h−1 , where k is the partition number and h is the vertex number.This completes the description of the transformation from a CECC instance to a BCSILA instance.
Let us now analyze the transformation by changing the BCSILA instance back to a CCEC instance using the construction of the preceding section.Specifically, we will analyze the swaps in the BWT produced from the LCP array.Let W be the set of cyclic strings constructed from the CCEC instance, and let V be the BWT with swaps constructed from LCP W .An interval [i..j) in V is a swap interval if and only if (1) [i..j) is an x-interval for a string x such that either occ(axa) = occ(bxb) = occ(x)/2 or occ(axb) = occ(bxa) = occ(x)/2, where occ(y) is the number of occurrences of y in W , and (2) is a swap interval, the string x is called its swap core.Our goal is to identify all swap cores.
Let us first consider strings of the form x = ba k b.If k > m, occ(x) ≤ 1 and x cannot be a swap core.For k ∈ [1..m], x is always a swap core and corresponds to the CCEC partition numbered k.Let v = BWT(W ) and let V ′ be v together with the swaps for cores of the form It is easy to verify that a CCEC instance constructed from V ′ as described in the previous section is identical to the original CCEC instance.Thus, if there were no other swap cores, we would have a perfect reduction.
Unfortunately, there are other swap cores.A systematic examination of all strings in Appendix F shows that the other swap cores must be of the following forms: ba m+2n−1 , a m+2n−1 b, a m ba m , a m bba m ,a k ba h , a k bba h , a k ba i ba h and a k bba i bba h .Furthermore, it shows that each such swap core has exactly two occurrences, which means that the values k and/or h have to be sufficiently large.Each extra swap core adds a free vertex that is connected to the graph by making two existing edges to pass through the new vertex.Because of the way we chose to assign the biggest partition and vertex numbers, all the additional connections are within the extra gadgets, which does not change the existence of an Eulerian cycle.This completes the proof.Theorem 3. BCSILA is NP-complete.

BTSILA is NP-Complete
We will now show that BTSILA is NP-complete by modifying the above reduction for BCSILA to include a single terminator symbol $ in the strings.The modification is applied to the set W of cyclic strings derived from the CCEC instance such that LCP W is the BCSILA instance.Specifically, we replace the (unique) occurrence of a m+2n , which is the longest consecutive run of a's, with a m+2n+1 $a m+2n to obtain W $ and LCP W $ .We will show that LCP W $ is a yes-instance of CSILA iff LCP W is a yes-instance of BCSILA.Furthermore, if a cyclic string u is a solution to the CSILA instance, i.e., LCP u = LCP W $ , then LCP v = LCP W $ , where v is the rotation of u ending with $ interpreted as a terminated string.Thus LCP W $ is a yes-instance of BTSILA iff it is a yes-instance of CSILA iff LCP W is a yes-instance of BCSILA.
In general, adding even a single occurrence of a third symbol complicates the inference of the BWT from the LCP array and means that the set of equivalent BWTs can no more be described by a set of swaps.Consider how the operation of the procedure InferInterval (Algorithm 2) changes.First, it gets an extra $x-interval as an input in addition to x-, ax-and bx-intervals.Second, the x-interval may be split into three subintervals, xy$-, xya-and xyb-intervals, instead of two (which happens when the LCP interval contains two identical minima).This leads to many more combinations to consider, and some of those combinations are more complicated.
Fortunately, in our case, having the single $ surrounded by the two longest runs of a's simplifies things, and we will describe a modification of InferInterval to handle this case.Every call to InferInterval belongs to one of the following three types: (1) the x-interval is split into two and the $x-interval is empty, (2) the x-interval is split into two and the $x-interval is non-empty, and (3) the x-interval is split into three.The first case needs no modification at all.The other two cases mean that either $x or x$ occurs in the produced string set, and since this property is not affected by swaps (or the threeway permutations described below), one of them occurs in every produced string set including W $ .Since x must occur at least twice, one of the latter two cases happens iff x = a k for some k ∈ [0..m + 2n].Although in general InferInterval cannot always know x, it is easy to keep track of x when x = a k .
When InferInterval is called with x = a k for k ≤ m+2n−2, the x-interval and the ax-interval are always split into three, the bx-interval is split into two, and there is a $x-interval of size one.In general, we might not know whether the two subintervals of bx-interval are bx$-and bxa-, or bx$-and bxb-, or bxa-and bxb-intervals.However, since x$-and ax$-intervals both have size one, there can be no bx$-interval, and thus all the subintervals can be uniquely determined and recursed on.When x = a m+2n−1 , the x-interval has size five and is split into three with the middle part (xa-interval) having size three.The ax interval has size three and is split into three.In this case too, only one combination of subintervals is possible.
When x = a m+2n , the x-interval has size three and is split into three, and the $x-, ax-and bx-intervals have size one.Therefore, the x-interval in the BWT contains some permutation of the three characters and all permutations are valid.This threeway permutation adds to the variation provided by the swaps in other parts of the BWT.A more careful analysis shows that the BWT x-interval of -$ab or $ba implies an occurrence of $x$ which is only possible if x$ is a separate string; -ba$ implies an occurrence of axa which is only possible if a single a is separate string; -a$b implies occurrences of ax$ and $xa which is only possible if ax$ is a separate string; -ab$ implies an occurrence of ax$xb; and b$a implies an occurrence of bx$xa.
A single string solution is only possible in the last two cases, and any such solution corresponds to a solution for the BCSILA instance LCP W (obtained by replacing ax$x or x$ax with x).Hence LCP W $ is a yes-instance of CSILA, and thus of BTSILA, if and only if LCP W is a yes-instance of BCSILA, which proves the following result.Theorem 4. BTSILA is NP-complete.

Algorithm for CSSILA
In all of the above, we have assumed a binary alphabet (excluding the single symbol $).In this section, we consider the CSSILA problem (i.e.Cyclic String Set Inference from LCP Array) without a restriction on the alphabet size.
Let L[1..n) be an instance of the CSSILA problem, i.e., an array of integers (and possibly ω's).Let σ − 1 be the number of zeroes in L, and Σ an alphabet of size σ.As with the binary BCSSILA problem, we describe an algorithm that outputs a representation of the set W L = {w ∈ Σ n : LCP IBWT(w) = L}; in this case the representation is an automaton that accepts W L .We show the following result.
Fig. 5. SA, LCP and ST for cyclic string aababa$.Notice that SA, LCP and suffix tree shape are the same as in Fig. 2 5 a a a b a b Fig. 6.SA, LCP and ST for cyclic string aababa.
Fig. 7. SA, LCP and ST for cyclic string aabaab.Because the string is non-primitive (concatenation of multiple copies of the same string), some of its cyclic suffixes are identical.The LCP of identical suffixes is ω and they share a leaf in the suffix tree.

B Reductions from BTSILA
Proof (of Proposition 1).By the discussion in the introduction, an array of n integers is a yes-instance of BTSILA iff it has a leading zero and is a yes-instance of BOSILA with the leading zero removed, -a yes-instance of BTSILA iff it has a leading zero and at most one other zero, and is a yes-instance of TSILA, -a yes-instance of TSILA iff it has a leading zero and is a yes-instance of OSILA with the leading zero removed, -a yes-instance of TSILA iff it is a yes-instance of TSSILA, -a yes-instance of TSSILA iff it has one or more leading zeros and is a yes-instance of OSSILA with the leading zeros removed, -a yes-instance of BTSILA iff it has a leading zero and at most one other zero, and is a yes-instance of BTSSILA, and a yes-instance of BTSSILA iff it has one or more leading zeros and at most one other zero, and is a yes-instance of BOSSILA with the leading zeros removed.
In all cases, there is a simple linear or at most quadratic time reduction.⊓ ⊔

C Algorithm for BCSSILA: A Proof and an Example
Proof (of Lemma 6).Consider first how it is swapped from one side of the interval [i x ..j x ) to the other side.Now we use Lemma 1 to determine how a suffix at SA[i] changes with the swap.If i belongs to a cycle that never visits [i x ..j x ), i.e., the suffix does not contain x, there is no change.Suppose then that the cycle starting at i first reaches [i x ..j x ) after k steps, and w.l.o.g.assume that it reaches specifically the xa-interval, i.e.
Then for some string y of length k, the suffix at i changes from yxa . . .into yxb . . . .Note also that yx cannot contain x except at the end.Now consider two adjacent suffixes.If both are of the form yxa . . ., they both change to yxb . . . .The parts after x may change a lot but LCP of the two suffixes remains the same because LCP[i xa + 1..j xa ) = LCP[i xb + 1..j xb ).In all other cases (one or both do not contain x or the parts before x differ), the LCP is determined in the unchanged part of the suffixes.Thus LCP IBWT(v ′ ) = LCP.⊓ ⊔ The following example illustrates the operation of the algorithm.When processing InferInterval([1..3), [0..1), [4..5)) (see Figure 8 (3)), we find that both the ax-and the bx-interval have size 1.In such a case, we always have a swap interval.Here we set v[1..3) = ab and add [1..3) into S.

D Algorithm for CSSILA
In this section we present the algorithm solving CSSILA problem for alphabets of any size.Let Σ = {a 1 , a 2 , . . ., a σ } be an alphabet and L[1..n) be LCP array containing σ − 1 zeroes.We try to reconstruct a set of strings W L = {w ∈ Σ n : LCP IBWT(w) = L}.The resulting set W L is represented as an acyclic deterministic finite automaton A L accepting all strings w ∈ W L .Such a representation allows us to perform efficient W L membership tests, enumerate all its members, and efficiently find the lexicographic predecessor and successor for any w ∈ W L .
The resursive iteration of intervals in the binary case does not work for larger alphabets, because we can no more uniquely match intervals.Instead, the algorithm iterates from left to right, and for that we need a different characterization of W L .
For any c ∈ Σ and any w ∈ W L , consider two consecutive occurrences of c in w (i.e., there are no other occurrences of c between them but there may be other characters).Say, they occur at positions h and k, and are the i th and (i + 1) th occurrence of c in w.Then we must have that where i c is the starting position of the c-interval.We call this the pair constraint.The following lemma shows how to characterize W L using pair constraints.Lemma 7.For any w ∈ Σ n , w ∈ W L if and only if every pair of consecutive occurrences satisfies the pair constraint.
Proof.Let V = IBWT(w).Consider a pair of consecutive occurrences at positions h and k in w, which are the i th and (i + 1) th occurrence of c in w.
, where i c is the starting position of the c-interval.
For any suffix array SA and the corresponding LCP array LCP, and any two positions h and k with h < k, min{LCP[j] : h < j ≤ k} is the length of the longest common prefix of the suffixes SA[h] and SA [k].Thus if L = LCP V , we must have This proves the "only if" part.
The "if" part is proven by contradiction.Suppose that all the pair constraints hold in otherwise we swap the roles of L and LCP V and pick the smallest value in LCP V that differs from L. Let c ∈ Σ be the character such that d is in the c-interval [i c , j c ), and let i = d − i c .Let h and k be the positions of the i th and (i + 1) th occurrences of c in w.Since the pair constraints hold for both L and LCP V , we must have being the smallest wrong value.This completes the "if" part.

⊓ ⊔
Recall that for a string w ∈ Σ * and c ∈ Σ, |w| c denotes the number of occurrences of c in w.We extend this notions to LCP arrays.Namely, |L| c denotes the number of occurrences of c in any string w such that LCP w = L.We split LCP array L into σ so-called character arrays as follows.For any c ∈ Σ, let [i c , j c ) be the c-interval and let L c [1..j c − i c ) = L[i c + 1..j c ) − 1 (where A − 1 means subtracting one from each element of A).Notice that the c-intervals can be determined solely based on the occurrences of zeroes in L, and thus we can extend the above definitions to cases where L is not a valid LCP array.For a technical reason, to avoid a number of special cases to be checked (e.g. for empty character subsequences or boundary cases), we set for all c ∈ Σ.This gives us a trivial match for the begin and end of each character sequence with the global sequence L.
To be able to construct the set W L iteratively we define a notion of (prefix) consistency of a string s ∈ Σ k (k ≤ n) with an LCP array L when s is considered to be a prefix of some string in In other words, it is a pair condition on the pair consisting of the last occurrence of c in s and the next occurrence of c after the end of s.Since we do not know the location of the next occurrence, we only verify that nothing in L[0..k] violates the condition.Therefore, we have the inequality in place of the equality in the condition.
Definition 3. Let s ∈ Σ k for k ≤ n.We say that s is prefix consistent with L, if 1. the pair constraint holds for every pair of consecutive occurrences in s, and 2. the partial pair constraint holds for each c ∈ Σ such that |s| c < |L| c .
From the definition and Lemma 7, we immediately get the following.
The following is easy to verify.
Lemma 8.A string s violates a partial pair constraint if and only if b(s) contains −1.
The significance of the vectors p(s) and b(s) is shown by the following lemma.
Lemma 9. Let s ∈ Σ k , k < n, be a string prefix consistent with L. Given p(s) and b(s) (but not s), and c ∈ Σ, we can determine whether sc is prefix consistent with L and compute p(sc) and b(sc) in O(σ) time.
Proof.Let us first look at updating the vectors.Let s ∈ Σ k be a string consistent with an LCP array L and a i ∈ Σ.Given p(s) = (|s| a 1 , |s| a 2 , . . ., |s| aσ ) we have p(s By definition of b and consistency of s with L we have: because we look for the minimal value over the singleton interval L[|s|+ 1..|s|+ 2), and for c = a i we have according to the relation of L[|s| + 1] to the minimal value in L[ℓ c (s) + 1..|s| + 2).Now consider prefix consistency.The extension of s with a i adds one new pair of consecutive occurrences of a i 's, which satisfies the pair constraint if and only if b a i (s) = 0.The partial pair constraints of s • a i can be checked using Lemma 8.
The computation of b(s • a i ) requires the verification of a separate condition for each b c (s • a i ) for each c ∈ Σ, hence it could be done in time O(σ).On the other hand, the computation of p(s • a i ) can be done in a constant time.

⊓ ⊔
The structure of the automaton A L produced by the algorithm is as follows.Each state v of A L corresponds to a unique pair (p v , b v ) and represents the set of strings and for each s ∈ S v 1 sc is consistent with L (where A • c denotes appending a character c to each element of the set A).In such a case (p v 2 , b v 2 ) are given by the equations ( 1) and (2).
Note that if for a string s consistent with L b(s•c) contains −1, then the state v representing s can not have an outgoing transition labelled with c.Therefore, for any s consistent with L, b(s) can be represented as a bit vector (i.e.contain only binary values).
Observe that the empty string ε and all single characters c ∈ Σ are consistent with L. Hence, we can construct the set W L and the automaton A L by iterative extension of strings consistent with L. To construct A L we iterate through sets of states corresponding to strings of length k = 1, . . ., n − 1, i.e.
and for each state v ∈ P k we check the existence of a transition v → v 1 .All states corresponding to the sets of strings of length k + 1 consistent with L form the set P k+1 .
Observe that for any w ∈ W L we have p(w) = p(L) and b(w) = (0, 0, .., Now we are ready to discuss the time and space complexity of our solution.The number of states of A L is bounded by the number all possible pairs (p, b) of Parikh vectors and bit vectors.The number of all possible bit vectors is bounded by 2 σ and the number of all possible Parikh vectors reaches its maximum when the number of occurrences of all characters are equal.Moreover, we need O(σ) space to store each state of A L .Therefore, the space complexity of presented algorithm is O(σ2 σ ( n σ + 1) σ ).To construct an automaton A L returned by the algorithm we need to check for each state v ∈ A L up to σ possible transition.Validation of a single transition requires O(σ) time.This, together with the bound for the number of all states, gives us the time complexity O(σ 2 2 σ ( n σ + 1) σ ).The above discussion constitutes a proof of Theorem 5 in Section 8.
Remark 1.The above presented algorithm works correctly also for binary alphabet, however its time and space complexity is worse than the complexity of Algorithm 1.
The structure of the complete automaton A L is depicted on Figure 9.
We start with the automaton A L consisting of a single initial node v (0) represented by a pair (p 0 , b 0 ) = ([0, 0, 0], [0, 0, 0]) and contained in the set P 0 .Next, we are iterate over all sets P k for k = 0, . . ., n − 1 and check for a possible extensions of each state v ∈ P k .In all cases below we do not consider the obvious inconsistency of s • c for |s| c = |L| c .Summing up for P 1 , ab, ba and cc are consistent with L, while aa, ac, bb, bc, ca and cb are not.
, hence we create the state v (6) and the transition v (4) → v (6) , labelled with a.We have p , hence we create the state v (7) and the transition v (4) → v (7) , labelled with b.We have p and it is the first occurrence of c, hence we create the state v (8) and the transition v (4) → v (8) , labelled with c.
and it is the firs occurrence of a, hence we create the state v (9) and the transition v (5) → v (9) , labelled with a.We have and it is the firs occurrence of b, hence we create the state v (10) and the transition v (5) → v (10) , labelled with b.
Summing up for P 2 , aba, abb, abc, baa, bab, bac, cca and ccb are consistent with L. .The upper part of each state v contains the Parikh vector pv, while the lower part the bit vector bv.P k denotes the sets of nodes representing all strings of length k consistent with L. The temporarily created states with no valid extension, which are not included in AL, were marked with dashed lines.We have eight possible paths leading from the initial to the final state corresponding to strings abccab, abccba, baccab, baccba, ccabab, ccabba, ccbaab and ccbaba.
, hence we create the state v (11) and the transition v (6) → v (11) , labelled with b.On the other hand, we have p , hence we create the transition v (7) → v (11) , labelled with a.On the other hand, we have p , hence we create the state v (12) and the transition v (8) → v (12) , labelled with c.On the other hand we have p hence there is no consistent extension of v (8) with a and b.
and and it is the firs occurrence of b, hence we create the state v (13) and the transition v (9) → v (13) , labelled with b.On the other hand, we have p (10) • a) = [0, 0, 1] and and it is the firs occurrence of a, hence we create the transition v (10) → v (13) , labelled with a.On the other hand, we have p(S v (10) , hence there is no consistent extension of v (10) with b.
Note that to list the strings which are inconsistent with L for each hyperplane P k we consider only those having prefixes of length k−1, which are consistent with L. If we skip this requirement, we can produce more examples of strings inconsistent with L.
E BCSILA to CCEC: An Example

F Identification of Swap Cores
As descibed in Section 6, the BCSILA instance derived from a CCEC instance has the strings ba k b, k ∈ [1..m], as desired swap cores.We will next systematically inspect all other strings to identify all other (undesirable) swap cores.Recall that an interval [i..j) in V is a swap interval if and only if the following conditions hold: 1. [i..j) is an x-interval for a string x such that either occ(axa) = occ(bxb) = occ(x)/2 or occ(axb) = occ(bxa) = occ(x)/2, where occ(y) is the number of occurrences of y in W , and 2. LCP W [i + 1..k) = LCP W [k + 1..j), where k = (i + j)/2.
Notice that if occ(x) = j − i = 2, the second condition is trivially true.
Let us start with unary strings.First, b, bb and a k for k < m + 2n are not swap cores because they are preceded and succeeded by a more often than by b.We can also eliminate all other unary strings since they occur at most once.We also note that any string beginning (ending) with bb cannot be a swap core because it is always preceded (succeeded) by a.Let us then consider strings x of the following forms: x = ba k .If k < m + 2n − 1, occ(xa) > occ(xb), and if k ≥ m + 2n, occ(x) ≤ 1.In either case, x is not a swap core.On the other hand, x = ba m+2n−1 is always a swap core with two occurrences.-x = a k b.This case is symmetric to the one above except we cannot be certain whether x = a m+2n−1 b is a swap core or not since the characters following the two occurrences of x are not fully determined.However, we count x as a potential swap core.-x = a k ba k and x = a k bba k .If k > m, we have occ(x) = 0, and if k < m, we have occ(ax) > occ(bx).If k = m, then x is a swap core and occ(x) = 2. -x = a k ba h and x = a k bba h for k < h.If k > m, we have occ(x) = 0 and if h ≤ m, we have occ(ax) > occ(bx).If k ≤ m < h, x is obviously not a swap core if occ(x) < 2 but also not if occ(x) > 2 because then we must have occ(xa) > occ(xb).On the other hand, if k ≤ m < h and occ(x) = 2, then x might be a swap core.
x = a k ba h and x = a k bba h for k > h.This is symmetric to the case above.
x = ba k ba h , x = ba k bba h , x = a h ba k b and x = a h bba k b.If k > m, occ(x) ≤ 1.If k ≤ m, every occurrence of x is either preceded (the first two cases) or succeeded (the latter two cases) by the same character.Thus x is never a swap core.-x = a k ba i ba h and x = a k bba i bba h for i ∈ [1..m].Obviously, x is not a swap core if occ(x) < 2 but also not if occ(x) > 2 because then occ(xa) > occ(xb).If occ(x) = 2 then x may or may not be a swap core.
Any string not mentioned above either does not occur at all or contains a substring of the form ba k b for k > m and occurs once.
Notice that each of the potential extra swap cores has exactly two occurences.

Theorem 5 .Fig. 2 .Fig. 3 .Fig. 4 .
Fig.2.SA, LCP and ST for terminated string aababa$.Notice how the LCP array encodes the shape of the suffix tree.The dashed arrows are suffix links, which connect node representing cx for a symbols c and a string x to node representing x.

Corollary 2 .
w ∈ W L if and only if |w| = n and w is prefix consistent with L. See Examples 4 and 5 for illustration of strings consistent and inconsistent with a given LCP array.Let p(s) = (|s| a 1 , |s| a 2 , . . ., |s| aσ ) be the Parikh vector of s and p(L) = (|L| a 1 , |L| a 2 , . . ., |L| aσ ) be the Parikh vector of L. Define b

Fig. 9 .
Fig. 9.The finite deterministic automaton AL constructed for LCP array L = [1, 4, 0, 2, 1, 3].Squares containing nodes represent Parikh vectors, i.e. the square in i-th row and j-th column represents the vector p = (i − 1, j − 1).P k denotes the sets of nodes representing all strings of length k consistent with L. The temporarily created states with no valid extension, which are not included in AL, were marked with dashed lines.We have two possible paths leading from the initial to the final state corresponding to strings babbbaa and bbabbaa.

Fig. 10 .
Fig.10.The finite deterministic automaton AL constructed for LCP array L = [1, 0, 1, 0, 2].The upper part of each state v contains the Parikh vector pv, while the lower part the bit vector bv.P k denotes the sets of nodes representing all strings of length k consistent with L. The temporarily created states with no valid extension, which are not included in AL, were marked with dashed lines.We have eight possible paths leading from the initial to the final state corresponding to strings abccab, abccba, baccab, baccba, ccabab, ccabba, ccbaab and ccbaba.

Fig. 11 .
Fig. 11.The graphs GV (a) and GV (b) for V = b[ab][aabb]baa[ab]aa, which is the BWT with swaps produced from the LCP array L = [2, 5, 1, 4, 3, 4, 2, 0, 3, 2, 5, 3, 1].The solid edges in GV are the edges of Gv for v = babaabbbaaabaa.The graph GV is the CCEC instance derived from the BCSILA instance L. In the initial state, all vertices are in the straight state, so that the cycles in GV correspond to the cycles in Gv.The only non-singleton partition in Gv is {3/5, 6/4} corresponding to the only swap interval of length more than two in V .