Degenerate String Comparison and Applications

A generalised degenerate string (GD string) ˆ S is a sequence of n sets of strings of total size N , where the i th set contains strings of the same length k i but this length can vary between diﬀerent sets. We denote the sum of these lengths k 0 , k 1 , . . . , k n − 1 by W . This type of uncertain sequence can represent, for example, a gapless multiple sequence alignment of width W in a compact form. Our ﬁrst result in this paper is an O ( N + M )-time algorithm for deciding whether the intersection


Introduction
A degenerate string (or indeterminate string) over an alphabet Σ is a sequence of subsets of Σ.A great deal of research has been conducted on degenerate strings (see [1,11,20,29,32] and references therein).These types of uncertain sequences have been used extensively for flexible modelling of DNA sequences known as IUPAC-encoded DNA sequences [23].
In [19], the authors introduced a more general definition of degenerate strings: an elasticdegenerate string (ED string) S over Σ is a sequence of subsets of Σ * (see also network expressions [28]) with the aim of representing multiple genomic sequences [10].That is, any set of S does not contain, in general, only letters; a set may also contain strings, including the empty string.In a few recent papers on this notion, the authors provided several algorithms for pattern matching; specifically, for finding all exact [17] and approximate [8] occurrences of a standard string pattern in an ED text.
We introduce here another special type of uncertain sequence called generalised degenerate string; this can be viewed as an extension of degenerate strings or as a restricted variant of ED strings.Formally, a generalised degenerate string (GD string) Ŝ over Σ is a sequence of n sets of strings over Σ of total size N , where the ith set contains strings of the same length k i > 0 but this length can vary between different sets.We denote the sum of these lengths k 0 , k 1 , . . ., k n−1 by W . Thus a GD string can be used to represent a gapless multiple sequence alignment (MSA) of fixed width, that is, for example, a high-scoring local alignment of multiple sequences, in a compact form; see Figure 1.This type of alignment is used for finding functional sequence elements [14].For instance, searching for palindromic motifs in these type of alignments is an important problem since many transcription factors bind as homodimers to palindromes [26].Specifically, a set of virus species can be clustered using high-scoring MSA to obtain subsets of viruses that have a common hairpin structure [27].
Our motivation for this paper comes from finding palindromes in these types of uncertain sequences.Let us start off with standard strings.A palindrome is a sequence that reads the same from left to right and from right to left.Detection of palindromic factors in texts is a classical and well-studied problem in algorithms on strings and combinatorics on words with a lot of variants arising out of different practical scenarios.In molecular biology, for instance, palindromic sequences are extensively studied: they are often distributed around promoters, introns, and untranslated regions, playing important roles in gene regulation and other cell CA--AGCTCTATCTCGTA--TT C---AGCCGAAGCTCGTATATT CATCAAGTCAACGCAG----TT • G (c) GD string obtained from the local gapless alignment.
Figure 1 A GD string representing a gapless multiple sequence alignment.
processes (e.g.see [4]).In particular these are strings of the form X XR , also known as complemented palindromes, occurring in single-stranded DNA or, more commonly, in RNA, where X is a string and XR is the reverse complement of X.In DNA, C-G are complements and A-T are complements; in RNA, C-G are complements and A-U are complements.
A string X = X[0]X [1] . . .X[n − 1] is said to have an initial palindrome of length k if its prefix of length k is a palindrome.Manacher first discovered an on-line algorithm that finds all initial palindromes in a string [25].Later Apostolico et al observed that the algorithm given by Manacher is able to find all maximal palindromic factors in the string in O(n) time [6].Gusfield gave an off-line linear-time algorithm to find all maximal palindromes in a string and also discussed the relation between biological sequences and gapped palindromes [18].
For uncertain sequences, we first need to have an algorithm for efficient string comparison, where automata provide the following baseline.Let X and Ŷ be two GD (or two ED) strings of total sizes N and M , respectively.We first build the non-deterministic finite automaton (NFA) A of X and the NFA B of Ŷ in time O(N + M ).We then construct the product NFA C such that L(C) = L(A) ∩ L(B) in time O(N M ).The non-emptiness decision problem, namely, checking if L(C) = ∅, is decidable in time linear in the size of C, using breadth-first search (BFS).Hence the comparison of X and Ŷ can be done in time O(N M ).It is known that if there existed faster methods for obtaining the automata intersection, then significant improvements would be implied to many long standing open problems [24].Hence an immediate reduction to the problem of NFA intersection does not particularly help.For GD strings we show at the beginning of Section 3 that we can build an ad-hoc deterministic finite automaton (DFA) for X and Ŷ , so that the intersection can be performed efficiently, but this simple solution cannot achieve O(N + M ) time as its cost is alphabet-dependent.
Our Contribution.Our first result in this paper is an O(N +M )-time algorithm for deciding whether the intersection of two GD strings of sizes N and M , respectively, over an integer alphabet is non-empty.This result is based on a combinatorial result of independent interest: although the intersection of two GD strings can be exponential in the total size of the two strings, it can be represented in only linear space.An automata model of computation can also be employed to obtain these results but we present here an efficient implementation in the standard word RAM model with word size w = Ω(log(N + M )) that works also for integer alphabets.We then apply our string comparison tool to compute palindromes in GD strings.We present an O(min{W, n 2 }N )-time algorithm for computing all palindromes in Ŝ.Furthermore, we show a non-trivial Ω(n 2 |Σ|) lower bound under the Strong Exponential Time Hypothesis [21,22] for computing all maximal palindromes.Note that there exists an infinite WA B I 2 0 1 8

21:4 Degenerate String Comparison and Applications
family of GD strings over an integer alphabet of size |Σ| = Θ(N ) on which our algorithm requires time O(n 2 N ) thus matching the conditional lower bound.Finally, proof-of-concept experimental results are presented using real protein datasets; specifically, on applying our tools to find the location of palindromes in immunoglobulins genes of the human V regions.

Preliminaries
An alphabet Σ is a non-empty finite set of letters of size σ = |Σ|.A string X on an alphabet Σ is a sequence of elements of Σ.The set of all strings on an alphabet Σ, including the empty string ε of length 0, is denoted by Σ * .For any string X, we denote by X[i . . .j] the substring or factor of X that starts at position i and ends at position j.In particular, X[0 . . .j] is the prefix of X that ends at position j, and X[i . . .|X| − 1] is the suffix of X that starts at position i, where |X| denotes the length of X.The suffix tree of X (generalised suffix tree for a set of strings) is a compact trie representing all suffixes of X.We denote the reversal of X by string X R , i.e.
A string P is said to be a palindrome if and only if In other words, a palindrome is a string that reads the same forward and backward, i.e. a string P is a palindrome if P = Y aY R where Y is a string, Y R is the reversal of Y and a is either a single letter or the empty string.Moreover, X[i . . .j] is called a palindromic factor of X.It is said to be a maximal palindrome if there is no other palindrome in X with center i+j 2 and larger radius.Hence X has exactly 2n − 1 maximal palindromes.A maximal palindrome P of X can be encoded as a pair (c, r), where c is the center of P in X and r is the radius of P .
In this work, we generally consider GD strings over an integer alphabet of size σ = N O (1) .Example 3. The GD string Ŝ of Figure 1(c) has length n = 6, size N = 28, and W = 12.Definition 4. Given two degenerate letters X and Ŷ , their Cartesian concatenation is Given two GD strings R and Ŝ of equal total width the intersection of their languages is defined by L( R) ∩ L( Ŝ).Definition 6.Let X = { x i ∈ Σ k } and Ŷ = { y j ∈ Σ h } be two degenerate letters on alphabet Σ.Further let us assume without loss of generality that Ŷ is the set that contains the shorter strings (i.e.h ≤ k).We define the chop of X and Ŷ and the active suffixes of X and Ŷ as follows: When active X, Ŷ = {ε}, we set active X, Ŷ = ∅.We then have that active X, Ŷ = ∅ either if h = k or if there is no match between any of the strings in Ŷ and the prefix of a string in X; i.e. chop X, Ŷ = ∅.
Example 7. Consider the following degenerate letters X and Ŷ where w( Ŷ ) < w( X).The underlined strings in letter Ŷ are prefixes of strings in letter X, hence they are in chop X, Ŷ .The suffixes of such strings in X are the active suffixes in active X, Ŷ .
Let R and Ŝ be two GD strings of length r and s, respectively.R ).We call these the shortest synchronized prefixes of R and Ŝ, respectively, when

GD String Comparison
In this section, we consider the fundamental problem of GD string comparison.Let R and Ŝ be of total size N and M , respectively.We provide an O(N + M )-time algorithm in the standard word RAM model with word size w = Ω(log(N + M )) that works also for integer alphabets.Before presenting our efficient implementation, we observe that there is the following simple algorithm based on DFAs.Each degenerate letter of R and Ŝ can be represented by a trie, where its leaves are collapsed to a single one.For every two consecutive degenerate letters, the collapsed leaves of the former trie coincide with the root of the latter trie.An acyclic DFA is obtained in this way, as illustrated in Appendix A. We can perform the comparison of R and Ŝ by intersecting their corresponding DFAs using BFS on their product DFA.The trivial upper bound on the number of reachable states is O(N M ), but this can be improved to O(N + M ) by exploiting the structure of the two input DFAs.Each state in such a DFA has a unique level: the common length of paths from the initial state; and this structure is inherited by the product DFA.In other words, a level-i state in the product DFA corresponds to a pair of level-i states in the input DFAs.Observe that a level-i state in one DFA is uniquely represented by the label of the path from the root of its trie, and for a fixed DFA and level, these labels have uniform lengths.Considering the two states composing a reachable state in the product DFA, it is easy to see that the shorter label must be a suffix of the longer label.Hence, the state in the DFA with longer labels at level i uniquely determines the state in the DFA with shorter labels at level i.Consequently, the number of reachable level-i states in the product DFA is bounded by the number of level-i states in the input DFAs, and the size is O(N + M ).
We observe that the cost of implementing the above ideas has an extra logarithmic factor due to state branching and, moreover, GD string comparisons require to build the DFAs each time.We show how to obtain O(N + M ) time for integer alphabets, without creating DFAs.We show that, even if the size of L( R) ∩ L( Ŝ) can be exponential in the total sizes of R and Ŝ (Fact 9), the problem of GD string comparison, i.e. deciding whether L( R) ∩ L( Ŝ) is non-empty, can be solved in time linear with respect to the sum of the total sizes of the two GD strings (Theorem 17) and is thus of independent interest.We next show when it is possible to factorize L( R) ∩ L( Ŝ) into a Cartesian concatenation.

Lemma 10. Consider two GD strings
, that is a contradiction.
By applying Lemma 10 wherever R and Ŝ have synchronized prefixes, we are then left with the problem of intersecting GD strings with no synchronized proper prefixes.We now define an alternative decomposition within such strings (see also Example 12).Definition 11.Let R and Ŝ be two GD strings of length r and s, respectively, with no synchronized proper prefixes.We define where chop i denotes the set chop Âi, Bi , and ( Â0 , B0 ), ( Â1 , B1 ), . . ., ( Âq , Bq ), pos( Âi ), pos( Bi ) are recursively defined as follows: The generation of pairs ( Âi , Bi ) stops at i = q either if q = r+s−2, or when chop q+1 = ∅, in which case R and Ŝ only match until ( Âq , Bq ).Intuitively, Âi (respectively, Bi ) represents suffixes of the current position of R (respectively, of Ŝ), while pos( Bi ) (respectively, pos( Âi )) tells which position of R (respectively, Ŝ) we are chopping.
Example 12 (Definition 11).Consider the following GD strings R and Ŝ with no synchronized proper prefixes: chop 0 is the first red set from the left, chop 1 is the first blue one, chop 2 is the second red one, etc.The c-chain( R, Ŝ) terminates when q = 7.
Let R and Ŝ be two GD strings of length r and s, respectively, with w( R) = w( Ŝ) and no synchronized proper prefixes.We define G R, Ŝ as a directed acyclic graph with a structure of up to r + s − 1 levels, each node being a set of strings, as follows, where we assume without loss of generality that w( R[0]) > w( Ŝ[0]): Level k = 0: consists of a single node: . .y q0 with y j ∈ chop j ∀j : 0 ≤ j ≤ q 0 }, where q 0 is the index of the rightmost chop containing suffixes of R[0].Level k > 0: consists of = |chop q k−1 | nodes.Assuming without loss of generality that level k−1 has been built with suffixes of R[pos( Âq k−1 )], level k contains suffixes of a position of Ŝ.Let c 0 , . . ., c −1 denote the elements of chop q k−1 .Then, for 0 ≤ i ≤ −1, the i-th node of level k is: Every string in level k − 1 whose suffix is c i is the source of an edge having the whole node n i as a sink.We define paths(G R, Ŝ ) as the set of strings spelled by a path in G R, Ŝ that starts at n 0 and ends at the last level.
Note that the size of G R, Ŝ is at most linear in the sum of the sizes of R and Ŝ, as the nodes contain strings either in R or in Ŝ with no duplications, and each node has out-degree equal to the number of strings it contains.
Example 14 (Definition 13).G R, Ŝ for the GD strings R, Ŝ of Example 12 is: q 0 = 2 and the strings in level 0 belong to (chop 0 ⊗ chop 1 ⊗ chop 2 ) ∩ R[0].Level 1 contains suffixes of strings in B2 (and of strings in B3 as chop 3 = {A, T} and indeed q 1 = 3), level 2 suffixes of strings in Â3 (as q 2 = 5), level 3 suffixes of strings in B5 (q 3 = 6), level 4 suffixes of strings in Â6 (q 4 = 7).The three paths from level 0 to level 4 correspond to the three strings in L( R) ∩ L( Ŝ): AGCCGAATCTCG, AAGTCAATCTCG, AAGTCTAGCTCG.reached by an edge leaving a suffix of z .By inductive hypothesis z and, again by Definition 13, z ∈ chop that can be written as u = u u with u the prefix of u having length , then there is an edge linking a suffix of u at level k −1 with a node at level As a special case of Lemma 15, if L( Ŝ) ∩ L( R) = ∅, then G R, Ŝ is built up to the last level and the following holds.
Theorem 16.Let R, Ŝ be two GD strings having lengths, respectively, r and s, with w( R) = w( Ŝ) and no synchronized proper prefixes.Then G R, Ŝ has exactly r + s − 1 levels, and we have that G R, Ŝ is thus a linear-sized representation of the possibly exponential-sized (Fact 9) set L( Ŝ) ∩ L( R).
We now show an O(N + M )-time algorithm for the standard word RAM model, denoted by GDSC, that decides whether L( R) and L( Ŝ) share at least one string (returns 1) or not (returns 0).GDSC starts with constructing the generalized suffix tree T R, Ŝ of all the strings in R and Ŝ.Then it scans R and Ŝ starting with R[0] and Ŝ[0] storing in chop R, Ŝ the latest chop i and in active R, Ŝ the latest active Âi, Bi using T R, Ŝ .For an efficient implementation, suffixes in active R, Ŝ are stored (e.g. for active Â0, B0 assuming that w( R[0]) > w( Ŝ[0])) as index positions of R[0] and the starting position of the suffix as active R, Ŝ .suff.The next comparison is made between the corresponding suffixes of R[0] of length w( R[0]) − active R, Ŝ .suffand Ŝ [1], identifying first the minimum length of the two, and proceeding with the same process.The comparison of letters can be: (i) between R[i] and Ŝ[j]; or (ii) between the corresponding strings of active R, Ŝ .indexand R[i]; or (iii) between the corresponding strings of active R, Ŝ .indexand Ŝ[j].If the two GD strings have a synchronized proper prefix, this will result in active R, Ŝ = ∅ at positions i in R and j in Ŝ.At this point, the comparison is restarted with the immediately following pair of degenerate letters.Theorem 17. Algorithm GDSC is correct.Given two GD strings R and Ŝ of total sizes N and M , respectively, over an integer alphabet, algorithm GDSC requires O(N + M ) time.
Proof.The correctness follows directly from Lemma 10, Lemma 15, and Theorem 16.
Constructing the generalized suffix tree T R, Ŝ can be done in time O(N + M ) [12].For the sets pair ( Âi , Bi ) as in Definition 11, such that w( Âi ) = k and w( Âi ) ≤ w( Bi ), we query T R, Ŝ with the k-length prefixes of strings in Bi .For integer alphabets, instead of spelling the strings from the root of T R, Ŝ , we locate the corresponding terminal nodes for ( Âi , Bi ).It then suffices to find longest common prefixes between these suffixes to simulate the querying process.Since all suffixes are lexicographically sorted during the construction of T R, Ŝ , we can also have the suffixes considered by pair ( Âi , Bi ) lexicographically ranked with respect to ( Âi , Bi ).Hence we do not perform the longest common prefix operation for all possible suffix pairs, but only for the lexicographically adjacent ones within this group.This can be done in O(1) time per pair after O(N + M )-time pre-processing over T R, Ŝ [7].chop i is thus populated with the k-length prefixes of strings in Bi found in Âi .The set active Âi, Bi of active suffixes can be found by chopping the suffixes of the string in Bi from their prefixes successfully queried in T R, Ŝ .This requires time Let R and Ŝ be of length r and s, respectively.Assume that R and Ŝ have no synchronized proper prefixes.Then Theorem 16 ensures that the total number of comparisons cannot exceed r + s − 2: this results in a time complexity of If R and Ŝ have synchronized proper prefixes, we perform the comparison up to the shortest synchronized prefixes (i.e. the set of active suffixes becomes empty) and then restart the procedure from the immediately following pair of degenerate letters.Clearly the total number of comparisons also in this case cannot be more than r + s − 2.

Computing Palindromes in GD Strings
Armed with the efficient GD string comparison tool, we shift our focus on our initial motivation, namely, computing palindromes in GD strings.

Definition 18.
A GD string Ŝ is a GD palindrome if there exists a string in L( Ŝ) that is a palindrome.
), can be encoded as a pair (c, r), where its other GD palindrome (c, r ) exists in Ŝ with r > r.Note that we only consider the GD palindromes Ŝ[i] . . .Ŝ[j] that start with the first letter of some string X ∈ Ŝ[i] and end with the last letter of some string Y ∈ Ŝ[j], while the center can be anywhere: in between or inside degenerate letters.That is, in Ŝ there are 2 • w( Ŝ) − 1 = 2W − 1 possible centers.
In this section, we consider the following problem.Given a GD string Ŝ of length n, total size N , and total width W , find all GD strings Ŝ[i] . . .Ŝ[j], with 0 ≤ i ≤ j ≤ n − 1, that are GD palindromes.We give two alternative algorithms: one finds all GD palindromes seeking them for all (i, j) pairs; and the other one finds them starting from all possible centers.The two algorithms have different time complexities: which one is faster depends on W , N , and n.In fact, they compute all GD palindromes, but report only the maximal ones.
We first describe algorithm MaxPalPairs.For all i, j positions within Ŝ, in order to check whether Ŝ[i] . . .Ŝ[j] is a GD palindrome, we apply the GDSC algorithm to Ŝ[i] . . .Ŝ[j] and its reverse, denoted by rev( Ŝ[i] . . .Ŝ[j]); the reverse is defined by reversing the sequence of degenerate letters and also reversing the strings in every degenerate letter.GD palindromes are, finally, sorted per center, and the maximal GD palindromes are reported.Sorting the (i, j) pairs by their centers can be done in O(W ) time using bucket sort, which is bounded by O(N ) since N ≥ W .
Since there are O(n 2 ) pairs (i, j), and since by Theorem 17 algorithm GDSC takes time proportional to the total size of Ŝ[i] . . .Ŝ[j] to check whether Ŝ[i] . . .Ŝ[j] is a GD palindrome, algorithm MaxPalPairs takes O(n 2 N ) time in total.In algorithm MaxPalCenters, we consider all possible centers c of Ŝ.In the case when c is in between two degenerate letters we simply try to extend to the left and to the right via applying GDSC.In the case when c is inside a degenerate letter we intuitively split the letter vertically into two letters and try to extend to the left and to the right via applying GDSC.At each extension step of this procedure we maintain two GD strings L (left of the center) and R (right of the center) such that they are of the same total width.We consider the reverse of L (similar to algorithm MaxPalPairs) for the comparison.In the case where c occurs inside a degenerate letter to make sure we do not identify palindromes which do not exist, for all j split strings of the degenerate letter, we check that LR where LR = rev( L) and k = min(w(L R [0]), w( R[0])).If no matches are found, we move onto the next center.Otherwise, when a match is found, we update rev( L) and R with the remainder of the split degenerate letter (if its length is greater than k), as well as the next degenerate letters.Algorithm GDSC is applied to compare rev( L) and R.After a positive comparison, we overwrite L and R by adding the degenerate letters of the current extension until w( L) = w( R) (or until the end of the string is reached).This process is repeated as long as GDSC returns a positive comparison, that is, until the maximal GD palindrome with center c is found.The radius reported is then the total sum of all values of w( L).If GDSC returns a negative comparison at center c, we proceed with the next center, because we clearly cannot have a GD palindrome centered at c extended further if rev( L) ∩ R is empty.By Theorem 17 and the fact that there are 2W − 1 possible centers, we have that algorithm MaxPalCenters takes O(W N ) time in total.We obtain the following result.
Theorem 20.Given a GD string of length n, total size N , and total width W , over an integer alphabet, all (maximal) GD palindromes can be computed in time O(min{W, n 2 }N ).
The problem that gained significant attention recently is the factorization of a string X of length n into a sequence of palindromes [3,13,30,9,5,2].We say that X 1 , X 2 , . . ., X is a (maximal) palindromic factorization of string X, if every X i is a (maximal) palindrome, X = X 1 X 2 . . .X , and is minimal.In biological applications we need to factorize a sequence into palindromes in order to identify hairpins, patterns that occur in single-stranded DNA or, more commonly, in RNA.Next, we define and solve the same problem for GD strings.Definition 21.A (maximal) GD palindromic factorization of a GD string Ŝ is a sequence P1 , . . ., P of GD strings, such that: (i) every Pi is either a (maximal) GD palindrome or a degenerate letter of Ŝ; (ii) Ŝ = P1 . . .P ; (iii) is minimal.
After locating all (maximal) GD palindromes in Ŝ using Theorem 20, we are in a position to amend the algorithm of Alatabbi et al [3] to find a (maximal) GD palindromic factorization of Ŝ.We define a directed graph Note that V contains a node n being the sink of edges representing (maximal) GD palindromes ending at Ŝ[n − 1].For maximal GD palindromes, E contains no more than 3W edges, as the maximum number of maximal GD palindromes is 2W − 1.For GD palindromes, E contains O(n 2 ) edges, as the maximum number of GD palindromes is O(n 2 ).A shortest path in G Ŝ from 0 to n gives a (maximal) GD palindromic factorization.For maximal GD palindromes, the size of G Ŝ is O(W ), as n ≤ W , and so finding this shortest path requires O(W ) time using a standard algorithm.For GD palindromes, the size of G Ŝ , and thus the time, is O(n 2 ).and H were replaced by degenerate letters according to IUPAC [23].Each other letter, c ∈ {A, C, G, U}, was treated as a single degenerate letter {c}.An average of 47% of the total number of positions within the 5 sequences consisted of one of the following: X, S, T, Y, Z, R and H.We then used algorithm MaxPalPairs to find all maximal palindromes in the 5 sequences.We construct the DFA for R and the DFA for Ŝ.We observe that computing the product DFA is alphabet-dependent, due to branching (transition function) on the same letter in the states of the two input DFAs.

Definition 1 .Definition 2 .
A generalised degenerate string (GD string) Ŝ = Ŝ[0] Ŝ[1] . . .Ŝ[n − 1] of length n over an alphabet Σ is a finite sequence of n degenerate letters.Every degenerate letter Ŝ[i] of width k i > 0, denoted also by w( Ŝ[i]), is a finite non-empty set of strings Ŝ[i][j] ∈ Σ ki , with 0 ≤ j < | Ŝ[i]|.For any GD string Ŝ, we denote by Ŝ[i] . . .Ŝ[j] the GD substring of Ŝ that starts at position i and ends at position j.The total size N and total width W , denoted also by w( Ŝ), of a GD string Ŝ are respectively defined as N

Fact 9 .
Given two GD strings R and Ŝ, L( Ŝ) ∩ L( R) can have size exponential in the total sizes of R and Ŝ.

Lemma 15 .
Ŝ truncated at level k, and let |G k R, Ŝ | be the length of the strings it spells.Let L k ( Ŝ) denote the set of prefixes of length |G k R, Ŝ | of L( Ŝ).Let R, Ŝ be two GD strings with w( R) = w( Ŝ) = W and no synchronized proper prefixes.Then L

Proof.[
Again, let us assume without loss of generality that w( R[0]) > w( Ŝ[0]).We prove the result by induction on k. [Level k = 0] By construction, n 0 contains strings in R[0] ∩ (chop 0 ⊗• • •⊗chop q0 ), which have length |G 0 R, Ŝ |, and are also in Ŝ[0], and hence belong to both L 0 ( Ŝ) and L 0 ( R). Level k > 0] By inductive hypothesis, we have that R): by Definition 13, any z ∈ paths(G k R, Ŝ ) can be written as z = z z with z in paths(G k−1 R, Ŝ ) and with z that belongs to some node at level k of G k R, Ŝ

Example 25 .
We illustrate here a simple automata-based approach.Say we want to compare the following two GD strings:

C
Their product DFA gives their intersection: ACACAAC and CCCACCC.

Table 1
Coordinates of (maximal) palindromes identified within hypervariable regions I and II.

14 Degenerate String Comparison and Applications 30
Table 1 shows the palindromes identified within hypervariable regions I and II.Our results are in accordance with Wuilmart et al [34] who presented a statistical (fundamentally different) method to identify the location of palindromes within regions of immunoglobulin genes.The ranges we report are greater than or equal to the ones of [34] due to the maximality criterion.Mikhail Rubinchik and Arseny M. Shur.Eertree: An efficient data structure for processing palindromes in strings.In IWOCA, volume 9538 of LNCS, pages 321-333.Springer International Publishing, 2016.31 Randall T. Schuh.Major patterns in vertebrate evolution.Systematic Biology, 27(2):172, 1978.32 Henry Soldano, Alain Viari, and Marc Champesme.Searching for flexible repeated patterns using a non-transitive similarity relation.Pattern Recognition Letters, 16(3):233-246, 1995.33 Ryan Williams.A new algorithm for optimal 2-constraint satisfaction and its implications.Theor.Comput.Sci, 348(2-3):357-365, 2005.34 C. Wuilmart, J. Urbain, and D. Givol.On the location of palindromes in immunoglobulin genes.Proceedings of the National Academy of Sciences of the United States of America, 74(6):2526-2530, 1977. 21: