Approximate Cover of Strings

Regularities in strings arise in various areas of science, including coding and automata theory, formal language theory, combinatorics, molecular biology and many others. A common notion to describe regularity in a string T is a cover, which is a string C for which every letter of T lies within some occurrence of C. The alignment of the cover repetitions in the given text is called a tiling. In many applications finding exact repetitions is not sufficient, due to the presence of errors. In this paper, we use a new approach for handling errors in coverable phenomena and define the approximate cover problem (ACP), in which we are given a text that is a sequence of some cover repetitions with possible mismatch errors, and we seek a string that covers the text with the minimum number of errors. We first show that the ACP is NP-hard, by studying the cover-size relaxation of the ACP, in which the requested size of the approximate cover is also given with the input string. We show this relaxation is already NP-hard. We also study another two relaxations of the ACP, which we call the partial-tiling relaxation of the ACP and the full-tiling relaxation of the ACP, in which a tiling of the requested cover is also given with the input string. A given full tiling retains all the occurrences of the cover before the errors, while in a partial tiling there can be additional occurrences of the cover that are not marked by the tiling. We show that the partial-tiling relaxation has a polynomial time complexity and give experimental evidence that the full-tiling also has polynomial time complexity. The study of these relaxations, besides shedding another light on the complexity of the ACP, also involves a deep understanding of the properties of covers, yielding some key lemmas and observations that may be helpful for a future study of regularities in the presence of errors. 1998 ACM Subject Classification F.2.2 Nonnumerical Algorithms and Problems, G.2.1 Combinatorics, G.4 Mathematical Software, I.5.2 Design Methodology


Introduction
Regularities in strings arise in various areas of science, including coding and automata theory, formal language theory, combinatorics, molecular biology and many others.A typical form of regularity is periodicity, meaning that a "long" string T can be represented as a concatenation of copies of a "short" string P , possibly ending in a prefix of P .Periodicity has been extensively studied in Computer Science over the years (see [26]).number of errors.The alignment of the cover repetitions in the given text is called a tiling.We prove that the ACP is N P-hard by studying a relaxation of this problem, which we call the cover-size relaxation of the ACP.In this relaxation the requested size of the approximate cover is also given with the input string.We prove that this relaxation is already N P-hard, thus proving the N P-hardness of ACP.
We also study another two relaxations of the problem, which we call the partial-tiling relaxation of the ACP and the full-tiling relaxation of the ACP.In this relaxations a tiling of the requested cover is also given, and we seek a string such that when using the given tiling to align it with the given text, the number of mismatches is minimized.The full tiling retains all the occurrences of the cover before the errors, while in the partial tiling there can be additional occurrences of the cover that are not marked by the tiling.We examine these relaxations and show the partial-tiling has polynomial time complexity and give experimental evidence that the full-tiling also has polynomial time complexity.The study of these relaxations, besides shedding another light on the complexity of the ACP, also involves a deep understanding of the properties of covers and seeds, yielding some key lemmas and observations (such as [2]) that may be helpful for a future study of regularities in the presence of errors.

Paper Contributions. The main contributions of this paper are:
Proving that the ACP is N P-hard.
Formalizing the partial-tiling relaxation of the ACP and proving it is polynomial time computable.
Formalizing the full-tiling relaxation of the ACP and suggesting a polynomial time algorithm for its computation, while giving an experimental evidence for the correctness of this algorithm.
The paper is organized as follows.In Section 2, we give formal definitions.In Section 3, we study the cover-size relaxation of the ACP and prove the N P-hardness of the ACP.In Section 4, we study the partial-tiling relaxation of the ACP and show it is polynomialtime computable.In Section 5, we study the full-tiling relaxation of the ACP, suggest a polynomial-time algorithm for this problem and experimentally test its correctness.We conclude with some open problems in Section 6.

Preliminaries
In this section we give the needed formal definitions.

Definition 2 (Tiling).
Let T be a string over alphabet Σ such that the string C over alphabet Σ is a cover of T .Then, the sorted list of indices representing the start positions of occurrences of the cover C in the text T is called the tiling of C in T .
In this paper we have a text T which may have been introduced to errors and, therefore, is not coverable.However, we would like to refer to a retained tiling of an unknown string C in T although C does not cover T because of mismatch positions.The following definition makes a distinction between a list of indices that may be assumed to be a tiling of the text before mismatch errors occurred and a list of indices that cannot be such a tiling.Note that S(C) is not uniquely defined even for a fixed n > m, since every different valid tiling of the m-length string C generates a different n-length string S(C).A unique version can be obtained if a valid tiling L is also given.Notation 2. Let T be an n-length string over alphabet Σ and let L be a valid tiling of T .Let m = n + 1 − L last , where L last is the last index in the tiling L. For any m-length string C , let S L (C ) be the n-length string obtained using C as a cover and L as the tiling as follows: S L (C ) begins with a copy of C and for each index i in L a new copy of C is concatenated starting from index i of S L (C ) (maybe running over a suffix of the last copy of C ). Definition 4. Let T be a string of length n over alphabet Σ.Let H be the Hamming distance.The distance of T from being covered is:

|C|<n,S(C)∈Σ n H(S(C), T ).
We will also refer to dist as the number of errors in T .

N P-Hardness of the ACP
In this section we prove the N P-hardness of the ACP.To this end, we study a variant of the problem where m, the length of a requested approximate cover, is also given together with the input string T , and we are requested to find a string C of length m that is an m-length approximate cover of T , i.e., C covers T with the minimum number of errors over all strings of length m.We call this problem the cover-size relaxation of the ACP.Clearly, if the cover-size relaxation of the ACP is already N P-hard, then so is the ACP.
Our hardness proof uses a reduction from the 3-SAT problem, in which the input is a logical formula ϕ on N variables in 3-CNF (each clause has exactly three literals), and we need to decide whether ϕ is satisfiable or not.The N P-hardness of 3-SAT is well-known (see e.g.[11]).

The Reduction from 3-SAT
Given a 3-CNF formula ϕ on N variables, x 1 , . . ., x N , with clauses.Assume without loss of generality that the literals in each clause are sorted by the index of their variables.We need to define a text T of length n over an alphabet Σ and to specify the size m of the requested approximate cover.We will then show that ϕ is satisfiable if and only if T has an m-approximate cover with at most some specified number of errors to be defined.
We begin by defining the alphabet Σ to include all the variables and their negation together with 4 additional dummy variables: x 0 , x −1 , x N +1 , x N +2 and also a special padding character p. Formally, The definition of the text T has two parts: a header and a body, where the body of T is defined according to the clauses of the given logical formula ϕ, and the header preceding this body imposes a structure on an m-approximate cover for T .
The definition of the body of T follows directly from the formula ϕ.For each clause we add to the body of T the substring L j 1 L j 2 L j 3 , preceded and followed by a padding of 2N + 14 occurrences of the character p.The role of this padding is to avoid overlaps between occurrences of an approximate cover covering substrings originating from different clauses.The header is composed of (N + 3) copies of the following string: where each padding contains N + 7 occurrences of p.
We define the size of the requested approximate cover m to be 3N + 18.Note that the size of T and m as well as their construction are polynomial in N and .Lemma 8 assures the correctness of the reduction.

Lemma 8. ϕ is satisfiable if and only if T has an m-approximate cover with at most
We have, therefore, proven Theorem 9.
Theorem 9. ACP is N P-hard.

4
The Partial-Tiling Relaxation of the ACP In this section we study another relaxation of the approximate cover problem: the partialtiling relaxation, in which we are given a retained tiling of the cover before the errors has occurred together with the input string itself.In order to formally define the relaxation we need Definitions 10 and 11.We describe an algorithm for the partial-tiling relaxation of the approximate cover problem in two parts.We first describe the mandatory part of the algorithm, which we call the Histogram Greedy Algorithm.This algorithm does the main work in finding an approximate cover subject to the tiling L. It returns a candidate for the final L approximate cover to be output.This candidate is legal if it is primitive and illegal, otherwise.We then describe the second part, which we call the Partial-Tiling Primitivity Coercion.In this part, the legality of the candidate is checked, and if needed, the candidate is corrected in order to coerce the primitivity requirement.

The Histogram Greedy Algorithm
This part of the algorithm performs the following steps given the text T and the valid tiling L: 1. Find m, the length of an approximate cover subject to the tiling L, by computing the difference between n + 1, and the last index in the tiling L, L last , which indicates the last occurrence of the cover in T .

2.
Compute the m-length mask M of an approximate cover, by initializing M to zeroes, setting M [1] = 1, then reading the tiling L from beginning to end and for each i

3.
Compute the m-long string V C of variables from an auxiliary alphabet First, we initialize the m-long string V C to v 1 v 2 . . .v m .Then, we read the mask M from end to beginning, and for every j such that M [j] = 1, we update the string V C by equalizing the substrings V C [1..m − j + 1] and V C [j..m].In the equalization process, when we obtain an equation v k = v for k < , we replace both letters by v k .The resulting string V C represents C in the following sense: for any pair of indices 1 In other words, V C carries the information on equalities imposed by the mask M between indices of C. 4. Compute the n-long string V T of variables from the auxiliary alphabet Σ V , which is a string covered by V C according to the tiling L of T .V C is computed using the tiling L and V C as follows: it begins with a copy of V C and for each index i in L a new copy of V C is concatenated starting from index i of V T (maybe running over a suffix of the last copy of V C ).

5.
Compute the histogram Hist V C ,Σ using the alignment of T with V T and counting for each variable V ∈ V C and each σ ∈ Σ, the number of indices i in T, V T for which V T [i] = V and T [i] = σ.6. Compute an L approximate cover candidate C greedily according to the histogram Hist V C ,Σ , as follows: for every index 1 , σ], i.e., for each index in C we choose the alphabet symbol that minimizes the number of mismatch errors between S L (C) and T in the relevant indices according to the tiling L. The algorithm outputs the m-length string C from its last step and the histogram table Lemma 13 describes a property of the output C returned by the Histogram Greedy algorithm, and immediately follows from the greedy criterion used in step 6 of the algorithm.Lemma 14 describes the algorithm time complexity.

Lemma 13. Let C be the output of the Histogram Greedy algorithm. Then, H(T, S L (C)) = min
C ∈Σ m H(T, S L (C )).

Lemma 14. The time complexity of the Histogram Greedy algorithm is:
Despite Lemma 13, the output C of the Histogram Greedy algorithm might not be an L approximate cover of T , because it might not be primitive, as the following example shows.
Example: Assume that V C = XY ZW XY and Σ = {a, b} and that the histogram Hist V C ,Σ computed by the algorithm is the following: Then, the Histogram Greedy algorithm chooses: X = a, Y = b, Z = a, W = b, and outputs C = ababab, which cannot be considered a legal cover since it is not primitive, i.e., C itself can be covered by the shorter string ab.However, the partial L-approximate cover can have a tiling L , such that L ⊆ L , which exactly is the case with ab.Therefore, ab should be returned as the partial L-approximate cover of T .The Partial-Tiling Primitivity Coercion algorithm described in Subsection 4.2 is responsible for checking the legality of the output string received from the Histogram Greedy algorithm and returning a partial L-approximate cover.
Note, that the input tiling L requires an m-length string as an output.Therefore, the (primitive) 2-length approximate cover ab is precluded as an L-approximate cover.Assuming that the input tiling L is the retained tiling of the cover of the original text before the errors occurred, such a case means that, though ab is a string covering T subject to a partial tiling L with the least number of errors, it does not cover T with L as a full tiling.In this sense, L is an evidence that the original cover is of larger length than ab and that more errors actually happened.Section 5 is devoted to finding an L-approximate cover.

The Partial-Tiling Primitivity Coercion Algorithm
This part of the algorithm gets as input the string C returned by the Histogram Greedy algorithm and performs the following steps: C P M 2 0 1 7

26:8
Approximate Cover of Strings 1. Check the primitivity of C (using the linear-time algorithm of [7]).If C is primitive, return C. 2. Else, return the primitive cover C of C (found using the linear-time algorithm of [7] in the first step).
The time complexity of the Partial-Tiling Primitivity Coercion algorithm is immediate from the linear-time complexity of the algorithm in [7].Thus, we get:

5
The Full-Tiling Relaxation of the ACP In this section we study another relaxation of the approximate cover problem: the full-tiling relaxation, in which we are given a retained tiling of the cover before the errors have occurred together with the input string itself.Unlike the situation in the problem of the previous section, this tiling is assumed to be exact.Therefore, the algorithm cannot return as cover a string that in order to cover T must have repetitions that are not marked in the tiling L.
The formal definition of the problem is as follows.
Definition 17 (The Full-Tiling Relaxation of the ACP).INPUT: String T of length n over alphabet Σ, and a valid tiling L of T .OUTPUT: An L-approximate cover C of T .
In order to impose the requirement of the definition of an L-approximate cover of T to be a primitive string such that all its repetitions to cover T (with minimum number of errors) are marked in the tiling L, we need a different primitivity coercion algorithm than the one described in the previous section.This algorithm is described in Subsection 5.1.Unfortunately, proving the correctness of this algorithm requires a deep understanding of the properties of coverability in the presence of mismatch errors.Although we are making progress in proving this needed background (see, for example [2]), a lack in the complete understanding of the phenomenon prevents us from proving the correctness formally.Hence, in Subsection 5.2, we resort to experimental evidence of the correctness.

The Full-Tiling Primitivity Coercion Algorithm
This part of the algorithm gets as input the string C returned by the Histogram Greedy algorithm (Subsection 4.1) and performs the following steps: 1. Check the primitivity of C (using the linear-time algorithm of [7]).If C is primitive, return C. 2. Else, find V k ∈ V C such that if the assignment of V k is changed from the symbol with the largest value in the row of V k in Hist V C ,Σ to the symbol with the second largest value in this row, thus obtaining a new m-length candidate string C , such that the difference H(S L (C ), T ) − H(S L (C), T ) is minimized and where C is primitive.
Lemma 18 below describes the time complexity of the Full-Tiling Primitivity Coercion algorithm and immediately follows from the linear-time complexity of the algorithm [7] we use in the first step and the description of the second step.Remark: Note that we can use a different algorithm that instead of checking the change of single variables to the second best assignment and choosing the one that gives primitivity with the least number of errors (as our algorithm does), checks the changing to the second best assignment of all subsets of variables and chooses the set that gives primitivity with the least number of errors.This algorithm is obviously correct , i.e., primitivity with the least number of errors, however, it has an exponential-time complexity.On the other hand, our algorithm is assured to have polynomial-time complexity, so a proof of its correctness will assure the polynomial-time complexity of the full-tiling relaxation of the ACP.

Experimental Tests of the Full-Tiling Relaxation Algorithm
Experiment were designed to test the full-tiling relaxation algorithm, which is composed of the algorithms of Subsections 4.1 and 5.1.In particular, we also wanted to experimentally test how many times the full-tiling primitivity coercion is necessary.Note that, due to the result of [3], this algorithm is only of interest to test under a rather high error rate, in which there is an error in every occurrence of the approximate cover of the text, otherwise, the dynamic programming algorithm solving the candidate-relaxation of the ACP is applicable, where trying every substring of T as a candidate cover [3].In order to comprehensively test the algorithm, the inputs for the tests were classified according to the following criteria: cover size: A cover C of size m is constructed, where m is small (less than 10), medium  or large (100-400).Covers of size more than 400 were not created due to space limitations.alphabet size: The alphabet size was chosen to be either small (at most √ m) or large (more than √ m).tiling style: Given a cover C and its mask M , a tiling L for the text S L (C) is constructed where the decision of the next index in L is made according to the following styles: random -an equal priority is given to every set bit in M , left priority -a decreasing priority is given to the set bits in M , right priority -an increasing priority is given to the set bits in M .error rate: The input string T is constructed from S L (C) by inserting mismatch errors according the following error rates: medium (in every m characters at least one error), high (in every m characters at least √ m errors).error style: The mismatching character is determined according to the following style: random (replacing by a uniformly at random choice of another character from the alphabet) or priority (replacing by another character with priority to the first character in the alphabet, and if the first character is to be replaced then by a uniformly at random chosen different character).These criteria guarantee that the inputs created for testing the algorithm all have a coverable original string, that its valid tiling is retained.This original string is then introduced with a sufficiently high error rate to produce the current string together with the valid tiling as inputs for the tiling relaxation algorithm.Therefore, all the tested inputs have an L approximate C P M 2 0 1 7

26:10
Approximate Cover of Strings cover and our tiling relaxation algorithm is indeed applicable for them.Moreover, the above criteria for input generation also aim at neutralizing the effect of the cover size, the alphabet size, the tiling style, the error rate or the error style on the validity of the hypothesis, by exhaustively using all reasonable alternatives.
A total of 372000 texts T were constructed as described above and served as inputs (together with the tiling L) to the full-tiling relaxation algorithm.The results are given in Tables 1 and 2 (see Appendix).The column "Percent of Inputs" describes how many of the input texts had each row's characteristics.Numbers are rounded to two digits after decimal point.The column "Identical" describes in how many of the input texts the Histogram Greedy algorithm of Subsection 4.1 returned the original cover C of the text S L (C) built prior to the error insertion process.The column "Primitive" describes in how many of the input texts the Histogram Greedy algorithm of Subsection 4.1 returned a primitive cover and there was no need to proceed with the second phase of the Full-Tiling Primitivity Coercion algorithm of Subsection 5.1.The column "Non-Primitive" describes in how many of the input texts the Histogram Greedy algorithm of Subsection 4.1 returned a non-primitive string and, therefore, the second phase of the Full-Tiling Primitivity Coercion algorithm of Subsection 5.1 was performed.This latter case happened in 8912 texts, which are about 2% of the texts.

Experiments Conclusion:
Primitivity coercion was necessary in 2% of the total tested inputs.In a 100% of the tests the returned string after the Full-Tiling Primitivity Coercion algorithm was indeed an L-approximate cover of the input string.

Open Problems
In this paper we initiated the study of the approximate cover problem using a new approach.We proved that the some relaxations (the cover size relaxation) of the approximate cover problem are N P-hard, thus proving that the ACP is N P-hard, while other relaxations (the partial-tiling relaxation and the full-tiling relaxation) are polynomial-time computable.Some interesting questions and open problems are: Our N P-hardness proof uses unbounded-size alphabet.Is the ACP still N P-hard for finite alphabet?It is interesting to define other relaxations of the ACP and to study their complexity in order to have a deeper understanding of the ACP.
In this paper we only experimentally checked the correctness of our full-tiling relaxation algorithm.We would like to have a formal proof of its correctness.
In this paper we considered the Hamming distance as a metric in the definition of approximate cover.Other string metrics can be considered as well.It is interesting to see if and how the complexity of the problem changes with the use of other string metrics.

Definition 3 ( 1 .
A Valid Tiling).Let T be an n-length string over alphabet Σ and let L be a sorted list of indices L ⊂ {1, ..., n}.Let m = n + 1 − L last , where L last is the last index in L.Then, L is called a valid tiling of T , if i 1 = 1 and for every i k , i k+1 ∈ L, it holds that i k+1 − i k ≤ m.Let C be an m length string over alphabet Σ. Denote by S(C) a string of length n, n > m, such that C is a cover of S(C).

Definition 10 . 7 26: 6 Approximate
Let T be an n-length string over alphabet Σ and let L be a valid tiling of T .Let m = n + 1 − L last , where L last is the last index in the tiling L.Then, an C P M 2 0 1 Cover of Strings L-approximate cover of T is a primitive string C such that for every string C of length m over Σ, H(S L (C ), T ) ≥ H(S L (C), T ), where H is the hamming distance of the given strings.min C∈Σ m H(S L (C), T ) is the number of errors of an L approximate cover of T .Definition 11.Let T be an n-length string over alphabet Σ.Let L be a valid tiling of T and let L be a valid tiling of T such that L ⊆ L .Let m = n + 1 − L last , where L last is the last index in the tiling L .Then, a partial L-approximate cover of T is a primitive string C of length m such that for every string C of length m over Σ, H(S L (C ), T ) ≥ H(S L (C), T ), where H is the hamming distance of the given strings.min C∈Σ m H(S L (C), T ) is the number of errors of a partial L-approximate cover of T .Definition 12 (The Partial-Tiling Relaxation of the ACP).INPUT: String T of length n over alphabet Σ, and a valid tiling L of T .OUTPUT: A partial L-approximate cover C of T .

Lemma 15 .Theorem 16 .
The time complexity of the Partial-Tiling Primitivity Coercion algorithm is O(m).Theorem 16 follows.Given a text T of length n over alphabet Σ and a valid tiling L. Let L last be the last index in L.Then, the partial-tiling relaxation of the approximate cover problem of T can be solved in O(|Σ| • m + n) time, where m = n + 1 − L last .

Definition 5. Let
T be an n-long string over alphabet Σ.An m-long string C over Σ, m ∈ N, m < n, is called an m-length approximate cover of T , if for every string C of length m over Σ, min S(C )∈Σ n H(S(C ), T ) ≥ min S(C)∈Σ n H(S(C), T ), where H is the hamming distance of the given strings.We refer to min S(C)∈Σ n H(S(C), T ) as the number of errors of an m-length approximate cover of T .