Faster Approximate Elastic-Degenerate String Matching – Part A
Abstract
An elastic-degenerate (ED) string is a sequence of finite sets of strings. The cardinality of is the total number of strings in , for all . The size of is the total length of all strings of . ED strings have been introduced to represent a set of closely-related DNA sequences. Let be a pattern of length and be an integer. We consider the problem of -Approximate ED String Matching (EDSM): searching -approximate occurrences of in the language of . We call -Approximate EDSM under the Hamming distance, -Mismatch EDSM; and we call -Approximate EDSM under edit distance, -Edit EDSM.
Bernardini et al. (Theoretical Computer Science, 2020) showed a simple -time algorithm for -Mismatch EDSM and an -time algorithm for -Edit EDSM. We improve the dependency on in both results, obtaining an -time algorithm for -Mismatch EDSM and an -time algorithm for -Edit EDSM.
Bernardini et al. (Theory of Computing Systems, 2024) presented several algorithms for 1-Approximate EDSM working in time. They have also left the possibility to generalize these solutions for as an open problem. We improve the runtime of their solution for 1-Mismatch and 1-Edit EDSM from to . We further show algorithms for -Approximate EDSM for the Hamming and edit distances working in time, for any constant .
Finally, we show how our techniques can be applied to improve upon the complexity of the -Approximate ED String Intersection and -Approximate Doubly EDSM problems that were introduced very recently by Gabory et al. (Information and Computation, 2025).
Keywords and phrases:
ED string, approximate string matching, Hamming distance, edit distanceFunding:
Solon P. Pissis: Supported in part by the PANGAIA and ALPACA projects that have received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreements No 872539 and 956229, respectively.Copyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Theory of computation Pattern matchingEditors:
Paola Bonizzoni and Veli MäkinenSeries and Publisher:

1 Introduction
Elastic-degenerate strings (ED strings, in short) were introduced in [23] to represent a set of closely-related DNA sequences. They allow, among others, efficient pattern matching [22, 2, 7] and approximate pattern matching [5, 6, 8], whereas pattern matching in general graph-based representations cannot be done as fast [16, 3]. An ED string usually arises from a multiple sequence alignment (MSA) of the underlying closely-related DNA sequences. It can also be treated as a compacted nondeterministic finite automaton (NFA); see Figure 1.
Let be a totally-ordered alphabet. An ED string is a sequence of finite sets, where is a subset of . The total size of is defined as , where is the total number of empty strings in . By we denote the total number of strings in all , i.e., . We say that has length , cardinality and size . The language generated by the ED string is . A standard string can be represented as an ED string in which each set is a singleton of a character.
The Hamming distance of two equal-length strings and is defined as the number of positions where the two strings do not match, that is, . The edit distance (a.k.a. the Levenshtein distance) of and is the minimum number of edit operations (single-character insertions, deletions, substitutions) that allow to transform to . We say that a substring of a string text is a -approximate occurrence of a string pattern under the Hamming distance (edit distance) if the Hamming (edit) distance of and is at most . The problem in scope is now stated as follows.
-Approximate EDSM Input: An ED string (called text) of length , cardinality and size , a string (called pattern) of length over an alphabet of size , and a metric (Hamming or edit distance). Output: Decision version: YES if there is a string that contains a -approximate occurrence of under the respective metric, and NO otherwise. Reporting version: The set of positions for which there exists a string and a (possibly empty) string such that has a -approximate occurrence in string that ends at some position in .
We call the -Approximate EDSM problem under the Hamming distance, -Mismatch EDSM, and under the edit distance, -Edit EDSM (inspect Figure 2 for an example).
Exact EDSM (i.e., 0-Approximate EDSM) was considered by Grossi et al. [22], Aoyama et al. [2], and Bernardini et al. [7], who presented , , and time algorithms, respectively (here is the exponent of matrix multiplication). A practical approach using bit parallelism was proposed by Cisłak et al. [13]. -Approximate EDSM under the Hamming and edit distances was considered by Bernardini et al. in [8] (general ) and in [6] (). Below we elaborate on these two results and present our improvements.
Bernardini et al. [8] showed a simple -time algorithm for -Mismatch EDSM and an -time algorithm for the -Edit EDSM. We improve the dependency on in both these results, obtaining an -time algorithm for -Mismatch EDSM and an -time algorithm for -Edit EDSM.
metric | time complexity | remarks | reference |
Hamming | [8] | ||
Theorem 5 | |||
edit | [8] | ||
Theorem 10 | |||
Hamming | [5, 6] | ||
edit | |||
Hamming | Theorem 14 | ||
edit |
Bernardini et al. [5, 6] presented several algorithms for 1-Approximate EDSM working in time and in time. In [6] they left three open questions, two related to improving the polylogarithmic factors in the complexity of 1-Mismatch and 1-Edit EDSM and one asking for an efficient generalization to mismatches or errors. We answer all these open questions. We show how to avoid all the polylogarithmic factors in the complexity and present algorithms for 1-Approximate EDSM (both Hamming and edit) that work in time. We also show an algorithm for -Approximate EDSM for each of the metrics working in time, for constant . Actually, there is a factor in the complexity; in [5, 6] it was mentioned that having an exponential dependency on in this form of complexity for -Approximate EDSM could be inevitable.
Let us note that the complexities of our algorithms for 1-Approximate EDSM match the complexity of the fastest currently known combinatorial algorithm (that is, without using fast Fourier transform (FFT) [2] or fast matrix multiplication [7]) for exact EDSM [22]. Moreover, in [7] a conditional lower bound for combinatorial algorithms for exact EDSM was given according to which exact EDSM cannot be solved in time, for constant . The bound for and the bound for any constant that we show here work for the Hamming and the edit distance. In Part B, it is shown how to improve both bounds for the Hamming distance by resorting to FFTs [21].
A summary of the existing results and our results for -Approximate EDSM is shown in Tables 1 and 2. We consider the word RAM model of computation and assume that the ED string and the string are over an integer alphabet . All algorithms mentioned in the tables process the text ED string on-line: after each set is processed, we report YES if the string pattern has a -approximate occurrence ending at set .
metric | time complexity | remarks | reference |
---|---|---|---|
Hamming | [8] | ||
or expected time | |||
Theorem 5 | |||
expected time | |||
edit | [8] | ||
or expected time | Theorem 10 | ||
Hamming | [6] | ||
edit | |||
, decision ver. | |||
Hamming | Theorem 20 | ||
edit | |||
Hamming | Theorem 14 | ||
edit |
Finally, we show how our techniques can be applied to improve upon the complexity of the -Approximate ED String Intersection (-Approximate EDSI) and -Approximate Doubly EDSM problems that were introduced very recently by Gabory et al. [19]. In both problems, the input are two ED strings and such that has cardinality and size . In -Approximate EDSI, we would like to check if there are strings and at distance (Hamming or edit) at most . -Approximate Doubly EDSM, at least in its decision version, consists in -Approximate EDSM for all string patterns from in . In [19], both problems were solved in time for the Hamming distance and in time for the edit distance. We obtain time and time for the Hamming and edit distance, respectively.
Comparison of Techniques.
For -Mismatch EDSM, Bernardini et al. [8] used kangaroo jumps [20, 25], whereas we use efficient -Mismatch Pattern Matching [1, 11] and an algorithm for computing the so-called th mismatch [24]. For -Edit EDSM, Bernardini et al. used the Landau-Vishkin algorithm [26]; we apply the same algorithm and also Tiskin’s seaweeds [29], via an interface that was developed by Charalampopoulos et al. [12].
For 1-Approximate EDSM, Bernardini et al. [5, 6] used special instances of classic computational geometry problems to arrive at a time complexity of . For -Mismatch EDSM, in particular, they also introduced a variant of errata trees [14] to shave a logarithmic factor and obtain a running time of . We first show that their technique based on errata trees can be extended to -Edit EDSM thus improving the runtime from to . Then we show a simple sorting-based algorithm working in time if all strings in are of length at most . A trade-off between the two algorithms lets us shave the remaining factor.
Roadmap.
In Section 2, we discuss the necessary preliminaries. In Section 3, we prove Theorems 5 and 10. In Section 4, we prove Theorem 14. In Section 5, we prove Theorem 20. In Section 6, we show our results on -Approximate EDSI and -Approximate Doubly EDSM.
2 Preliminaries
We use the notion of active prefix (and suffix) that was already present in [7] (exact) and [5] (1-approximate). We generalize this notion in a natural way to a -approximate version.
Definition 1.
We say that is a -active prefix of string ending at position in an ED string if is at distance at most from a suffix of a string in .
Symmetrically, we say that is a -active suffix of starting at position in an ED string if is at distance at most from a prefix of a string in .
By we denote an array of values in . We have if and only if is a -active prefix ending at position in and then is the minimum distance between and a suffix of a string in . By definition, for the Hamming distance and (or for ) for the edit distance. Symmetrically, we define as an array of values in such that if and only if is a -active suffix starting at position in and then is the minimum distance between and a prefix of a string in . We make the following fundamental observation.
Observation 2.
A string pattern of length has a -approximate occurrence (under the Hamming or the edit distance) ending at position of an ED string if has a -approximate occurrence (under the respective distance) in one of the strings in or there is some index such that (again, under the respective distance).
By Observation 2, there are four essential steps to design an on-line algorithm for -Approximate EDSM:
-
(1)
For every given on-line, check if the string pattern has a -mismatch occurrence in one of the strings in .
-
(2)
For every given on-line, compute the arrays , of active prefixes and suffixes produced by only this set of the ED string.
-
(3)
For every given on-line, compute the array using and together with the set of -approximate occurrences of strings from in .
-
(4)
For every given on-line, check if there is an index such that . This step takes just time, for a total of .
Similar steps were considered in previous work on -Approximate EDSM. For instance, Bernardini et al. [6] introduced four cases for 1-Approximate EDSM: Easy Case that is exactly step (1), Suffix Case that is covered by step (2), Prefix Case covered by steps (2) and (4), and Anchor Case covered by step (3).
Let and . (Thus and .)
Often we would like to construct a data structure for the strings in a given set and the string that requires that the strings in are over an integer alphabet of size (this could be, for example, the generalized suffix tree; see [17]). However, in -Approximate EDSM we are only guaranteed that the strings are over an integer alphabet of size . As we aim at an on-line algorithm, we use a technique discussed in [8] in which an index of all characters of is constructed. With the index we renumber all characters of with integers in and then, for every character in a string in , we check if it is contained in the index. If so, we assign the character the number of the character in the pattern in , and if not, we assign it a number . The index can be constructed using perfect hashing [18] in expected time or using a deterministic dictionary [27] in time; in both cases, an index query is answered in time.
3 Approximate EDSM via Approximate Overlaps
Our solutions to -Approximate EDSM depend on two building blocks: -Approximate Pattern Matching; and computing -approximate overlaps of two strings.
Let us consider the Hamming or edit distance. In the -Approximate Pattern Matching (-Approximate PM) problem, we are given two strings, a text and a pattern , and we are to compute all -approximate occurrences of , i.e., such that and are at distance at most (under the respective metric). In particular, we have under the Hamming distance and for the edit distance. The solutions to -Approximate PM also report the distance (Hamming or edit) between a -approximate occurrence and ; in this sense, they solve a more general -Bounded PM problem.
We say that two strings and have a -approximate overlap of length under the Hamming or edit distance if the length- prefix of is at (Hamming or edit) distance at most from some suffix of . We define the -Bounded Overlaps problem as follows.
-Bounded Overlaps Input: Two strings and and a metric (Hamming or edit distance). Output: For every length , the smallest such that and have a -approximate overlap of length or NO if no such exists.
We impose a restriction since otherwise the answer for the edit distance would be clearly NO. Under the Hamming distance, we can further restrict the problem to .
3.1 -Mismatch EDSM
The -Approximate PM problem under the Hamming distance is called -Mismatch PM. For a string text of length , if and are over an integer alphabet of size , -Mismatch PM can be solved in time with a classic result by Amir et al. [1] or by a recent Monte Carlo algorithm of Chan et al. [11] in expected time with high probability.
We call -approximate overlaps under the Hamming distance -mismatch overlaps. The -Bounded Overlaps problem under Hamming distance was present implicitly in [8] where an -time solution using kangaroo jumps was presented. We solve the -Bounded Overlaps problem under Hamming distance more efficiently using the algorithm for the th mismatch computation of Kaplan et al. [24].
In the th Mismatch problem, we are given two strings and , , and we are to determine, for every , if matches with at most mismatches and, if not, to report the position of the th mismatch in . Kaplan et al. [24] presented a solution to the th Mismatch problem working in time for strings and over an integer alphabet.
Lemma 3.
The -Bounded Overlaps problem under Hamming distance for two strings and such that and are over an integer alphabet of size can be solved in time.
Proof.
Let be the prefix of of length and be the suffix of of length . By definition, if is a length of -mismatch overlap of and , then and the approximately matching prefix of and suffix of are a prefix of and a suffix of , respectively. We construct strings and , where are symbols not occurring in and , and solve the th Mismatch problem for and , where .
Claim 4.
For every , is the smallest nonnegative integer such that and have a -mismatch overlap of length if and only if and the th mismatch between and is at position in .
Proof.
Assume that and
with . We have , so . Then
where the second string ends with a and . The two above strings are prefixes of and , respectively. Hence, the th mismatch between and is at position in , as required.
Assume that and the th mismatch between and is at position . By definition, the last positions of and do not match. This means that the remaining prefixes of the two strings, i.e., and , are at Hamming distance , which concludes the proof.
By the claim, the solution to th Mismatch problem for and allows us to recover in time the answer to -Bounded Overlap problem. We have . The complexity follows by [24].
Theorem 5.
-Mismatch EDSM can be solved on-line in worst-case time or expected time with high probability.
Proof.
We apply the general scheme of steps (1)–(4), according to Observation 2. In step (1), we check if the string pattern has a -mismatch occurrence in a string . If , we apply the -Mismatch PM algorithm for string pattern and string text . After the -time preprocessing on , both strings can be assumed to be over alphabet . Over all strings , this takes worst-case time [1] or expected time with high probability [11], which sums up over all to worst case or expected time.
In step (2), the array of active prefixes of a single set is computed using -Bounded Overlaps. More precisely, for and every string , for every such that and have a -mismatch overlap of length , we set to ; otherwise is set to . By Lemma 3, the time complexity, over all , is . We point out that renumbering the alphabet to integer to satisfy the requirements of Lemma 3, over all calls to -Bounded Overlaps, takes time. The array of active suffixes is computed using a symmetric application of -Bounded Overlaps in the same time complexity. This step is performed for all in total time.
In step (3), we compute , for on-line. Initially . For a given , for all such that , we compute all -mismatch occurrences of string pattern in string text . Using -Mismatch PM, this takes worst-case time [1], for a total of time for all .
Whenever a -mismatch occurrence of is discovered, we set where and . In total, this takes time for a given , and time overall.
Finally, in step (4) we check the condition of Observation 2 for each position in time.
3.2 -Edit EDSM
The -Approximate PM problem under the edit distance is called -Edit PM. For a string text of length , if and are over an integer alphabet of size , -Edit PM can be solved in time using the Landau-Vishkin algorithm [26].
We call -approximate overlaps under the edit distance -edit overlaps. We compute -Bounded Overlaps under edit distance using a modification of the All--LPAM problem of Charalampopoulos et al. [12] (LPAM stands for Longest Prefix Approximate Match). In the original All--LPAM problem we are given a string text of length and a string pattern of length and we are to compute, for each and , the length of the longest prefix of that matches a prefix of with at most edits. We introduce a similar but different problem, All--BO (BO stands for Bounded Overlap), in which, given a string text of length and a string pattern of length , we are to compute for each and , the edit distance between a length- prefix of and a length- suffix of provided that it is at most or state otherwise.
The output size for each of the problems All--LPAM and All--BO is . The All--LPAM problem can be solved in time [12]. We show that All--BO can be solved in time using the techniques from [12] that we recall next.
The deletion distance of two strings and is the minimum number of character insertions and deletions required to transform to (i.e., substitutions are not allowed). For a string , by we denote the string . By the following fact, we can consider the deletion distance instead of the edit distance.
Fact 6 ([12, Fact 24]).
For any two strings and that do not contain the character , we have .
A permutation matrix is a square matrix over that contains exactly one 1 in each row and in each column. A permutation matrix of size corresponds to a permutation of such that if and only if . For two permutations , and their corresponding permutation matrices , , by we denote a shortest sequence of transpositions of neighboring elements such that (that is, each transposition swaps two adjacent columns of the maintained permutation matrix). Such a transposition of and is ordered if . For an matrix , we denote by an matrix such that .
We refer to a lemma from [12] that relates computing the -bounded deletion distance between a prefix of and a substring of with a certain family of matrices (actually, Monge matrices [10]) that can be stored efficiently.
Lemma 7 ([12, Observation 27 and Lemma 28]).
Let and be strings of length and , respectively, and . There exists a sequence of matrices , for , that satisfies the following conditions.
-
(a)
Let , , , and . Then, if and only if .
-
(b)
For each , there is a permutation matrix such that holds for all . Matrix is an identity permutation matrix (i.e., ). Moreover, the sequence , , contains at most ordered transpositions of neighboring elements in total and all its non-empty elements can be computed in time plus longest common prefix (LCP) queries on pairs of suffixes of and .
We will also use a dynamic partial sums data structure that stores an integer array of size and allows to update its elements and query for its prefix sums. We use a dynamic partial sums data structure with -time queries and space [9].
We are ready to describe our algorithm for All--BO.
Lemma 8.
If string , with , and string , with , are over an integer alphabet of size , All--BO can be solved in time.
Proof.
We perform in time a preprocessing on , where is a symbol that does not occur in and , after which longest common prefix (LCP) queries on suffixes of and can be answered in time [4, 17] (we use the assumption on the alphabet).
The following computations are made for values of such that the intervals cover . We will treat the sequence of matrices for as a dynamic matrix . Each ordered transposition of adjacent columns in the maintained matrix corresponds to a sub-column increment in (increasing the entries in the sub-column by ). The th column of the dynamic matrix will be stored with the aid of a dynamic partial sums data structure on array . Initially, the arrays are zeroed. If the elements are to be increased by , in the array corresponding to the th column, we increase by and decrease by . Then
(1) |
can indeed be computed by a single dynamic partial sum query. By Lemma 7(b), the number of updates will be and the updates can be computed in time after -time preprocessing for LCP queries. The updates for the partial sums data structure are performed in total time [9].
Recall that position in translates to position in and Fact 6. For every and , we will compute the -bounded value of
(2) |
using for such that . More precisely, by Lemma 7(a) (with , , , and threshold ), if , then (2) equals , and otherwise (2) is greater than . We will process all pairs of interest as follows.
For each of the given values of (that satisfy ), we consider all lengths such that the interval
(3) |
is non-empty. For each such , we compute in time the interval from (3) and, for each odd , assuming , compute . This way, for we consider pairs . For each of them, can be computed in time using a dynamic prefix sum as in (1), which gives total time.
The time complexity is .
Lemma 9.
The -Bounded Overlaps problem under edit distance for two strings and such that and are over an integer alphabet of size can be solved in time.
Proof.
Let be the prefix of of length and be the suffix of of length . If is a length of a -edit overlap of and , then, by definition, . Further, the approximately overlapping prefix of and suffix of are a prefix of and a suffix of , respectively. We have .
We solve the All--BO problem for string pattern and string text using Lemma 8 in time. For each , we can then compute the -bounded minimum edit distance between and a length- suffix of , over all such that the distance can be at most , i.e., for all , in total time. This produces the desired output.
Theorem 10.
-Edit EDSM can be solved on-line in time, plus time preprocessing or in expectation.
Proof.
We apply the general scheme of steps (1)–(4), according to Observation 2. In step (1), we check if the string pattern has a -edit occurrence in a string . We apply the -Edit PM algorithm for string pattern and string text . Over all strings , this takes time using the Landau-Vishkin algorithm [26] which sums up to time.
In step (2), the array of active prefixes of a single set is computed using -Bounded Overlaps. More precisely, for and every string , for every such that and have a -edit overlap of length , we set to the smallest such that and have a -edit overlap of length ; otherwise is set to . By Lemma 9, the time complexity, over all , is where . We can assume that since otherwise occurs at every position in . The array of active suffixes is computed symmetrically. This step is performed for all in total time.
In step (3), we compute , for on-line. Initially . For a given , for all such that , we compute all -edit occurrences of string pattern in string text . Using Landau-Vishkin -Edit PM, this takes time [26], for a total of time for all and .
Whenever a -edit occurrence of is discovered, we set where and . There are such occurrences, so the time complexity is .
Finally, step (4) is performed in time.
4 Approximate EDSM via Approximate Dictionary Matching
Bernardini et al. [6] used the technique behind errata trees [14] to design their solution for 1-Approximate EDSM. We also use the data structure of [14], but basically as a black-box; overall, we design a general solution for -Approximate EDSM for a constant .
In Approximate Dictionary Matching, we are given a dictionary of strings of total length and an integer parameter . Our goal is to preprocess so that later, given a string text of length , we can list all the -approximate occurrences of the patterns from in , under the Hamming or edit distance. Lemma 11 below follows from Cole et al. [14]. Minor tweaks are needed in their approach to obtain the exact complexity as stated in the lemma; the original complexity would be in the worst case, where is the number of pairs (string from , substring of at distance at most ).
Lemma 11 ([14]).
If and all strings are over an integer alphabet, Approximate Dictionary Matching can be solved in time.
Proof.
Using the approach of Cole et al. [14], one first constructs the generalized suffix tree of all the patterns from the dictionary in expected time or worst-case time. This time complexity is required to implement an oracle with -time access to a child of a given node with a given first character of the edge; Cole et al. either use perfect hashing or a slower deterministic solution. The sole purpose of the oracle is to be able to extend the suffix tree with all suffixes of the string text in time, as in Ukkonen’s on-line suffix tree construction [30]. We can, instead, construct the generalized suffix tree of all patterns from and the text in worst-case time [17], as we consider only one text .
With the aid of the suffix tree, Cole et al. construct a data structure composed of several compacted tries in total time. Each explicit node of each compacted trie stores a (possibly empty) list of patterns from . We do not alter this part of their construction.
In the query algorithm, for a text , after all suffixes of have been located in the suffix tree as discussed above, for each suffix of , paths in various compacted tries are computed in total time. For each such path, all patterns from stored in the lists of the explicit nodes on the path are returned. If pattern is returned, this means that suffix of has a prefix that matches with (Hamming or edit) distance at most . From the path information one can easily extract a corresponding prefix of such that the (Hamming or edit) distance between and is at most .
In the worst case, approximate occurrences can be reported in total, which we would like to avoid. Each path computed in the query algorithm has length at most . If instead of reporting all elements of a list of an explicit node, we report one single arbitrarily chosen element, we report only approximate occurrences in total.
We generalize the Active Prefixes problem that was introduced in the exact variant in [7]. A solution to this problem is the non-trivial part of step (3).
-Approximate Active Prefixes (-Approximate AP) Input: A string of length , an array containing values in , a set of strings of total length , and a metric (Hamming or edit distance). Output: An array containing values in such that , for , is the smallest value for which there exist , and , such that is at distance (Hamming or edit) at most from and , or if such a value does not exist.
We obtain -Mismatch AP and -Edit AP problems for the respective distances. The -Approximate AP problem can be solved directly using Approximate Dictionary Matching.
Lemma 12.
If and all strings are over an integer alphabet, -Approximate AP under Hamming or edit distance can be solved in time.
Proof.
We construct an instance of Approximate Dictionary Matching for dictionary of strings of total length , for every . The text in the Approximate Dictionary Matching will be . By Lemma 11, for every , all -approximate occurrences of strings in in can be computed in time.
Initially, the array is filled with . For every and every -approximate occurrence of a string from , we set . If , we set it to . This step takes time as there are distinct occurrences.
Lemma 13.
If , -Approximate EDSM under the Hamming or edit distance can be solved on-line in time, where is the time complexity of a solution to -Approximate AP with parameters , , .
Proof.
We follow steps (1)–(4) according to Observation 2. In step (1), we need to check if the string pattern has a -mismatch occurrence in one of the strings in . In case of the Hamming distance, we can use the algorithm by Amir et al. [1] that works in time, and in case of edit distance, the Landau-Vishkin algorithm [26] that works in time. After -time preprocessing on , all strings in can be assumed to be over alphabet . Overall, for the time complexity is under each of the distance metrics.
Assuming the Hamming distance, in step (2) the array of active prefixes of a single set is computed using -Bounded Overlaps for and every string . Using a solution to -Bounded Overlaps with kangaroo jumps, the time complexity, over all , is . The array of active suffixes is computed using a symmetric application of -Bounded Overlaps in the same time complexity. This step is performed for all in total time, which is for constant .
Step (2) under edit distance is solved using -Bounded Overlaps exactly as in Theorem 10. However, this time we use Lemma 9 to bound the complexity for a single string by , which yields time over all . For all , we get just time as is constant.
Step (3) requires to compute the array using dynamic programming. We start with . Then we solve an instance of -Approximate AP problem with , and . For a given , we have , . Over all , the time complexity is .
Finally, step (4) is performed in time.
We plug in the solution to -Approximate AP from Lemma 12 to obtain the next theorem.
Theorem 14.
If , -Approximate EDSM under the Hamming or edit distance can be solved on-line in time.
5 1-Approximate EDSM in Time
In this section, we solve the other two open problems stated by Bernardini et al. [6]. We design a combinatorial -time algorithm for 1-Approximate EDSM under Hamming distance and also show how this can be extended to work for edit distance. In Section 5.1, we extend the -time algorithm presented in [6] for Hamming to edit distance. In Section 5.2, we show how we can shave the factor from the time complexity.
We introduce the 01-Approximate AP problem, in which the array contains values in and the array is to contain values in such that for if and only if there exist and , such that is at distance (Hamming, edit) at most from and . The 1-Approximate AP problem reduces to two instances of exact (i.e., 0-Approximate) AP problem, one in which we have and one with , and an instance of 01-Approximate AP.
5.1 1-Edit AP in Time using 1-Errata Tree
In [6], the 01-Mismatch AP problem is solved using a version of -errata tree, where the copied nodes of the input trie are explicitly inserted into the tree. This way the obtained errata tree is an actual trie and hence allows to apply standard tree traversal algorithms. In the next lemma, we generalize the construction in [6] to edit distance.
Lemma 15.
Let us consider an instance of the 01-Edit AP problem. In time and space, one can construct a 1-errata tree satisfying the following: if and only if contains a pair of nodes such that has label , it is an ancestor of node with label , where , , and one of the following is satisfied: for some integer or or .
Proof.
We start with the description of the -errata tree , and then show that it satisfies the conditions stated in Lemma 15.
For a general compacted trie with leaf nodes, we can construct its -errata tree as follows. We compute the heavy-light decomposition [28] of in time and make all direct children of branching nodes explicit. A node of the trie reached by reading the string is labeled with the reference to this string (a single node can have multiple labels; e.g., if the same string appears in the collection multiple times); here we just use the string itself for simplicity of the description, while in practice one would use a more succinct representation. For every light node at depth that is reached from with a character we make three copies of its subtree:
-
in one copy, for each label of each node, we add (the label becomes ) and merge it with the subtree of ;
-
in one copy, for each label of each node, we add (the label becomes ) and merge it with the subtree of the node reached by reading from the heavy sibling of (this node can be implicit, or even non-existent in , in which case we simply make it explicit/add it).
We change the original labels of nodes (the ones which were already present in ) from to so that each label forms a pair. We denote the resulting tree by .
Claim 16.
Let and be two nodes of labeled with and respectively, for two strings and . String is at edit distance at most from a prefix of if and only if contains two nodes, labeled with and its ancestor labeled with , and
-
for some integer ; or
-
or .
Proof.
We assume that to reduce the number of special cases – if such an assumption cannot be made we can simply modify the construction by appending the pattern with 3 characters out of the alphabet, which does not change the performance.
() Let node be the lowest common ancestor of and in . If , then is a prefix of and hence the pair is such a pair ( with label is an ancestor of labeled with ). Otherwise is the position of the error (substitution, deletion, or insertion) in some alignment (sometimes a string can be obtained in a few ways for example by removing different characters of a unary run). Now it is enough to consider the three possible cases:
-
is obtained from a prefix of by a substitution at position – in the heavy child of there exist nodes , with the descendant ancestor relation in with labels , , respectively, such that .
-
is obtained from a prefix of by a deletion of the character at position – if the node is in the subtree of the heavy child of in , then in the node with label is a descendant of with label , otherwise in a node with label is a descendant of .
-
is obtained from a prefix of by an insertion of a character between positions and – if the node is in the subtree of the heavy child of in , then in the node with label is an ancestor of with label , otherwise in a node with label is an ancestor of .
() Let us assume that the consequences are satisfied. Let be the node whose label contains and the node whose label contains , such that is an ancestor of . We first assume for some , then (like in the Hamming case) the prefix of of length can be obtained from by replacing its th character with . The same applies when and or and .
If and or and then the length- prefix of can be obtained from by removing the character on its th position.
If and or and then can be obtained from the length- prefix of by removing the character on its th position.
Now, Lemma 15 is a direct application of Claim 16 to the tree representing the set of the strings in plus the set of the suffixes of the pattern such that . This tree can be constructed in time and has nodes, hence its -errata tree can be constructed in time in the same way as in [6, Lemma 4.2]. (Actually, in [6, Lemma 4.2] the complexity is stated as , but it can be readily verified that the algorithm presented there achieves the claimed complexity; cf. [31] for a more general construction with such a bound.)
This property of our -errata tree for edit distance is exactly the same as the property of the -errata tree for the Hamming distance in [6, Lemma 4.3]. At the same time it is a simple special case of the property shown in [31] (together with the errata tree construction). Moreover, we can assume that because when we use a -time algorithm that was shown in [6] (such an algorithm also follows by Lemma 18 below as we can assume that ). We can now employ the algorithm from [6] without any modification (only the size of the tree and the number of bit-vectors used therein is up to times bigger).
Lemma 17.
The -Edit AP problem can be solved in time.
5.2 Shaving the Factor
To shave the factor from Lemma 17, we need to apply it for strings of . By the pigeonhole principle, we have strings of length greater than in . For the remaining short strings, we employ a simple sorting-based algorithm.
In the next lemma, in the case of Hamming distance, instead of performing every possible substitution in pattern substrings, which would incur an additional factor in the complexity, we replace every possible character in pattern substrings as well as in every string by a wildcard character . We then sort these strings using existing techniques. If two such modified strings are equal, the original strings were at Hamming distance at most 1. This is efficient only for 1 mismatch, because then we obtain only modified strings from .
Lemma 18.
Let us consider an instance of the 1-Approximate AP problem in which all strings in have length at most . The -Approximate AP problem under the Hamming or edit distance can be solved in time.
Proof.
We first solve the 01-Approximate AP problem; the 0-Approximate AP problem is easier. Our algorithm for 01-Approximate AP uses a technique based on sorting. We will describe it for the Hamming distance but edit distance works largely in the same manner. Consider the following grouping; for every , we construct group consisting of:
-
For all pairs , with such that and , the modified substrings of , such that the th character of every substring is replaced by a special character that does not occur in .
-
The modified strings , , such that the th character of is replaced by .
To simulate a single insertion or deletion operation, we delete the th character from or from . For technical purposes, we assume that every such modified string is concatenated with another special character that is lexicographically the smallest.
For efficiency, we never construct these strings explicitly. We rather encode them using a constant number of substrings from or from strings in (substrings are represented by intervals). The key observation is that the total number of strings in , for all , is . We then sort the strings in and compute the (longest common prefix) LCP between every pair of successive strings in this sorted list; we denote the sorted list by . This can be done in time, for all , using the gapped suffix array [15], because all strings in group have character at the same th position.
The strings in , whose unmodified version are within Hamming distance and the mismatching position is , form a consecutive sublist of . In particular, two elements are in a consecutive sublist if and only if . This can be checked in time because the length of their LCP is . It suffices to look if, in any such consecutive sublist, we have one element that originates from and one element that originates from . By thus traversing all and looking at the LCP values of consecutive entries, we can find if any string from is within Hamming distance 1 from any . If so, we set to 1. The time for processing one is . Thus the total time is .
For 0-Approximate AP, we construct just one group that contains all the considered substrings of and all strings in . The complexity is .
Lemma 19.
1-Approximate AP under Hamming or edit distance can be solved in time.
Proof.
For short strings of length at most , we employ Lemma 18 working in = time. For long strings of length greater than we use the -time algorithm for Hamming distance from [6] and for edit distance we use Lemma 17. By the pigeonhole principle, and the second complexity becomes .
Theorem 20.
-Approximate EDSM under the Hamming or edit distance can be solved on-line in time.
6 -Approximate Problems on Two ED Strings
We proceed to our results on -Approximate EDSI and -Approximate Doubly EDSM.
Theorem 21.
Given an ED string of cardinality and size , an ED string of cardinality and size , and an integer , we can check whether a pair of strings with (, respectively) exists in time ( time, respectively) and, if that is the case, return a pair with the smallest distance.
Proof.
Let and denote the lengths of and , respectively. We use the approach from [19, Theorem 8.2] that performs the following steps for either of the distance metrics:
-
1.
For , compute all -approximate occurrences of each string pattern from each set in each string text from each set , for and . Together with each approximate occurrence, report also the actual minimal distance (that is at most ).
-
2.
For , for each string from each set and for each string from each set , compute all pairs such that is a prefix of , is a suffix of and the distance between and is at most . Together with each prefix-suffix pair, report also the actual minimal distance (that is at most ).
-
3.
Use the computed outputs to construct a graph of size in case of the Hamming distance and of size in case of edit distance [19].
- 4.
Steps 3 and 4 are the same as in Gabory et al. [19]. In steps 1 and 2, Gabory et al. [19] achieved time for the Hamming distance and time for the edit distance. We use the techniques from Section 3 to improve these complexities.
In step 1, in each string text in , we perform -Approximate PM for patterns. Each -Mismatch PM is performed in time using the algorithm of Amir et al. [1] and each -Edit PM is performed in time using the Landau-Vishkin algorithm [26]. Over all patterns and texts, we obtain time for the Hamming distance and for the edit distance. We recall that each algorithm returns the actual minimal distance of the occurrence if it is at most .
In step 2 for the Hamming distance, we solve the -Bounded Overlaps problem for every string from and from . The total number of prefix-suffix pairs is bounded by the number of prefixes, which is at most , times the number of strings , which is . With Lemma 3, the time complexity is bounded by . For the edit distance, each prefix of each string from can match up to suffixes of each string from . Hence, the number of prefix-suffix pairs is bounded by . The approximately matching prefix-suffix pairs are computed using the All--BO problem (Lemma 8) in time. For all and , we achieve time. Thus the complexities of step 2 dominate the respective complexities in step 1.
The time complexities of steps 3 and 4, that is, for the Hamming distance and for the edit distance, are also dominated.
In [19, Corollary 8.3], it was shown that the complexity of -Approximate EDSI carries on to -Approximate Doubly EDSM, which we define next. We therefore obtain Corollary 22.
-Approximate Doubly EDSM Input: An ED string of cardinality and size (called text) and an ED string of cardinality and size (called pattern). Output: The set of positions for which there exists a string whose -approximate occurrence in starts in .
Corollary 22.
-Approximate Doubly EDSM problem can be solved in time and in time for the Hamming and edit distance, respectively.
References
- [1] Amihood Amir, Moshe Lewenstein, and Ely Porat. Faster algorithms for string matching with k mismatches. Journal of Algorithms, 50(2):257–275, 2004. doi:10.1016/S0196-6774(03)00097-X.
- [2] Kotaro Aoyama, Yuto Nakashima, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Faster online elastic degenerate string matching. In Gonzalo Navarro, David Sankoff, and Binhai Zhu, editors, Annual Symposium on Combinatorial Pattern Matching, CPM 2018, July 2-4, 2018 - Qingdao, China, volume 105 of LIPIcs, pages 9:1–9:10. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2018. doi:10.4230/LIPICS.CPM.2018.9.
- [3] Rocco Ascone, Giulia Bernardini, Alessio Conte, Massimo Equi, Estéban Gabory, Roberto Grossi, and Nadia Pisanti. A unifying taxonomy of pattern matching in degenerate strings and founder graphs. In Solon P. Pissis and Wing-Kin Sung, editors, 24th International Workshop on Algorithms in Bioinformatics, WABI 2024, September 2-4, 2024, Royal Holloway, London, United Kingdom, volume 312 of LIPIcs, pages 14:1–14:21. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2024. doi:10.4230/LIPICS.WABI.2024.14.
- [4] Michael A. Bender and Martin Farach-Colton. The LCA problem revisited. In Gaston H. Gonnet, Daniel Panario, and Alfredo Viola, editors, LATIN 2000: Theoretical Informatics, 4th Latin American Symposium, Punta del Este, Uruguay, April 10-14, 2000, Proceedings, volume 1776 of Lecture Notes in Computer Science, pages 88–94. Springer, 2000. doi:10.1007/10719839_9.
- [5] Giulia Bernardini, Estéban Gabory, Solon P. Pissis, Leen Stougie, Michelle Sweering, and Wiktor Zuba. Elastic-degenerate string matching with 1 error. In Armando Castañeda and Francisco Rodríguez-Henríquez, editors, LATIN 2022: Theoretical Informatics - 15th Latin American Symposium, Guanajuato, Mexico, November 7-11, 2022, Proceedings, volume 13568 of Lecture Notes in Computer Science, pages 20–37. Springer, 2022. doi:10.1007/978-3-031-20624-5_2.
- [6] Giulia Bernardini, Estéban Gabory, Solon P. Pissis, Leen Stougie, Michelle Sweering, and Wiktor Zuba. Elastic-degenerate string matching with 1 error or mismatch. Theory of Computing Systems, 68:1442–1467, 2024. doi:10.1007/s00224-024-10194-8.
- [7] Giulia Bernardini, Paweł Gawrychowski, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone. Elastic-degenerate string matching via fast matrix multiplication. SIAM Journal on Computing, 51(3):549–576, 2022. doi:10.1137/20M1368033.
- [8] Giulia Bernardini, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone. Approximate pattern matching on elastic-degenerate text. Theoretical Computer Science, 812:109–122, 2020. doi:10.1016/J.TCS.2019.08.012.
- [9] Gerth Stølting Brodal, Pooya Davoodi, and S. Srinivasa Rao. Path minima queries in dynamic weighted trees. In Frank Dehne, John Iacono, and Jörg-Rüdiger Sack, editors, Algorithms and Data Structures - 12th International Symposium, WADS 2011, New York, NY, USA, August 15-17, 2011. Proceedings, volume 6844 of Lecture Notes in Computer Science, pages 290–301. Springer, 2011. doi:10.1007/978-3-642-22300-6_25.
- [10] Rainer E. Burkard, Bettina Klinz, and Rüdiger Rudolf. Perspectives of Monge properties in optimization. Discrete Applied Mathematics, 70(2):95–161, 1996. doi:10.1016/0166-218X(95)00103-X.
- [11] Timothy M. Chan, Ce Jin, Virginia Vassilevska Williams, and Yinzhan Xu. Faster algorithms for text-to-pattern Hamming distances. In 64th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2023, Santa Cruz, CA, USA, November 6-9, 2023, pages 2188–2203. IEEE, 2023. doi:10.1109/FOCS57990.2023.00136.
- [12] Panagiotis Charalampopoulos, Tomasz Kociumaka, Jakub Radoszewski, Solon P. Pissis, Wojciech Rytter, Tomasz Waleń, and Wiktor Zuba. Approximate circular pattern matching. In Shiri Chechik, Gonzalo Navarro, Eva Rotenberg, and Grzegorz Herman, editors, 30th Annual European Symposium on Algorithms, ESA 2022, September 5-9, 2022, Berlin/Potsdam, Germany, volume 244 of LIPIcs, pages 35:1–35:19. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2022. doi:10.4230/LIPICS.ESA.2022.35.
- [13] Aleksander Cisłak, Szymon Grabowski, and Jan Holub. SOPanG: online text searching over a pan-genome. Bioinformatics, 34(24):4290–4292, 2018. doi:10.1093/BIOINFORMATICS/BTY506.
- [14] Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. Dictionary matching and indexing with errors and don’t cares. In László Babai, editor, Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, June 13-16, 2004, pages 91–100. ACM, 2004. doi:10.1145/1007352.1007374.
- [15] Maxime Crochemore and German Tischler. The gapped suffix array: A new index structure for fast approximate matching. In Edgar Chávez and Stefano Lonardi, editors, String Processing and Information Retrieval - 17th International Symposium, SPIRE 2010, Los Cabos, Mexico, October 11-13, 2010. Proceedings, volume 6393 of Lecture Notes in Computer Science, pages 359–364. Springer, 2010. doi:10.1007/978-3-642-16321-0_37.
- [16] Massimo Equi, Veli Mäkinen, Alexandru I. Tomescu, and Roberto Grossi. On the complexity of string matching for graphs. ACM Transactions on Algorithms, 19(3):21:1–21:25, 2023. doi:10.1145/3588334.
- [17] Martin Farach. Optimal suffix tree construction with large alphabets. In 38th Annual Symposium on Foundations of Computer Science, FOCS 1997, Miami Beach, Florida, USA, October 19-22, 1997, pages 137–143. IEEE Computer Society, 1997. doi:10.1109/SFCS.1997.646102.
- [18] Michael L. Fredman, János Komlós, and Endre Szemerédi. Storing a sparse table with O(1) worst case access time. Journal of the ACM, 31(3):538–544, 1984. doi:10.1145/828.1884.
- [19] Estéban Gabory, Moses Njagi Mwaniki, Nadia Pisanti, Solon P. Pissis, Jakub Radoszewski, Michelle Sweering, and Wiktor Zuba. Elastic-degenerate string comparison. Information and Computation, 304:105296, 2025. doi:10.1016/j.ic.2025.105296.
- [20] Zvi Galil and Raffaele Giancarlo. Parallel string matching with k mismatches. Theoretical Computer Science, 51:341–348, 1987. doi:10.1016/0304-3975(87)90042-9.
- [21] Paweł Gawrychowski, Adam Górkiewicz, Pola Marciniak, Solon P. Pissis, and Karol Pokorski. Faster approximate elastic-degenerate string matching – Part B. In Paola Bonizzoni and Veli Mäkinen, editors, 36th Annual Symposium on Combinatorial Pattern Matching, CPM 2025, June 17-19, 2025, Milan, Italy, volume 331 of LIPIcs, pages 29:1–29:21. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2025. doi:10.4230/LIPIcs.CPM.2025.29.
- [22] Roberto Grossi, Costas S. Iliopoulos, Chang Liu, Nadia Pisanti, Solon P. Pissis, Ahmad Retha, Giovanna Rosone, Fatima Vayani, and Luca Versari. On-line pattern matching on similar texts. In Juha Kärkkäinen, Jakub Radoszewski, and Wojciech Rytter, editors, 28th Annual Symposium on Combinatorial Pattern Matching, CPM 2017, July 4-6, 2017, Warsaw, Poland, volume 78 of LIPIcs, pages 9:1–9:14. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2017. doi:10.4230/LIPICS.CPM.2017.9.
- [23] Costas S. Iliopoulos, Ritu Kundu, and Solon P. Pissis. Efficient pattern matching in elastic-degenerate strings. Information and Computation, 279:104616, 2021. doi:10.1016/J.IC.2020.104616.
- [24] Haim Kaplan, Ely Porat, and Nira Shafrir. Finding the position of the k-mismatch and approximate tandem repeats. In Lars Arge and Rusins Freivalds, editors, Algorithm Theory - SWAT 2006, 10th ScandinavianWorkshop on Algorithm Theory, Riga, Latvia, July 6-8, 2006, Proceedings, volume 4059 of Lecture Notes in Computer Science, pages 90–101. Springer, 2006. doi:10.1007/11785293_11.
- [25] Gad M. Landau and Uzi Vishkin. Efficient string matching with k mismatches. Theoretical Computer Science, 43:239–249, 1986. doi:10.1016/0304-3975(86)90178-7.
- [26] Gad M. Landau and Uzi Vishkin. Fast parallel and serial approximate string matching. Journal of Algorithms, 10(2):157–169, 1989. doi:10.1016/0196-6774(89)90010-2.
- [27] Milan Ruzic. Constructing efficient dictionaries in close to sorting time. In Luca Aceto, Ivan Damgård, Leslie Ann Goldberg, Magnús M. Halldórsson, Anna Ingólfsdóttir, and Igor Walukiewicz, editors, Automata, Languages and Programming, 35th International Colloquium, ICALP 2008, Reykjavik, Iceland, July 7-11, 2008, Proceedings, Part I: Tack A: Algorithms, Automata, Complexity, and Games, volume 5125 of Lecture Notes in Computer Science, pages 84–95. Springer, 2008. doi:10.1007/978-3-540-70575-8_8.
- [28] Daniel Dominic Sleator and Robert Endre Tarjan. A data structure for dynamic trees. J. Comput. Syst. Sci., 26(3):362–391, 1983. doi:10.1016/0022-0000(83)90006-5.
- [29] Alexandre Tiskin. Semi-local string comparison: algorithmic techniques and applications. CoRR, abs/0707.3619, 2007. arXiv:0707.3619.
- [30] Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249–260, 1995. doi:10.1007/BF01206331.
- [31] Wiktor Zuba, Grigorios Loukides, Solon P. Pissis, and Sharma V. Thankachan. Approximate suffix-prefix dictionary queries. In Rastislav Královic and Antonín Kucera, editors, 49th International Symposium on Mathematical Foundations of Computer Science, MFCS 2024, August 26-30, 2024, Bratislava, Slovakia, volume 306 of LIPIcs, pages 85:1–85:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2024. doi:10.4230/LIPICS.MFCS.2024.85.