Faster Approximate Elastic-Degenerate String Matching – Part B
Abstract
We revisit the complexity of approximate pattern matching in an elastic-degenerate string. Such a string is a sequence of finite sets of strings of total length , and compactly describes a collection of strings obtained by first choosing exactly one string in every set, and then concatenating them together. This is motivated by the need of storing a collection of highly similar DNA sequences.
The basic algorithmic question on elastic-degenerate strings is pattern matching: given such an elastic-degenerate string and a standard pattern of length , check if the pattern occurs in one of the strings in the described collection. Bernardini et al. [SICOMP 2022] showed how to leverage fast matrix multiplication to obtain an -time complexity for this problem, where is the matrix multiplication exponent. However, from the point of view of possible applications, it is more desirable to work with approximate pattern matching, where we seek approximate occurrences of the pattern. This generalization has been considered in a few papers already, but the best result so far for occurrences with mismatches, where is a constant, is the -time algorithm presented in Part A [CPM 2025]. This brings the question whether increasing the dependency on from to quadratic is necessary when moving from to larger (but still constant) .
We design an -time algorithm for pattern matching with mismatches in an elastic-degenerate string, for any constant . To obtain this time bound, we leverage the structural characterization of occurrences with mismatches of Charalampopoulos, Kociumaka, and Wellnitz [FOCS 2020] together with fast Fourier transform. We need to work with multiple patterns at the same time, instead of a single pattern, which requires refining the original characterization. This might be of independent interest.
Keywords and phrases:
ED string, approximate pattern matching, Hamming distance, mismatchesFunding:
Paweł Gawrychowski: Partially supported by the Polish National Science Centre grant number 2023/51/B/ST6/01505.Copyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Theory of computation Pattern matchingEditors:
Paola Bonizzoni and Veli MäkinenSeries and Publisher:

1 Introduction
An elastic-degenerate string (ED-string, in short) is a sequence of finite sets, where is a subset of and is an ordered finite alphabet. The length of is defined as the length of the associated sequence. The size of is defined as , where is the total number of empty strings in . The cardinality of is defined as . Every ED-string represents a collection of strings, each of generally different length. We formalize this intuition as follows. The language generated by is defined as .
The main motivation behind introducing ED-strings [19] was to encode a collection of highly similar DNA sequences in a compact form. Consider a multiple sequence alignment (MSA) of such a collection (see below). Maximal conserved substrings (conserved columns of the MSA) form singleton sets of the ED-string and the non-conserved ones form sets that list the corresponding variants. Moreover, the language of the ED-string consists of (at least) the underlying sequences of the MSA. Under the assumption that these underlying sequences are highly similar, the size of the ED-string is substantially smaller than the total size of the collection. ED-strings have been used in several applications: indexing a pangenome [8], on-line string matching in a pangenome [10], and pairwise comparison of pangenomes [15].
ED-strings are also interesting as a simplified model for string matching on node-labeled graphs [3]. The ED-string can be viewed as a graph of layers [21], where the nodes are strings from , such that from layer to layer all possible edges are present and the nodes in layer are adjacent only to the nodes in layer . As a simplified model, ED-strings offer important advantages, such as on-line (left to right) string matching algorithms whose running times have a linear dependency on [17, 2, 5]. This linear dependency on (without any multiplicative polylogarithmic factors) is highly desirable in applications because, nowadays, it is typical to encode a very large number of genomes (e.g., millions of SARS-CoV-2 genomes111https://gisaid.org) in a single representation resulting in huge values.
In this work, we focus on the string matching (or pattern matching) task. In the elastic-degenerate string matching (EDSM) problem, we are given a string of length (known as the pattern) and an ED-string (known as the text), and we are asked to find the occurrences of in . Grossi et al. showed that EDSM can be solved in time using combinatorial pattern matching tools [17]. Aoyama et al. improved this to time by employing fast Fourier transform [2]. Finally, Bernardini et al. improved it to time [5] by employing fast matrix multiplication, where [1] is the matrix multiplication exponent. The authors of [1] also showed a conditional lower bound for combinatorial algorithms222The term “combinatorial” is not formally well-defined. for EDSM stating that EDSM cannot be solved in time, for any constant .
In the approximate counterpart of EDSM, we are also given an integer , and we are asked to find -approximate occurrences of in ; namely, the occurrences that are at Hamming or edit distance at most from . For Hamming distance, we call the problem EDSM with Mismatches (see below); and for edit distance, EDSM with Errors. The approximate EDSM problem was introduced by Bernardini et al. who showed a simple -time algorithm for EDSM with Mismatches and an -time algorithm for EDSM with Errors using combinatorial pattern matching tools [6]. In Part A, the dependency on for both the Hamming and the edit distance metrics is improved, obtaining an -time algorithm for EDSM with Mismatches and an -time algorithm for EDSM with Errors [22].
Unfortunately, the cardinality of in the above complexities is bounded only by , so even for , the existing algorithms run in time in the worst case. Bernardini et al. [4] showed many algorithms for approximate EDSM for working in time or in time for both the Hamming and the edit distance metrics. In Part A, the results for (for both metrics) are improved to time and extended to work for (for both metrics) in time, for any constant [22].
In this work, we consider the EDSM with Mismatches problem with constant , and observe that all the existing algorithms have at best a quadratic dependency on , the length of the pattern, for this problem. This is in stark contrast to the case of , and brings the question of whether non-combinatorial methods could be employed to solve EDSM with Mismatches for any constant , similar to EDSM () [2, 5].
Theorem 1.
Given a pattern of length and an ED-string of length and size , EDSM with Mismatches, for , can be solved in time.
Theorem 2.
Given a pattern of length and an ED-string of length and size , EDSM with Mismatches, for any constant , can be solved in time.
Other Approaches.
The key ingredient of [4] and [22] to achieve the term for is a new counterpart of the -errata tree [11], where the copied nodes of the input trie are explicitly inserted into the tree. This counterpart is an actual trie, and hence it allows to apply standard tree traversal algorithms. Since for , the constructed trie for and the suffixes of has nodes originating from , bitwise -time operations per such node result in the desired complexity. The main tool in [22] for extending to a constant is also -errata trees; however, the authors of [22] manage to apply -errata trees as a black-box. We stress that those algorithms are combinatorial (they do not use fast Fourier transform or fast matrix multiplication) and work also for edit distance.
Our Approach.
As in the previous works on elastic-degenerate string matching, we work with the so-called active prefixes extension problem. In this problem, we are given a text of length , an input bitvector of length , and a collection of patterns of total length . The goal is to produce an output bitvector, also of length . Informally, whenever there is an occurrence of some pattern starting at position and ending at position , we can propagate a one from the input bitvector at position to a one in the output bitvector at position . For approximate pattern matching, we have input and output bitvectors, corresponding to matching prefixes with different (i.e., the corresponding) number of mismatches.
The previous solutions can be seen as propagating the information from every to every explicitly. This, of course, cannot achieve time complexity better than quadratic in . Instead, we leverage the following high-level idea: if a given pattern occurs very few times in the text, then we can afford to iterate through all of its occurrences and propagate the corresponding information. Otherwise, its occurrences are probably somewhat structured. More concretely, exact occurrences of a pattern of length in a text of length form a single arithmetic progression. This has been extended by Bringmann, Künnemann, and Wellnitz [7] to occurrences with mismatches, and further refined (and made effectively computable) by Charalampopoulos, Kociumaka, and Wellnitz [9]: either there are few occurrences, or they can be represented by a few arithmetic progressions (where few means polynomial in ). Further, a representation of all the occurrences can be computed effectively.
To implement this high-level idea, we first apply some relatively standard preliminary steps that allow us to handle short patterns effectively with -errata trees. Further, we show how to reduce a given instance to multiple instances in which the pattern and the text are roughly of the same length. Then, we handle patterns with only a few occurrences naively. For the remaining patterns, we obtain a compact representation (as a few arithmetic progressions) of their occurrences. We cannot afford to process each progression separately, but we observe that, because we have restricted the length of the text, their differences are in fact equal to the same for every remaining pattern. Now, if is somewhat large (with the exact threshold to be chosen at the very end to balance the complexities), we can afford to process every occurrence of every pattern naively. Otherwise, we would like to work with every remainder modulo separately, leveraging fast Fourier transform to process all progressions starting at the positions with that remainder together. As a very simplified example, if the progressions were the same for all the patterns, we only need to compute the sumset of the set of starting positions with a one in the input bitvector (restricted to positions with a specific reminder modulo ) with the set of lengths of the patterns. This can be indeed done with a single fast Fourier transform. However, the structural characterization of Charalampopoulos et al. [9] only says that, for every pattern, we have arithmetic progressions with the same period . However, the progressions are possibly quite different for different patterns. Our new insight is that, in fact, we can group the progressions into only a few classes, more specifically irrespectively of the number of patterns, and then process each class together. This requires looking more carefully at the structural characterization of Charalampopoulos et al. [9], and might be of independent interest.
Structure of the Paper.
In Section 2, we provide some preliminaries and problem definitions. In Section 3, we discuss how the EDSM with Mismatches problem can be solved via the APE with Mismatches problem, the auxiliary problem used also in previous solutions [2, 5, 4] and in Part A [22]. In Section 4, we present our algorithms for three different cases: very short in Section 4.1; short in Section 4.2; and, finally, long in Section 4.3, which is the most interesting case. We conclude with balancing the thresholds in Section 4.4.
Computational Model.
We assume the standard Word RAM model with words consisting of bits, where is the size of the input. Basic operations on such words, such as indirect addressing and arithmetic operations, are thus assumed to take constant time.
2 Preliminaries
Strings.
Let be a finite ordered alphabet of size . We will usually assume that , where is the size of the input, which is called the polynomial alphabet. The elements of are called characters. A sequence of characters from , , is called a (classic) string . We call the length of , and denote it by . The empty string is denoted by . By , we denote a fragment of (starting at position and ending at position ), which equals when . Fragments of the form and are called prefixes and suffixes of , respectively. A fragment of (or its prefix/suffix) is called proper, if it is not equal to . Strings that are fragments of (for some and ) are called substrings of . We also write to denote . By , we denote the concatenation of strings and , i.e., the string . String is a cyclic shift of when and , and then we call and cyclically equivalent. We say that has an occurrence in (at position ), if for some strings and such that . Finally, is the reversal of , i.e., the string .
Elastic-Degenerate Strings.
We study the following extensions of classic strings.
A symbol (over alphabet ) is an unordered subset of (classic) strings from , different from and . Note that symbols may contain but not as their only element. The size of a symbol is the total length of all strings in the symbol (with the additional assumption that the empty string is counted as if it had length 1). The Cartesian concatenation of two symbols and is defined as .
An elastic-degenerate string (or ED-string, in short) (over alphabet ) is a sequence of symbols (over ). We use to denote the length of , i.e., the length of the associated sequence (the number of its symbols). The size of is the sum of the sizes of the symbols in . As for classic strings, we denote a fragment of by . We similarly denote prefixes and suffixes of . The language of is .
Given a (classic) string and an ED-string , we say that matches the fragment (or that an occurrence of starts at position and ends at position of ), if and is a fragment of at least one of the strings of (the whole pattern is fully contained in one of the symbols), or if and there is a sequence of strings , such that: ; is a suffix of one of the strings of , , for all ; and is a prefix of one of the strings of ( uses parts of at least two symbols).
Hamming Distance.
Given two (classic) strings and of the same length over alphabet , their Hamming distance is defined as the number of mismatches (i.e., the positions such that ). We use to denote the set of mismatches.
Given two (classic) strings and over alphabet , we say that is an approximate fragment (with at most mismatches) of if there is a string with , such that is a substring of . We similarly define approximate prefixes and approximate suffixes. We write to denote the set of all -mismatch approximate occurrences of in , i.e., all positions in , such that .
Given a string , an ED-string and an integer , we say that approximately matches the fragment (with at most mismatches) of , or that an approximate occurrence of starts at position and ends at position of , if there is a string such that and matches . We stress that, as in the case of exact occurrences, each approximate occurrence of in is of the following forms: either has Hamming distance at most to a fragment of a string in a symbol of ; or it uses parts of at least two symbols of . In the latter case, a prefix of is an approximate (possibly empty) suffix of a string in , a suffix of is an approximate (possibly empty) prefix of a string in , and the remaining fragments of the pattern are approximate matches of a string in all other used symbols of (except the first and the last one).
Periodicity.
For a string , we write to denote the string concatenated infinitely many times with itself. We call a string primitive when it cannot be represented as , for some string and some integer . We say that a string is a -period with offset of some other string when for some ; and we call the elements of the periodic mismatches. If , then is an exact period of , or just a period. Note that all cyclic shifts of are also (approximate or exact) periods, but with different offsets.
ED-string Matching.
Similarly as in Part A and in previous works (cf. [4]), we define the following problem – we assume integer to be a fixed constant and not part of the input.
Problem: Elastic Degenerate String Matching (EDSM) with Mismatches
Input:
A string of length and
an ED-string of length and size .
Output: All positions in where at least one approximate occurrence
(with at most mismatches) of ends.
In the above problem, we call the pattern and the text.
Active Prefixes Extension.
Similarly as in Part A and in previous works (cf. [4]), we solve EDSM with Mismatches through the following auxiliary problem.
Problem: Active Prefixes Extension (APE) with Mismatches
Input:
A string of length , bitvectors of size each, and strings
of total length .
Output: bitvectors of size each, where
if and only if there is a string
and with such that and .
In the above problem, we call the text, and the patterns.
3 EDSM with Mismatches via APE with Mismatches
We begin by showing how to reduce EDSM with Mismatches to multiple instances of APE with Mismatches. This does not require any new ideas, and it proceeds similarly as in Part A and in previous works (cf. [4]), so we only state it here for completeness.
As we mentioned before, each approximate occurrence of pattern in ED-string is:
-
1.
either an approximate fragment of a string of a symbol; or
-
2.
crossing the boundary between two consecutive symbols.
We explain how to detect the occurrences of each form separately.
Approximate Fragments of Symbols.
To check if the pattern is an approximate fragment of a string of a symbol, we test each symbol of separately (cf. [6]). To this end, we apply the technique of Landau and Vishkin [20], informally referred to as the kangaroo jumps. First, we preprocess the concatenation of all the symbols of and the pattern with the following.
Lemma 3 (suffix tree [12] with LCA queries [18]).
A string over a polynomial alphabet can be preprocessed in time to allow computing the longest common prefix of any two suffixes and of in constant time.
Recall that pattern is of length . For a symbol with strings, we consider each string and, for every , check if in time by repeatedly computing the longest common prefix of the remaining suffix of and . This takes total time.
Crossing the Boundary between two Consecutive Symbols.
To check if approximately matches a fragment , for some positions , we reduce the problem to multiple instances of APE with Mismatches. We iterate through the symbols of left-to-right and maintain bitvectors , each of size , such that when the prefix of is a -approximate suffix of the current , for . Let denote the size of , i.e., the total length of all strings in .
To proceed to the next iteration, and compute the bitvectors for from the bitvectors for , we need to consider two possibilities. First, to consider the case when the -approximate suffix is fully within , for every , we find all -approximate prefixes of that are suffixes of . This is done in time by iterating over all strings in , considering for each of them every sufficiently short prefix of , and computing the number of mismatches (if they do not exceed ) using kangaroo jumps in time. Second, to consider the case when the -approximate suffix crosses the boundary between and , we create and solve an instance of APE with Mismatches with the bitvectors representing the results for and the strings in . We take as the new bitvectors the bitwise-OR of the bitvectors corresponding to both cases.
Before proceeding to the next iteration, we need to detect an occurrence that crosses the boundary between and . To this end, we consider each string . Then, for every and such that and , we check if is a -approximate prefix of using kangaroo jumps in time, and if so, report position as a -approximate occurrence. Because we only need to consider possibilities for , this takes time.
We summarize the complexity of the reduction in the following lemma.
Lemma 4.
Assume that APE with Mismatches can be solved in time. Then EDSM with Mismatches can be solved in time.
4 Faster APE with Mismatches
We now move to designing efficient algorithms for APE with Mismatches, separately for and then any constant . Combined with the reduction underlying Lemma 4, this will result in Theorem 1 and Theorem 2. Recall that the input to an instance of APE with Mismatches consists of a string (called the text) of length and a collection of strings (called the patterns) of total length .
For , the strings are partitioned depending on their lengths and parameters and depending on :
-
1.
Very Short Case: the length of each string is ,
-
2.
Short Case: the length of each string is and ,
-
3.
Long Case: the length of each string is .
We separately solve the three obtained instances of APE with Mismatches and return the bitwise-OR of the obtained bitvectors. For an arbitrary constant , we have two cases:
-
1.
Short Case: the length of each string is ,
-
2.
Long Case: the length of each string is .
4.1 Very Short Case (for )
An efficient algorithm for this case can be obtained by using suffix trees and separately considering exact occurrences and occurrences with one mismatch. For completeness, we provide the proof in the appendix. An alternative algorithm is given in Part A, Lemma 18.
Theorem 5.
An instance of APE with Mismatches where and the length of each pattern is at most can be solved in time.
4.2 Short Case
Recall that an instance of APE with Mismatches consists of the patterns of total length and the text . Further, in this case, the length of each is at least but at most . We start with observing that, after -time preprocessing, we can assume that , because we only need to keep patterns such that , for some fragment . This relates the number of the patterns to and . Then, we state another known tool, and finally provide the algorithm.
Reducing the Number of Patterns.
Recall that we only need to keep patterns such that for some fragment . If then there is nothing to do. Otherwise, we consider all fragments of the text. For every such , we choose up to positions where there is a mismatch, and for each of them we either choose a special character not occurring in , represented by , or a character occurring somewhere in . Overall, we have at most possibilities. For each of them, we construct the corresponding candidate string. Then, we sort the obtained candidate strings together with the patterns with radix sort in time. We scan the obtained sorted list and only keep patterns equal to some candidate string.
The -errata Trie.
Cole, Gottlieb, and Lewenstein [11] considered the problem of preprocessing a dictionary of patterns of total length for finding, given a query string of length , whether is at Hamming distance at most from some . We provide a brief overview of their approach, following the exposition in [16] that provides some details not present in the original description.
For , this can be of course easily solved with a structure of size and query time by arranging the patterns in a trie. For larger values of , the -errata trie is defined recursively. In every step of the recursion, the input is a collection of strings, each of them being a suffix of some pattern decorated with its mismatch budget initially set to . We arrange the strings in a compact trie, and then recurse guided by the heavy-path decomposition [24] of the trie. The depth of the recursion is , and on each level the overall number of strings increases by a factor of , starting from . Answering a query requires the following primitive: given a node of one of the compact tries and the remaining suffix of the query string , we need to navigate down starting from the given node while reading off the subsequent characters of . This needs to be done while avoiding explicitly scanning , as such a primitive is invoked multiple times. For a compact trie storing suffixes of the patterns, such a primitive can be implemented by a structure of size and query time , assuming that we know the position of every suffix of the query string in the suffix tree of (also known as the generalized suffix tree of ).
In our application, the query string will always be a fragment of the text . Thus, we can guarantee that the position of every suffix of the query string in the generalized suffix tree of is known by building the generalized suffix tree of . This gives us the position of every suffix in the generalized suffix tree of , from which we can infer the position of . We summarize the properties of such an implementation below.
Lemma 6 ([11]).
For any constant , a dictionary of patterns of total length and a text can be preprocessed in time to obtain a structure of size , such that for any fragment we can check in time whether , for some .
Theorem 7.
For any constant , an instance of APE with Mismatches where the length of each pattern is at least and at most can be solved in time.
Proof.
We start with applying Lemma 6 on the patterns and the text . Then, iterate over every position , length , and such that . Next, for every we check if for some . If so, we set .
We analyze the overall time complexity. First, we need to construct the -errata trie for and . This takes time. Then, we consider possibilities for iterating over the position and the length, and for each of them spend time. As each is of length at least , from the preprocessing we have and , the overall complexity is:
as claimed.
4.3 Long Case
In the most technical case, we assume that the length of each pattern is at least . We start with providing an overview, and then move to filling in the technical details.
The very high-level idea is to explicitly or implicitly process all occurrences of every pattern . If a given pattern occurs sufficiently few times in the text then we can afford to list and process each of its occurrences explicitly. Otherwise, we invoke the structural characterization of [9], which, roughly speaking, says that if there are many approximate occurrences of the same string sufficiently close to each other in the text, then the string and the relevant fragment of the text have a certain regular structure. Thus, we can certainly hope to process all occurrences of such a pattern together faster than by considering each of those occurrences one-by-one. However, this would not result in a speed-up, and in fact, we need to consider multiple such patterns together. To this end, we need to further refine the characterization of [9]. Before we proceed with a description of our refinement, we start with a summary of the necessary tools from [9]. Then, we introduce some notation, introduce some simplifying assumptions, and then describe our refinement.
Tools.
The authors of [9] phrase their algorithmic results using the framework of PILLAR operations. In this framework, we operate on strings, each of them specified by a handle. For two strings and , the following operations are possible (among others):
-
1.
Extract: retrieve the string ,
-
2.
LCP: compute the length of the longest common prefix of and ,
-
3.
IPM: assuming that , return the starting positions of all exact occurrences of in (at most two starting positions or an arithmetic progression of starting positions).
Lemma 8 ([9, Theorem 7.2]).
After an -time preprocessing of a collection of strings of total length , each PILLAR operation can be performed in time.
We apply the above lemma on the text and all the patterns in time.
The first main result of [9] is the following structural characterization.
Lemma 9 ([9, Theorem 3.1]).
For each pattern of length , at least one of the following holds:
-
,
-
There is a primitive string of length such that .
Then, they convert the structural characterization (Lemma 9) into an efficient algorithm.
Lemma 10 ([9, Main Theorem 8]).
For any pattern of length , we can compute (a representation of) in time plus PILLAR operations.
The representation is a set of arithmetic progressions. Further, as the algorithm follows the proof of Lemma 9, in fact it either outputs a set of occurrences or finds a primitive string of length such that .
Notation.
We rephrase the APE with Mismatches problem as follows. For a pattern and a set of positions in the text we define:
which is also a set of positions in . Then, APE with Mismatches for for a given set of patterns and a text can be solved as follows. For every we set to be . Then, for every we create an instance of computing:
where . The obtained bitvector contributes to the bitvector . From now on, we focus on designing an algorithm that computes , and identify the underlying sets with their characteristic vectors. For two sets of integers and , we define their sumset . For a set of integers and a shift , we define . The following result is well-known.
Lemma 11 (e.g. [14]).
Given , we can compute in time.
Simplifying Assumptions.
It is convenient to assume that each pattern has roughly the same length, similar to the length of the text. More formally, our algorithm will assume that:
-
1.
, for every ,
-
2.
,
for some . Any instance can be reduced to instances in which , for every , by considering . For each such , we create a separate instance containing only patterns of length from . As each pattern falls within exactly one such instance, designing an algorithm running in time for every such instance, implies an algorithm running in time for a general instance. To additionally guarantee that (so that Lemma 9 can be directly applied), we choose fragments such that each potential occurrence of a pattern in falls within some fragments . Formally, if is the -length fragment (possibly shorter for the very last fragment) starting at position then:
where we disregard fragments shorter than as they cannot contain an occurrence of any . From now on, always assume that we deal with a single text , with , and a set of patterns with lengths in (we will sometimes omit the index of a pattern and simply write ). The preprocessing from Lemma 8 is performed only once, and then in each instance we assume that any PILLAR operation can be performed in time. The input bitvectors in such an instance are fragments of the original input bitvectors, and after computing the output bitvectors we update appropriate fragments of the original output bitvectors by computing bitwise-OR. The final number of restricted instances is , and each original pattern appears in instances.
Consider a restricted instance containing patterns. Our goal will be to solve it in time. Before we proceed to describe such an algorithm, we analyze what this time implies for an algorithm solving the original instance.
Theorem 12.
For any , an instance of APE with Mismatches where the length of each pattern is at least can be solved in time.
Proof.
We assume that a restricted instance with patterns can be solved in time, and describe an algorithm for solving a general instance of APE with Mismatches.
Let denote the number of patterns of length from , where , in the original instance. Recall that we only consider such that . After the initial -time preprocessing, ignoring factors polynomial in , the total time to solve the restricted instances is:
where we have used and . We split the sum by separately considering such that , i.e., . This gives us:
Thus, as long as we indeed manage to solve a restricted instance in the promised complexity, we obtain the theorem.
In what follows, we describe an algorithm for solving a restricted instance of APE with Mismatches containing patterns in time.
Additional Assumptions.
We start with applying Lemma 10 on every pattern to obtain a representation of its occurrences in in time. As mentioned earlier, the algorithm underlying Lemma 10 either outputs a set of occurrences of in or finds a primitive -period of such that (note that the second inequality holds because ). In the latter case, we also obtain a representation of the whole set of occurrences as arithmetic progressions.
If there are occurrences of in , then we process each of them naively in time. From now on we can thus assume otherwise for every pattern . Then, we consider the text , and ensure that it is fully covered by approximate occurrences of the patterns:
-
some pattern is a -mismatch prefix of , formally ; and
-
some pattern is a -mismatch suffix of , formally .
This is guaranteed by removing some prefix and some suffix of the text; it can be implemented in time by extracting the first and the last occurrence from each arithmetic progression in the representation. Then the following claim can be inferred from the characterization of the period case in [9]. We provide a proof in the appendix for completeness.
Lemma 13.
All s are cyclically equivalent, and every is a -period of the text .
We choose to be a cyclic shift of the period that we got for the pattern , so is a -period of every pattern, and a -period with offset 0 of the text. This can be implemented in time as follows. For every , because and , we have for some . Further, such a can be computed in time by just trying and verifying each by computing the longest common prefix twice. Overall, this takes time. We start with setting . Then, we search for a cyclic shift of such that . To this end, we check all possible cyclic shifts. To verify whether is a good cyclic shift, we extract the mismatches between and , terminating when there are more than . The next mismatch can be found in constant time by first computing the longest common prefix of the remaining suffix of with an appropriate cyclic shift of , and if there is none by computing the longest common prefix of the remaining suffix of with the suffix shortened by characters. Overall, this takes time. After having found , we also compute, for every pattern , an integer such that , which can be done with a single internal pattern matching query to find an occurrence of in .
The Algorithm.
To obtain an efficient algorithm, we will partition the set of all positions into consecutive regions with the property that if we restrict the text to any region , the corresponding fragment is almost periodic with respect to ; more specifically, it may have a single periodic mismatch at the rightmost position. Then, for each pair of regions and , with , we separately calculate the set of extensions induced by occurrences of the pattern starting at positions such that :
Since , this allows us to reduce the problem to separate instances of calculating:
Consider a single pattern , and recall that by the additional assumptions, we have a primitive string of length such that:
Further, let be the positions congruent to modulo in the text. We start with recalling from [9] that the positions of all -mismatch occurrences of in are congruent modulo . We provide a proof in the appendix for completeness.
Lemma 14.
.
Following Lemma 14, choose such that . In order to characterize , let us now analyze the values for . From triangle inequality, we have
(1) |
Observe that
(2) |
since both strings have a period with offsets congruent modulo , which gives us
(3) |
We will later show that the above inequality is in fact an equality for all except for exceptions . Specifically, define
which is the set of all starting positions in such that when comparing and , at least one pair of mismatches aligns with each other. Note that . Finally, we define . Let:
-
denote the sorted elements of ,
-
for all , where we set and .
This is illustrated in Figure 1.
The elements of can be computed in time by extracting the mismatches with longest common prefix queries, which allows us to find the regions in time. Similarly, we can compute the elements of in time, so we can also compute the set in time. We stress that the regions are the same for every considered pattern (but the set does depend on the pattern ).
Now we can state our extension of the structural characterization of [9].
Lemma 15.
There exist an integer and a set , with , such that, for each pair , with :
Proof.
For any the resulting set is trivially empty. Now let us fix any and define
We will show that either or , depending on whether by using the following property:
Proposition 16.
For all , we have .
Proof.
We know that
-
, so ,
-
, so ,
therefore
and finally
Now assume that . In that case, recall that by (3) combined with Proposition 16, for every , we have
Consequently , and therefore . In that case observe that
where the third equality follows from and the fifth from . Since as desired, the proof for this case is complete.
For the second case, when , we need to make use of the following property:
Proposition 17.
For all , we have
Proof.
Consider the triangle inequality (1) stated explicitly for each position :
From (2) we already know that , thus
We will now show that the above inequality holds with equality by considering two cases. The proof is completed by summing the equations for all .
-
. In that case, observe that by triangle inequality
and similarly, since and , we again get
It remains to show that every falls into at least one of these two cases. Indeed, if for some we would have and , then by the definition of , we would get , which is a contradiction.
By Proposition 17, combined with Proposition 16 for every , we have
therefore , which implies . Following the same reasoning as before, up to the fourth equality, we get
as required.
We apply Lemma 15 on every pattern . Whenever the second case applies, we process all occurrences of naively. We observe that by definition we have
Since we already have access to the previously calculated representation of , we can simply check for each element in whether it is in in time, as the representation of is of size , so time in total.
For the remaining patterns that fall into the first case, we still use the naive approach if for some threshold to be chosen later. Since , this takes time per pattern. Otherwise, . We partition the remaining patterns into groups with the same . Formally, let denote the set of patterns with a specific value of :
We calculate the result for each separately by phrasing it as a sumset of some common set of positions with the set of pattern lengths, where the result is then truncated to :
This takes total time by Lemma 11. Overall, the time complexity is , which by choosing becomes as promised.
4.4 Combining the Cases
After designing an algorithm for every case, we show how to combine them to obtain the claimed bounds.
See 1
Proof.
By Lemma 4, to prove the theorem it is enough to show how to solve APE with Mismatches, where , in time. We choose and . For patterns of length at most , we use Theorem 5. For patterns of length at least but at most , we use Theorem 7. Finally, for patterns of length at least , we use Theorem 12. Summing up the time complexities, we obtain
as required.
See 2
Proof.
References
- [1] Josh Alman, Ran Duan, Virginia Vassilevska Williams, Yinzhan Xu, Zixuan Xu, and Renfei Zhou. More asymmetry yields faster matrix multiplication. In Yossi Azar and Debmalya Panigrahi, editors, Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2025, New Orleans, LA, USA, January 12-15, 2025, pages 2005–2039. SIAM, 2025. doi:10.1137/1.9781611978322.63.
- [2] Kotaro Aoyama, Yuto Nakashima, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Faster online elastic degenerate string matching. In Gonzalo Navarro, David Sankoff, and Binhai Zhu, editors, Annual Symposium on Combinatorial Pattern Matching, CPM 2018, July 2-4, 2018 - Qingdao, China, volume 105 of LIPIcs, pages 9:1–9:10. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2018. doi:10.4230/LIPICS.CPM.2018.9.
- [3] Rocco Ascone, Giulia Bernardini, Alessio Conte, Massimo Equi, Estéban Gabory, Roberto Grossi, and Nadia Pisanti. A unifying taxonomy of pattern matching in degenerate strings and founder graphs. In Solon P. Pissis and Wing-Kin Sung, editors, 24th International Workshop on Algorithms in Bioinformatics, WABI 2024, September 2-4, 2024, Royal Holloway, London, United Kingdom, volume 312 of LIPIcs, pages 14:1–14:21. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2024. doi:10.4230/LIPICS.WABI.2024.14.
- [4] Giulia Bernardini, Estéban Gabory, Solon P. Pissis, Leen Stougie, Michelle Sweering, and Wiktor Zuba. Elastic-degenerate string matching with 1 error or mismatch. Theory Comput. Syst., 68(5):1442–1467, 2024. doi:10.1007/S00224-024-10194-8.
- [5] Giulia Bernardini, Pawel Gawrychowski, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone. Elastic-degenerate string matching via fast matrix multiplication. SIAM J. Comput., 51(3):549–576, 2022. doi:10.1137/20M1368033.
- [6] Giulia Bernardini, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone. Approximate pattern matching on elastic-degenerate text. Theor. Comput. Sci., 812:109–122, 2020. doi:10.1016/J.TCS.2019.08.012.
- [7] Karl Bringmann, Marvin Künnemann, and Philip Wellnitz. Few matches or almost periodicity: Faster pattern matching with mismatches in compressed texts. In Timothy M. Chan, editor, Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pages 1126–1145. SIAM, 2019. doi:10.1137/1.9781611975482.69.
- [8] Thomas Büchler, Jannik Olbrich, and Enno Ohlebusch. Efficient short read mapping to a pangenome that is represented by a graph of ED strings. Bioinform., 39(5), 2023. doi:10.1093/BIOINFORMATICS/BTAD320.
- [9] Panagiotis Charalampopoulos, Tomasz Kociumaka, and Philip Wellnitz. Faster approximate pattern matching: A unified approach. In Sandy Irani, editor, 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020, Durham, NC, USA, November 16-19, 2020, pages 978–989. IEEE, 2020. doi:10.1109/FOCS46700.2020.00095.
- [10] Aleksander Cislak, Szymon Grabowski, and Jan Holub. Sopang: online text searching over a pan-genome. Bioinform., 34(24):4290–4292, 2018. doi:10.1093/BIOINFORMATICS/BTY506.
- [11] Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. Dictionary matching and indexing with errors and don’t cares. In László Babai, editor, Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, June 13-16, 2004, pages 91–100. ACM, 2004. doi:10.1145/1007352.1007374.
- [12] Martin Farach. Optimal suffix tree construction with large alphabets. In 38th Annual Symposium on Foundations of Computer Science, FOCS ’97, Miami Beach, Florida, USA, October 19-22, 1997, pages 137–143. IEEE Computer Society, 1997. doi:10.1109/SFCS.1997.646102.
- [13] Nathan J. Fine and Herbert S. Wilf. Uniqueness theorems for periodic functions. Proceedings of the American Mathematical Society, 16:109–114, 1965. doi:10.1090/S0002-9939-1965-0174934-9.
- [14] Martin Fürer. How fast can we multiply large integers on an actual computer? In Alberto Pardo and Alfredo Viola, editors, LATIN 2014: Theoretical Informatics - 11th Latin American Symposium, Montevideo, Uruguay, March 31 - April 4, 2014. Proceedings, volume 8392 of Lecture Notes in Computer Science, pages 660–670. Springer, 2014. doi:10.1007/978-3-642-54423-1_57.
- [15] Esteban Gabory, Moses Njagi Mwaniki, Nadia Pisanti, Solon P. Pissis, Jakub Radoszewski, Michelle Sweering, and Wiktor Zuba. Pangenome comparison via ED strings. Frontiers in Bioinformatics, 4, 2024. doi:10.3389/fbinf.2024.1397036.
- [16] Pawel Gawrychowski, Gad M. Landau, and Tatiana Starikovskaya. Fast entropy-bounded string dictionary look-up with mismatches. In Igor Potapov, Paul G. Spirakis, and James Worrell, editors, 43rd International Symposium on Mathematical Foundations of Computer Science, MFCS 2018, August 27-31, 2018, Liverpool, UK, volume 117 of LIPIcs, pages 66:1–66:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2018. doi:10.4230/LIPICS.MFCS.2018.66.
- [17] Roberto Grossi, Costas S. Iliopoulos, Chang Liu, Nadia Pisanti, Solon P. Pissis, Ahmad Retha, Giovanna Rosone, Fatima Vayani, and Luca Versari. On-line pattern matching on similar texts. In Juha Kärkkäinen, Jakub Radoszewski, and Wojciech Rytter, editors, 28th Annual Symposium on Combinatorial Pattern Matching, CPM 2017, July 4-6, 2017, Warsaw, Poland, volume 78 of LIPIcs, pages 9:1–9:14. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2017. doi:10.4230/LIPICS.CPM.2017.9.
- [18] Dov Harel and Robert Endre Tarjan. Fast algorithms for finding nearest common ancestors. SIAM J. Comput., 13(2):338–355, 1984. doi:10.1137/0213024.
- [19] Costas S. Iliopoulos, Ritu Kundu, and Solon P. Pissis. Efficient pattern matching in elastic-degenerate strings. Inf. Comput., 279:104616, 2021. doi:10.1016/J.IC.2020.104616.
- [20] Gad M. Landau and Uzi Vishkin. Efficient string matching with k mismatches. Theor. Comput. Sci., 43:239–249, 1986. doi:10.1016/0304-3975(86)90178-7.
- [21] Veli Mäkinen, Bastien Cazaux, Massimo Equi, Tuukka Norri, and Alexandru I. Tomescu. Linear time construction of indexable founder block graphs. In Carl Kingsford and Nadia Pisanti, editors, 20th International Workshop on Algorithms in Bioinformatics, WABI 2020, September 7-9, 2020, Pisa, Italy (Virtual Conference), volume 172 of LIPIcs, pages 7:1–7:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPICS.WABI.2020.7.
- [22] Solon P. Pissis, Jakub Radoszewski, and Wiktor Zuba. Faster approximate elastic-degenerate string matching – Part A. In Paola Bonizzoni and Veli Mäkinen, editors, 36th Annual Symposium on Combinatorial Pattern Matching, CPM 2025, June 17-19, 2025, Milan, Italy, volume 331 of LIPIcs, pages 28:1–28:19. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2025. doi:10.4230/LIPIcs.CPM.2025.28.
- [23] Milan Ruzic. Constructing efficient dictionaries in close to sorting time. In Luca Aceto, Ivan Damgård, Leslie Ann Goldberg, Magnús M. Halldórsson, Anna Ingólfsdóttir, and Igor Walukiewicz, editors, Automata, Languages and Programming, 35th International Colloquium, ICALP 2008, Reykjavik, Iceland, July 7-11, 2008, Proceedings, Part I: Tack A: Algorithms, Automata, Complexity, and Games, volume 5125 of Lecture Notes in Computer Science, pages 84–95. Springer, 2008. doi:10.1007/978-3-540-70575-8_8.
- [24] Daniel Dominic Sleator and Robert Endre Tarjan. A data structure for dynamic trees. J. Comput. Syst. Sci., 26(3):362–391, 1983. doi:10.1016/0022-0000(83)90006-5.
- [25] Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, pages 1–11. IEEE Computer Society, 1973. doi:10.1109/SWAT.1973.13.
Appendix A Very Short Case (for )
Before stating the algorithm, we need a few standard tools.
Suffix Tree.
A trie is a (rooted) tree, where every edge is labeled with a single character. Each node of a trie represents the string obtained by concatenating the labels on its path from the root. We consider only deterministic tries, meaning that the labels of all edges outgoing from the same node are pairwise distinct. Then, a compact trie is obtained from a trie by collapsing maximal downward paths on which every inner node has exactly one child. The remaining nodes are called explicit, and the nodes that have been removed while collapsing the paths are called implicit. In a compact trie, every edge is labeled with a nonempty string, and the first characters of all edges outgoing from the same node are pairwise distinct.
The suffix tree of a string is the compact trie of all the suffixes of , where is a special character not occurring anywhere in [25]. Thus, there are leaves in the suffix tree of , and it contains nodes and edges. The label of each edge is equal to some fragment , and we represent it by storing and ; thus the whole suffix tree needs only space. For constructing the suffix tree we apply the following result.
Lemma 18 ([12]).
The suffix tree of a string over a polynomial alphabet can be constructed in time.
The suffix tree of string , denoted by , allows us to easily check if any string is a substring of by starting at the root and simulating navigating down in the trie storing the suffixes of . In every step, we need to choose the outgoing edge labeled by the next character . If the current node is implicit, this is trivial to implement in constant time. Otherwise, we might have multiple outgoing edges, and we need to store the first characters of their edges in an appropriate structure. To this end, we use deterministic dictionaries.
Lemma 19 ([23, Theorem 3]).
Given a set of integer keys, we can build in time a structure of size , such that given any integer we can check in time if , and if so return its associated information.
We apply Lemma 19 at each explicit node of the suffix tree. This takes total time, and then allows us to implement navigating down in total time. This also gives us, for each prefix , its unique identifier: if we are at an explicit node then it is simply its preorder number, and otherwise it is a pair consisting of the preorder number of the nearest explicit ancestor and the length of the current prefix. If in any step we cannot proceed further, we set the identifier to null denoting that the prefix does not occur in . Such identifiers have the property that the identifier of is null if and only if does not occur in , and the identifiers of and that both occur in are equal if and only if the strings themselves are equal. Further, we can think that each identifier is an integer from .
See 5
Proof.
We assume that the length of each pattern is at most . Recall that we are given bitvectors and the goal is to compute the bitvectors . This will be done by explicitly listing all fragments such that , for some , and , for some . For each such a fragment, we propagate the appropriate information from the input bitvectors to the output bitvectors.
We begin with constructing the suffix trees of and , including the dictionary structures at each explicit node. Then, we distinguish two cases as follows.
Exact Occurrences.
For each pattern , we find its identifier in in time. If the identifier is non-null then we include it in a set .
We iterate over every position and length . While we iterate over the lengths, we simultaneously navigate in to maintain the identifier of . To check if for some , we thus need to check if a specific identifier belongs to . Recall that the identifiers are integers from . To avoid the necessity of paying extra logarithmic factors or using randomization, we answer all such queries together. In more detail, we gather all the queries. Then, we sort the elements of and the queries together with radix sort. Then, we scan the obtained sorted list and obtain the answer to each query in linear total time. Finally, if for some and , then we set , for .
Occurrences with one Mismatch.
For each pattern , we iterate over every position , assuming that the mismatch is at position . We would like to have access to the identifiers of and . This can be guaranteed by first navigating in to compute the identifier of every prefix in time, and similarly navigating in to compute the identifier of the reversal of every suffix . After such a preliminary step, for every position , if both identifiers are non-null then we form a pair consisting of the identifier of and the identifier of . Let denote the obtained set of pairs.
We iterate over every position , position and position , where is the considered fragment of and is the position of the mismatch. We would like to have access to the identifier of in and the identifier of in . This can be assumed without increasing the time complexity by first iterating over (in any order), then over in the increasing order, and finally over in decreasing order, all while simultaneously navigating in and , respectively. With the identifiers at hand, we need to check if the pair consisting of the identifier of and the identifier of belongs to . Similarly as for exact occurrences, this is done by answering the queries together with radix sort. Then, if , for some , and , we set .
Summary.
The algorithm described above consists of the following steps. First, we need to construct the suffix trees of and in time. Constructing the deterministic dictionaries storing the outgoing edges takes time. Second, listing and processing the exact occurrences takes time. Third, listing and processing occurrences with one mismatch takes time.
Appendix B Omitted Proofs
See 13
Proof.
We first prove that the periods obtained for all the patterns must be cyclically equivalent. Let be the middle part of . Since all patterns are of length and the text is of length , all pattern occurrences must cover the middle part of . Recall that we assume that every pattern has some -period . By triangle inequality, every must be a -period of . We will first show that if the strings are primitive and of length , then they are all cyclically equivalent. Select any two such periods of , denoted by and , and assume (only to avoid clutter) that both of their offsets are equal to .
First, assume that and are not of the same length. Observe that since the size of the combined set of periodic mismatches is at most , there must exist a substring of that does not contain any such mismatch of length at least
The strings and are thus exact periods of . In addition we have
which by the periodicity lemma of Fine and Wilf [13] induces a period of length , and contradicts the assumption that and are primitive.
In the other case, when , assume that . We would then have
On the other hand, by triangle inequality
which again gives us a contradiction and proves that must be equivalent to .
Since we have assumed that some is a -mismatch prefix of and some is a -mismatch suffix of , both having -period , it can be proven with similar arguments that is a -period of .
See 14
Proof.
For any , by triangle inequality, we have
and since
-
,
-
,
-
we get
Recall that and both have a primitive period (with offsets and , respectively). If their offsets are not congruent modulo , we can bound the number of mismatches by
which yields a contradiction (the second inequality follows from ). Therefore