Abstract 1 Introduction 2 Preliminaries 3 EDSM with 𝒌 Mismatches via APE with 𝒌 Mismatches 4 Faster APE with 𝒌 Mismatches References Appendix A Very Short Case (for 𝒌=𝟏) Appendix B Omitted Proofs

Faster Approximate Elastic-Degenerate String Matching – Part B

Paweł Gawrychowski ORCID Institute of Computer Science, University of Wrocław, Poland Adam Górkiewicz ORCID Institute of Computer Science, University of Wrocław, Poland Pola Marciniak ORCID Institute of Computer Science, University of Wrocław, Poland Solon P. Pissis ORCID CWI, Amsterdam, The Netherlands
Vrije Universiteit, Amsterdam, The Netherlands
Karol Pokorski ORCID Institute of Computer Science, University of Wrocław, Poland
Abstract

We revisit the complexity of approximate pattern matching in an elastic-degenerate string. Such a string is a sequence of n finite sets of strings of total length N, and compactly describes a collection of strings obtained by first choosing exactly one string in every set, and then concatenating them together. This is motivated by the need of storing a collection of highly similar DNA sequences.

The basic algorithmic question on elastic-degenerate strings is pattern matching: given such an elastic-degenerate string and a standard pattern of length m, check if the pattern occurs in one of the strings in the described collection. Bernardini et al. [SICOMP 2022] showed how to leverage fast matrix multiplication to obtain an 𝒪~(nmω1)+𝒪(N)-time complexity for this problem, where ω is the matrix multiplication exponent. However, from the point of view of possible applications, it is more desirable to work with approximate pattern matching, where we seek approximate occurrences of the pattern. This generalization has been considered in a few papers already, but the best result so far for occurrences with k mismatches, where k is a constant, is the 𝒪~(nm2+N)-time algorithm presented in Part A [CPM 2025]. This brings the question whether increasing the dependency on m from mω1 to quadratic is necessary when moving from k=0 to larger (but still constant) k.

We design an 𝒪~(nm1.5+N)-time algorithm for pattern matching with k mismatches in an elastic-degenerate string, for any constant k. To obtain this time bound, we leverage the structural characterization of occurrences with k mismatches of Charalampopoulos, Kociumaka, and Wellnitz [FOCS 2020] together with fast Fourier transform. We need to work with multiple patterns at the same time, instead of a single pattern, which requires refining the original characterization. This might be of independent interest.

Keywords and phrases:
ED string, approximate pattern matching, Hamming distance, k mismatches
Funding:
Paweł Gawrychowski: Partially supported by the Polish National Science Centre grant number 2023/51/B/ST6/01505.
Adam Górkiewicz: Partially supported by the Polish National Science Centre grant number 2023/51/B/ST6/01505.
Pola Marciniak: Partially supported by the Polish National Science Centre grant number 2023/51/B/ST6/01505.
Solon P. Pissis: Partially supported by the PANGAIA and ALPACA projects that have received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreements No 872539 and 956229, respectively.
Copyright and License:
[Uncaptioned image] © Paweł Gawrychowski, Adam Górkiewicz, Pola Marciniak, Solon P. Pissis, and Karol Pokorski; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Pattern matching
Editors:
Paola Bonizzoni and Veli Mäkinen

1 Introduction

An elastic-degenerate string (ED-string, in short) T~ is a sequence T~=T~[0]T~[1]T~[n1] of n finite sets, where T~[i] is a subset of Σ and Σ is an ordered finite alphabet. The length of T~ is defined as the length n=|T~| of the associated sequence. The size of T~ is defined as N=Nε+i=0n1ST~[i]|S|, where Nε is the total number of empty strings in T~. The cardinality of T~ is defined as c=i=0n1|T~[i]|. Every ED-string represents a collection of strings, each of generally different length. We formalize this intuition as follows. The language (T~) generated by T~ is defined as (T~)={S0S1Sn1:SiT~[i], for all i[0,n1]}.

The main motivation behind introducing ED-strings [19] was to encode a collection of highly similar DNA sequences in a compact form. Consider a multiple sequence alignment (MSA) of such a collection (see below). Maximal conserved substrings (conserved columns of the MSA) form singleton sets of the ED-string and the non-conserved ones form sets that list the corresponding variants. Moreover, the language of the ED-string consists of (at least) the underlying sequences of the MSA. Under the assumption that these underlying sequences are highly similar, the size of the ED-string is substantially smaller than the total size of the collection. ED-strings have been used in several applications: indexing a pangenome [8], on-line string matching in a pangenome [10], and pairwise comparison of pangenomes [15].

ED-strings are also interesting as a simplified model for string matching on node-labeled graphs [3]. The ED-string T~ can be viewed as a graph of n layers [21], where the c nodes are strings from Σ, such that from layer i to layer i+1 all possible edges are present and the nodes in layer i are adjacent only to the nodes in layer i+1. As a simplified model, ED-strings offer important advantages, such as on-line (left to right) string matching algorithms whose running times have a linear dependency on N [17, 2, 5]. This linear dependency on N (without any multiplicative polylogarithmic factors) is highly desirable in applications because, nowadays, it is typical to encode a very large number of genomes (e.g., millions of SARS-CoV-2 genomes111https://gisaid.org) in a single representation resulting in huge N values.

In this work, we focus on the string matching (or pattern matching) task. In the elastic-degenerate string matching (EDSM) problem, we are given a string P of length m (known as the pattern) and an ED-string T~ (known as the text), and we are asked to find the occurrences of P in (T~). Grossi et al. showed that EDSM can be solved in 𝒪(nm2+N) time using combinatorial pattern matching tools [17]. Aoyama et al. improved this to 𝒪~(nm1.5)+𝒪(N) time by employing fast Fourier transform [2]. Finally, Bernardini et al. improved it to 𝒪~(nmw1)+𝒪(N) time [5] by employing fast matrix multiplication, where w<2.37134 [1] is the matrix multiplication exponent. The authors of [1] also showed a conditional lower bound for combinatorial algorithms222The term “combinatorial” is not formally well-defined. for EDSM stating that EDSM cannot be solved in 𝒪(nm1.5ϵ+N) time, for any constant ϵ>0.

In the approximate counterpart of EDSM, we are also given an integer k>0, and we are asked to find k-approximate occurrences of P in (T~); namely, the occurrences that are at Hamming or edit distance at most k from P. For Hamming distance, we call the problem EDSM with k Mismatches (see below); and for edit distance, EDSM with k Errors. The approximate EDSM problem was introduced by Bernardini et al. who showed a simple 𝒪(kmc+kN)-time algorithm for EDSM with k Mismatches and an 𝒪(k2mc+kN)-time algorithm for EDSM with k Errors using combinatorial pattern matching tools [6]. In Part A, the dependency on k for both the Hamming and the edit distance metrics is improved, obtaining an 𝒪~(k2/3mc+kN)-time algorithm for EDSM with k Mismatches and an 𝒪~(kmc+kN)-time algorithm for EDSM with k Errors [22].

Unfortunately, the cardinality c of T~ in the above complexities is bounded only by N, so even for k=1, the existing algorithms run in Ω(mN) time in the worst case. Bernardini et al. [4] showed many algorithms for approximate EDSM for k=1 working in 𝒪~(nm2+N) time or in 𝒪(nm3+N) time for both the Hamming and the edit distance metrics. In Part A, the results for k=1 (for both metrics) are improved to 𝒪(nm2+N) time and extended to work for k>1 (for both metrics) in 𝒪~(nm2+N) time, for any constant k>1 [22].

Time complexity Remarks Reference
𝒪(cm+N) k=𝒪(1) [6]
𝒪(nm3+N) k=1 [4]
𝒪(nm2+Nlogm) k=1 [4]
𝒪(nm2+N) k=1 Part A, [22]
𝒪~(nm1.5)+𝒪(N) k=1 Theorem 1
𝒪~(nm2+N) k=𝒪(1) Part A, [22]
𝒪~(nm1.5+N) k=𝒪(1) Theorem 2

In this work, we consider the EDSM with k Mismatches problem with constant k1, and observe that all the existing algorithms have at best a quadratic dependency on m, the length of the pattern, for this problem. This is in stark contrast to the case of k=0, and brings the question of whether non-combinatorial methods could be employed to solve EDSM with k Mismatches for any constant k1, similar to EDSM (k=0[2, 5].

Theorem 1.

Given a pattern P of length m and an ED-string T~ of length n and size N, EDSM with k Mismatches, for k=1, can be solved in 𝒪(nm1.5polylogm+N) time.

Theorem 2.

Given a pattern P of length m and an ED-string T~ of length n and size N, EDSM with k Mismatches, for any constant k1, can be solved in 𝒪((nm1.5+N)polylogm) time.

Other Approaches.

The key ingredient of [4] and [22] to achieve the m2 term for k=1 is a new counterpart of the k-errata tree [11], where the copied nodes of the input trie are explicitly inserted into the tree. This counterpart is an actual trie, and hence it allows to apply standard tree traversal algorithms. Since for k=1, the constructed trie for T~[i] and the suffixes of P has 𝒪(mlog(N+m)) nodes originating from P, bitwise 𝒪(m/log(N+m))-time operations per such node result in the desired complexity. The main tool in [22] for extending to a constant k>1 is also k-errata trees; however, the authors of [22] manage to apply k-errata trees as a black-box. We stress that those algorithms are combinatorial (they do not use fast Fourier transform or fast matrix multiplication) and work also for edit distance.

Our Approach.

As in the previous works on elastic-degenerate string matching, we work with the so-called active prefixes extension problem. In this problem, we are given a text of length m, an input bitvector of length m, and a collection of patterns of total length N. The goal is to produce an output bitvector, also of length m. Informally, whenever there is an occurrence of some pattern starting at position i and ending at position j, we can propagate a one from the input bitvector at position i1 to a one in the output bitvector at position j. For approximate pattern matching, we have k+1 input and output bitvectors, corresponding to matching prefixes with different (i.e., the corresponding) number of mismatches.

The previous solutions can be seen as propagating the information from every i1 to every j explicitly. This, of course, cannot achieve time complexity better than quadratic in m. Instead, we leverage the following high-level idea: if a given pattern occurs very few times in the text, then we can afford to iterate through all of its occurrences and propagate the corresponding information. Otherwise, its occurrences are probably somewhat structured. More concretely, exact occurrences of a pattern of length in a text of length 1.5 form a single arithmetic progression. This has been extended by Bringmann, Künnemann, and Wellnitz [7] to occurrences with k mismatches, and further refined (and made effectively computable) by Charalampopoulos, Kociumaka, and Wellnitz [9]: either there are few occurrences, or they can be represented by a few arithmetic progressions (where few means polynomial in k). Further, a representation of all the occurrences can be computed effectively.

To implement this high-level idea, we first apply some relatively standard preliminary steps that allow us to handle short patterns effectively with k-errata trees. Further, we show how to reduce a given instance to multiple instances in which the pattern and the text are roughly of the same length. Then, we handle patterns with only a few occurrences naively. For the remaining patterns, we obtain a compact representation (as a few arithmetic progressions) of their occurrences. We cannot afford to process each progression separately, but we observe that, because we have restricted the length of the text, their differences are in fact equal to the same q for every remaining pattern. Now, if q is somewhat large (with the exact threshold to be chosen at the very end to balance the complexities), we can afford to process every occurrence of every pattern naively. Otherwise, we would like to work with every remainder modulo q separately, leveraging fast Fourier transform to process all progressions starting at the positions with that remainder together. As a very simplified example, if the progressions were the same for all the patterns, we only need to compute the sumset of the set of starting positions with a one in the input bitvector (restricted to positions with a specific reminder modulo q) with the set of lengths of the patterns. This can be indeed done with a single fast Fourier transform. However, the structural characterization of Charalampopoulos et al. [9] only says that, for every pattern, we have 𝒪(k2) arithmetic progressions with the same period q. However, the progressions are possibly quite different for different patterns. Our new insight is that, in fact, we can group the progressions into only a few classes, more specifically 𝒪(k2) irrespectively of the number of patterns, and then process each class together. This requires looking more carefully at the structural characterization of Charalampopoulos et al. [9], and might be of independent interest.

Structure of the Paper.

In Section 2, we provide some preliminaries and problem definitions. In Section 3, we discuss how the EDSM with k Mismatches problem can be solved via the APE with k Mismatches problem, the auxiliary problem used also in previous solutions [2, 5, 4] and in Part A [22]. In Section 4, we present our algorithms for three different cases: very short in Section 4.1; short in Section 4.2; and, finally, long in Section 4.3, which is the most interesting case. We conclude with balancing the thresholds in Section 4.4.

Computational Model.

We assume the standard Word RAM model with words consisting of Ω(logn) bits, where n is the size of the input. Basic operations on such words, such as indirect addressing and arithmetic operations, are thus assumed to take constant time.

2 Preliminaries

Strings.

Let Σ be a finite ordered alphabet of size |Σ|=σ. We will usually assume that Σ={1,2,,poly(n)}, where n is the size of the input, which is called the polynomial alphabet. The elements of Σ are called characters. A sequence of characters from Σ, X[0]X[1]X[n1], is called a (classic) string X. We call n the length of X, and denote it by |X|. The empty string is denoted by ε. By X[i..j]=X[i]X[i+1]X[j], we denote a fragment of X (starting at position i and ending at position j), which equals ε when i>j. Fragments of the form X[..j]:=X[0..j] and X[i..]:=X[i..n1] are called prefixes and suffixes of X, respectively. A fragment of X (or its prefix/suffix) is called proper, if it is not equal to X. Strings that are fragments of X (for some i and j) are called substrings of X. We also write X[i..j) to denote X[i..j1]. By XY, we denote the concatenation of strings X and Y, i.e., the string X[0]X[|X|1]Y[0]Y[|Y|1]. String X is a cyclic shift of Y when X=X1X2 and Y=X2X1, and then we call X and Y cyclically equivalent. We say that X has an occurrence in Y (at position t), if Y=AXB for some strings A and B such that |A|=t. Finally, Xr is the reversal of X, i.e., the string X[n1]X[n2]X[0].

Elastic-Degenerate Strings.

We study the following extensions of classic strings.

A symbol (over alphabet Σ) is an unordered subset of (classic) strings from Σ, different from {ε} and . Note that symbols may contain ε but not as their only element. The size of a symbol is the total length of all strings in the symbol (with the additional assumption that the empty string is counted as if it had length 1). The Cartesian concatenation of two symbols X and Y is defined as XY:={xy|xX,yY}.

An elastic-degenerate string (or ED-string, in short) X~=X~[0]X~[1]X~[n1] (over alphabet Σ) is a sequence of symbols (over Σ). We use |X~| to denote the length of X~, i.e., the length of the associated sequence (the number of its symbols). The size of X~ is the sum of the sizes of the symbols in X~. As for classic strings, we denote a fragment of X~ by X~[i..j]=X~[i]X~[i+1]..X~[j]. We similarly denote prefixes and suffixes of X~. The language of X~ is (X~)=X~[0]X~[1]X~[n1].

Given a (classic) string P and an ED-string T~, we say that P matches the fragment T~[i..j] (or that an occurrence of P starts at position i and ends at position j of T~), if i=j and P is a fragment of at least one of the strings of T~[i] (the whole pattern is fully contained in one of the symbols), or if i<j and there is a sequence of strings (Pi,Pi+1,,Pj), such that: P=PiPi+1Pj; Pi is a suffix of one of the strings of T~[i], PkT~[k], for all i<k<j; and Pj is a prefix of one of the strings of T~[j] (P uses parts of at least two symbols).

Hamming Distance.

Given two (classic) strings X and Y of the same length over alphabet Σ, their Hamming distance δH(X,Y) is defined as the number of mismatches (i.e., the positions i such that X[i]Y[i]). We use Mis(X,Y) to denote the set of mismatches.

Given two (classic) strings X and Y over alphabet Σ, we say that X is an approximate fragment (with at most k mismatches) of Y if there is a string X with δH(X,X)k, such that X is a substring of Y. We similarly define approximate prefixes and approximate suffixes. We write OcckH(X,Y) to denote the set of all k-mismatch approximate occurrences of X in Y, i.e., all positions i in Y, such that δH(X,Y[i..i+|X|))k.

Given a string P, an ED-string T~ and an integer k1, we say that P approximately matches the fragment T~[i..j] (with at most k mismatches) of T~, or that an approximate occurrence of P starts at position i and ends at position j of T~, if there is a string P such that δH(P,P)k and P matches T~[i..j]. We stress that, as in the case of exact occurrences, each approximate occurrence of P in T~ is of the following forms: either P has Hamming distance at most k to a fragment of a string in a symbol of T~; or it uses parts of at least two symbols of T~. In the latter case, a prefix of P is an approximate (possibly empty) suffix of a string in T~[i], a suffix of P is an approximate (possibly empty) prefix of a string in T~[j], and the remaining fragments of the pattern are approximate matches of a string in all other used symbols of T~ (except the first and the last one).

Periodicity.

For a string X, we write X to denote the string X concatenated infinitely many times with itself. We call a string X primitive when it cannot be represented as Yh, for some string Y and some integer h2. We say that a string X is a d-period with offset α of some other string Y when δH(Y,X[α..α+|Y|))d for some d,α0; and we call the elements of Mis(Y,X[α..α+|Y|)) the periodic mismatches. If d=0, then X is an exact period of Y, or just a period. Note that all cyclic shifts of X are also (approximate or exact) periods, but with different offsets.

ED-string Matching.

Similarly as in Part A and in previous works (cf. [4]), we define the following problem – we assume integer k to be a fixed constant and not part of the input.

Problem: Elastic Degenerate String Matching (EDSM) with k Mismatches
Input: A string P of length m and an ED-string T~ of length n and size Nm.
Output: All positions in T~ where at least one approximate occurrence (with at most k mismatches) of P ends.

In the above problem, we call P the pattern and T~ the text.

Active Prefixes Extension.

Similarly as in Part A and in previous works (cf. [4]), we solve EDSM with k Mismatches through the following auxiliary problem.

Problem: Active Prefixes Extension (APE) with k Mismatches
Input: A string T of length m, k+1 bitvectors U0,U1,,Uk of size m each, and strings P1,P2, of total length N.
Output: k+1 bitvectors V0,V1,,Vk of size m each, where Vd[j]=1 if and only if there is a string Pi and j{0,1,,m1} with Ud[j]=1 such that j=j+|Pi| and dd+δH(Pi,T[j..j]).

In the above problem, we call T the text, and P1,P2, the patterns.

3 EDSM with 𝒌 Mismatches via APE with 𝒌 Mismatches

We begin by showing how to reduce EDSM with k Mismatches to multiple instances of APE with k Mismatches. This does not require any new ideas, and it proceeds similarly as in Part A and in previous works (cf. [4]), so we only state it here for completeness.

As we mentioned before, each approximate occurrence of pattern P in ED-string T~ is:

  1. 1.

    either an approximate fragment of a string of a symbol; or

  2. 2.

    crossing the boundary between two consecutive symbols.

We explain how to detect the occurrences of each form separately.

Approximate Fragments of Symbols.

To check if the pattern is an approximate fragment of a string of a symbol, we test each symbol of T~ separately (cf. [6]). To this end, we apply the technique of Landau and Vishkin [20], informally referred to as the kangaroo jumps. First, we preprocess the concatenation of all the symbols of T~ and the pattern with the following.

Lemma 3 (suffix tree [12] with LCA queries [18]).

A string T over a polynomial alphabet can be preprocessed in 𝒪(|T|) time to allow computing the longest common prefix of any two suffixes T[i..] and T[j..] of T in constant time.

Recall that pattern P is of length m. For a symbol X={X1,X2,,Xt} with t strings, we consider each string Xi and, for every j=0,1,,|Xi|m, check if δH(Xi[j..j+m),P)k in 𝒪(k) time by repeatedly computing the longest common prefix of the remaining suffix of Xi and P. This takes 𝒪(m+kN) total time.

Crossing the Boundary between two Consecutive Symbols.

To check if P approximately matches a fragment T~[i..i], for some positions i<i, we reduce the problem to multiple instances of APE with k Mismatches. We iterate through the symbols of T~ left-to-right and maintain k+1 bitvectors B0,B1,,Bk, each of size m, such that Bd[j]=1 when the prefix P[..j1] of P is a d-approximate suffix of the current T~[..i], for d=0,1,,k. Let Ni denote the size of T~[i], i.e., the total length of all strings in T~[i].

To proceed to the next iteration, and compute the bitvectors for T~[..i+1] from the bitvectors for T~[..i], we need to consider two possibilities. First, to consider the case when the d-approximate suffix is fully within T~[i+1], for every d=0,1,,k, we find all d-approximate prefixes of P that are suffixes of T~[i+1]. This is done in 𝒪(kNi+1) time by iterating over all strings in T~[i+1], considering for each of them every sufficiently short prefix of P, and computing the number of mismatches (if they do not exceed k) using kangaroo jumps in 𝒪(k) time. Second, to consider the case when the d-approximate suffix crosses the boundary between T~[i] and T~[i+1], we create and solve an instance of APE with k Mismatches with the bitvectors representing the results for T~[..i] and the strings in T~[i+1]. We take as the new bitvectors the bitwise-OR of the bitvectors corresponding to both cases.

Before proceeding to the next iteration, we need to detect an occurrence that crosses the boundary between T~[i] and T~[i+1]. To this end, we consider each string TT~[i+1]. Then, for every d=0,1,k and j=0,1,,m1 such that Bd[j]=1 and mj|T|, we check if P[j..] is a (kd)-approximate prefix of T~[i+1] using kangaroo jumps in 𝒪(k) time, and if so, report position i+1 as a k-approximate occurrence. Because we only need to consider |T|+1 possibilities for j, this takes 𝒪(kNi+1) time.

We summarize the complexity of the reduction in the following lemma.

Lemma 4.

Assume that APE with k Mismatches can be solved in fk(m,N) time. Then EDSM with k Mismatches can be solved in 𝒪(m+kiNi+ifk(m,Ni)) time.

4 Faster APE with 𝒌 Mismatches

We now move to designing efficient algorithms for APE with k Mismatches, separately for k=1 and then any constant k. Combined with the reduction underlying Lemma 4, this will result in Theorem 1 and Theorem 2. Recall that the input to an instance of APE with k Mismatches consists of a string T (called the text) of length m and a collection of strings P1,P2, (called the patterns) of total length N.

For k=1, the strings are partitioned depending on their lengths and parameters B and B depending on m:

  1. 1.

    Very Short Case: the length of each string is B,

  2. 2.

    Short Case: the length of each string is >B and B,

  3. 3.

    Long Case: the length of each string is >B.

We separately solve the three obtained instances of APE with k Mismatches and return the bitwise-OR of the obtained bitvectors. For an arbitrary constant k, we have two cases:

  1. 1.

    Short Case: the length of each string is B,

  2. 2.

    Long Case: the length of each string is >B.

4.1 Very Short Case (for 𝒌=𝟏)

An efficient algorithm for this case can be obtained by using suffix trees and separately considering exact occurrences and occurrences with one mismatch. For completeness, we provide the proof in the appendix. An alternative algorithm is given in Part A, Lemma 18.

Theorem 5.

An instance of APE with k Mismatches where k=1 and the length of each pattern is at most B can be solved in 𝒪(m(B)2+m(loglogm)2+N) time.

4.2 Short Case

Recall that an instance of APE with k Mismatches consists of the patterns P1,P2,,Pd of total length N and the text T[0..m1]. Further, in this case, the length of each Pi is at least B but at most B. We start with observing that, after 𝒪(m+N)-time preprocessing, we can assume that logd=𝒪(klogm), because we only need to keep patterns Pi such that δH(T[j..j],Pi)k, for some fragment T[j..j]. This relates the number d of the patterns to k and m. Then, we state another known tool, and finally provide the algorithm.

Reducing the Number of Patterns.

Recall that we only need to keep patterns Pi such that δH(T[j..j],Pi)k for some fragment T[j..j]. If d=m𝒪(k) then there is nothing to do. Otherwise, we consider all fragments of the text. For every such T[j..j], we choose up to k positions where there is a mismatch, and for each of them we either choose a special character not occurring in T, represented by 0, or a character occurring somewhere in T. Overall, we have at most m2mk(m+1)k=m𝒪(k) possibilities. For each of them, we construct the corresponding candidate string. Then, we sort the obtained candidate strings together with the patterns Pi with radix sort in 𝒪(N+m𝒪(k))=𝒪(N) time. We scan the obtained sorted list and only keep patterns Pi equal to some candidate string.

The 𝒌-errata Trie.

Cole, Gottlieb, and Lewenstein [11] considered the problem of preprocessing a dictionary of d patterns P1,P2,,Pd of total length N for finding, given a query string Q of length m, whether Q is at Hamming distance at most k from some Pi. We provide a brief overview of their approach, following the exposition in [16] that provides some details not present in the original description.

For k=0, this can be of course easily solved with a structure of size 𝒪(N) and query time 𝒪(m) by arranging the patterns in a trie. For larger values of k, the k-errata trie is defined recursively. In every step of the recursion, the input is a collection of xd strings, each of them being a suffix of some pattern Pi decorated with its mismatch budget initially set to k. We arrange the strings in a compact trie, and then recurse guided by the heavy-path decomposition [24] of the trie. The depth of the recursion is k, and on each level the overall number of strings increases by a factor of logd, starting from d. Answering a query requires the following primitive: given a node of one of the compact tries and the remaining suffix of the query string Q[i..], we need to navigate down starting from the given node while reading off the subsequent characters of Q[i..]. This needs to be done while avoiding explicitly scanning Q[i..], as such a primitive is invoked multiple times. For a compact trie storing x suffixes of the patterns, such a primitive can be implemented by a structure of size 𝒪(xlogx) and query time 𝒪(loglogN), assuming that we know the position of every suffix of the query string in the suffix tree of P1$1P2$2Pd$d (also known as the generalized suffix tree of P1,P2,,Pd).

In our application, the query string will always be a fragment of the text T[i..j]. Thus, we can guarantee that the position of every suffix of the query string in the generalized suffix tree of P1,P2,,Pd is known by building the generalized suffix tree of P1,P2,,Pd,T. This gives us the position of every suffix T[i..] in the generalized suffix tree of P1,P2,,Pd, from which we can infer the position of T[i..j]. We summarize the properties of such an implementation below.

Lemma 6 ([11]).

For any constant k, a dictionary of d patterns P1,P2,,Pd of total length N and a text T[0..m1] can be preprocessed in 𝒪(m+N+dlogk+1d) time to obtain a structure of size 𝒪(m+N+dlogkd), such that for any fragment T[i..j] we can check in 𝒪(logkdloglogN) time whether δH(T[i..j],Pi)k, for some i.

Theorem 7.

For any constant k, an instance of APE with k Mismatches where the length of each pattern is at least B and at most B can be solved in 𝒪(mBlogkmloglogm+N+N/Blogk+1m) time.

Proof.

We start with applying Lemma 6 on the patterns P1,P2, and the text T[0..m1]. Then, iterate over every position j=0,1,,m1, length =1,2,,min{mj,B}, and d=0,1,,k such that Ud[j]=1. Next, for every d=0,1,,kd we check if δH(T[i..i+),Pi)d for some i. If so, we set Vd+d[j+]=1.

We analyze the overall time complexity. First, we need to construct the k-errata trie for P1,P2,,Pd and T[0..m1]. This takes 𝒪(m+N+dlogk+1d) time. Then, we consider 𝒪(mB) possibilities for iterating over the position and the length, and for each of them spend 𝒪(logkdloglogN) time. As each Pi is of length at least B, from the preprocessing we have logd=𝒪(klogm) and Ndm, the overall complexity is:

𝒪(m+N+dlogk+1d+mBlogkdloglogN)=𝒪(N+N/Blogk+1m+mBlogkmloglogm)

as claimed.

4.3 Long Case

In the most technical case, we assume that the length of each pattern Pi is at least B. We start with providing an overview, and then move to filling in the technical details.

The very high-level idea is to explicitly or implicitly process all occurrences of every pattern Pi. If a given pattern Pi occurs sufficiently few times in the text then we can afford to list and process each of its occurrences explicitly. Otherwise, we invoke the structural characterization of [9], which, roughly speaking, says that if there are many approximate occurrences of the same string sufficiently close to each other in the text, then the string and the relevant fragment of the text have a certain regular structure. Thus, we can certainly hope to process all occurrences of such a pattern Pi together faster than by considering each of those occurrences one-by-one. However, this would not result in a speed-up, and in fact, we need to consider multiple such patterns Pi together. To this end, we need to further refine the characterization of [9]. Before we proceed with a description of our refinement, we start with a summary of the necessary tools from [9]. Then, we introduce some notation, introduce some simplifying assumptions, and then describe our refinement.

Tools.

The authors of [9] phrase their algorithmic results using the framework of PILLAR operations. In this framework, we operate on strings, each of them specified by a handle. For two strings S and T, the following operations are possible (among others):

  1. 1.

    Extract(S,,r): retrieve the string S[..r],

  2. 2.

    LCP(S,T): compute the length of the longest common prefix of S and T,

  3. 3.

    IPM(S,T): assuming that |T|2|S|, return the starting positions of all exact occurrences of S in T (at most two starting positions or an arithmetic progression of starting positions).

Lemma 8 ([9, Theorem 7.2]).

After an 𝒪(N)-time preprocessing of a collection of strings of total length N, each PILLAR operation can be performed in 𝒪(1) time.

We apply the above lemma on the text and all the patterns in 𝒪(m+N) time.

The first main result of [9] is the following structural characterization.

Lemma 9 ([9, Theorem 3.1]).

For each pattern of length |P|1.5|T|, at least one of the following holds:

  • |OcckH(P,T)|864k,

  • There is a primitive string Q of length |Q||P|/128k such that δH(P,Q[0..|P|))<2k.

Then, they convert the structural characterization (Lemma 9) into an efficient algorithm.

Lemma 10 ([9, Main Theorem 8]).

For any pattern of length |P|1.5|T|, we can compute (a representation of) OcckH(P,T) in time 𝒪(k2loglogk) plus 𝒪(k2) PILLAR operations.

The representation is a set of 𝒪(k2) arithmetic progressions. Further, as the algorithm follows the proof of Lemma 9, in fact it either outputs a set of 𝒪(k) occurrences or finds a primitive string Q of length |Q||P|/128k such that δH(P,Q[0..|P|))<2k.

Notation.

We rephrase the APE with k Mismatches problem as follows. For a pattern P and a set of positions A in the text T we define:

ExtkH(P,T,A):=(OcckH(P,T)A)+|P|,

which is also a set of positions in T. Then, APE with k Mismatches for for a given set of patterns P1,P2, and a text T can be solved as follows. For every d=0,1,,k we set A to be Ud. Then, for every d=d,d+1,,k we create an instance of computing:

iExtkH(Pi,T,A),

where k=dd. The obtained bitvector contributes to the bitvector Vd. From now on, we focus on designing an algorithm that computes iExtkH(Pi,T,A), and identify the underlying sets with their characteristic vectors. For two sets of integers X and Y, we define their sumset XY:={x+y:xX,yY}. For a set of integers X and a shift s, we define X+s:={x+s:xX}. The following result is well-known.

Lemma 11 (e.g. [14]).

Given X,Y[0,), we can compute XY in 𝒪(log) time.

Simplifying Assumptions.

It is convenient to assume that each pattern has roughly the same length, similar to the length of the text. More formally, our algorithm will assume that:

  1. 1.

    |Pi|[,1.1), for every i,

  2. 2.

    |T|[,1.5],

for some . Any instance can be reduced to 𝒪(logm) instances in which |Pi|[,1.1), for every i, by considering =1.10,1.11,1.12,m. For each such B, we create a separate instance containing only patterns of length from [,1.1). As each pattern Pi falls within exactly one such instance, designing an algorithm running in 𝒪(f(m)+N) time for every such instance, implies an algorithm running in 𝒪(f(m)logm+N) time for a general instance. To additionally guarantee that |T|1.5 (so that Lemma 9 can be directly applied), we choose |T|/0.4 fragments Tj such that each potential occurrence of a pattern Pi in T falls within some fragments Tj. Formally, if Tj is the 1.5-length fragment (possibly shorter for the very last fragment) starting at position 0.4j then:

OcckH(Pi,T)=jOcckH(Pi,Tj)+0.4j,

where we disregard fragments shorter than as they cannot contain an occurrence of any Pi. From now on, always assume that we deal with a single text T, with |T|[,1.5], and a set of patterns Pi with lengths in [,1.1) (we will sometimes omit the index of a pattern and simply write P). The preprocessing from Lemma 8 is performed only once, and then in each instance we assume that any PILLAR operation can be performed in 𝒪(1) time. The input bitvectors U0,U1,,Uk in such an instance are fragments of the original input bitvectors, and after computing the output bitvectors V0,V1,,Vk we update appropriate fragments of the original output bitvectors by computing bitwise-OR. The final number of restricted instances is 𝒪(m/logm), and each original pattern appears in 𝒪(m/) instances.

Consider a restricted instance containing d patterns. Our goal will be to solve it in 𝒪((d+dlog)poly(k)) time. Before we proceed to describe such an algorithm, we analyze what this time implies for an algorithm solving the original instance.

Theorem 12.

For any k, an instance of APE with k Mismatches where the length of each pattern is at least B can be solved in 𝒪(m+(Nm/B2+m2log2m/B)poly(k)) time.

Proof.

We assume that a restricted instance with d patterns can be solved in 𝒪((d+dlog)poly(k)) time, and describe an algorithm for solving a general instance of APE with k Mismatches.

Let di denote the number of patterns of length from [i,i+1), where i=1.1i, in the original instance. Recall that we only consider i such that iB. After the initial 𝒪(m+N)-time preprocessing, ignoring factors polynomial in k, the total time to solve the restricted instances is:

𝒪(im/i(di+idilogi) =𝒪(im/i2dii+imdilogm)
=𝒪(Nm/B2+imdilogm),

where we have used idii=N and iB. We split the sum by separately considering i such that mdilogmdii, i.e., dimlogm/i. This gives us:

𝒪(Nm/B2+idii+im2logm/i)=𝒪(Nm/B2+N+m2log2m/B).

Thus, as long as we indeed manage to solve a restricted instance in the promised complexity, we obtain the theorem.

In what follows, we describe an algorithm for solving a restricted instance of APE with k Mismatches containing d patterns in 𝒪((d+dlog)poly(k)) time.

Additional Assumptions.

We start with applying Lemma 10 on every pattern Pi to obtain a representation of its occurrences in T in 𝒪(dpoly(k)) time. As mentioned earlier, the algorithm underlying Lemma 10 either outputs a set of 𝒪(k) occurrences of Pi in T or finds a primitive 2k-period Qi of Pi such that |Qi||P|/128k</100k (note that the second inequality holds because |P|<1.1). In the latter case, we also obtain a representation of the whole set of occurrences as 𝒪(k2) arithmetic progressions.

If there are 𝒪(k) occurrences of Pi in T, then we process each of them naively in 𝒪(dk) time. From now on we can thus assume otherwise for every pattern Pi. Then, we consider the text T, and ensure that it is fully covered by approximate occurrences of the patterns:

  • some pattern P is a k-mismatch prefix of T, formally δH(P,T[0..|P|))k; and

  • some pattern P is a k-mismatch suffix of T, formally δH(P,T[|T||P|..|T|))k.

This is guaranteed by removing some prefix and some suffix of the text; it can be implemented in 𝒪(dpoly(k)) time by extracting the first and the last occurrence from each arithmetic progression in the representation. Then the following claim can be inferred from the characterization of the period case in [9]. We provide a proof in the appendix for completeness.

Lemma 13.

All Qis are cyclically equivalent, and every Qi is a 6k-period of the text T.

We choose Q to be a cyclic shift of the period Q1 that we got for the pattern P1, so Q is a 2k-period of every pattern, and a 6k-period with offset 0 of the text. This can be implemented in 𝒪(dpoly(k)+) time as follows. For every Pi, because |Qi|</100k and δH(Pi,Qi[0..|Pi|))2k, we have Pi[j|Qi|..(j+2)|Qi|)=QiQi for some j. Further, such a j can be computed in 𝒪(k) time by just trying j=0,1,2, and verifying each j by computing the longest common prefix twice. Overall, this takes 𝒪(dpoly(k)) time. We start with setting Q:=Q1. Then, we search for a cyclic shift Q of Q such that δH(T,Q[0..|T|))6k. To this end, we check all possible 𝒪(/k) cyclic shifts. To verify whether Q is a good cyclic shift, we extract the mismatches between T and Q[0..|T|), terminating when there are more than 6k. The next mismatch can be found in constant time by first computing the longest common prefix of the remaining suffix of T with an appropriate cyclic shift of Q, and if there is none by computing the longest common prefix of the remaining suffix of T with the suffix shortened by |Q| characters. Overall, this takes 𝒪() time. After having found Q, we also compute, for every pattern Pi, an integer r such that δH(Pi,Q[r..r+|Pi|))2k, which can be done with a single internal pattern matching query to find an occurrence of Q in QiQi.

The Algorithm.

To obtain an efficient algorithm, we will partition the set of all positions [0..|T|) into 𝒪(k) consecutive regions R0,R1,,Rb with the property that if we restrict the text to any region Ri, the corresponding fragment is almost periodic with respect to Q; more specifically, it may have a single periodic mismatch at the rightmost position. Then, for each pair of regions Rs and Rt, with st, we separately calculate the set of extensions induced by occurrences of the pattern starting at positions xRs such that (x+|P|)Rt:

iExtkH(Pi,T,A)=stiExtkH(Pi,T,ARs)Rt.

Since b=𝒪(k), this allows us to reduce the problem to 𝒪(k2) separate instances of calculating:

iExtkH(Pi,T,ARs)Rt.

Consider a single pattern P, and recall that by the additional assumptions, we have a primitive string Q of length |Q|</100k such that:

δH(P,P¯)2k for P¯ :=Q[r..r+|P|),
δH(T,T¯)6k for T¯ :=Q[0..|T|).

Further, let Cr be the positions congruent to r modulo |Q| in the text. We start with recalling from [9] that the positions of all k-mismatch occurrences of P in T are congruent modulo |Q|. We provide a proof in the appendix for completeness.

Lemma 14.

OcckH(P,T){r+i|Q|:i}.

Following Lemma 14, choose r such that OcckH(P,T)Cr. In order to characterize OcckH(P,T), let us now analyze the values δH(P,T[x..x+|P|)) for xCr. From triangle inequality, we have

δH(P,T[x..x+|P|))δH(P,P¯)+δH(P¯,T¯[x..x+|P|))+δH(T¯[x..x+|P|),T[x..x+|P|)). (1)

Observe that

P¯=T¯[x..x+|P|), (2)

since both strings have a period Q with offsets congruent modulo |Q|, which gives us

δH(P,T[x..x+|P|))δH(P,P¯)+δH(T¯[x..x+|P|),T[x..x+|P|)). (3)

We will later show that the above inequality is in fact an equality for all xCr except for 𝒪(k2) exceptions E. Specifically, define

E:={τπ:πMis(P,P¯),τMis(T,T¯)}[0,|T|),

which is the set of all starting positions x in T such that when comparing P and T[x..x+|P|), at least one pair of mismatches aligns with each other. Note that |E|δH(P,P¯)δH(T,T¯)2k6k=12k2. Finally, we define R0,R1,,Rb. Let:

  • τ0<τ1<<τb1 denote the sorted elements of Mis(T,T¯),

  • Ri:=(τi1..τi] for all 0ib, where we set τ1:=1 and τb:=|T|1.

This is illustrated in Figure 1.

The elements of Mis(T,T¯) can be computed in 𝒪(k) time by extracting the mismatches with longest common prefix queries, which allows us to find the regions in 𝒪(poly(k)) time. Similarly, we can compute the elements of Mis(P,P¯) in 𝒪(k) time, so we can also compute the set E in 𝒪(poly(k)) time. We stress that the regions are the same for every considered pattern P (but the set E does depend on the pattern P).

Figure 1: Regions R0,R1,,Rb. Black rectangles denote the elements of Mis(T,T¯).

Now we can state our extension of the structural characterization of [9].

Lemma 15.

There exist an integer r and a set E, with |E|=𝒪(k2), such that, for each pair s,t, with 0stb:

ExtkH(P,T,ARs)Rt={((CrARs)+|P|)Rt if δH(P,P¯)+tsk,some subset of E+|P| otherwise.

Proof.

For any s>t the resulting set is trivially empty. Now let us fix any st and define

IstP:=CrARs(Rt|P|).

We will show that either IstPOcckH(P,T) or IstPOcckH(P,T)E, depending on whether δH(P,P¯)+tsk by using the following property:

Proposition 16.

For all xIstP, we have δH(T[x..x+|P|),T¯[x..x+|P|))=ts.

Proof.

We know that

  • xRs, so τs1<xτs,

  • x+|P|Rt, so τt1<x+|P|τt,

therefore

[x,x+|P|){τ0,τ1,,τb1}={τs,τs+1,,τt1},

and finally

δH(T[x..x+|P|),T¯[x..x+|P|)) =|[x,x+|P|)Mis(T,T¯)|
=|[x,x+|P|){τ0,τ1,,τb1}|
=|{τs,τs+1,,τt1}|
=ts.

Now assume that δH(P,P¯)+tsk. In that case, recall that by (3) combined with Proposition 16, for every xIstP, we have

δH(P,T[x..x+|P|)) δH(P,P¯)+δH(T¯[x..x+|P|),T[x..x+|P|))=δH(P,P¯)+tsk.

Consequently xOcckH(P,T), and therefore IstPOcckH(P,T). In that case observe that

ExtkH(P,T,ARs)Rt =((OcckH(P,T)ARs)+|P|)Rt
=(OcckH(P,T)ARs(Rt|P|))+|P|
=(OcckH(P,T)CrARs(Rt|P|))+|P|
=(OcckH(P,T)IstP)+|P|
=IstP+|P|,

where the third equality follows from OcckH(P,T)Cr and the fifth from IstPOcckH(P,T). Since IstP+|P|=((CrARs)+|P|)Rt as desired, the proof for this case is complete.

For the second case, when δH(P,P¯)+ts>k, we need to make use of the following property:

Proposition 17.

For all xCrE, we have

δH(P,T[x..x+|P|))=δH(P,P¯)+δH(T¯[x..x+|P|),T[x..x+|P|)).

Proof.

Consider the triangle inequality (1) stated explicitly for each position i[0..|T|):

δH(P[i],T[i+x])δH(P[i],P¯[i])+δH(P¯[i],T¯[i+x])+δH(T¯[i+x],T[i+x]).

From (2) we already know that δH(P¯[i],T¯[i+x])=0, thus

δH(P[i],T[i+x])δH(P[i],P¯[i])+δH(T¯[i+x],T[i+x]).

We will now show that the above inequality holds with equality by considering two cases. The proof is completed by summing the equations for all i[0..|T|).

  1. 1

    i+xMis(T,T¯). In that case, observe that by triangle inequality

    δH(P[i],P¯[i])δH(P[i],T[i+x])+δH(T[i+x],T¯[i+x])+δH(T¯[i+x],P¯[i])

    and since δH(T[i+x],T¯[i+x])=0 by the assumption and δH(T¯[i+x],P¯[i])=0 by (2), we get

    δH(P[i],P¯[i])+δH(T¯[i+x],T[i+x])δH(P[i],T[i+x]).
  2. 2

    iMis(P,P¯). In that case, observe that by triangle inequality

    δH(T¯[i+x],T[i+x])δH(T¯[i+x],P¯[i])+δH(P¯[i],P[i])+δH(P[i],T[i+x])

    and similarly, since δH(P¯[i],P[i])=0 and δH(T¯[i+x],P¯[i])=0, we again get

    δH(P[i],P¯[i])+δH(T¯[i+x],T[i+x])δH(P[i],T[i+x]).

It remains to show that every i[0..|T|) falls into at least one of these two cases. Indeed, if for some i we would have iMis(P,P¯) and i+xMis(T,T¯), then by the definition of E, we would get xE, which is a contradiction.

By Proposition 17, combined with Proposition 16 for every xIstPE, we have

δH(P,T[x..x+|P|)) =δH(P,P¯)+δH(T¯[x..x+|P|),T[x..x+|P|))=δH(P,P¯)+ts>k,

therefore xOcckH(P,T), which implies IstPOcckH(P,T)E. Following the same reasoning as before, up to the fourth equality, we get

ExtkH(P,T,ARs)Rt=(OcckH(P,T)IstP)+|P|E+|P|

as required.

We apply Lemma 15 on every pattern Pi. Whenever the second case applies, we process all occurrences of Pi naively. We observe that by definition we have

ExtkH(P,T,ARs)Rt=((OcckH(P,T)EARs)+|P|)Rt.

Since we already have access to the previously calculated representation of OcckH(P,T), we can simply check for each element in E whether it is in OcckH(P,T)ARs in 𝒪(k2) time, as the representation of OcckH(P,T) is of size 𝒪(k2), so 𝒪(|E|k2)=𝒪(k4) time in total.

For the remaining patterns that fall into the first case, we still use the naive approach if |Q|>z for some threshold z to be chosen later. Since |Cr|=𝒪(/|Q|), this takes 𝒪(/z) time per pattern. Otherwise, |Q|z. We partition the remaining patterns into |Q| groups with the same r. Formally, let 𝒫r denote the set of patterns P with a specific value of r:

iExtkH(Pi,T,ARs)Rt=r=0|Q|1P𝒫r((CrARs)+|P|)Rt.

We calculate the result for each r separately by phrasing it as a sumset of some common set of positions with the set of pattern lengths, where the result is then truncated to Rt:

P𝒫r((CrARs)+|P|)Rt=((CrARs){|P|:P𝒫r})Rt.

This takes 𝒪(zlog) total time by Lemma 11. Overall, the time complexity is 𝒪(dpoly(k)++d/z+zlog), which by choosing z=d/log becomes 𝒪(dpoly(k)+dlog) as promised.

4.4 Combining the Cases

After designing an algorithm for every case, we show how to combine them to obtain the claimed bounds.

See 1

Proof.

By Lemma 4, to prove the theorem it is enough to show how to solve APE with k Mismatches, where k=1, in 𝒪(m1.5polylogm+N) time. We choose B=log2m and B=m. For patterns of length at most B, we use Theorem 5. For patterns of length at least B but at most B, we use Theorem 7. Finally, for patterns of length at least B, we use Theorem 12. Summing up the time complexities, we obtain

𝒪(mlog4m+N)+𝒪(mlog3mloglogm+N)+𝒪(m+N+m1.5log2m)=𝒪(m1.5polylogm+N)

as required.

See 2

Proof.

By Lemma 4, to prove the theorem it is enough to show how to solve APE with k Mismatches for any constant k1 in 𝒪((m1.5+N)polylogm) time. We choose B=m. For patterns of length at most B, we use Theorem 7. Finally, for patterns of length at least B, we use Theorem 12. Summing up the time complexities, we obtain

𝒪(m1.5logkmloglogm+Nlogk+1m)+𝒪(m+N+m1.5log2m)=𝒪((m1.5+N)polylogm)

as required.

References

  • [1] Josh Alman, Ran Duan, Virginia Vassilevska Williams, Yinzhan Xu, Zixuan Xu, and Renfei Zhou. More asymmetry yields faster matrix multiplication. In Yossi Azar and Debmalya Panigrahi, editors, Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2025, New Orleans, LA, USA, January 12-15, 2025, pages 2005–2039. SIAM, 2025. doi:10.1137/1.9781611978322.63.
  • [2] Kotaro Aoyama, Yuto Nakashima, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Faster online elastic degenerate string matching. In Gonzalo Navarro, David Sankoff, and Binhai Zhu, editors, Annual Symposium on Combinatorial Pattern Matching, CPM 2018, July 2-4, 2018 - Qingdao, China, volume 105 of LIPIcs, pages 9:1–9:10. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2018. doi:10.4230/LIPICS.CPM.2018.9.
  • [3] Rocco Ascone, Giulia Bernardini, Alessio Conte, Massimo Equi, Estéban Gabory, Roberto Grossi, and Nadia Pisanti. A unifying taxonomy of pattern matching in degenerate strings and founder graphs. In Solon P. Pissis and Wing-Kin Sung, editors, 24th International Workshop on Algorithms in Bioinformatics, WABI 2024, September 2-4, 2024, Royal Holloway, London, United Kingdom, volume 312 of LIPIcs, pages 14:1–14:21. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2024. doi:10.4230/LIPICS.WABI.2024.14.
  • [4] Giulia Bernardini, Estéban Gabory, Solon P. Pissis, Leen Stougie, Michelle Sweering, and Wiktor Zuba. Elastic-degenerate string matching with 1 error or mismatch. Theory Comput. Syst., 68(5):1442–1467, 2024. doi:10.1007/S00224-024-10194-8.
  • [5] Giulia Bernardini, Pawel Gawrychowski, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone. Elastic-degenerate string matching via fast matrix multiplication. SIAM J. Comput., 51(3):549–576, 2022. doi:10.1137/20M1368033.
  • [6] Giulia Bernardini, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone. Approximate pattern matching on elastic-degenerate text. Theor. Comput. Sci., 812:109–122, 2020. doi:10.1016/J.TCS.2019.08.012.
  • [7] Karl Bringmann, Marvin Künnemann, and Philip Wellnitz. Few matches or almost periodicity: Faster pattern matching with mismatches in compressed texts. In Timothy M. Chan, editor, Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pages 1126–1145. SIAM, 2019. doi:10.1137/1.9781611975482.69.
  • [8] Thomas Büchler, Jannik Olbrich, and Enno Ohlebusch. Efficient short read mapping to a pangenome that is represented by a graph of ED strings. Bioinform., 39(5), 2023. doi:10.1093/BIOINFORMATICS/BTAD320.
  • [9] Panagiotis Charalampopoulos, Tomasz Kociumaka, and Philip Wellnitz. Faster approximate pattern matching: A unified approach. In Sandy Irani, editor, 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020, Durham, NC, USA, November 16-19, 2020, pages 978–989. IEEE, 2020. doi:10.1109/FOCS46700.2020.00095.
  • [10] Aleksander Cislak, Szymon Grabowski, and Jan Holub. Sopang: online text searching over a pan-genome. Bioinform., 34(24):4290–4292, 2018. doi:10.1093/BIOINFORMATICS/BTY506.
  • [11] Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. Dictionary matching and indexing with errors and don’t cares. In László Babai, editor, Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, June 13-16, 2004, pages 91–100. ACM, 2004. doi:10.1145/1007352.1007374.
  • [12] Martin Farach. Optimal suffix tree construction with large alphabets. In 38th Annual Symposium on Foundations of Computer Science, FOCS ’97, Miami Beach, Florida, USA, October 19-22, 1997, pages 137–143. IEEE Computer Society, 1997. doi:10.1109/SFCS.1997.646102.
  • [13] Nathan J. Fine and Herbert S. Wilf. Uniqueness theorems for periodic functions. Proceedings of the American Mathematical Society, 16:109–114, 1965. doi:10.1090/S0002-9939-1965-0174934-9.
  • [14] Martin Fürer. How fast can we multiply large integers on an actual computer? In Alberto Pardo and Alfredo Viola, editors, LATIN 2014: Theoretical Informatics - 11th Latin American Symposium, Montevideo, Uruguay, March 31 - April 4, 2014. Proceedings, volume 8392 of Lecture Notes in Computer Science, pages 660–670. Springer, 2014. doi:10.1007/978-3-642-54423-1_57.
  • [15] Esteban Gabory, Moses Njagi Mwaniki, Nadia Pisanti, Solon P. Pissis, Jakub Radoszewski, Michelle Sweering, and Wiktor Zuba. Pangenome comparison via ED strings. Frontiers in Bioinformatics, 4, 2024. doi:10.3389/fbinf.2024.1397036.
  • [16] Pawel Gawrychowski, Gad M. Landau, and Tatiana Starikovskaya. Fast entropy-bounded string dictionary look-up with mismatches. In Igor Potapov, Paul G. Spirakis, and James Worrell, editors, 43rd International Symposium on Mathematical Foundations of Computer Science, MFCS 2018, August 27-31, 2018, Liverpool, UK, volume 117 of LIPIcs, pages 66:1–66:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2018. doi:10.4230/LIPICS.MFCS.2018.66.
  • [17] Roberto Grossi, Costas S. Iliopoulos, Chang Liu, Nadia Pisanti, Solon P. Pissis, Ahmad Retha, Giovanna Rosone, Fatima Vayani, and Luca Versari. On-line pattern matching on similar texts. In Juha Kärkkäinen, Jakub Radoszewski, and Wojciech Rytter, editors, 28th Annual Symposium on Combinatorial Pattern Matching, CPM 2017, July 4-6, 2017, Warsaw, Poland, volume 78 of LIPIcs, pages 9:1–9:14. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2017. doi:10.4230/LIPICS.CPM.2017.9.
  • [18] Dov Harel and Robert Endre Tarjan. Fast algorithms for finding nearest common ancestors. SIAM J. Comput., 13(2):338–355, 1984. doi:10.1137/0213024.
  • [19] Costas S. Iliopoulos, Ritu Kundu, and Solon P. Pissis. Efficient pattern matching in elastic-degenerate strings. Inf. Comput., 279:104616, 2021. doi:10.1016/J.IC.2020.104616.
  • [20] Gad M. Landau and Uzi Vishkin. Efficient string matching with k mismatches. Theor. Comput. Sci., 43:239–249, 1986. doi:10.1016/0304-3975(86)90178-7.
  • [21] Veli Mäkinen, Bastien Cazaux, Massimo Equi, Tuukka Norri, and Alexandru I. Tomescu. Linear time construction of indexable founder block graphs. In Carl Kingsford and Nadia Pisanti, editors, 20th International Workshop on Algorithms in Bioinformatics, WABI 2020, September 7-9, 2020, Pisa, Italy (Virtual Conference), volume 172 of LIPIcs, pages 7:1–7:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPICS.WABI.2020.7.
  • [22] Solon P. Pissis, Jakub Radoszewski, and Wiktor Zuba. Faster approximate elastic-degenerate string matching – Part A. In Paola Bonizzoni and Veli Mäkinen, editors, 36th Annual Symposium on Combinatorial Pattern Matching, CPM 2025, June 17-19, 2025, Milan, Italy, volume 331 of LIPIcs, pages 28:1–28:19. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2025. doi:10.4230/LIPIcs.CPM.2025.28.
  • [23] Milan Ruzic. Constructing efficient dictionaries in close to sorting time. In Luca Aceto, Ivan Damgård, Leslie Ann Goldberg, Magnús M. Halldórsson, Anna Ingólfsdóttir, and Igor Walukiewicz, editors, Automata, Languages and Programming, 35th International Colloquium, ICALP 2008, Reykjavik, Iceland, July 7-11, 2008, Proceedings, Part I: Tack A: Algorithms, Automata, Complexity, and Games, volume 5125 of Lecture Notes in Computer Science, pages 84–95. Springer, 2008. doi:10.1007/978-3-540-70575-8_8.
  • [24] Daniel Dominic Sleator and Robert Endre Tarjan. A data structure for dynamic trees. J. Comput. Syst. Sci., 26(3):362–391, 1983. doi:10.1016/0022-0000(83)90006-5.
  • [25] Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, pages 1–11. IEEE Computer Society, 1973. doi:10.1109/SWAT.1973.13.

Appendix A Very Short Case (for 𝒌=𝟏)

Before stating the algorithm, we need a few standard tools.

Suffix Tree.

A trie is a (rooted) tree, where every edge is labeled with a single character. Each node of a trie represents the string obtained by concatenating the labels on its path from the root. We consider only deterministic tries, meaning that the labels of all edges outgoing from the same node are pairwise distinct. Then, a compact trie is obtained from a trie by collapsing maximal downward paths on which every inner node has exactly one child. The remaining nodes are called explicit, and the nodes that have been removed while collapsing the paths are called implicit. In a compact trie, every edge is labeled with a nonempty string, and the first characters of all edges outgoing from the same node are pairwise distinct.

The suffix tree of a string T[0..n1] is the compact trie of all the suffixes of T$, where $ is a special character not occurring anywhere in T [25]. Thus, there are n+1 leaves in the suffix tree of T, and it contains 𝒪(n) nodes and edges. The label of each edge is equal to some fragment T[i..j], and we represent it by storing i and j; thus the whole suffix tree needs only 𝒪(n) space. For constructing the suffix tree we apply the following result.

Lemma 18 ([12]).

The suffix tree of a string T[0..n1] over a polynomial alphabet can be constructed in 𝒪(n) time.

The suffix tree of string T, denoted by STT, allows us to easily check if any string X is a substring of T by starting at the root and simulating navigating down in the trie storing the suffixes of T. In every step, we need to choose the outgoing edge labeled by the next character X[i]. If the current node is implicit, this is trivial to implement in constant time. Otherwise, we might have multiple outgoing edges, and we need to store the first characters of their edges in an appropriate structure. To this end, we use deterministic dictionaries.

Lemma 19 ([23, Theorem 3]).

Given a set S of n integer keys, we can build in 𝒪(n(loglogn)2) time a structure of size 𝒪(n), such that given any integer x we can check in 𝒪(1) time if xS, and if so return its associated information.

We apply Lemma 19 at each explicit node of the suffix tree. This takes 𝒪(n(loglogn)2) total time, and then allows us to implement navigating down in 𝒪(|X|) total time. This also gives us, for each prefix X[..i], its unique identifier: if we are at an explicit node then it is simply its preorder number, and otherwise it is a pair consisting of the preorder number of the nearest explicit ancestor and the length of the current prefix. If in any step we cannot proceed further, we set the identifier to null denoting that the prefix does not occur in T. Such identifiers have the property that the identifier of X[..i] is null if and only if X[..i] does not occur in T, and the identifiers of X[..i] and Y[..j] that both occur in T are equal if and only if the strings themselves are equal. Further, we can think that each identifier is an integer from {1,2,,n2}.

See 5

Proof.

We assume that the length of each pattern Pi is at most B. Recall that we are given bitvectors U0,U1 and the goal is to compute the bitvectors V0,V1. This will be done by explicitly listing all fragments T[j..j) such that δH(T[j..j),Pi)=0, for some i, and δH(T[j..j),Pi)=1, for some i. For each such a fragment, we propagate the appropriate information from the input bitvectors to the output bitvectors.

We begin with constructing the suffix trees of T and Tr, including the dictionary structures at each explicit node. Then, we distinguish two cases as follows.

Exact Occurrences.

For each pattern Pi, we find its identifier in STT in 𝒪(|Pi|) time. If the identifier is non-null then we include it in a set S.

We iterate over every position j=0,1,,m1 and length =1,2,,min{B,mj}. While we iterate over the lengths, we simultaneously navigate in STT to maintain the identifier of T[j..j+). To check if T[j..j+)=Pi for some i, we thus need to check if a specific identifier belongs to S. Recall that the identifiers are integers from {1,2,,m2}. To avoid the necessity of paying extra logarithmic factors or using randomization, we answer all such queries together. In more detail, we gather all the queries. Then, we sort the elements of S and the queries together with radix sort. Then, we scan the obtained sorted list and obtain the answer to each query in linear total time. Finally, if T[j..j+)=Pi for some i and Ud[j]=1, then we set Vd[j+]=1, for d=0,1.

Occurrences with one Mismatch.

For each pattern Pi, we iterate over every position j=0,1,,|Pi|1, assuming that the mismatch is at position j. We would like to have access to the identifiers of Pi[..j1] and Pi[j+1..]. This can be guaranteed by first navigating in STT to compute the identifier of every prefix Pi[..j] in 𝒪(|Pi|) time, and similarly navigating in STTr to compute the identifier of the reversal of every suffix (Pi[j..])r. After such a preliminary step, for every position j=0,1,,|Pi|1, if both identifiers are non-null then we form a pair consisting of the identifier of Pi[..j1] and the identifier of (Pi[j+1..])r. Let S denote the obtained set of pairs.

We iterate over every position j=0,1,,m1, position j=j,j+1,,m1 and position j′′=j,j+1,,m1, where T[j..j′′] is the considered fragment of T and j is the position of the mismatch. We would like to have access to the identifier of (T[j..j))r in STTr and the identifier of T(j..j′′] in STT. This can be assumed without increasing the time complexity by first iterating over j (in any order), then over j′′ in the increasing order, and finally over j in decreasing order, all while simultaneously navigating in STT and STTr, respectively. With the identifiers at hand, we need to check if the pair consisting of the identifier of T(j..j′′] and the identifier of (T[j..j))r belongs to S. Similarly as for exact occurrences, this is done by answering the queries together with radix sort. Then, if δH(T[j..j′′],Pi)1, for some i, and U0[j]=1, we set V1[j′′+1]=1.

Summary.

The algorithm described above consists of the following steps. First, we need to construct the suffix trees of T and TR in 𝒪(m) time. Constructing the deterministic dictionaries storing the outgoing edges takes 𝒪(m(loglogm)2) time. Second, listing and processing the exact occurrences takes 𝒪(mB+N) time. Third, listing and processing occurrences with one mismatch takes 𝒪(m(B)2+N) time.

Appendix B Omitted Proofs

See 13

Proof.

We first prove that the periods Qi obtained for all the patterns Pi must be cyclically equivalent. Let Tmid:=T[/2..) be the middle part of T. Since all patterns are of length |P| and the text is of length |T|1.5, all pattern occurrences must cover the middle part of T. Recall that we assume that every pattern Pi has some 2k-period Qi. By triangle inequality, every Qi must be a 3k-period of Tmid. We will first show that if the strings Qi are primitive and of length |Qi|</100k, then they are all cyclically equivalent. Select any two such periods of Tmid, denoted by Q1 and Q2, and assume (only to avoid clutter) that both of their offsets are equal to 0.

First, assume that Q1 and Q2 are not of the same length. Observe that since the size of the combined set of periodic mismatches Mis(Tmid,Q1[0../2))Mis(Tmid,Q2[0../2)) is at most 6k, there must exist a substring Tsub of Tmid that does not contain any such mismatch of length at least

|Tsub||Tmid|6k6k+1/2+16k+11>14k1.

The strings Q1 and Q2 are thus exact periods of Tsub. In addition we have

|Q1|+|Q2|/100k+/100k</14k<|Tsub|+1

which by the periodicity lemma of Fine and Wilf [13] induces a period of length gcd(|Q1|,|Q2|), and contradicts the assumption that Q1 and Q2 are primitive.

In the other case, when |Q1|=|Q2|, assume that Q1Q2. We would then have

δH(Q1[0../2),Q2[0../2))/2|Q1|/2/100k=50k.

On the other hand, by triangle inequality

δH(Q1[0../2),Q2[0../2))δH(Q1[0../2),Tmid)+δH(Tmid,Q2[0../2))3k+3k,

which again gives us a contradiction and proves that Q1 must be equivalent to Q2.

Since we have assumed that some P is a k-mismatch prefix of T and some P is a k-mismatch suffix of T, both having 2k-period Q, it can be proven with similar arguments that Q is a 6k-period of T.

See 14

Proof.

For any xOcckH(P,T), by triangle inequality, we have

δH(P¯,T¯[x..x+|P|))δH(P¯,P)+δH(P,T[x..x+|P|))+δH(T[x..x+|P|),T¯[x..x+|P|))

and since

  • δH(P¯,P)2k,

  • xOcckH(P,T)δH(P,T[x..x+|P|))k,

  • δH(T[x..x+|P|),T¯[x..x+|P|))δH(T,T¯)6k

we get

δH(P¯,T¯[x..x+|P|))9k.

Recall that P¯=Q[r..r+|P|) and T¯[x..x+|P|)=Q[x..x+|P|) both have a primitive period Q (with offsets r and x, respectively). If their offsets are not congruent modulo |Q|, we can bound the number of mismatches by

δH(P¯,T¯[x..x+|P|))|P|/|Q|>9k,

which yields a contradiction (the second inequality follows from |Q||P|/128k). Therefore x{r+i|Q|:i}.