Abstract 1 Introduction and Related Work 2 Preliminaries 3 Our Solution 4 An Improved Solution 5 Trade-Offs for Time and Space 6 Indexing for Pattern Matching with Gaps References

Repetition Aware Text Indexing for Matching Patterns with Wildcards

Daniel Gibney111Corresponding author ORCID University of Texas at Dallas, Richardson, TX, USA Jackson Huffstutler ORCID University of Texas at Dallas, Richardson, TX, USA Mano Prakash Parthasarathi ORCID North Carolina State University, Raleigh, NC, USA Sharma V. Thankachan ORCID North Carolina State University, Raleigh, NC, USA
Abstract

We study the problem of indexing a text T[1..n] to support pattern matching with wildcards. The input of a query is a pattern P[1..m] containing h[0,k] wildcard (a.k.a. don’t care) characters and the output is the set of occurrences of P in T (i.e., starting positions of substrings of T that matches P), where k=o(logn) is fixed at index construction. A classic solution by Cole et al. [STOC 2004] provides an index with space complexity O(n(clogn)k/k!)) and query time O(m+2hloglogn+𝗈𝖼𝖼), where c>1 is a constant, and 𝗈𝖼𝖼 denotes the number of occurrences of P in T. We introduce a new data structure that significantly reduces space usage for highly repetitive texts while maintaining efficient query processing. Its space (in words) and query time are as follows:

O(δlog(n/δ)ck(1+logk(δlogn)k!)) and O((m+2h+𝗈𝖼𝖼)logn))

The parameter δ, known as substring complexity, is a recently introduced measure of repetitiveness that serves as a unifying and lower-bounding metric for several popular measures, including the number of phrases in the LZ77 factorization (denoted by z) and the number of runs in the Burrows-Wheeler Transform (denoted by r). Moreover, O(δlog(n/δ)) represents the optimal space required to encode the data in terms of n and δ, helping us see how close our space is to the minimum required. In another trade-off, we match the query time of Cole et al.’s index using O(n+δlog(n/δ)(clogδ)k+ϵ/k!) space, where ϵ>0 is an arbitrarily small constant. We also demonstrate how these techniques can be applied to a more general indexing problem, where the query pattern includes k-gaps (a gap can be interpreted as a contiguous sequence of wildcard characters).

Keywords and phrases:
Pattern Matching, Text Indexing, Wildcard Matching
Category:
Track A: Algorithms, Complexity and Games
Funding:
Mano Prakash Parthasarathi: U.S. National Science Foundation (NSF) award CCF-2316691.
Sharma V. Thankachan: U.S. National Science Foundation (NSF) award CCF-2316691.
Copyright and License:
[Uncaptioned image] © Daniel Gibney, Jackson Huffstutler, Mano Prakash Parthasarathi, and
Sharma V. Thankachan; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Pattern matching
Editors:
Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele Puppis

1 Introduction and Related Work

Efficient indexing of string or textual data is crucial in fields such as computational biology and information retrieval, enabling fast and accurate search queries. A fundamental task in this context is to preprocess a long string T[1..n] (the text) into a data structure that allows efficient retrieval of occurrences of a short string P[1..m] (the pattern) when queried. An occurrence of P refers to the starting position of a substring in T that matches P either exactly or approximately, depending on the matching criteria. We denote the number of occurrences as 𝗈𝖼𝖼. In the exact match setting, the well-known suffix tree [51] structure supports such queries in O(m+𝗈𝖼𝖼) time. WLOG, we assume that the characters in T are from an integer alphabet Σ={1,2,,σ} and σn.

A major drawback of suffix trees (and even closely related suffix arrays) is their space usage, which is O(n) words (equivalently, O(nlogn) bits). This can be significantly larger than the space required to store the text itself, which is nlogσ bits. In the early 2000s, a key research challenge was designing indexes that required space closer to that of the text. This led to major breakthroughs, such as compressed suffix arrays/trees [17, 46, 47], FM-index [12] and their refinements. See [40] for an extensive survey on this topic. These innovations enabled encodings that use close to nlogσ bits – or even approach entropy-compressed space while supporting suffix tree/array operations efficiently. While these findings are significant, they face major challenges when applied to many modern datasets, such as large collections of DNA sequences or versioned texts. These data sets tend to be highly repetitive, i.e., contain many long, repeated substrings. Although large in size, repetitiveness makes them highly compressible, but statistical entropy-based compressors (and data structures of that space) are often less effective at capturing such redundancy. Instead, dictionary-based compressors perform better in these scenarios [37].

Two popular dictionary-based compression methods are Lempel-Ziv (LZ77) [53] and the (run-length-encoded) Burrows-Wheeler Transform (BWT) [7], which use O(z) and O(r) space, respectively. Here, z denotes the number of phrases in the LZ77 factorization of the text, while r represents the number of runs in the text’s BWT. A major recent breakthrough in this area is the introduction of two new measures of repetitiveness – the string attractor [25] and the δ-measure (a.k.a. substring complexity) [27, 43] - which led to the discovery of the asymptotic relationship between several seemingly distinct measures. A string attractor Γ of T[1..n] is a subset of the text’s positions such that for every substring T[i..j] there exists i,j[1,n] and xΓ where T[i..j]=T[i..j] and x[i,j]. In linear time, we can compute a string attractor of size z or r; however, finding the smallest string attractor (its size is denoted by γ) is NP-hard. The δ-measure is equal to maxtdt(T)/t where dt(T) is number of distinct substrings of T of length t, which can be easily computed in O(n) time using a suffix tree. An important and unifying result in the field of text compression is that δγz,r=O(δlogδmax(1,lognδlogδ)) [23]; also z=O(δlog(n/δ)) [43]. We can also bound r¯, the number of runs in the BWT of text’s reverse by O(δlogδmax(1,lognδlogδ)) [23] since δ (and γ) are invariant under the reversal of the text. In short, the compressibility measures mentioned are always within poly-logarithmic factors of one another and δ lower bounds all others. It is also known that the optimal space required to encode the data in terms of n and δ is O(δlog(n/δ)) [28]. This raises an important question:

Can we index the text in (close to) optimal repetition-aware space of O(δlog(n/δ)) words
and support various pattern matching queries efficiently?

For exact pattern matching, this question has been answered positively; there exists such an optimal space index that supports queries in O(m+(1+𝗈𝖼𝖼)logϵn) time [26], where ϵ>0 is an arbitrarily small constant. The authors also showed how to achieve O(m+𝗈𝖼𝖼) query time using logϵ(n/δ) factor more space. The recently introduced structure (called δ-SA) encodes suffix array and inverse suffix array in optimal space and supports random access in poly-logarithmic time [24]. Another structure encodes the suffix tree in O(rlog(n/r)) space and supports all basic operations in O(log(n/r)) time [14]. We refer to [38] for a detailed survey on this topic. In summary, data structures for the exact pattern-matching problem have kept pace with the data in terms of space as repetitiveness has increased. However, this adaptation is still in its early stages for more advanced problems; See [1, 16, 39, 36, 45, 44, 50] for a list of problems, where limited progress has been achieved. To this end, we revisit two fundamental problems, which we define formally below.

Problem 1 (k-Wildcard Indexing).

Given a text T[1..n] and an integer k=o(logn), design a data structure (called an index) that supports the following query efficiently.

  • Input: A pattern P[1..m] with hk wildcards. A wildcard is a character, denoted by #, that matches any character.

  • Output: The set of occurrences of P in T (denote the output size by 𝗈𝖼𝖼).

The alphabet set of T is Σ={1,2,,σ} and σn. The alphabet set of P is Σ{#}.

This problem can be solved algorithmically in O(nlogm) time for any k using the Fast Fourier Transform (FFT). The first non-trivial indexing solution was given by Cole et al. [10]. Their O(n(clogn)k/k!) space structure, known as the k-wildcard errata trie, can answer queries in time O(m+2hloglogn+occ), where c>1 is a fixed constant. There have been several attempts to generalize/improve this classic result for constant k as follows.

  • O(nlognlogβk1m) space and O(m+βjloglogn+𝗈𝖼𝖼) query time [6] for any β[2,σ].

  • O(nlogk1+ϵn) space and O(m+2hloglogn+𝗈𝖼𝖼) query time [32].

  • O(nlogk1nlogσ) space and O(m+2hlogn+𝗈𝖼𝖼) query time [32]. We note that this time can be improved to O(m+2hloglogn+𝗈𝖼𝖼).

There are also some linear (or slightly better) space indexes that can handle queries with an unbounded number h of wildcards. However, they generally have a factor σh in the query time, making them comparable to others only when the alphabet size is constant [6, 33, 31]. This includes an index achieving O(m+𝗈𝖼𝖼) query time using O(n(σkloglogn)k) space [6]. We present the first set of repetition-aware solutions as summarized below.

Theorem 1.

There exists a solution for Problem 1 with space

O(δlog(n/δ)ck(1+logk(δlogn)k!))words and query time O((m+2h+𝗈𝖼𝖼)logn)).

Here δ is the substring complexity and c>1 is a fixed constant.

The query time is improved in the next result, matching that of Cole et al.’s index.

Theorem 2.

There exists a solution for Problem 1 with space

O(δlog(n/δ)(clogδ)k+ϵk!+n(logσn)1ϵ)words and query time O(m+2hloglogn+𝗈𝖼𝖼).

Here δ is the substring complexity, c>1 is a fixed constant and ϵ>0 is an arbitrarily small constant. The query time can be made O(m+𝗈𝖼𝖼) time by maintaining an additional structure of space O(δlog(n/δ)(ckloglogn)k+2/k!).

We also show how our techniques can be extended to design a repetition-aware index for a closely related problem known as indexing with gaps [30].

2 Preliminaries

Notation.

For a length n string T[1..n], we denote the ith character by T[i]. We use “” to denote concatenation between two strings. A substring of T starting at position i and ending at position j of T is denoted by T[i..j]=T[i]T[i+1]T[j]. If i>j, then T[i..j] is the empty string. We also use the notation T[i..j)=T[i]T[j1] and T(i..j]=T[i+1]T[j]. We use TR to denote the reverse of the string T. A suffix is a string of the form T[i..n], which we denote as T[i..]. A suffix T[i..] is called a proper suffix if i>1. A prefix is a string of the form T[1..i], which we denote as T[..i]. A prefix T[..i] is called a proper prefix if i<n.

Compact Tries and Suffix Trees.

Let 𝒮 be a set of strings over Σ. We use 𝒯(𝒮) to denote a compact trie constructed over 𝒮 after appending each string in 𝒮 with a special character $Σ. The character $ is lexicographically smaller than all other characters in Σ. The number of leaves in 𝒯(𝒮) will be exactly |𝒮|. The leaves of 𝒯(𝒮), denoted in left-to-right order are 1, 2, , |𝒮| such that i corresponds to ith string in 𝒮 when 𝒮 is sorted in ascending lexicographic order. For a node u (either implicit or explicit222Branching nodes and leaves are called explicit nodes, and positions on edges are called implicit nodes.) in 𝒯(𝒮), we use str(u) to denote the string obtained by concatenating edge labels on the root-to-u path and string depth strlen(u) to denote its length.

Given a pattern P[1..m], and a compact trie 𝒯, if str(u)=P for some node u (either implicit or explicit) in 𝒯, then u is called the locus of P. Given a locus, the collection of leaves under that locus forms a contiguous range a, a+1, , b and we call [a,b] the range of the locus. Observe that P is a prefix of str(x) for all x[a,b].

The suffix tree [51] of a string T[1..n], denoted as ST(T), is a compact trie built over all suffixes of T. The suffix tree can be constructed in O(n) time for polynomially-sized integer alphabets [11]. The suffix array SA[1..n] is an array such that T[SA[i]..] is the ith suffix when sorted in ascending lexicographic order. The inverse suffix array ISA[1..n] is an array such that ISA[SA[i]]=i, or equivalently ISA[i] is the lexicographic rank of the suffix T[i..]. The longest common extension of two suffixes T[i..] and T[j..], denoted by LCE(i,j) is the length of their longest common prefix. LCE queries can be answered in O(1) time using an O(n) space structure [18].

There exists a combination of compact data structures of total space O(nlogσlogσϵn) bits, equivalently O(n/logσ1ϵn) words, and support all suffix tree functionalities with no slowdown in performance [17, 47]. There also exist indexes of space O(nlogσ) bits that supports SA and ISA queries in time O(logσϵn) time [17] and LCE queries in O(1) time [22].

Burrows-Wheeler Transform.

The Burrows-Wheeler Transform (BWT) [7] of a text T[1..n], denoted by BWT[1..n+1] is a permutation of all characters in T$, such that BWT[i] is the last character of ith string in the set {T[i..]$T[..i)i[1,n]}{$T} when sorted lexicographically. The BWT can be computed in linear time. The BWT has been a popular form of compression used in several previous text indexes [3, 34, 35] including the well known FM-index [12]. The popularity of the BWT is due to the following favorable properties: First, the BWT groups characters sharing similar contexts, resulting in a small run-length encoded form when the text is highly repetitive. The letter r denotes the number of maximal unary substrings (called runs) in the BWT. Second, the BWT is amenable to efficient pattern matching and string query procedures.

Figure 1: LZ77 -factorization of abababbabab (The dashed and the solid boxes capture a primary and a secondary occurence of bab respectively).
Figure 2: Reduction of finding primary occurrences to 2D range queries.
LZ77 and String Attractor.

Lempel-Ziv (LZ77) [53] is another popular type of repetition-based compression. LZ77 partitions the string greedily from left-to-right into what are called LZ factors or phrases, T[i1..i2), T[i2..i3), . Let z denote the number of factors. The following conditions are satisfied: (i) T=T[i1..i2)T[iz..] and (ii) T[ij..ij+1) is either the first occurrence of a character or is the longest substring starting at position ij that has an occurrence starting at some position <ij. The positions i1, , iz are called phrase boundaries and we can compute them in linear time [49]. Let T[i..j]=S, we say i is a primary occurrence of S if a phrase boundary exists in the range [i,j]. Every occurrence of S not containing a phrase boundary is called a secondary occurrence. It can easily be shown that every distinct substring of T has a primary occurrence [21], which makes {i1, , iz} a string attractor. See Figure 2 for an example.

Repetition-Aware Suffix Trees/Arrays and the 𝒓-index.

By indexing the text in space O(rlog(n/r)), we can support SA, ISA, and LCE operations in O(log(n/r)) time [14]. There also exists an optimal O(δlog(n/δ)) space index (called δ-SA) that supports SA and ISA operations in O(log4+ϵn) time and LCE operation in time O(logn) [24]. The r-index is another important structure of space O(r) words [14]; see [41] for its refinements. It supports efficient pattern matching, but it cannot support any of the basic operations mentioned earlier. Fortunately, the limited operations supported by r-index are sufficient for our purpose.

Orthogonal Range Queries.

The task here is to preprocess a set of d-dimensional points 𝒫. For a query [a1,b1]××[ad,bd], the output is the subset of points in 𝒫[a1,b1]××[ad,bd]. Over a set of N 2-dimensional points in an N×N grid, we can maintain an O(N) space structure and support queries in O((1+t)logεN) time [8], where ε>0 is an arbitrarily small constant, (or) an O(NlogεN) space and support queries in O(loglogN+t) time, where t denotes the output size [9].

A standard technique in compressed text indexing, and one that we expand on here, is the use of orthogonal range queries to report primary occurrences [13, 29, 38]. For a simple example, assume that we have the phrases T=T[i1..i2)T[i2..i3)T[is..n]. We build a compact trie 𝒯F over the suffixes T[i1..], , T[ij..], maintaining for each leaf its respective ij value. We also build a compact trie 𝒯F over the reversed prefixes T[..i2)R, , T[..is)R, maintaining for each leaf its respective ij value. Next for each ij, we create a 2D point (x,y) where x is the leaf index in 𝒯R corresponding to T[..ij)R and y is the leaf index in 𝒯F corresponding to T[ij..] and associate the value ij with (x,y). Construct a 2D range query structure over this set of points.

Given a query pattern P, for each x[0,m), we find the locus of P[..x]R in 𝒯R, let its range be [a1,b1]. We also find the locus of P[x+1..] in 𝒯F, let its range be [a2,b2]. Observe that query [a1,b1]×[a2,b2] returns points for phrase boundaries aligning with position x+1 in a primary occurrence of P. See Figure 2.

Heavy Path Decomposition.

For a general tree 𝒯 and node u in 𝒯, we denote the subtree rooted at u as subtree(u). The size of the subtree(u), denoted as |subtree(u)|, is defined as the number of leaves in subtree(u). We next describe heavy path decomposition [48]. Given any tree 𝒯, we categorize its nodes into light and heavy: we take the root as a light node. Exactly one child of every (non-leaf) node is heavy, specifically the one with maximum subtree size. When there is a tie, we pick the leftmost child as heavy. Any path starting from a light node and recursively following its heavy child, and ending at a leaf node is a heavy path. Note that each node belongs to precisely one heavy path.

Property 1.

Each root-to-leaf path in a tree of size n contains at most logn light nodes.

Property 1 can be shown by observing that for any node u in 𝒯 with light child v, |subtree(v)||subtree(u)|/2. Letting L be the set of light nodes in 𝒯, Property 1 further implies that uL|subtree(u)|nlogn.

3 Our Solution

As the first building block, we maintain a base structure that supports suffix array, inverse suffix array, and LCE queries for the text and its reverse. This can be suffix trees in some (compressed) form. We use S𝖣𝖲 to denote the space of the base structure and tQ to denote its query time. We perform the analysis with a general base structure and specify specific trade-offs in Section 5. Our approach then builds two versions of Cole et al.’s k-wildcard errata structure [10] over a subset of substrings based on a parse of the text. Given a query pattern P[1..m], we first find the primary occurrences of P. This is done using a 2D-range query structure constructed over a set of points created with the leaves of both k-wildcard errata structures. The secondary occurrences are obtained using the base structure. We start with a basic subroutine used throughout.

3.1 LCP Queries

We consider a compact trie 𝒯 built from a subset of q substrings of T. With every leaf of 𝒯 we associate a starting text position of the corresponding substring. We seek to answer queries of the following form: Given i, j and a node u (either implicit or explicit) in 𝒯, find the locus of T[i..j] in subtree(u), where T[i..j] is the longest prefix of T[i..j] for which such a locus exists. We call this type of query a unrooted LCP query. If u is restricted to always be the root 𝒯, we call this a rooted LCP query.

Lemma 3.

Assuming the availability of our base structure of space S𝖣𝖲, which supports basic suffix tree operations (SA, ISA, and LCE queries) in tQ time, we have the following results. An unrooted LCP query on a compact trie 𝒯 constructed from q substrings of T can be answered in O(tQ+loglogn) time. The (additional) space used by the data structure is O(qlogq). Moreover, for rooted LCP queries, the (additional) space can be made O(q) with the same query time complexity.

Proof.

As preprocessing for unrooted LCP queries we do the following. Let ur be the root of 𝒯. For every light node u in 𝒯, we consider all substrings corresponding to the leaves in its subtree, that is, strings obtained by concatentating edges labels on ur-to-leaf paths for leaves in subtree(u). Suppose this is the set {T[i1..i1], T[i2..i2] }. For each u we create a y-fast trie [52] over the values ISA[i1+strlen(u)], ISA[i2+strlen(u)], that supports predecessor search in O(loglogn) time. Note that by Property 1 of the heavy path decomposition, the number of values over all light nodes, as well as the total space is O(qlogq).

Given a query consisting of positions i,j (specifying an arbitrary substring T[i..j]) and a locus u in 𝒯, we first perform an LCE(i,b+strlen(u)) where b is the text position of the leaf on the heavy path containing u. If the entirety of T[i..j] is matched along this heavy path (LCE(i,b+strlen(u))ji+1), or if the matching ends prematurely at an implicit node of the heavy path, we are done. Otherwise, we obtain a node v on the heavy path from which we have to explore v’s light children. We next obtain the child w of v that needs to be traversed; note that w must be a light child of v. We query the base structure to obtain the value xISA[i+strlen(w)strlen(u)]. We then perform a predecessor search for x over the values stored for w. The result of the predecessor search gives us at most two leaves corresponding to the predecessor and successor of x in terms of ISA value. We then perform LCE queries with these neighboring leaves and obtain the result.

For rooted LCP queries, we apply the same preprocessing technique but only at the root of 𝒯. Queries are answered in the same way as unrooted LCP queries.

3.2 Cole et al.’s 𝒌-Wildcard Errata Structure

Figure 3: Visualization of Cole’s k-wildcard errata structure.
The Structure.

Cole et al.’s structure [10] consists of a collection of compact tries, which are further categorized into levels (from level 0 to k). We use h to denote the set of tries at level h for h[0,k]. In 0 we have a single trie. In Cole et al.’s framework, this level 0 trie is taken as the complete suffix tree ST(T). In this work, we will be more general, allowing it to be an arbitrary trie constructed from q substrings of T. In 1, each trie corresponds to an internal node in the level 0 trie. A level 1 trie for an internal node u of the level 0 trie is constructed as follows: we collect all strings corresponding to leaves in the subtree of u except those in the subtree of the heavy child of u, remove their first strlen(u)+1 characters, and create a compact trie. Suppose that a substring T[j..j] used in level 1 is obtained from T[i..j] after removing the first strlen(u)+1 characters, then we say that the substring T[j..j] in the level 1 trie corresponds to the original position i. Continuing, each trie in h, h[1,k], corresponds to a unique internal node in some trie in level (h1). A level h trie for a node u in a level (h1) trie is constructed over all substrings for leaves in the subtree of u except those in the heavy child of u, and with their first strlen(u)+1 characters removed. We similarly maintain the correspondence with the original position in 0 for each substring in h. We use k to denote the collection of all k levels. See Figure 3.

It follows directly from Property 1 of heavy path decomposition that the overall space is O(q(logq)k). With a tighter analysis, one can show that the space is O(q2k(logq+k)k/k!). As the details of this analysis are important to our solution, and the statement of Lemma 4 differs from that made in [10], we present the argument in full.

Lemma 4.

Let q be the number of leaves in 0. Also, let i be the original position for a leaf in 0. Then i is the original position for at most 2k(k+logq)k/k! leaves in k.

Proof.

We fix a particular leaf in 0. Suppose the light nodes on the root-to- path in 0 are r=u0, u1, ,ux and the respective parents v1, , vx (excluding v0 since u0 is the root). There are two cases: if contributes to u0’s level 1 trie (that is, is in the subtree of a light child of u0), then the size of v1 is at most q/2, the size of v2 is at most q/4, and so on. Otherwise, the size of v1 is at most q, the size v2 is at most q/2, and so on. Let the inductive hypothesis be that in a (k1)-level structure for q substrings that the leaf occurs at most 2k1(k1+logq)k1/(k1)!. Based on the observation regarding subtree sizes, for a k-level structure, we have the number of leaves with original position i is bounded by

1+2k1(k1+log(q))k1(k1)!+2k1(k1+log(q/2))k1(k1)!++2k1(k1)k1(k1)!
=1+2k1(k1)!((k1)k1+kk1++(k1+logq)k1)
1+2k1(k1)!0k+logqxk1𝑑x=1+2k1(k+logq)kk!2k(k+logq)kk!for k1.

From Lemma 4, the total space needed for q substrings is O(q2k(logq+k)k/k!).

Querying.

Querying is carried out by starting at the root of the level 0 trie and matching until the first wildcard character # is reached. If the locus is an implicit node, then we consider the match to continue on the implicit node’s edge. If the locus of the last character before the first # is an explicit non-leaf node u, then we consider the match as continuing both to the heavy child of u and the level 1 trie for u. In both cases, we match the # by skipping the next character. In the case of the level 1 trie for u, this is accomplished by skipping the # and starting from the root of the level 1 trie.

The same procedure is recursively applied for the level 1 trie when the next wildcard is reached. Assuming P has hk wildcards, this causes at most 2h bifurcations to occur. If a mismatch occurs at any point during a branch of this search, we stop exploring that branch further. Note that if P=P0#P1#Ph and the search is performed character-by-character then the total time required is |P0|+2|P1|++2h1|Ph|=O(2hm). However, by preprocessing all pattern segments Pi so that they are known substrings of T, we can apply Lemma 3 and find all pattern loci in O(2h(tQ+loglogn)) time. Note that by using the unrooted LCP query structure for tries in levels 0, , k1 rooted LCP query structure for tries in k, the space complexity is unaltered.

3.3 Our Data Structure

We maintain a base structure (as described in the beginning of Section 3) for T and for TR. We assume that we are given a string attractor Γ={i1,,is} of size s such that 1=i1<i2<<is. This gives a factorization T=T[i1..i2)T[is..] with s phrases.

Forward Sparse Structure.

The construction of our structure is similar to that of Cole et al.’s, but instead of having the full suffix tree as the level 0, we keep a trie of only those suffixes corresponding to the phrase boundaries. We assign a global index to the leaves across all tries. The only condition we ensure is that all leaves in the same trie should get contiguous global indices in the left-to-right order. We call the resulting structure kF. Following the same analysis as presented in Section 3.2, the space required for kF is O(s2k(logs+k)k/k!). We next equip all tries in 0F, , k1F with the unrooted LCP query structure described in Lemma 3 and the tries in kF with the rooted LCP query structure. This does not change the space complexity.

Reverse Sparse Structure.

Next, we collect all phrases. We create a compact trie from the collection of reversed phrases333In contrast to the forward sparse structure, we do not use the entire prefix ending at a phrase boundary.. For each leaf in the level 0 trie, we maintain the phrase boundary in T that immediately proceeds it; that is, if the reversed phrase is T[ij..ij+1)R, we associate the original position ij+1 with the corresponding leaf. With this trie as level 0, we then create Cole et al.’s k-level structure. We also give a global index labeling to the leaves as before. We call the resulting structure kR. As before, we equip the tries in k1R with the unrooted LCP query data structures from Lemma 3 and the rooted LCP structure on kR. The total space for this structure is O(s2k(logs+k)k/k!).

Orthogonal Range Query Structure.

Our next component is a 2D orthogonal range query structure. The 2D points (x,y) are created as follows: let y be the global index of a leaf in a level α trie of kF with original position i, and x is the global index of a leaf in a level β trie in kR with original position i. We create a point for all such x and y where the inequality α+βk holds. We associate the original position i with that point.

3.4 Space Analysis

Applying Lemma 4, the number of leaves with original position i at level α of kF is bounded by 2α(logs+α)α/α!. Similarly, the number of leaves assigned original position i at a level β in kR is also bounded by 2β(logs+β)β/β!. As a result, the number of 2D points corresponding to a specific original position i is bound by

α+βk2α(logs+α)αα!2β(logs+β)ββ! j=0kα=0j2α(logs+k)αα!2jα(logs+k)jα(jα)!
=j=0k2j(logs+k)jα=0j1α!(jα)!
=j=0k2j(logs+k)j1j!α=0j(jα)
=j=0k4j(logs+k)jj!

where the last equality applies that α=0j(jα)=2j. Next, we consider the inequality

4j+1(logs+k)j+1(j+1)!4j(logs+k)jj!.

This holds as long as j+14(logs+k), which is true, since j+1k+14(logs+k). Thus, we can write

j=0k4j(logs+k)jj!1+k4k(logs+k)kk!=O((4e1/e)k(logs+k)kk!)

where we used in the last inequality that kek/e. Letting C=4e1/e, which is constant, the total number of 2D points is N=O(sCk(logs+k)k/k!). It is now convenient to split our analysis into two cases, the case where logs dominates k and vice versa:

  • If logs dominates k, then the space complexity is

    Ck(logs+k)kk!(2C)k(logs)kk!
  • If k dominates logs, then the space complexity is

    Ck(logs+k)kk!(2C)kkkk!(2C)kkk2πk(ke)k(2Ce)k

Hence, we can say

Ck(logs+k)kk!(2C)k(logs)kk!+(2Ce)k(2Ce)k((logs)kk!+1).

We will first consider using a linear-space 2D range query data structure over these N points. Note that the space required by the forward sparse structure and the reverse sparse structure is also bound by the same expression. The resulting overall space complexity is therefore S𝖣𝖲+O(sck((logs)kk!+1)) for some constant c.

3.5 Querying Algorithm

We first present a simpler querying procedure and then show how it can be improved upon. Assume the query is P=P0#P1##Ph with hk. We begin by searching for each segment Pi (resp., PiR) in the base structure. If some segment does not exist in T, then there are no occurrences of P. Otherwise, this enables us to apply Lemma 3.

Finding Primary Occurrences.

Next, we obtain all primary occurrences of P. Let 𝗈𝖼𝖼p denote the number of primary occurrences of P. To accomplish finding all primary occurrences, we consider each split of P, that is, P[..x] and P[x+1..] for x[0,m). For a given split P[..x] and P[x+1..], we match P[..x]R in the kR and P[x+1..] in kF. In more detail, suppose we have β wildcards in the reverse part and α wildcards in the forward part. Then, based on the analysis of the querying procedure in Section 3.2, there will be at most 2β bifurcation loci in the search in kR and 2α loci in kF. Applying Lemma 3, we spend O(tQ+loglogn) time per bifurcation. For a given split of P, we now have the loci in all tries of kR where the matching of P[..x]R ends. Denote this set of loci as LR. We also have the loci in all tries of kF where the matching of P[x+1..] ends. Denote this set of loci as LF. We next consider each pair of locus, l1LR and l2LF. For l1 (resp., l2) we obtain a contiguous range of global indices [a1,b1] (resp., [a2,b2]) corresponding to the leaves in the subtree of l1 (resp., l2). For all such pairs of ranges, we make the 2D range query [a1,b1]×[a2,b2] and from each reported point’s value and offset x, we obtain a primary occurrence.

Finding Secondary Occurrences.

From the set of primary occurrences, we first extract one occurrence per each distinct substring of T that matches with P. To do this, we find the ISA values of all primary occurrences first, sort them in ascending order of their ISA values, then scan this list in the left to right order and retain only those entries with LCE value less than m with its previous element. Then, for each occurrence i in the reduced set, we can find all other occurrences of T[i..i+m) in T as follows: First: initialize j=ISA[i]+1, while LCE(i,SA[j])m, report SA[j] and increment j. Second: initialize j=ISA[i]1, while LCE(i,SA[j])m, report SA[j] and decrement j. This part of the query algorithm takes O(𝗈𝖼𝖼plog𝗈𝖼𝖼p+𝗈𝖼𝖼tQ) time.

Time Complexity.

Regarding the query time complexity, as there are 2α+2β2h+1 bifurcations loci per pattern split and m pattern splits, the total time for finding all pattern loci (using Lemma 3) is bound by O(m2h(tQ+loglogn)). For orthogonal range queries, the number of pairs of ranges we have to consider for a given pattern split is bound by 2α2β=2h. Each query takes O((1+𝗈𝖼𝖼jx)logεN) time, where N is the number of 2D points and 𝗈𝖼𝖼jx refers to the number of occurrences reported for the jth range query (the jth locus pair) with pattern split x. Since all loci used for the orthogonal range query are for distinct substrings in T, all reported occurrences for two different orthogonal range queries are distinct, and j𝗈𝖼𝖼jx=𝗈𝖼𝖼x, where 𝗈𝖼𝖼x is the total number of occurrences reported for a split x. Summing over all orthogonal range queries for a given split, we have the time used per split x bounded by

j(1+𝗈𝖼𝖼jx)logεN(2h+𝗈𝖼𝖼x)logεN

We claim x=1m𝗈𝖼𝖼x=𝗈𝖼𝖼p. It is easy to see that every primary occurrence will be reported. Moreover, each primary occurrence will be reported only once, because kR is built over the set of phrases (rather than the entire prefix ending prior to the start of a phrase). Due to this, even if a primary occurrence contains multiple phrase boundaries, it is still only reported once. In particular, it gets reported for the split aligned with its leftmost phrase boundary.

To obtain the time used by orthogonal range queries for finding all primary occurrences, we sum over all x, resulting in the expression

x=1m(2h+𝗈𝖼𝖼x)logεN=(2hm+𝗈𝖼𝖼p)logεN.

Note that N can be loosely bounded by 2O(k+logs). This is because the maximum value of the function f(k)=(logs)k/k! is f(logs)=(logs)logs/(logs)!12πlogselogs=2O(logs). Therefore, logN=O(k+logs)O(logn).

The total time is O(2hm(tQ+logn)+𝗈𝖼𝖼p(log𝗈𝖼𝖼p+logεn)+𝗈𝖼𝖼tQ), in addition to the time for finding one occurrence of all segments of P.

We summarize the results of Section 3 in Lemma 5.

Lemma 5.

For problem 1, there exists an S𝖣𝖲+O(sck((logs)kk!+1)) space data structure, where s is the size of text’s string attractor and c>1 is a fixed constant. The query time is O(2hm(tQ+logn)+𝗈𝖼𝖼p(log𝗈𝖼𝖼p+logεn)+𝗈𝖼𝖼tQ), in addition to the time for initial pattern search.

4 An Improved Solution

4.1 Our Data Structure

Our next aim is to replace the product 2hm in the time complexity with 2h+m. The idea is to handle long patterns (m>τ) and short patterns (mτ) separately, where τ is a parameter we specify later.

Long Patterns.

For long patterns, we include some additional suffixes and phrases (extended slightly more to the right and left) within our construction. Specifically, for each phrase T[ij..ij+1) for j[1,s], we include all of the suffixes T[ij+t..] for t(τ,τ) in 0F and T[ij1..ij+t)R in 0R. We then build kF and kR based of their respective level 0 trie. We also create a set of 2D points, again with (x,y) values where x is global index of a leaf in kR, and y a global index of a leaf in kF, and x and y both have original position ij+t for some j[1..s] and t(τ,τ). The linear space 2D orthogonal range query structure used in Section 3.3 is constructed over these points.

Short Patterns.

To handle short patterns, we collect the substrings T[ijτ..ij+τ] for each j[1,s] (also record one of its occurrence in T). After appending each such substring with a $ (a delimiter), we concatenate them and obtain a new string T. We build the k wildcard errata structure of Cole et al. [10] for T.

4.2 Space Analysis

For the long pattern data structure, the number of substrings used in level 0 of both kR and kF now becomes O(sτ). Following Lemma 4, the space for both kF and kR becomes at most 2sτ2k(log(2sτ)+k)k/k!. Following the same analysis presented in Section 3.4, the number of 2D points created and the final asymptotic space complexity is O(sτck(1+logk(sτ)/k!)) for a constant c. For short patterns, the length of T is O(sτ). The k-wildcard errata structure over T requires O(sτck(1+logk(sτ)/k!)).

4.3 Querying Algorithm

As before, we first find occurrences of all pattern segments using the base structure. This enables us to apply Lemma 3. We first consider the long pattern case where m>τ. Now, for finding primary occurrences, rather than trying every split x[0,m) as was done in Section 3.5, we split only at positions x{1,1+τ,1+2τ,1+3τ}. For a split of P at x, we find the loci of P[..x]R in kR and P[x+1..] in kF. For each pair of loci, we obtain two ranges of global leaf labels and perform a range query as before. From the original index associated with a reported point we can obtain the corresponding primary occurrence. Secondary occurrences are then reported using the base structure.

For the short pattern case where mτ, we first obtain the primary occurrences by querying the k-wildcard errata trie over T. Secondary occurrences are obtained using the base structure.

4.4 Time Complexity

Figure 4: Sampling method for long patterns with τ=3 and phrase boundary si. The x’s in P indicate the splits considered, and the x’s in T indicate the sampled suffix positions used.

In the case of long patterns, we consider O(m/τ) different splits of the pattern. For a split, locating all of the loci in kR and kF requires O(2h(tQ+loglogn)) time. Two key observations of our sampling method are that (i) every primary occurrence has a split that aligns with a sampled position (ii) each primary occurrence will be reported at most twice. See Figure 4. This follows from the splits considered being τ indexes apart in P, making it so at most two will be within the same τ sized window around a phrase boundary. For a split at x and the jth pair of loci, one in kR and kF, let 𝗈𝖼𝖼jx be the number of 2D points reported for a range query. Combining the above to facts, for a particular split x, the time complexity for range queries is j(1+𝗈𝖼𝖼jx)logn=(2h+𝗈𝖼𝖼x)logn. Summing over all O(m/τ) splits, the time complexity for all range queries O((2hm/τ+𝗈𝖼𝖼p)logn). Finally, using the base structure to obtain the secondary occurrences once again requires O(𝗈𝖼𝖼plogn+𝗈𝖼𝖼tQ) time. We conclude that the total time taken for long patterns is the time for locating all segments of P in addition to

O(mτ2h(tQ+loglogn+logεn)+𝗈𝖼𝖼p(log𝗈𝖼𝖼p+logεn)+𝗈𝖼𝖼tQ).

For short patterns, each locus is found in O(loglogn) time. The resulting time complexity for finding all occurrences is the time for locating all segments of P in addition to O(2hloglogn+𝗈𝖼𝖼plog𝗈𝖼𝖼p+𝗈𝖼𝖼tQ).

Lemma 6.

For problem 1, there exists an S𝖣𝖲+O(sτck((logsτ)kk!+1)) space data structure, where s is the size of text’s string attractor, c>1 is a fixed constant and τ1 is a parameter. The query time, in addition to the time for initial pattern search is O((1+mτ)2h(tQ+logεn)+𝗈𝖼𝖼p(log𝗈𝖼𝖼p+logεn)+𝗈𝖼𝖼tQ).

5 Trade-Offs for Time and Space

We now look at some possible trade-offs based on the base structure and the choice of τ.

5.1 Fully Functional Suffix Trees in Repetition-Aware Space

We first consider using fully functional compressed suffix tree in [14] for the text and its reverse. The space is O(rlognr+r¯lognr¯), where r and r¯ are the number of runs in BWT(T) and BWT(TR) respectively. Making τ=2k, the additional space complexity becomes

sτck(logk(sτ)k!+1)s(2c)k((logs+k)kk!+1)sCk(logksk!+1).

Here C>c is a large enough constant. For query times, we have tQ=O(logn). The suffix ranges of all pattern segments can be found in O(m) time, and one occurrence of each segment is reported in O(logn) time. This gives an O(rlognr+r¯lognr¯+s(ck((logs)kk!+1))) space index with query time O((m+2h+𝗈𝖼𝖼)logn). We improve this result next.

5.2 The r-Index and Optimal LCE Structure (Completing Theorem 1)

Here we replace the fully functional suffix trees in the base structure with two r-indexes [14] (one for T and one for TR), which brings down the space to O(r+r¯). Although, the r-index supports only limited functionalities, we will see that they suffice for our purposes. We additionally maintain δ-SA [24] for T and TR in O(δlog(n/δ)) space, for supporting LCE queries in time time O(logn). In what follows, we examine the steps where the base structure is queried, and observe how our new base structure suffices.

  • For each segment of the pattern Pi, we need to find an occurrence (text position of Pi and its inverse suffix array position). This can be done with the O(r) space r-index. Moreover, the backward search procedure utilized by the r-index provides us with a corresponding values (i.e., position in the text and its ISA) for every suffix of the pattern segment Pi[j..]; we refer to the toehold lemma in [14], also see [42]. The same can be accomplished for the reverse of the pattern segment with the r-index for the reversed text. The total time for this step is O(mloglogn).

  • For rooted and unrooted LCP queries, we used ISA value of the current suffix (or prefix) of a pattern segment as well as LCE queries. As observed above, the needed ISA values of suffixes and prefixes of pattern segments can be obtained while performing the backward search procedure. For LCE queries, we use the structure of Kempa et al. [24].

  • The next step utilizing the base structure is finding the secondary occurrences from the primary occurrences. The first step in our previous algorithm was to find a subset of primary occurrences, such that for each each distinct substring of T matching P, we have exactly one occurrence in that set. Previously, this procedure required ISA queries on arbitrary positions, which cannot be supported by the r-index. Therefore, we follow an alternative procedure, which rely on the fact that r-index can support operations ϕ(p)=SA[ISA[p]1], ϕ1(p)=SA[ISA[p]+1], LCE(p,ϕ1(p)), and LCE(p,ϕ(p)) for any given p in O(loglogn) time [14]. Therefore, by applying ϕ iteratively, we can compute SA[ISA[p]1],SA[ISA[p]2],,SA[ISA[p]t] for any t in O(tloglogn) time. Similarly, by iteratively applying ϕ1, we can find SA[ISA[p]+1],SA[ISA[p]+2],,SA[ISA[p]+t] for any t in O(tloglogn) time.

    We now describe the procedure. Let 𝖮𝖢𝖢p be the non-reduced set of a primary occurrences. Also, let 𝖮𝖢𝖢 be a set, which is initially empty; we maintain its elements in the form of a balanced binary search tree [2], supporting membership queries in logarithmic time. We process each p𝖮𝖢𝖢p in any order as follows. If p𝖮𝖢𝖢 then skip p, otherwise,

    1. 1.

      Initialize p=ϕ(p).
      While LCE(ϕ1(p),p)m, add p to 𝖮𝖢𝖢 and make pϕ(p).

    2. 2.

      Initialize p=ϕ1(p).
      While LCE(ϕ(p),p)m, add p to 𝖮𝖢𝖢 and make pϕ1(p).

    Lastly, we output the set 𝖮𝖢𝖢.

The total query time is O((m+2h)logn+𝗈𝖼𝖼plogεn+𝗈𝖼𝖼(loglogn+log𝗈𝖼𝖼)). We summarize the results of this approach in Theorem 7.

Theorem 7.

There exists a solution for Problem 1 with space

O(r+r¯+δlog(n/δ)+sck(1+logksk!))words,

and query time O((m+2h)logn+𝗈𝖼𝖼(logεn+log𝗈𝖼𝖼))O((m+2h+𝗈𝖼𝖼)logn), where δ is the substring complexity, s is the size of a known string attractor, r (resp., r¯) denotes the number of runs in the BWT of T (resp., TR), and c>1 is a fixed constant.

By replacing s with z=O(δlog(n/δ)) [43] and r+r¯ with O(δlogδlog(n/δ)) [23] in the space complexity in Theorem 7, we obtain the result in Theorem 1 for k1. Recall that for k=0, an O(δlog(n/δ)) space index with query time O(m+(1+𝗈𝖼𝖼)logϵn) is already known [26]. By combining both cases, we obtain the result in Theorem 1.

5.3 Matching Cole et al.’s Query Time (Completing Theorem 2)

If we are willing to utilize more space, an improved query time is possible. To that end, for the base structure, we use text indexes that support all queries in O(1) time, just like (uncompressed) suffix trees, but in space just O(nlogσlogσϵn) bits, or, equivalently, O(n/(logσn)1ϵ) words [17, 47]. We also maintain the version of the FM-index in [4] of space O(nlogσ) bits, which can be utilized to find the suffix ranges of all segments of P in O(m) time. For 2D range queries, we use the super-linear space structure of Chan et al. [9]. Over a set of N points, it occupies O(NlogεN) space and answers queries in time O(loglogN+t) where t is the output size. We make τ=2kloglogn. Note that since k=O(logn),

loglogN=loglog(sτ(logsτ+k)kk!)=O(loglogn).

The space complexity of our structure becomes O(n(logσn)1ϵ+sτck((logτs)k+εk!+1)) for some constant c. We again substitute in that sz=O(δlog(n/δ)) and simplify to get the space bound shown in Theorem 2.

We now analyze the query complexity. The first step of finding an occurrence of each segment of the pattern can be done in O(m) total time. Regarding long pattern queries, for a particular split x, the time is bound by j(loglogn+𝗈𝖼𝖼jx)=2hloglogn+𝗈𝖼𝖼x where 𝗈𝖼𝖼jx and 𝗈𝖼𝖼x are defined as in Section 4.4. Summing over all O(m/τ) splits, we can bound the time complexity of computing the primary occurrences by O((2hmloglogn)/τ+𝗈𝖼𝖼p)O(m+𝗈𝖼𝖼p). To further obtain the secondary occurrences, we follow the same procedure as in Section 5.2, however we use a bit string B[1..n] to support constant time membership queries on the set 𝖮𝖢𝖢 (i.e., B[i]=1 iff i𝖮𝖢𝖢). Thus, the overall query complexity for long patterns is O(m+𝗈𝖼𝖼). For short pattern queries, the time complexity is O(m+2hloglogn+𝗈𝖼𝖼). This completes the proof of the first part of Theorem 2.

5.3.1 Achieving 𝑶(𝒎+𝗼𝗰𝗰) Query Time

Our goal in this subsection is to achieve O(m+𝗈𝖼𝖼) query time. When mτ=2kloglogn or 𝗈𝖼𝖼τ, this has already been accomplished using Theorem 2. Therefore, we consider designing an auxiliary structure for queries with m,𝗈𝖼𝖼<τ. We first take all substrings of length τ containing a phrase boundary. There are at most O(sτ) such substrings. For each distinct substring constructed above, we consider all possible ways of substituting up to k number of characters ( is a special symbol) into the substring. For a given string, there at most h=1k(τh)k(τk) ways to do this, where we used that kτ/2. Thus, the total number of strings generated in this way is bounded by

sτk(τk)skτk+1k!=sk2k(k+1)(loglogn)k+1k!=O(sck2(loglogn)k+1k!)

for some constant c. We remove duplicate strings and build a compact trie 𝒯 over the resulting set. We associate with each node u in 𝒯 the first τ positions in T matching str(u) (treating as a wildcard). The total additional space becomes O(sck2(loglogn)k+2k!) for some constant c. Finally, we fix s=z, where z=O(δlog(n/δ)) as earlier.

Given a query pattern P, we replace every # with and find its locus in 𝒯. Then, we report all associated occurrences if the number of associated occurrences is less than τ. This takes O(m+𝗈𝖼𝖼) time. If the number of associated occurrences is greater or equal to τ, we use the previous data structure.

This completes the second part of Theorem 2.

5.4 A Simple, Space Optimal Solution

The suffix tree of T can be used to answer the query in O(mσh+𝗈𝖼𝖼) time for any h, by exhaustively exploring all possible substitutions of each wildcard with actual characters. This can be improved to O(m+σhlogn+𝗈𝖼𝖼) by leveraging the fact that an unrooted LCE query on the suffix tree can be resolved via binary search, requiring O(logn) basic operations (i.e., SA, ISA, and LCE queries). The query procedure can be simulated on the O(rlog(n/r)) space repetition-aware suffix tree [14] in time O(m+(σhlogn+𝗈𝖼𝖼)log(n/r)). Alternatively, we can use the δ-suffix array [24] of T and the structure in [26] for initial pattern search. The combined space is O(δlog(n/δ)) words and the query time is O(m+(σhlogn+𝗈𝖼𝖼)log4+ϵn).

6 Indexing for Pattern Matching with Gaps

We now consider a generalization of Problem 1, where the query consists of a pattern P with h gaps and we want to find its occurrences in the text. A maximal contiguous sequence of wildcards is defined as a single gap, and its size is the number of wildcards. The maximum number of gaps allowed and the maximum size of a gap are fixed at construction. A formal definition is below.

Problem 2 (k-Gapped Indexing).

Given a text T[1..n] and integers k and G, design a data structure that supports the following query efficiently.

  • Input: Strings P0,P1,Ph of total length m and integers g1,g2,gh, where hk and gi[1,G] for all i[1,h].

  • Output: The set of occurrences of the gapped-pattern P0#g1P1#g2#ghPh in T.

A solution by Lewenstein [30] offers O(nG2klogkn) space and O(m+2hloglogn+𝗈𝖼𝖼) query time. To obtain a repetition-aware solution, one could replace each gap gi with gi number of wildcard characters and apply our solution from Section 3. However, this would require a structure with a space complexity exponential in kG. On a related note, see [5, 15, 19, 20] for some interesting solutions for the special case where k=1.

Our Data Structure.

We start with the sparse suffix trees for 0F and 0R obtained in the same way as in Section 3.3. To construct 1F, we consider a particular explicit node u in 0F. For each g{1,,G}, consider every light node v that (i) branches off the heavy path containing u and (ii) has depth satisfying strlen(u)strlen(v)strlen(u)+g. For every substring represented by a leaf in the subtree of such a node v, remove its first strlen(u)+g characters. These are combined into a single level 1 trie for u and g. The same procedure is performed recursively until all k levels are constructed in kF. The same procedure is also used to construct kR from 0R. Again, a linear space 2D range query structure is constructed based on original positions, as was done in Section 3.3.

Querying Algorithm and Time Complexity.

To answer a query P=P0#g1P1#g2#ghPh, we try each split within each segment one by one, i.e., splits of the form (P0#g1P1#g2#gjPj[..x))R and Pj[x..]#gj+1#ghPh. When a gap gi is reached in either a trie in kR or kF, if the locus is on an edge (u,v), we subtract the remaining edge label length from the gap size, say l. We then continue from node v, taking both the heavy path within the same trie and the associated trie for v for the gap gil. The query time complexity remains the same as in k-wildcard problem, that is O((m+2h+𝗈𝖼𝖼)logn).

Space Analysis.

To analyze the space required by this structure, consider that in our solution to Problem 1 , a 2k arose in the analysis due to a potential bifurcation at every node. Instead, we now take at most G associated side tries for each node, causing us to replace that 2k with a Gk. Following a similar analysis as Lemma 4, we arrive at a structure of size O(s(cG)k((logs)k/k!+1)) for some constant c, in addition to the size of the base structure. The key advantage of this space complexity is that it has ceased to be exponential in G. Using the same base structure from Section 5.2 and s=z, we achieve Theorem 8.

Theorem 8.

There exists a solution for Problem 2 with space

O(δlog(n/δ)(cG)k(1+logk(δlogn)k!))words and query time O((m+2h+𝗈𝖼𝖼)logn).

Here δ is the substring complexity of the text and c>1 is a fixed constant.

References

  • [1] Paniz Abedin, Oliver A. Chubet, Daniel Gibney, and Sharma V. Thankachan. Contextual pattern matching in less space. In Data Compression Conference, DCC 2023, Snowbird, UT, USA, March 21-24, 2023, pages 160–167. IEEE, 2023. doi:10.1109/DCC55655.2023.00024.
  • [2] Georgii M Adel’son-Vel’skii. An algorithm for the organization of information. Soviet Math., 3:1259–1263, 1962.
  • [3] Djamal Belazzougui, Fabio Cunial, Travis Gagie, Nicola Prezza, and Mathieu Raffinot. Composite repetition-aware data structures. In Combinatorial Pattern Matching - 26th Annual Symposium, CPM 2015, Ischia Island, Italy, June 29 - July 1, 2015, Proceedings, volume 9133 of Lecture Notes in Computer Science, pages 26–39. Springer, 2015. doi:10.1007/978-3-319-19929-0_3.
  • [4] Djamal Belazzougui and Gonzalo Navarro. Alphabet-independent compressed text indexing. ACM Trans. Algorithms, 10(4):23:1–23:19, 2014. doi:10.1145/2635816.
  • [5] Philip Bille, Inge Li Gørtz, Moshe Lewenstein, Solon P. Pissis, Eva Rotenberg, and Teresa Anna Steiner. Gapped string indexing in subquadratic space and sublinear query time. In 41st International Symposium on Theoretical Aspects of Computer Science, STACS 2024, volume 289 of LIPIcs, pages 16:1–16:21. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2024. doi:10.4230/LIPICS.STACS.2024.16.
  • [6] Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and Søren Vind. String indexing for patterns with wildcards. Theory Comput. Syst., 55(1):41–60, 2014. doi:10.1007/S00224-013-9498-4.
  • [7] Michael Burrows and D J Wheeler. A block-sorting lossless data compression algorithm. In  , 1994. URL: https://api.semanticscholar.org/CorpusID:2167441.
  • [8] Timothy M. Chan, Kasper Green Larsen, and Mihai Pătraşcu. Orthogonal range searching on the RAM, revisited. In Proceedings of the 27th ACM Symposium on Computational Geometry, Paris, France, June 13-15, 2011, pages 1–10. ACM, 2011. doi:10.1145/1998196.1998198.
  • [9] Timothy M. Chan and Konstantinos Tsakalidis. Dynamic orthogonal range searching on the RAM, revisited. J. Comput. Geom., 9(2):45–66, 2018. doi:10.20382/JOCG.V9I2A5.
  • [10] Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. Dictionary matching and indexing with errors and don’t cares. In László Babai, editor, Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, June 13-16, 2004, pages 91–100. ACM, 2004. doi:10.1145/1007352.1007374.
  • [11] Martin Farach. Optimal suffix tree construction with large alphabets. In 38th Annual Symposium on Foundations of Computer Science, FOCS ’97, Miami Beach, Florida, USA, October 19-22, 1997, pages 137–143. IEEE Computer Society, 1997. doi:10.1109/SFCS.1997.646102.
  • [12] Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In 41st Annual Symposium on Foundations of Computer Science, FOCS 2000, 12-14 November 2000, Redondo Beach, California, USA, pages 390–398. IEEE Computer Society, 2000. doi:10.1109/SFCS.2000.892127.
  • [13] Travis Gagie, Pawel Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, and Simon J. Puglisi. LZ77-based self-indexing with faster pattern matching. In LATIN 2014: Theoretical Informatics - 11th Latin American Symposium, Montevideo, Uruguay, March 31 - April 4, 2014. Proceedings, volume 8392 of Lecture Notes in Computer Science, pages 731–742. Springer, 2014. doi:10.1007/978-3-642-54423-1_63.
  • [14] Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM, 67(1):2:1–2:54, 2020. doi:10.1145/3375890.
  • [15] Arnab Ganguly, Daniel Gibney, Paul Macnichol, and Sharma V. Thankachan. Bounded-ratio gapped string indexing. In String Processing and Information Retrieval - 31st International Symposium, SPIRE 2024, Puerto Vallarta, Mexico, September 23-25, 2024, Proceedings, volume 14899 of Lecture Notes in Computer Science, pages 118–126. Springer, 2024. doi:10.1007/978-3-031-72200-4_9.
  • [16] Daniel Gibney, Paul Macnichol, and Sharma V. Thankachan. Non-overlapping indexing in BWT-runs bounded space. In String Processing and Information Retrieval - 30th International Symposium, SPIRE 2023, Pisa, Italy, September 26-28, 2023, Proceedings, volume 14240 of Lecture Notes in Computer Science, pages 260–270. Springer, 2023. doi:10.1007/978-3-031-43980-3_21.
  • [17] Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput., 35(2):378–407, 2005. doi:10.1137/S0097539702402354.
  • [18] Dov Harel and Robert Endre Tarjan. Fast algorithms for finding nearest common ancestors. SIAM J. Comput., 13(2):338–355, 1984. doi:10.1137/0213024.
  • [19] Md Helal Hossen, Daniel Gibney, and Sharma V Thankachan. Text indexing for faster gapped pattern matching. Algorithms, 17(12), 2024.
  • [20] Costas S. Iliopoulos and M. Sohel Rahman. Indexing factors with gaps. Algorithmica, 55(1):60–70, 2009. doi:10.1007/S00453-007-9141-3.
  • [21] Juha Kärkkäinen and Esko Ukkonen. Lempel-ziv parsing and sublinear-size index structures for string matching. In Proc. 3rd South American Workshop on String Processing (WSP), pages 141–155, 1996.
  • [22] Dominik Kempa and Tomasz Kociumaka. String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 756–767. ACM, 2019. doi:10.1145/3313276.3316368.
  • [23] Dominik Kempa and Tomasz Kociumaka. Resolution of the burrows-wheeler transform conjecture. Commun. ACM, 65(6):91–98, 2022. doi:10.1145/3531445.
  • [24] Dominik Kempa and Tomasz Kociumaka. Collapsing the hierarchy of compressed data structures: Suffix arrays in optimal compressed space. In 64th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2023, Santa Cruz, CA, USA, November 6-9, 2023, pages 1877–1886. IEEE, 2023. doi:10.1109/FOCS57990.2023.00114.
  • [25] Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: string attractors. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, pages 827–840. ACM, 2018. doi:10.1145/3188745.3188814.
  • [26] Tomasz Kociumaka, Gonzalo Navarro, and Francisco Olivares. Near-optimal search time in δ-optimal space, and vice versa. Algorithmica, 86(4):1031–1056, 2024. doi:10.1007/S00453-023-01186-0.
  • [27] Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Towards a definitive measure of repetitiveness. In LATIN 2020: Theoretical Informatics - 14th Latin American Symposium, São Paulo, Brazil, January 5-8, 2021, Proceedings, volume 12118 of Lecture Notes in Computer Science, pages 207–219. Springer, 2020. doi:10.1007/978-3-030-61792-9_17.
  • [28] Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Toward a definitive compressibility measure for repetitive sequences. IEEE Trans. Inf. Theory, 69(4):2074–2092, 2023. doi:10.1109/TIT.2022.3224382.
  • [29] Sebastian Kreft and Gonzalo Navarro. Self-indexing based on LZ77. In Combinatorial Pattern Matching - 22nd Annual Symposium, CPM 2011, Palermo, Italy, June 27-29, 2011. Proceedings, volume 6661 of Lecture Notes in Computer Science, pages 41–54. Springer, 2011. doi:10.1007/978-3-642-21458-5_6.
  • [30] Moshe Lewenstein. Indexing with gaps. In String Processing and Information Retrieval, 18th International Symposium, SPIRE 2011, Pisa, Italy, October 17-21, 2011. Proceedings, volume 7024 of Lecture Notes in Computer Science, pages 135–143. Springer, 2011. doi:10.1007/978-3-642-24583-1_14.
  • [31] Moshe Lewenstein, J. Ian Munro, Yakov Nekrich, and Sharma V. Thankachan. Document retrieval with one wildcard. In Mathematical Foundations of Computer Science 2014 - 39th International Symposium, MFCS 2014, Budapest, Hungary, August 25-29, 2014. Proceedings, Part II, volume 8635 of Lecture Notes in Computer Science, pages 529–540. Springer, 2014. doi:10.1007/978-3-662-44465-8_45.
  • [32] Moshe Lewenstein, J. Ian Munro, Venkatesh Raman, and Sharma V. Thankachan. Less space: Indexing for queries with wildcards. Theor. Comput. Sci., 557:120–127, 2014. doi:10.1016/J.TCS.2014.09.003.
  • [33] Moshe Lewenstein, Yakov Nekrich, and Jeffrey Scott Vitter. Space-efficient string indexing for wildcard pattern matching. In 31st International Symposium on Theoretical Aspects of Computer Science (STACS 2014), STACS 2014, March 5-8, 2014, Lyon, France, volume 25 of LIPIcs, pages 506–517. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2014. doi:10.4230/LIPICS.STACS.2014.506.
  • [34] Heng Li and Richard Durbin. Fast and accurate short read alignment with burrows-wheeler transform. Bioinform., 25(14):1754–1760, 2009. doi:10.1093/BIOINFORMATICS/BTP324.
  • [35] Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol., 17(3):281–308, 2010. doi:10.1089/CMB.2009.0169.
  • [36] César Martínez-Guardiola, Nathaniel K. Brown, Fernando Silva-Coira, Dominik Köppl, Travis Gagie, and Susana Ladra. Augmented thresholds for MONI. In Data Compression Conference, DCC 2023, Snowbird, UT, USA, March 21-24, 2023, pages 268–277. IEEE, 2023. doi:10.1109/DCC55655.2023.00035.
  • [37] Gonzalo Navarro. Indexing highly repetitive string collections, part I: repetitiveness measures. ACM Comput. Surv., 54(2):29:1–29:31, 2022. doi:10.1145/3434399.
  • [38] Gonzalo Navarro. Indexing highly repetitive string collections, part II: compressed indexes. ACM Comput. Surv., 54(2):26:1–26:32, 2022. doi:10.1145/3432999.
  • [39] Gonzalo Navarro. Computing mems and relatives on repetitive text collections. ACM Trans. Algorithms, 21(1):12:1–12:33, 2025. doi:10.1145/3701561.
  • [40] Gonzalo Navarro and Veli Mäkinen. Compressed full-text indexes. ACM Comput. Surv., 39(1):2, 2007. doi:10.1145/1216370.1216372.
  • [41] Takaaki Nishimoto and Yasuo Tabei. Optimal-time queries on bwt-runs compressed indexes. In 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, July 12-16, 2021, Glasgow, Scotland (Virtual Conference), volume 198 of LIPIcs, pages 101:1–101:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.ICALP.2021.101.
  • [42] Alberto Policriti and Nicola Prezza. LZ77 computation based on the run-length encoded BWT. Algorithmica, 80(7):1986–2011, 2018. doi:10.1007/S00453-017-0327-Z.
  • [43] Sofya Raskhodnikova, Dana Ron, Ronitt Rubinfeld, and Adam D. Smith. Sublinear algorithms for approximating string compressibility. Algorithmica, 65(3):685–709, 2013. doi:10.1007/S00453-012-9618-6.
  • [44] Massimiliano Rossi, Marco Oliva, Paola Bonizzoni, Ben Langmead, Travis Gagie, and Christina Boucher. Finding maximal exact matches using the r-index. J. Comput. Biol., 29(2):188–194, 2022. doi:10.1089/CMB.2021.0445.
  • [45] Massimiliano Rossi, Marco Oliva, Ben Langmead, Travis Gagie, and Christina Boucher. MONI: A pangenomic index for finding maximal exact matches. J. Comput. Biol., 29(2):169–187, 2022. doi:10.1089/CMB.2021.0290.
  • [46] Luís M. S. Russo, Gonzalo Navarro, and Arlindo L. Oliveira. Fully compressed suffix trees. ACM Trans. Algorithms, 7(4):53:1–53:34, 2011. doi:10.1145/2000807.2000821.
  • [47] Kunihiko Sadakane. Compressed suffix trees with full functionality. Theory Comput. Syst., 41(4):589–607, 2007. doi:10.1007/S00224-006-1198-X.
  • [48] Daniel Dominic Sleator and Robert Endre Tarjan. A data structure for dynamic trees. J. Comput. Syst. Sci., 26(3):362–391, 1983. doi:10.1016/0022-0000(83)90006-5.
  • [49] James A. Storer and Thomas G. Szymanski. Data compression via textual substitution. J. ACM, 29(4):928–951, 1982. doi:10.1145/322344.322346.
  • [50] Igor Tatarnikov, Ardavan Shahrabi Farahani, Sana Kashgouli, and Travis Gagie. MONI can find k-MEMs. In 34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023, June 26-28, 2023, Marne-la-Vallée, France, volume 259 of LIPIcs, pages 26:1–26:14. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.CPM.2023.26.
  • [51] Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, pages 1–11. IEEE Computer Society, 1973. doi:10.1109/SWAT.1973.13.
  • [52] Dan E. Willard. Log-logarithmic worst-case range queries are possible in space Θ(N). Inf. Process. Lett., 17(2):81–84, 1983. doi:10.1016/0020-0190(83)90075-3.
  • [53] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory, 23(3):337–343, 1977. doi:10.1109/TIT.1977.1055714.