Repetition Aware Text Indexing for Matching Patterns with Wildcards
Abstract
We study the problem of indexing a text to support pattern matching with wildcards. The input of a query is a pattern containing wildcard (a.k.a. don’t care) characters and the output is the set of occurrences of in (i.e., starting positions of substrings of that matches ), where is fixed at index construction. A classic solution by Cole et al. [STOC 2004] provides an index with space complexity and query time , where is a constant, and denotes the number of occurrences of in . We introduce a new data structure that significantly reduces space usage for highly repetitive texts while maintaining efficient query processing. Its space (in words) and query time are as follows:
The parameter , known as substring complexity, is a recently introduced measure of repetitiveness that serves as a unifying and lower-bounding metric for several popular measures, including the number of phrases in the LZ77 factorization (denoted by ) and the number of runs in the Burrows-Wheeler Transform (denoted by ). Moreover, represents the optimal space required to encode the data in terms of and , helping us see how close our space is to the minimum required. In another trade-off, we match the query time of Cole et al.’s index using space, where is an arbitrarily small constant. We also demonstrate how these techniques can be applied to a more general indexing problem, where the query pattern includes -gaps (a gap can be interpreted as a contiguous sequence of wildcard characters).
Keywords and phrases:
Pattern Matching, Text Indexing, Wildcard MatchingCategory:
Track A: Algorithms, Complexity and GamesFunding:
Mano Prakash Parthasarathi: U.S. National Science Foundation (NSF) award CCF-2316691.Copyright and License:
![[Uncaptioned image]](x1.png)
Sharma V. Thankachan; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Pattern matchingEditors:
Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele PuppisSeries and Publisher:

1 Introduction and Related Work
Efficient indexing of string or textual data is crucial in fields such as computational biology and information retrieval, enabling fast and accurate search queries. A fundamental task in this context is to preprocess a long string (the text) into a data structure that allows efficient retrieval of occurrences of a short string (the pattern) when queried. An occurrence of refers to the starting position of a substring in that matches either exactly or approximately, depending on the matching criteria. We denote the number of occurrences as . In the exact match setting, the well-known suffix tree [51] structure supports such queries in time. WLOG, we assume that the characters in are from an integer alphabet and .
A major drawback of suffix trees (and even closely related suffix arrays) is their space usage, which is words (equivalently, bits). This can be significantly larger than the space required to store the text itself, which is bits. In the early 2000s, a key research challenge was designing indexes that required space closer to that of the text. This led to major breakthroughs, such as compressed suffix arrays/trees [17, 46, 47], FM-index [12] and their refinements. See [40] for an extensive survey on this topic. These innovations enabled encodings that use close to bits – or even approach entropy-compressed space while supporting suffix tree/array operations efficiently. While these findings are significant, they face major challenges when applied to many modern datasets, such as large collections of DNA sequences or versioned texts. These data sets tend to be highly repetitive, i.e., contain many long, repeated substrings. Although large in size, repetitiveness makes them highly compressible, but statistical entropy-based compressors (and data structures of that space) are often less effective at capturing such redundancy. Instead, dictionary-based compressors perform better in these scenarios [37].
Two popular dictionary-based compression methods are Lempel-Ziv (LZ77) [53] and the (run-length-encoded) Burrows-Wheeler Transform (BWT) [7], which use and space, respectively. Here, denotes the number of phrases in the LZ77 factorization of the text, while represents the number of runs in the text’s BWT. A major recent breakthrough in this area is the introduction of two new measures of repetitiveness – the string attractor [25] and the -measure (a.k.a. substring complexity) [27, 43] - which led to the discovery of the asymptotic relationship between several seemingly distinct measures. A string attractor of is a subset of the text’s positions such that for every substring there exists and where and . In linear time, we can compute a string attractor of size or ; however, finding the smallest string attractor (its size is denoted by ) is NP-hard. The -measure is equal to where is number of distinct substrings of of length , which can be easily computed in time using a suffix tree. An important and unifying result in the field of text compression is that [23]; also [43]. We can also bound , the number of runs in the BWT of text’s reverse by [23] since (and ) are invariant under the reversal of the text. In short, the compressibility measures mentioned are always within poly-logarithmic factors of one another and lower bounds all others. It is also known that the optimal space required to encode the data in terms of and is [28]. This raises an important question:
Can we index the text in (close to) optimal repetition-aware space of words
and support various pattern matching queries efficiently?
For exact pattern matching, this question has been answered positively; there exists such an optimal space index that supports queries in time [26], where is an arbitrarily small constant. The authors also showed how to achieve query time using factor more space. The recently introduced structure (called -SA) encodes suffix array and inverse suffix array in optimal space and supports random access in poly-logarithmic time [24]. Another structure encodes the suffix tree in space and supports all basic operations in time [14]. We refer to [38] for a detailed survey on this topic. In summary, data structures for the exact pattern-matching problem have kept pace with the data in terms of space as repetitiveness has increased. However, this adaptation is still in its early stages for more advanced problems; See [1, 16, 39, 36, 45, 44, 50] for a list of problems, where limited progress has been achieved. To this end, we revisit two fundamental problems, which we define formally below.
Problem 1 (-Wildcard Indexing).
Given a text and an integer , design a data structure (called an index) that supports the following query efficiently.
-
Input: A pattern with wildcards. A wildcard is a character, denoted by , that matches any character.
-
Output: The set of occurrences of in (denote the output size by ).
The alphabet set of is and . The alphabet set of is .
This problem can be solved algorithmically in time for any using the Fast Fourier Transform (FFT). The first non-trivial indexing solution was given by Cole et al. [10]. Their space structure, known as the -wildcard errata trie, can answer queries in time , where is a fixed constant. There have been several attempts to generalize/improve this classic result for constant as follows.
There are also some linear (or slightly better) space indexes that can handle queries with an unbounded number of wildcards. However, they generally have a factor in the query time, making them comparable to others only when the alphabet size is constant [6, 33, 31]. This includes an index achieving query time using space [6]. We present the first set of repetition-aware solutions as summarized below.
Theorem 1.
There exists a solution for Problem 1 with space
Here is the substring complexity and is a fixed constant.
The query time is improved in the next result, matching that of Cole et al.’s index.
Theorem 2.
There exists a solution for Problem 1 with space
Here is the substring complexity, is a fixed constant and is an arbitrarily small constant. The query time can be made time by maintaining an additional structure of space .
We also show how our techniques can be extended to design a repetition-aware index for a closely related problem known as indexing with gaps [30].
2 Preliminaries
Notation.
For a length string , we denote the character by . We use “” to denote concatenation between two strings. A substring of starting at position and ending at position of is denoted by . If , then is the empty string. We also use the notation and . We use to denote the reverse of the string . A suffix is a string of the form , which we denote as . A suffix is called a proper suffix if . A prefix is a string of the form , which we denote as . A prefix is called a proper prefix if .
Compact Tries and Suffix Trees.
Let be a set of strings over . We use to denote a compact trie constructed over after appending each string in with a special character . The character is lexicographically smaller than all other characters in . The number of leaves in will be exactly . The leaves of , denoted in left-to-right order are , , , such that corresponds to string in when is sorted in ascending lexicographic order. For a node (either implicit or explicit222Branching nodes and leaves are called explicit nodes, and positions on edges are called implicit nodes.) in , we use to denote the string obtained by concatenating edge labels on the root-to- path and string depth to denote its length.
Given a pattern , and a compact trie , if for some node (either implicit or explicit) in , then is called the locus of . Given a locus, the collection of leaves under that locus forms a contiguous range , , , and we call the range of the locus. Observe that is a prefix of for all .
The suffix tree [51] of a string , denoted as , is a compact trie built over all suffixes of . The suffix tree can be constructed in time for polynomially-sized integer alphabets [11]. The suffix array is an array such that is the suffix when sorted in ascending lexicographic order. The inverse suffix array is an array such that , or equivalently is the lexicographic rank of the suffix . The longest common extension of two suffixes and , denoted by is the length of their longest common prefix. queries can be answered in time using an space structure [18].
There exists a combination of compact data structures of total space bits, equivalently words, and support all suffix tree functionalities with no slowdown in performance [17, 47]. There also exist indexes of space bits that supports SA and ISA queries in time time [17] and LCE queries in time [22].
Burrows-Wheeler Transform.
The Burrows-Wheeler Transform (BWT) [7] of a text , denoted by is a permutation of all characters in , such that is the last character of string in the set when sorted lexicographically. The BWT can be computed in linear time. The BWT has been a popular form of compression used in several previous text indexes [3, 34, 35] including the well known FM-index [12]. The popularity of the BWT is due to the following favorable properties: First, the BWT groups characters sharing similar contexts, resulting in a small run-length encoded form when the text is highly repetitive. The letter denotes the number of maximal unary substrings (called runs) in the BWT. Second, the BWT is amenable to efficient pattern matching and string query procedures.
LZ77 and String Attractor.
Lempel-Ziv (LZ77) [53] is another popular type of repetition-based compression. LZ77 partitions the string greedily from left-to-right into what are called LZ factors or phrases, , , . Let denote the number of factors. The following conditions are satisfied: (i) and (ii) is either the first occurrence of a character or is the longest substring starting at position that has an occurrence starting at some position . The positions , , are called phrase boundaries and we can compute them in linear time [49]. Let , we say is a primary occurrence of if a phrase boundary exists in the range . Every occurrence of not containing a phrase boundary is called a secondary occurrence. It can easily be shown that every distinct substring of has a primary occurrence [21], which makes , , a string attractor. See Figure 2 for an example.
Repetition-Aware Suffix Trees/Arrays and the -index.
By indexing the text in space , we can support SA, ISA, and LCE operations in time [14]. There also exists an optimal space index (called -SA) that supports SA and ISA operations in time and LCE operation in time [24]. The -index is another important structure of space words [14]; see [41] for its refinements. It supports efficient pattern matching, but it cannot support any of the basic operations mentioned earlier. Fortunately, the limited operations supported by -index are sufficient for our purpose.
Orthogonal Range Queries.
The task here is to preprocess a set of -dimensional points . For a query , the output is the subset of points in . Over a set of -dimensional points in an grid, we can maintain an space structure and support queries in time [8], where is an arbitrarily small constant, (or) an space and support queries in time, where denotes the output size [9].
A standard technique in compressed text indexing, and one that we expand on here, is the use of orthogonal range queries to report primary occurrences [13, 29, 38]. For a simple example, assume that we have the phrases . We build a compact trie over the suffixes , , , maintaining for each leaf its respective value. We also build a compact trie over the reversed prefixes , , , maintaining for each leaf its respective value. Next for each , we create a 2D point where is the leaf index in corresponding to and is the leaf index in corresponding to and associate the value with . Construct a 2D range query structure over this set of points.
Given a query pattern , for each , we find the locus of in , let its range be . We also find the locus of in , let its range be . Observe that query returns points for phrase boundaries aligning with position in a primary occurrence of . See Figure 2.
Heavy Path Decomposition.
For a general tree and node in , we denote the subtree rooted at as . The size of the , denoted as , is defined as the number of leaves in . We next describe heavy path decomposition [48]. Given any tree , we categorize its nodes into light and heavy: we take the root as a light node. Exactly one child of every (non-leaf) node is heavy, specifically the one with maximum subtree size. When there is a tie, we pick the leftmost child as heavy. Any path starting from a light node and recursively following its heavy child, and ending at a leaf node is a heavy path. Note that each node belongs to precisely one heavy path.
Property 1.
Each root-to-leaf path in a tree of size contains at most light nodes.
3 Our Solution
As the first building block, we maintain a base structure that supports suffix array, inverse suffix array, and LCE queries for the text and its reverse. This can be suffix trees in some (compressed) form. We use to denote the space of the base structure and to denote its query time. We perform the analysis with a general base structure and specify specific trade-offs in Section 5. Our approach then builds two versions of Cole et al.’s -wildcard errata structure [10] over a subset of substrings based on a parse of the text. Given a query pattern , we first find the primary occurrences of . This is done using a 2D-range query structure constructed over a set of points created with the leaves of both -wildcard errata structures. The secondary occurrences are obtained using the base structure. We start with a basic subroutine used throughout.
3.1 LCP Queries
We consider a compact trie built from a subset of substrings of . With every leaf of we associate a starting text position of the corresponding substring. We seek to answer queries of the following form: Given , and a node (either implicit or explicit) in , find the locus of in , where is the longest prefix of for which such a locus exists. We call this type of query a unrooted LCP query. If is restricted to always be the root , we call this a rooted LCP query.
Lemma 3.
Assuming the availability of our base structure of space , which supports basic suffix tree operations (, , and queries) in time, we have the following results. An unrooted LCP query on a compact trie constructed from substrings of can be answered in time. The (additional) space used by the data structure is . Moreover, for rooted LCP queries, the (additional) space can be made with the same query time complexity.
Proof.
As preprocessing for unrooted LCP queries we do the following. Let be the root of . For every light node in , we consider all substrings corresponding to the leaves in its subtree, that is, strings obtained by concatentating edges labels on -to-leaf paths for leaves in . Suppose this is the set , . For each we create a y-fast trie [52] over the values , , that supports predecessor search in time. Note that by Property 1 of the heavy path decomposition, the number of values over all light nodes, as well as the total space is .
Given a query consisting of positions (specifying an arbitrary substring ) and a locus in , we first perform an where is the text position of the leaf on the heavy path containing . If the entirety of is matched along this heavy path (), or if the matching ends prematurely at an implicit node of the heavy path, we are done. Otherwise, we obtain a node on the heavy path from which we have to explore ’s light children. We next obtain the child of that needs to be traversed; note that must be a light child of . We query the base structure to obtain the value . We then perform a predecessor search for over the values stored for . The result of the predecessor search gives us at most two leaves corresponding to the predecessor and successor of in terms of value. We then perform queries with these neighboring leaves and obtain the result.
For rooted LCP queries, we apply the same preprocessing technique but only at the root of . Queries are answered in the same way as unrooted LCP queries.
3.2 Cole et al.’s -Wildcard Errata Structure
The Structure.
Cole et al.’s structure [10] consists of a collection of compact tries, which are further categorized into levels (from level to ). We use to denote the set of tries at level for . In we have a single trie. In Cole et al.’s framework, this level trie is taken as the complete suffix tree . In this work, we will be more general, allowing it to be an arbitrary trie constructed from substrings of . In , each trie corresponds to an internal node in the level trie. A level trie for an internal node of the level trie is constructed as follows: we collect all strings corresponding to leaves in the subtree of except those in the subtree of the heavy child of , remove their first characters, and create a compact trie. Suppose that a substring used in level 1 is obtained from after removing the first characters, then we say that the substring in the level trie corresponds to the original position . Continuing, each trie in , , corresponds to a unique internal node in some trie in level . A level trie for a node in a level trie is constructed over all substrings for leaves in the subtree of except those in the heavy child of , and with their first characters removed. We similarly maintain the correspondence with the original position in for each substring in . We use to denote the collection of all levels. See Figure 3.
It follows directly from Property 1 of heavy path decomposition that the overall space is . With a tighter analysis, one can show that the space is . As the details of this analysis are important to our solution, and the statement of Lemma 4 differs from that made in [10], we present the argument in full.
Lemma 4.
Let be the number of leaves in . Also, let be the original position for a leaf in . Then is the original position for at most leaves in .
Proof.
We fix a particular leaf in . Suppose the light nodes on the root-to- path in are , , , and the respective parents , , (excluding since is the root). There are two cases: if contributes to ’s level 1 trie (that is, is in the subtree of a light child of ), then the size of is at most , the size of is at most , and so on. Otherwise, the size of is at most , the size is at most , and so on. Let the inductive hypothesis be that in a -level structure for substrings that the leaf occurs at most . Based on the observation regarding subtree sizes, for a -level structure, we have the number of leaves with original position is bounded by
From Lemma 4, the total space needed for substrings is .
Querying.
Querying is carried out by starting at the root of the level 0 trie and matching until the first wildcard character is reached. If the locus is an implicit node, then we consider the match to continue on the implicit node’s edge. If the locus of the last character before the first is an explicit non-leaf node , then we consider the match as continuing both to the heavy child of and the level trie for . In both cases, we match the by skipping the next character. In the case of the level trie for , this is accomplished by skipping the and starting from the root of the level 1 trie.
The same procedure is recursively applied for the level trie when the next wildcard is reached. Assuming has wildcards, this causes at most bifurcations to occur. If a mismatch occurs at any point during a branch of this search, we stop exploring that branch further. Note that if and the search is performed character-by-character then the total time required is . However, by preprocessing all pattern segments so that they are known substrings of , we can apply Lemma 3 and find all pattern loci in time. Note that by using the unrooted LCP query structure for tries in levels , , rooted LCP query structure for tries in , the space complexity is unaltered.
3.3 Our Data Structure
We maintain a base structure (as described in the beginning of Section 3) for and for . We assume that we are given a string attractor of size such that . This gives a factorization with phrases.
Forward Sparse Structure.
The construction of our structure is similar to that of Cole et al.’s, but instead of having the full suffix tree as the level , we keep a trie of only those suffixes corresponding to the phrase boundaries. We assign a global index to the leaves across all tries. The only condition we ensure is that all leaves in the same trie should get contiguous global indices in the left-to-right order. We call the resulting structure . Following the same analysis as presented in Section 3.2, the space required for is . We next equip all tries in , , with the unrooted LCP query structure described in Lemma 3 and the tries in with the rooted LCP query structure. This does not change the space complexity.
Reverse Sparse Structure.
Next, we collect all phrases. We create a compact trie from the collection of reversed phrases333In contrast to the forward sparse structure, we do not use the entire prefix ending at a phrase boundary.. For each leaf in the level trie, we maintain the phrase boundary in that immediately proceeds it; that is, if the reversed phrase is , we associate the original position with the corresponding leaf. With this trie as level 0, we then create Cole et al.’s -level structure. We also give a global index labeling to the leaves as before. We call the resulting structure . As before, we equip the tries in with the unrooted LCP query data structures from Lemma 3 and the rooted LCP structure on . The total space for this structure is .
Orthogonal Range Query Structure.
Our next component is a 2D orthogonal range query structure. The 2D points are created as follows: let be the global index of a leaf in a level trie of with original position , and is the global index of a leaf in a level trie in with original position . We create a point for all such and where the inequality holds. We associate the original position with that point.
3.4 Space Analysis
Applying Lemma 4, the number of leaves with original position at level of is bounded by . Similarly, the number of leaves assigned original position at a level in is also bounded by . As a result, the number of 2D points corresponding to a specific original position is bound by
where the last equality applies that . Next, we consider the inequality
This holds as long as , which is true, since . Thus, we can write
where we used in the last inequality that . Letting , which is constant, the total number of 2D points is . It is now convenient to split our analysis into two cases, the case where dominates and vice versa:
-
If dominates , then the space complexity is
-
If dominates , then the space complexity is
Hence, we can say
We will first consider using a linear-space 2D range query data structure over these points. Note that the space required by the forward sparse structure and the reverse sparse structure is also bound by the same expression. The resulting overall space complexity is therefore for some constant .
3.5 Querying Algorithm
We first present a simpler querying procedure and then show how it can be improved upon. Assume the query is with . We begin by searching for each segment (resp., ) in the base structure. If some segment does not exist in , then there are no occurrences of . Otherwise, this enables us to apply Lemma 3.
Finding Primary Occurrences.
Next, we obtain all primary occurrences of . Let denote the number of primary occurrences of . To accomplish finding all primary occurrences, we consider each split of , that is, and for . For a given split and , we match in the and in . In more detail, suppose we have wildcards in the reverse part and wildcards in the forward part. Then, based on the analysis of the querying procedure in Section 3.2, there will be at most bifurcation loci in the search in and loci in . Applying Lemma 3, we spend time per bifurcation. For a given split of , we now have the loci in all tries of where the matching of ends. Denote this set of loci as . We also have the loci in all tries of where the matching of ends. Denote this set of loci as . We next consider each pair of locus, and . For (resp., ) we obtain a contiguous range of global indices (resp., ) corresponding to the leaves in the subtree of (resp., ). For all such pairs of ranges, we make the 2D range query and from each reported point’s value and offset , we obtain a primary occurrence.
Finding Secondary Occurrences.
From the set of primary occurrences, we first extract one occurrence per each distinct substring of that matches with . To do this, we find the values of all primary occurrences first, sort them in ascending order of their values, then scan this list in the left to right order and retain only those entries with value less than with its previous element. Then, for each occurrence in the reduced set, we can find all other occurrences of in as follows: First: initialize , while , report and increment . Second: initialize , while , report and decrement . This part of the query algorithm takes time.
Time Complexity.
Regarding the query time complexity, as there are bifurcations loci per pattern split and pattern splits, the total time for finding all pattern loci (using Lemma 3) is bound by . For orthogonal range queries, the number of pairs of ranges we have to consider for a given pattern split is bound by . Each query takes time, where is the number of 2D points and refers to the number of occurrences reported for the range query (the locus pair) with pattern split . Since all loci used for the orthogonal range query are for distinct substrings in , all reported occurrences for two different orthogonal range queries are distinct, and , where is the total number of occurrences reported for a split . Summing over all orthogonal range queries for a given split, we have the time used per split bounded by
We claim . It is easy to see that every primary occurrence will be reported. Moreover, each primary occurrence will be reported only once, because is built over the set of phrases (rather than the entire prefix ending prior to the start of a phrase). Due to this, even if a primary occurrence contains multiple phrase boundaries, it is still only reported once. In particular, it gets reported for the split aligned with its leftmost phrase boundary.
To obtain the time used by orthogonal range queries for finding all primary occurrences, we sum over all , resulting in the expression
Note that can be loosely bounded by . This is because the maximum value of the function is . Therefore, .
The total time is , in addition to the time for finding one occurrence of all segments of .
Lemma 5.
For problem 1, there exists an space data structure, where is the size of text’s string attractor and is a fixed constant. The query time is , in addition to the time for initial pattern search.
4 An Improved Solution
4.1 Our Data Structure
Our next aim is to replace the product in the time complexity with . The idea is to handle long patterns () and short patterns () separately, where is a parameter we specify later.
Long Patterns.
For long patterns, we include some additional suffixes and phrases (extended slightly more to the right and left) within our construction. Specifically, for each phrase for , we include all of the suffixes for in and in . We then build and based of their respective level 0 trie. We also create a set of 2D points, again with values where is global index of a leaf in , and a global index of a leaf in , and and both have original position for some and . The linear space 2D orthogonal range query structure used in Section 3.3 is constructed over these points.
Short Patterns.
To handle short patterns, we collect the substrings for each (also record one of its occurrence in ). After appending each such substring with a (a delimiter), we concatenate them and obtain a new string . We build the wildcard errata structure of Cole et al. [10] for .
4.2 Space Analysis
For the long pattern data structure, the number of substrings used in level of both and now becomes . Following Lemma 4, the space for both and becomes at most . Following the same analysis presented in Section 3.4, the number of 2D points created and the final asymptotic space complexity is for a constant . For short patterns, the length of is . The -wildcard errata structure over requires .
4.3 Querying Algorithm
As before, we first find occurrences of all pattern segments using the base structure. This enables us to apply Lemma 3. We first consider the long pattern case where . Now, for finding primary occurrences, rather than trying every split as was done in Section 3.5, we split only at positions . For a split of at , we find the loci of in and in . For each pair of loci, we obtain two ranges of global leaf labels and perform a range query as before. From the original index associated with a reported point we can obtain the corresponding primary occurrence. Secondary occurrences are then reported using the base structure.
For the short pattern case where , we first obtain the primary occurrences by querying the -wildcard errata trie over . Secondary occurrences are obtained using the base structure.
4.4 Time Complexity
In the case of long patterns, we consider different splits of the pattern. For a split, locating all of the loci in and requires time. Two key observations of our sampling method are that (i) every primary occurrence has a split that aligns with a sampled position (ii) each primary occurrence will be reported at most twice. See Figure 4. This follows from the splits considered being indexes apart in , making it so at most two will be within the same sized window around a phrase boundary. For a split at and the pair of loci, one in and , let be the number of 2D points reported for a range query. Combining the above to facts, for a particular split , the time complexity for range queries is . Summing over all splits, the time complexity for all range queries . Finally, using the base structure to obtain the secondary occurrences once again requires time. We conclude that the total time taken for long patterns is the time for locating all segments of in addition to
For short patterns, each locus is found in time. The resulting time complexity for finding all occurrences is the time for locating all segments of in addition to
Lemma 6.
For problem 1, there exists an space data structure, where is the size of text’s string attractor, is a fixed constant and is a parameter. The query time, in addition to the time for initial pattern search is .
5 Trade-Offs for Time and Space
We now look at some possible trade-offs based on the base structure and the choice of .
5.1 Fully Functional Suffix Trees in Repetition-Aware Space
We first consider using fully functional compressed suffix tree in [14] for the text and its reverse. The space is , where and are the number of runs in and respectively. Making , the additional space complexity becomes
Here is a large enough constant. For query times, we have . The suffix ranges of all pattern segments can be found in time, and one occurrence of each segment is reported in time. This gives an space index with query time . We improve this result next.
5.2 The r-Index and Optimal LCE Structure (Completing Theorem 1)
Here we replace the fully functional suffix trees in the base structure with two r-indexes [14] (one for and one for ), which brings down the space to . Although, the r-index supports only limited functionalities, we will see that they suffice for our purposes. We additionally maintain -SA [24] for and in space, for supporting LCE queries in time time . In what follows, we examine the steps where the base structure is queried, and observe how our new base structure suffices.
-
For each segment of the pattern , we need to find an occurrence (text position of and its inverse suffix array position). This can be done with the space r-index. Moreover, the backward search procedure utilized by the r-index provides us with a corresponding values (i.e., position in the text and its ISA) for every suffix of the pattern segment ; we refer to the toehold lemma in [14], also see [42]. The same can be accomplished for the reverse of the pattern segment with the r-index for the reversed text. The total time for this step is .
-
For rooted and unrooted LCP queries, we used value of the current suffix (or prefix) of a pattern segment as well as queries. As observed above, the needed values of suffixes and prefixes of pattern segments can be obtained while performing the backward search procedure. For LCE queries, we use the structure of Kempa et al. [24].
-
The next step utilizing the base structure is finding the secondary occurrences from the primary occurrences. The first step in our previous algorithm was to find a subset of primary occurrences, such that for each each distinct substring of matching , we have exactly one occurrence in that set. Previously, this procedure required ISA queries on arbitrary positions, which cannot be supported by the r-index. Therefore, we follow an alternative procedure, which rely on the fact that r-index can support operations , , , and for any given in time [14]. Therefore, by applying iteratively, we can compute for any in time. Similarly, by iteratively applying , we can find for any in time.
We now describe the procedure. Let be the non-reduced set of a primary occurrences. Also, let be a set, which is initially empty; we maintain its elements in the form of a balanced binary search tree [2], supporting membership queries in logarithmic time. We process each in any order as follows. If then skip , otherwise,
-
1.
Initialize .
While , add to and make . -
2.
Initialize .
While , add to and make .
Lastly, we output the set .
-
1.
The total query time is . We summarize the results of this approach in Theorem 7.
Theorem 7.
There exists a solution for Problem 1 with space
and query time , where is the substring complexity, is the size of a known string attractor, (resp., ) denotes the number of runs in the BWT of (resp., ), and is a fixed constant.
5.3 Matching Cole et al.’s Query Time (Completing Theorem 2)
If we are willing to utilize more space, an improved query time is possible. To that end, for the base structure, we use text indexes that support all queries in time, just like (uncompressed) suffix trees, but in space just bits, or, equivalently, words [17, 47]. We also maintain the version of the FM-index in [4] of space bits, which can be utilized to find the suffix ranges of all segments of in time. For 2D range queries, we use the super-linear space structure of Chan et al. [9]. Over a set of points, it occupies space and answers queries in time where is the output size. We make . Note that since ,
The space complexity of our structure becomes for some constant . We again substitute in that and simplify to get the space bound shown in Theorem 2.
We now analyze the query complexity. The first step of finding an occurrence of each segment of the pattern can be done in total time. Regarding long pattern queries, for a particular split , the time is bound by where and are defined as in Section 4.4. Summing over all splits, we can bound the time complexity of computing the primary occurrences by . To further obtain the secondary occurrences, we follow the same procedure as in Section 5.2, however we use a bit string to support constant time membership queries on the set (i.e., iff ). Thus, the overall query complexity for long patterns is . For short pattern queries, the time complexity is . This completes the proof of the first part of Theorem 2.
5.3.1 Achieving Query Time
Our goal in this subsection is to achieve query time. When or , this has already been accomplished using Theorem 2. Therefore, we consider designing an auxiliary structure for queries with . We first take all substrings of length containing a phrase boundary. There are at most such substrings. For each distinct substring constructed above, we consider all possible ways of substituting up to number of characters ( is a special symbol) into the substring. For a given string, there at most ways to do this, where we used that . Thus, the total number of strings generated in this way is bounded by
for some constant . We remove duplicate strings and build a compact trie over the resulting set. We associate with each node in the first positions in matching (treating as a wildcard). The total additional space becomes for some constant . Finally, we fix , where as earlier.
Given a query pattern , we replace every with and find its locus in . Then, we report all associated occurrences if the number of associated occurrences is less than . This takes time. If the number of associated occurrences is greater or equal to , we use the previous data structure.
This completes the second part of Theorem 2.
5.4 A Simple, Space Optimal Solution
The suffix tree of can be used to answer the query in time for any , by exhaustively exploring all possible substitutions of each wildcard with actual characters. This can be improved to by leveraging the fact that an unrooted LCE query on the suffix tree can be resolved via binary search, requiring basic operations (i.e., SA, ISA, and LCE queries). The query procedure can be simulated on the space repetition-aware suffix tree [14] in time . Alternatively, we can use the -suffix array [24] of and the structure in [26] for initial pattern search. The combined space is words and the query time is .
6 Indexing for Pattern Matching with Gaps
We now consider a generalization of Problem 1, where the query consists of a pattern with gaps and we want to find its occurrences in the text. A maximal contiguous sequence of wildcards is defined as a single gap, and its size is the number of wildcards. The maximum number of gaps allowed and the maximum size of a gap are fixed at construction. A formal definition is below.
Problem 2 (-Gapped Indexing).
Given a text and integers and , design a data structure that supports the following query efficiently.
-
Input: Strings of total length and integers , where and for all .
-
Output: The set of occurrences of the gapped-pattern in .
A solution by Lewenstein [30] offers space and query time. To obtain a repetition-aware solution, one could replace each gap with number of wildcard characters and apply our solution from Section 3. However, this would require a structure with a space complexity exponential in . On a related note, see [5, 15, 19, 20] for some interesting solutions for the special case where .
Our Data Structure.
We start with the sparse suffix trees for and obtained in the same way as in Section 3.3. To construct , we consider a particular explicit node in . For each , consider every light node that (i) branches off the heavy path containing and (ii) has depth satisfying . For every substring represented by a leaf in the subtree of such a node , remove its first characters. These are combined into a single level 1 trie for and . The same procedure is performed recursively until all levels are constructed in . The same procedure is also used to construct from . Again, a linear space 2D range query structure is constructed based on original positions, as was done in Section 3.3.
Querying Algorithm and Time Complexity.
To answer a query , we try each split within each segment one by one, i.e., splits of the form and . When a gap is reached in either a trie in or , if the locus is on an edge , we subtract the remaining edge label length from the gap size, say . We then continue from node , taking both the heavy path within the same trie and the associated trie for for the gap . The query time complexity remains the same as in -wildcard problem, that is .
Space Analysis.
To analyze the space required by this structure, consider that in our solution to Problem 1 , a arose in the analysis due to a potential bifurcation at every node. Instead, we now take at most associated side tries for each node, causing us to replace that with a . Following a similar analysis as Lemma 4, we arrive at a structure of size for some constant , in addition to the size of the base structure. The key advantage of this space complexity is that it has ceased to be exponential in . Using the same base structure from Section 5.2 and , we achieve Theorem 8.
Theorem 8.
There exists a solution for Problem 2 with space
Here is the substring complexity of the text and is a fixed constant.
References
- [1] Paniz Abedin, Oliver A. Chubet, Daniel Gibney, and Sharma V. Thankachan. Contextual pattern matching in less space. In Data Compression Conference, DCC 2023, Snowbird, UT, USA, March 21-24, 2023, pages 160–167. IEEE, 2023. doi:10.1109/DCC55655.2023.00024.
- [2] Georgii M Adel’son-Vel’skii. An algorithm for the organization of information. Soviet Math., 3:1259–1263, 1962.
- [3] Djamal Belazzougui, Fabio Cunial, Travis Gagie, Nicola Prezza, and Mathieu Raffinot. Composite repetition-aware data structures. In Combinatorial Pattern Matching - 26th Annual Symposium, CPM 2015, Ischia Island, Italy, June 29 - July 1, 2015, Proceedings, volume 9133 of Lecture Notes in Computer Science, pages 26–39. Springer, 2015. doi:10.1007/978-3-319-19929-0_3.
- [4] Djamal Belazzougui and Gonzalo Navarro. Alphabet-independent compressed text indexing. ACM Trans. Algorithms, 10(4):23:1–23:19, 2014. doi:10.1145/2635816.
- [5] Philip Bille, Inge Li Gørtz, Moshe Lewenstein, Solon P. Pissis, Eva Rotenberg, and Teresa Anna Steiner. Gapped string indexing in subquadratic space and sublinear query time. In 41st International Symposium on Theoretical Aspects of Computer Science, STACS 2024, volume 289 of LIPIcs, pages 16:1–16:21. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2024. doi:10.4230/LIPICS.STACS.2024.16.
- [6] Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and Søren Vind. String indexing for patterns with wildcards. Theory Comput. Syst., 55(1):41–60, 2014. doi:10.1007/S00224-013-9498-4.
- [7] Michael Burrows and D J Wheeler. A block-sorting lossless data compression algorithm. In , 1994. URL: https://api.semanticscholar.org/CorpusID:2167441.
- [8] Timothy M. Chan, Kasper Green Larsen, and Mihai Pătraşcu. Orthogonal range searching on the RAM, revisited. In Proceedings of the 27th ACM Symposium on Computational Geometry, Paris, France, June 13-15, 2011, pages 1–10. ACM, 2011. doi:10.1145/1998196.1998198.
- [9] Timothy M. Chan and Konstantinos Tsakalidis. Dynamic orthogonal range searching on the RAM, revisited. J. Comput. Geom., 9(2):45–66, 2018. doi:10.20382/JOCG.V9I2A5.
- [10] Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. Dictionary matching and indexing with errors and don’t cares. In László Babai, editor, Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, June 13-16, 2004, pages 91–100. ACM, 2004. doi:10.1145/1007352.1007374.
- [11] Martin Farach. Optimal suffix tree construction with large alphabets. In 38th Annual Symposium on Foundations of Computer Science, FOCS ’97, Miami Beach, Florida, USA, October 19-22, 1997, pages 137–143. IEEE Computer Society, 1997. doi:10.1109/SFCS.1997.646102.
- [12] Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In 41st Annual Symposium on Foundations of Computer Science, FOCS 2000, 12-14 November 2000, Redondo Beach, California, USA, pages 390–398. IEEE Computer Society, 2000. doi:10.1109/SFCS.2000.892127.
- [13] Travis Gagie, Pawel Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, and Simon J. Puglisi. LZ77-based self-indexing with faster pattern matching. In LATIN 2014: Theoretical Informatics - 11th Latin American Symposium, Montevideo, Uruguay, March 31 - April 4, 2014. Proceedings, volume 8392 of Lecture Notes in Computer Science, pages 731–742. Springer, 2014. doi:10.1007/978-3-642-54423-1_63.
- [14] Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM, 67(1):2:1–2:54, 2020. doi:10.1145/3375890.
- [15] Arnab Ganguly, Daniel Gibney, Paul Macnichol, and Sharma V. Thankachan. Bounded-ratio gapped string indexing. In String Processing and Information Retrieval - 31st International Symposium, SPIRE 2024, Puerto Vallarta, Mexico, September 23-25, 2024, Proceedings, volume 14899 of Lecture Notes in Computer Science, pages 118–126. Springer, 2024. doi:10.1007/978-3-031-72200-4_9.
- [16] Daniel Gibney, Paul Macnichol, and Sharma V. Thankachan. Non-overlapping indexing in BWT-runs bounded space. In String Processing and Information Retrieval - 30th International Symposium, SPIRE 2023, Pisa, Italy, September 26-28, 2023, Proceedings, volume 14240 of Lecture Notes in Computer Science, pages 260–270. Springer, 2023. doi:10.1007/978-3-031-43980-3_21.
- [17] Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput., 35(2):378–407, 2005. doi:10.1137/S0097539702402354.
- [18] Dov Harel and Robert Endre Tarjan. Fast algorithms for finding nearest common ancestors. SIAM J. Comput., 13(2):338–355, 1984. doi:10.1137/0213024.
- [19] Md Helal Hossen, Daniel Gibney, and Sharma V Thankachan. Text indexing for faster gapped pattern matching. Algorithms, 17(12), 2024.
- [20] Costas S. Iliopoulos and M. Sohel Rahman. Indexing factors with gaps. Algorithmica, 55(1):60–70, 2009. doi:10.1007/S00453-007-9141-3.
- [21] Juha Kärkkäinen and Esko Ukkonen. Lempel-ziv parsing and sublinear-size index structures for string matching. In Proc. 3rd South American Workshop on String Processing (WSP), pages 141–155, 1996.
- [22] Dominik Kempa and Tomasz Kociumaka. String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 756–767. ACM, 2019. doi:10.1145/3313276.3316368.
- [23] Dominik Kempa and Tomasz Kociumaka. Resolution of the burrows-wheeler transform conjecture. Commun. ACM, 65(6):91–98, 2022. doi:10.1145/3531445.
- [24] Dominik Kempa and Tomasz Kociumaka. Collapsing the hierarchy of compressed data structures: Suffix arrays in optimal compressed space. In 64th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2023, Santa Cruz, CA, USA, November 6-9, 2023, pages 1877–1886. IEEE, 2023. doi:10.1109/FOCS57990.2023.00114.
- [25] Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: string attractors. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, pages 827–840. ACM, 2018. doi:10.1145/3188745.3188814.
- [26] Tomasz Kociumaka, Gonzalo Navarro, and Francisco Olivares. Near-optimal search time in -optimal space, and vice versa. Algorithmica, 86(4):1031–1056, 2024. doi:10.1007/S00453-023-01186-0.
- [27] Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Towards a definitive measure of repetitiveness. In LATIN 2020: Theoretical Informatics - 14th Latin American Symposium, São Paulo, Brazil, January 5-8, 2021, Proceedings, volume 12118 of Lecture Notes in Computer Science, pages 207–219. Springer, 2020. doi:10.1007/978-3-030-61792-9_17.
- [28] Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Toward a definitive compressibility measure for repetitive sequences. IEEE Trans. Inf. Theory, 69(4):2074–2092, 2023. doi:10.1109/TIT.2022.3224382.
- [29] Sebastian Kreft and Gonzalo Navarro. Self-indexing based on LZ77. In Combinatorial Pattern Matching - 22nd Annual Symposium, CPM 2011, Palermo, Italy, June 27-29, 2011. Proceedings, volume 6661 of Lecture Notes in Computer Science, pages 41–54. Springer, 2011. doi:10.1007/978-3-642-21458-5_6.
- [30] Moshe Lewenstein. Indexing with gaps. In String Processing and Information Retrieval, 18th International Symposium, SPIRE 2011, Pisa, Italy, October 17-21, 2011. Proceedings, volume 7024 of Lecture Notes in Computer Science, pages 135–143. Springer, 2011. doi:10.1007/978-3-642-24583-1_14.
- [31] Moshe Lewenstein, J. Ian Munro, Yakov Nekrich, and Sharma V. Thankachan. Document retrieval with one wildcard. In Mathematical Foundations of Computer Science 2014 - 39th International Symposium, MFCS 2014, Budapest, Hungary, August 25-29, 2014. Proceedings, Part II, volume 8635 of Lecture Notes in Computer Science, pages 529–540. Springer, 2014. doi:10.1007/978-3-662-44465-8_45.
- [32] Moshe Lewenstein, J. Ian Munro, Venkatesh Raman, and Sharma V. Thankachan. Less space: Indexing for queries with wildcards. Theor. Comput. Sci., 557:120–127, 2014. doi:10.1016/J.TCS.2014.09.003.
- [33] Moshe Lewenstein, Yakov Nekrich, and Jeffrey Scott Vitter. Space-efficient string indexing for wildcard pattern matching. In 31st International Symposium on Theoretical Aspects of Computer Science (STACS 2014), STACS 2014, March 5-8, 2014, Lyon, France, volume 25 of LIPIcs, pages 506–517. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2014. doi:10.4230/LIPICS.STACS.2014.506.
- [34] Heng Li and Richard Durbin. Fast and accurate short read alignment with burrows-wheeler transform. Bioinform., 25(14):1754–1760, 2009. doi:10.1093/BIOINFORMATICS/BTP324.
- [35] Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol., 17(3):281–308, 2010. doi:10.1089/CMB.2009.0169.
- [36] César Martínez-Guardiola, Nathaniel K. Brown, Fernando Silva-Coira, Dominik Köppl, Travis Gagie, and Susana Ladra. Augmented thresholds for MONI. In Data Compression Conference, DCC 2023, Snowbird, UT, USA, March 21-24, 2023, pages 268–277. IEEE, 2023. doi:10.1109/DCC55655.2023.00035.
- [37] Gonzalo Navarro. Indexing highly repetitive string collections, part I: repetitiveness measures. ACM Comput. Surv., 54(2):29:1–29:31, 2022. doi:10.1145/3434399.
- [38] Gonzalo Navarro. Indexing highly repetitive string collections, part II: compressed indexes. ACM Comput. Surv., 54(2):26:1–26:32, 2022. doi:10.1145/3432999.
- [39] Gonzalo Navarro. Computing mems and relatives on repetitive text collections. ACM Trans. Algorithms, 21(1):12:1–12:33, 2025. doi:10.1145/3701561.
- [40] Gonzalo Navarro and Veli Mäkinen. Compressed full-text indexes. ACM Comput. Surv., 39(1):2, 2007. doi:10.1145/1216370.1216372.
- [41] Takaaki Nishimoto and Yasuo Tabei. Optimal-time queries on bwt-runs compressed indexes. In 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, July 12-16, 2021, Glasgow, Scotland (Virtual Conference), volume 198 of LIPIcs, pages 101:1–101:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.ICALP.2021.101.
- [42] Alberto Policriti and Nicola Prezza. LZ77 computation based on the run-length encoded BWT. Algorithmica, 80(7):1986–2011, 2018. doi:10.1007/S00453-017-0327-Z.
- [43] Sofya Raskhodnikova, Dana Ron, Ronitt Rubinfeld, and Adam D. Smith. Sublinear algorithms for approximating string compressibility. Algorithmica, 65(3):685–709, 2013. doi:10.1007/S00453-012-9618-6.
- [44] Massimiliano Rossi, Marco Oliva, Paola Bonizzoni, Ben Langmead, Travis Gagie, and Christina Boucher. Finding maximal exact matches using the r-index. J. Comput. Biol., 29(2):188–194, 2022. doi:10.1089/CMB.2021.0445.
- [45] Massimiliano Rossi, Marco Oliva, Ben Langmead, Travis Gagie, and Christina Boucher. MONI: A pangenomic index for finding maximal exact matches. J. Comput. Biol., 29(2):169–187, 2022. doi:10.1089/CMB.2021.0290.
- [46] Luís M. S. Russo, Gonzalo Navarro, and Arlindo L. Oliveira. Fully compressed suffix trees. ACM Trans. Algorithms, 7(4):53:1–53:34, 2011. doi:10.1145/2000807.2000821.
- [47] Kunihiko Sadakane. Compressed suffix trees with full functionality. Theory Comput. Syst., 41(4):589–607, 2007. doi:10.1007/S00224-006-1198-X.
- [48] Daniel Dominic Sleator and Robert Endre Tarjan. A data structure for dynamic trees. J. Comput. Syst. Sci., 26(3):362–391, 1983. doi:10.1016/0022-0000(83)90006-5.
- [49] James A. Storer and Thomas G. Szymanski. Data compression via textual substitution. J. ACM, 29(4):928–951, 1982. doi:10.1145/322344.322346.
- [50] Igor Tatarnikov, Ardavan Shahrabi Farahani, Sana Kashgouli, and Travis Gagie. MONI can find k-MEMs. In 34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023, June 26-28, 2023, Marne-la-Vallée, France, volume 259 of LIPIcs, pages 26:1–26:14. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.CPM.2023.26.
- [51] Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, pages 1–11. IEEE Computer Society, 1973. doi:10.1109/SWAT.1973.13.
- [52] Dan E. Willard. Log-logarithmic worst-case range queries are possible in space . Inf. Process. Lett., 17(2):81–84, 1983. doi:10.1016/0020-0190(83)90075-3.
- [53] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory, 23(3):337–343, 1977. doi:10.1109/TIT.1977.1055714.