Text Indexing for Simple Regular Expressions
Abstract
We study the problem of indexing a text so that, later, given a query regular expression pattern of size , we can report all the substrings of matching . The problem is known to be hard for arbitrary patterns , so in this paper, we consider the following two types of patterns. (1) Character-class Kleene-star patterns of the form , where and are strings and is a character-class (shorthand for the regular expression ) and (2) String Kleene-star patterns of the form where , and are strings. In case (1), we describe an index of space (for any constant ) solving queries in time on constant-sized alphabets. We also describe a general solution for any alphabet size. This result is conditioned on the existence of an anchor: a character of that does not belong to . We justify this assumption by proving that no efficient indexing solution can exist if an anchor is not present unless the Set Disjointness Conjecture fails. In case (2), we describe an index of size answering queries in time on any alphabet size.
Keywords and phrases:
Text indexing, regular expressions, data structuresFunding:
Hideo Bannai: JSPS KAKENHI Grant Number JP24K02899.Copyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Theory of computation Pattern matchingAcknowledgements:
Work initiated at Dagstuhl Seminar 24472 “Regular Expressions: Matching and Indexing.”Editors:
Paola Bonizzoni and Veli MäkinenSeries and Publisher:

1 Introduction
A regular expression specifies a set of strings formed by characters from an alphabet combined with concatenation (), union (), and Kleene star (∗) operators. For instance, describes the set of strings of s and s such that every is followed by an . The text indexing for regular expressions problem is to preprocess a text to support efficient regular expression matching queries on , that is, given a regular expression , report all occurrences of in . Here, an occurrence is a substring that matches any of the strings belonging to the regular language of . We also consider existential regular expression matching queries, that is, determining whether or not there is an occurrence of in . The goal is to obtain a compact data structure while supporting efficient queries.
Regular expressions are a fundamental concept in formal language theory introduced by Kleene in the 1950s [24], and regular expression pattern matching queries are a basic tool in computer science for searching and processing text. Standard tools such as grep and sed provide direct support for regular expression matching in files, and the scripting language perl [45] is a complete programming language designed to support regular expression matching queries easily. Regular expression matching appears in many large-scale data processing applications, such as internet traffic analysis [22, 46, 28], data mining [17], databases [32, 33], computational biology [37], and human-computer interaction [23]. Most of the solutions are based on the efficient algorithms for the classic regular expression matching problem, where we are given both the text and the regular expression as input, and the goal is to report the occurrences of in . However, in many scenarios, the text is available before we are given the regular expressions, and we may want to ask multiple regular expression matching queries on . In this case, we ideally want to take advantage of preprocessing to speed up the queries, and thus, the indexing version of the problem applies.
While the regular expression matching problem is a well-studied classic problem [44, 34, 5, 4, 11, 12, 2, 13, 43, 7, 2, 16], surprisingly few results are known for the text indexing for regular expressions problem. Let and be the length of and , respectively. Gibney and Thankachan [19] recently showed that text indexing for regular expression is hard to solve efficiently under popular complexity conjectures. More precisely, they showed that conditioned on the online matrix-vector multiplication conjecture, even with arbitrary polynomial preprocessing time, we cannot answer existential queries in for any . They also show that if conditioned on a slightly stronger assumption, we cannot even answer existential queries in time, for any . Gibney and Thankachan also studied upper bound time-space trade-offs with exponential preprocessing. Specifically, given a parameter , , fixed at preprocessing, we can solve the problem using space and preprocessing time and query time.
On the other hand, a few text indexing solutions have been studied for highly restricted kinds of regular expressions or regular expression-like patterns. These include text indexing for string patterns (simple strings corresponding to regular expressions that only use concatenations) and string patterns with wildcards and gaps (strings that include special characters or sequences of special characters that match any other character) and similar extensions [14, 10, 31, 41, 6, 21, 29, 27, 8, 30, 18].
Thus, we should not hope to efficiently solve text indexing for general regular expressions, and efficient solutions are only known for highly restricted regular expressions. Hence, a natural question is if there are simple regular expressions for which efficient solutions are possible and that form a large subset of those used in practice. This paper considers the following two such kinds of regular expressions and provides either efficient solutions or conditional lower bounds to them:
-
Character-class Kleene-star patterns. These are patterns of the form where and are strings and is a character-class that is shorthand for the regular expression .
-
String Kleene-star patterns. These are patterns of the form where , and are strings.
In other words, we provide solutions (or lower bounds) for all regular patterns containing only concatenations and at most one occurrence of a Kleene star (either of a string or a character-class). Using the notation introduced by the seminal paper of Backurs and Indyk [2] on the hardness of (non-indexed) regular expression matching, character-class Kleene-star patterns belong to the “” type: a concatenation of Kleene stars of (possibly degenerate, i.e. ) unions. To see this, observe that the characters of and can be interpreted as degenerate unions of one character (without Kleene). String Kleene-star patterns, on the other hand, belong to the “” type: a concatenation of Kleene stars of concatenations. Again (as discussed in [2]), since any level of the regular expression tree is allowed to contain leaves (i.e. an individual character), patterns of the form belong to this type by interpreting the characters of and as leaves in the regular expression tree. Our main results are new text indices that use near-linear space while supporting both kind of queries in time near-linear in the length of the pattern (under certain unavoidable assumptions discussed in detail below: if the assumptions fail, we show that the problem becomes again hard). Below, we introduce our results and discuss them in the context of the results obtained in [2].
1.1 Setup and Results
We first consider text indexing for character-class Kleene-star patterns , where is a characters class. We say that the pattern is anchored if either or has a character that is not in , and we call such a character an anchor. If the pattern is anchored, we show the following result.
Theorem 1.
Let be a text of length over an alphabet . Given a parameter and a constant fixed at preprocessing time, we can build a data structure that uses space and supports anchored character-class Kleene-star queries , where is a characters class with characters in time with high probability. Here, and is the number of occurrences of the pattern in .
In particular, our solution supports queries in almost optimal time for constant-sized alphabets. We also extend Theorem 1 result to handle slightly more general character-class interval patterns of the form , , and , meaning that there are at least, at most, and between and copies of characters from .
Intuitively, our strategy is to identify all the right-maximal substrings of , for every possible starting position , that contain only symbols in for every possible set . Such a substring will form the “” part of the occurrences. For each such , we then insert in a range reporting data structure a three-dimensional point with (lexicographically-sorted) coordinates . The data structure is labeled by set . We finally observe that the pattern can be used to query the right range data structure and report all matches of in .
Conversely, we show the following conditional lower bound if the pattern is not anchored.
Theorem 2.
Let be a text of length over an alphabet with and let . Assuming the strong Set Disjointness Conjecture, any data structure that supports existential (non-anchored) character-class Kleene-star pattern matching queries , where is a character class with at least 3 characters, in time, requires space.
With , Theorem 2 implies that any near linear space solution must have query time . On the other hand, with , Theorem 2 implies that any solution using time independent from must use space.
To get Theorem 2, we reduce from the Set Disjointness Problem: preprocessing some sets so we can quickly answer, for any pair of sets, if they are disjoint or not. [9] showed that wlog, we can assume every element appears in the same number of sets. The idea is then to define a string gadget representing any set, and a block for each element in the universe containing the string gadget for every set it is included in. The blocks are separated by a character not in the block. This way, the intersection of two sets is non-empty if and only if their gadgets appear somewhere in the string only separated by characters which appear in a block.
As noted above, character-class Kleene-star patterns belong to the “” type. Backurs and Indyk [2] prove a quadratic lower bound for this class of regular expressions. Our result shows that even the more restricted sub-class of “” is hard if no anchors are present.
We then consider text indexing for String Kleene-star patterns . We show the following result.
Theorem 3.
Let be a text of length over an alphabet . Given a constant fixed at preprocessing time, we can build a data structure that uses space and supports String Kleene-star patterns in time , where and is the number of occurrences of the pattern in .
As discussed above, String Kleene-star patterns belong to the “” type. For this type of patterns, Backurs and Indyk [2] proved a conditional lower bound of (for any constant ) in the offline setting for both pattern matching and membership queries. Our result, instead, implies an offline solution running in time (by stopping after locating the first pattern occurrence) after the indexing phase. This does not contradict Backurs and Indyk’s lower bound, since our patterns are a very specific case of the (broader) type “”. Equivalently, this indicates that including more than one Kleene star makes the problem hard again and thus justifies an index for the simpler case .
The main idea behind the strategy for Theorem 3 is to preprocess all maximal periodic substrings (called runs) in the string, so we can quickly find patterns ending just before or starting just after a run. However, there are some difficulties to overcome: firstly, may be periodic - e. g. if , we do not want to report occurrences of ; secondly, a run may end with a partial occurrence of the period; and lastly, may share a suffix with or a prefix with , in which case their occurrences should overlap with the run. We show how to deal with these difficulties in Section 4.
2 Preliminaries
A string of length is a sequence of characters drawn from an ordered alphabet of size . The string , denoted , is called a substring of ; and are called the prefix and suffix of , respectively. We use to denote the empty string (i.e., the string of length 0). The reverse string of a string of length , denoted by , is given by . Let and be strings over an alphabet . We say that the range is an occurrence of in iff .
Lexicographic order and Lyndon words.
The order of the alphabet defines a lexicographic order on the set of strings as follows: For two strings , let be the length of the longest common prefix of and . We have if and only if either i) or ii) both and have a length at least and . A string is a Lyndon word if it is lexicographically smaller than any of its proper cyclic shifts, i.e., , for all .
Concatenation of strings.
The concatenation of two strings and is defined as . The concatenation of copies of a string is denoted by , where ; i.e. and . A string is called primitive if there is no string and such that .
Sets of strings.
We denote by , , , and . The concatenation of a string with a set of strings is defined as . Similarly, the concatenation of two sets of strings and is defined as . We define , , , and for sets analogously. We say that the range is an occurrence of a set of strings if there is a such that is an occurrence of in .
Period of a string.
An integer is a period of a string of length if and only if for all . A string is called periodic if it has a period . The smallest period of will be called the period of .
Tries and suffix trees.
A trie for a collection of strings , is a rooted labeled tree such that: (1) The label on each edge is a character in some . (2) Each string in is represented by a path in going from the root down to some node (obtained by concatenating the labels on the edges of the path). (3) Each root-to-leaf path represents a string from . (4) Common prefixes of two strings share the same path maximally. A compact trie is obtained from by dissolving all nodes except the root, the branching nodes, and the leaves, and concatenating the labels on the edges incident to dissolved nodes to obtain string labels for the remaining edges.
Let be a string over an alphabet . The suffix tree of a string is the compacted trie of the set of all suffixes of . Throughout this paper, we assume that nodes in a compact trie or the suffix tree use deterministic dictionaries to store their children.
3 Character-class Kleene-star Patterns
In this section we give our data structure for answering anchored character-class Kleene-star pattern queries. Without loss of generality, we can assume that the anchor belongs to (the other case is captured by building our structures on the reversed text and querying the reversed pattern).
Recall that we assume for some parameter fixed at construction time. We first describe a solution for the case , and then in Section 3.3 show how to handle the case where .
Our general strategy is to identify all the right-maximal substrings of , for every possible starting position , that contain all and only the symbols of (we later generalize the solution to consider all the possible subsets of ). Such a substring forms the “” part of the occurrences. For this sake, must be preceded by and followed by . However, if starts with some symbols in , those symbols will belong to the right-maximal substring . We therefore separate , where is the longest prefix of that contains only symbols from , and starts with the anchor. The new condition is then that the substring ends with and is followed by . See Figure 1.
We need the following definitions.
Definition 4.
The -prefix of , denoted is the longest prefix of that is formed only by symbols in . We define so that
Definition 5.
The -run of that starts at position is the maximal range such that contains exactly distinct symbols. If the suffix has less than different symbols, then there is no -run starting at . We call the set of symbols that occur in the -run that starts at position .
Note that contains at most -runs, each starting at a distinct position .
We first show how to find occurrences matching all symbols of in the part of the pattern . Then, we complete this solution by allowing matches with any subset of .
3.1 Matching all Characters of
We show how to build a data structure for the case where is known at construction time, and we only find the occurrences that match exactly all distinct letters in the part of the occurrence. Recall that we also assume that contains an anchor.
Data structure.
Let be the set of subsets of size that occur as a -run in . Our data structure consists of the following:
-
The suffix tree of and the suffix tree of the reversed text, .
-
A data structure for each set indexing all the text positions . The structure consists of an orthogonal range reporting data structure for a four-dimensional grid in with points, one per -run with . For each such -run we store a point with coordinates , where
-
–
is the lexicographic rank of among all the reversed prefixes of .
-
–
is the lexicographic rank of among all the reversed prefixes of .
-
–
is the lexicographic rank of among all the suffixes of .
Each point stores the limits of its -run (so as to report occurrence positions).
-
–
-
A trie storing all the strings of length formed by sorting in increasing order the characters of , for every .
Note that the fourth coordinate of point could be avoided (i.e. using a 3D range reporting data structure) by defining to be the lexicographic rank of (where is a special terminator character) in the set formed by all the reversed prefixes of and strings of the form , for all -runs . While this solution would work in the same asymptotic space and query time (because we will only need one-sided queries on the fourth coordinate), we will need the fourth dimension in Subsection 3.4.
Basic search.
At query time, we first compute . For any occurrence of the query pattern, will necessarily be the suffix of a -run. This is why we need to contain an anchor; is not restricted because we index every possible initial position .
We then sort the symbols of and use the trie to find the data structure .
We now find the lexicographic range using the suffix tree of and the suffix tree of the reversed text, . The range then corresponds to the leaf range of the locus of in , the range to the leaf range of the locus of in , and the range to the leaf range of the locus of in .
Once the four-dimensional range is identified, we extract all the points from in the range using the range reporting data structure.
Time and space.
The suffix trees use space . The total number of points in the range reporting data structures is as there are at most -runs. Because we will perform one-sided searches on the fourth coordinate, the grid of can be represented in space, for any constant , so that range searches on it take time to report the points in the range [38, Thm. 7]. Thus, the total space for the range reporting data structures is . The space of the trie is .
The string can easily be computed in time with high probability using a dictionary data structure [15]. Sorting can be done in time [1]. By implementing the pointers of node children in and in the suffix trees and using perfect hashing (see [36]), the search in takes worst-case time and the three searches in and take total time . The range reporting query takes time . In total, a query takes time with high probability111Unfortunately, [38, Thm. 7] does not describe construction of the range reporting data structure that we use, so we are not able to provide construction time and working space of our index..
3.2 Matching any Subset of
We now show how to find all occurrences of , that is, also the ones containing only a subset of the characters of in the part of the occurrence.
Our previous search will not capture the -runs, for , containing only characters appearing in subsets of , as we only find and surrounding the -runs containing all characters from . To solve this we will build an orthogonal range reporting data structure for all . To capture all the occurrences of , we search the corresponding grids of all the nonempty subsets of , which leads to the cost . We wish to avoid, however, the cost of searching for , , and in the suffix trees for every subset of . In the following we show how to do this.
Data Structure.
Let . Our data structure consists of the following.
-
The suffix tree of and the suffix tree of the reversed text, .
-
The data structure from Section 3.1 for each set .
-
A trie storing all the strings of length 1 to , in increasing alphabetic order of characters, that correspond to some .
The suffix trees uses linear space. The space for each of the range reporting data structures is . Added over all , the total space becomes . The space for the trie is since there are at most strings each of length at most . Since we assume , the total space is .
Search.
To perform the search, we traverse to find all the subsets of as follows. Let be the string formed by concatenating all symbols of in sorted order. Letting be the set of nodes of reached after processing (initially, and contains only the root of ), is obtained by inserting in the nodes reached by following the edges labeled with character from nodes in . In other words, for each symbol of we try both skipping it or descending by it in . The last set, , contains all the nodes of corresponding to subsets of . Each time we are in a node of corresponding to some set which has an associated range reporting data structure , we perform a range reporting query .
Since the range is the same for all queries, we only compute this once. This is done by a search for in . The intervals and , on the other hand, change during the search, as the split of into and depends on the subset .
To compute these intervals we first preprocess as follows. Compute the ranges for all reversed prefixes of using the suffix tree : Start by looking up the locus for and then find the remaining ones by following suffix links.
Similarly, we compute the ranges for the suffixes of following suffix links in .
If we know the length of , we can then easily look up the corresponding intervals.
Maintaining . We now explain how to maintain the length of for in constant time for every trie node we meet during the traversal of . The difficulty with maintaining while changes is that when traversing the trie we add the characters to in lexicographical order and not in the order they occur in (see Figure 2).
First we compute for each character the position of the first occurrence of in . If does not occur in , we set . For each , we furthermore compute the position rank of , i.e., the rank of in the sorted set . We build:
-
a dictionary saving the position rank of each element .
-
an array containing the position of the first occurrence of the characters in in rank order, i.e., for each character , .
Let be the first character in position rank order that is not in . Then . The main idea is to maintain the intervals of characters in in position rank order. The position rank of can then easily be computed from the set of intervals and used to compute . Let be the first interval in in sorted order. If then otherwise, . We use an array to store the intervals of . We will maintain the invariant that if and only if the element with position rank is in . Furthermore, we will maintain the invariant that the first, respectively last, position of an interval of nonzero entries in contains the position of the end, respectively start, of the interval. Initially, all positions in are .
We proceed as follows. Initialize and initialize an empty stack . We now maintain as follows:
When we go down during the traversal adding a character to the set, we first lookup in and set . If there are no changes. Otherwise, let , i.e., is the set we had before inserting . We set and compute the leftmost position of the interval in containing : If then set . Otherwise, there is an interval in ending at position that must be merged with , and contains the left endpoint of this interval. Therefore we set . To compute the rightmost position of the interval in containing : If then set . Otherwise, there is another interval starting at position and we set to be the end of this interval, i.e., . We then push onto the stack to be able to quickly undo the operations later. Then we update by setting and . Finally, we update : If set . Otherwise, does not change.
When going up in the traversal removing character we first lookup . If there are no changes. Otherwise, we pop from the stack and set , , , and .
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | |
b | b | a | a | b | e | a | b | e | e | c | e | a | d | a | h |
in position rank order: | |||
---|---|---|---|
Time.
It takes time to search for in . Computing and for all splits of takes time . Sorting can be done in time [1]. Computing for all characters in , sorting them, computing the ranks , and constructing the arrays and and the dictionary takes linear time in the pattern length with high probability. The size of the subtrie we visit in the search is and in each step we use constant time to compute the length of . The total time for the range queries is . Thus, in total we use time with high probability.
3.3 Solution for
In the discussion above, we assumed that . If , we build the data structure described above by replacing with . The space of the data structure is still . At query time, if we use the data structure to answer queries in time.
If, on the other hand, then . We first find all occurrences of and using the suffix tree . Let be the end positions of the occurrences of and let be the start positions of the occurrences of . We sort the lists and . This can all be done in time and linear space using radix sort. We also mark with a 1 in a bitvector of length all text positions such that . This can be done in time with high probability, with a simple scan of and a dictionary over [15]. We build a data structure over the bitvector supporting rank queries in constant time [42]. We can now find all occurrences of the pattern by considering the occurrences in sorted order in a merge like fashion. Recall, that has an anchor. We consider the first occurrence in the list and find the first occurrence in that comes after , i.e. . If all characters between and are from (constant time with two rank operations over bitvector ) we output the occurrence. We delete from the list and continue the same way. In total, we find all occurrences in time with high probability. In summary, this proves Theorem 1.
3.4 Character-Class Interval Patterns
We extend our solution to handle patterns of the form , , and , meaning that there are at least, at most, and between and copies of characters from . We collectively call these character-class interval patterns.
By using one-sided restrictions on the fourth dimension, we can easily handle queries of the form in our solution from the previous section. Handling queries of the form or requires a two-sided restriction on the fourth dimension. This raises the space of the grid to , while retaining its query time [38, Thm. 7] [39]. With these observations we obtain the following results.
Theorem 6.
Let be a text of length over an alphabet . Given a parameter and a constant fixed at preprocessing time, we can build a data structure that uses space and supports anchored character-class interval queries of the form in time , where is a character class with characters, , and is the number of occurrences of the pattern in .
Theorem 7.
Let be a text of length over an alphabet . Given a parameter and a constant fixed at preprocessing time, we can build a data structure that uses space and supports anchored character-class interval queries of the form or in time , where is a characters class with characters, , and is the number of occurrences of the pattern in .
An alternative solution, when longer matches are more interesting than shorter ones, is to store the points in a three-dimensional grid, and use as the point weights. Three-dimensional grids on weighted points can use space and report points from larger to smaller weight (i.e., ) in time [35, Lem. A.5]. We can use this to report the occurrences from longer to shorter -runs, thereby stopping when the length drops below . We insert the first answer of each of the grids into a priority queue, where the priority will be the length of the matched -run minus , then extract the longest answer and replace it by the next point from the same grid, repeating until returning all the desired answers. The time per returned element now includes a factor if we implement the priority queue with a dynamic predecessor search data structure, plus for the initial insertions. We can also return longest answers in this case, within a total time of .
4 String Kleene-star Patterns
In this section we give our data structure for supporting string Kleene-star pattern queries.
As an intermediate step, we first create a structure that, given strings and , a primitive string , and numbers with and , where and do not share a suffix and and do not share a prefix, finds all occurrences in of patterns of the form , where and . Later we will show that this is sufficient to find occurrences of . For now, we assume that and are not the empty string; we will handle these cases later. We will also assume that is not the empty string - in our transformation from to , will be empty if and only if is empty. In this case, the problem reduces to matching in the suffix tree.
To define our data structures, we need the notion of a run (or maximal repetition) in .
Definition 8.
A run of is a periodic substring , such that the period cannot be extended to the left or the right. That is, if the smallest period of is , then and . We can write , where , and . We also call a run of . The Lyndon root of a run of is the cyclic shift of that is a Lyndon word.
Our general strategy is to preprocess all runs into a data structure, such that we can quickly determine the runs preceded by and followed by , which additionally end on and have a length that matches the query.
Data structure.
Let with be a run in . For each we insert a point in a three-dimensional grid where . Each point stores the positions and has coordinates defined as follows:
-
is the lexicographic rank of among all the reversed prefixes of .
-
is the lexicographical rank of among all the suffixes of .
-
.
Furthermore, we construct a compact trie of the strings of all runs and a lookup table such that given and we can find . Finally, we store the suffix tree of and the suffix tree of the reversed text .
By the runs theorem, the sum of exponents of all runs in is [26, 3], hence the total number of grids and points is . Let be the number of points in the grid . We store in the orthogonal range reporting data structure [39] using space, so that 5-sided searches on it take time , for any constant , to report the points in the range. Hence, our structure uses space in total.
Query.
To answer a query as above, we find the query ranges using the suffix trees and . The ranges and correspond to the leaf ranges of the loci of in and in , respectively. Finally, we find all occurrences of with as the points in inside the 5-sided query .
The ranges in and can be found in time if the suffix tree nodes use deterministic dictionaries to store their children (see [36]). Again, we augment each suffix tree node with the lexicographic range of the suffixes represented by the leaves below . We then do a single query to the range data structure , which reports points in time. We have proven the following:
Lemma 9.
Given a text over alphabet , we can build a data structure that uses space and can answer the following queries: Given two non-empty strings and , a primitive string , and numbers with and , where and do not share a suffix and and do not share a prefix, find all occurrences in of patterns of the form , where and . The query time is , where is the number of occurrences of .
Transforming into .
Given we compute the strings , and and the numbers , , , and as follows: The string is where is the length of the longest common suffix of and . Let and . We compute and such that and is maximal (this can be done in time e.g. using KMP [25]). By definition of and , we have that . Therefore, and do not share a suffix.
Let be the length of the longest common prefix of and . We define as and . Note that by definition of , and do not share a prefix. Finally, we let and . See Figure 3.
The transformation can be done in time: The longest common suffix of and can be computed in time and the longest common prefix of and in time. Further, as mentioned, the period of can be found in time. Other than that, the transformation consists of modulo calculations and cyclic shifts, which clearly can be done in linear time.
4.1 When one of and is the Empty String
In the transformation above, it might happen that or or both are empty, in which case the data structure from Lemma 9 cannot be used. We give additional data structures to handle these cases in this and the next subsection. Let us first consider the case where and . The general idea is that to answer a query , , where and do not share a suffix, we need to find all occurrences of followed by a long enough run of . Note that each one of these occurrences can contain multiple occurrences of our pattern, for different choices of .
Data structure.
Let with be a run in . For each run in , we insert a point into a two-dimensional grid . Each point stores the positions and of the occurrence of the run. The coordinates of the point in are defined as follows:
-
is the lexicographic rank of among all reversed prefixes of .
-
.
In terms of space complexity, as before, by the runs theorem, the sum of exponents of all runs in is [26, 3]. Thus, the total number of points in is . Further, we store a compact trie of all ’s together with a dictionary for finding and using linear space. The two-dimensional points can be processed into a data structure allowing -sided range queries in linear space and running time [40], where is the number of reported points.
Query.
To answer a query , as before, we find the lexicographical range for using the suffix tree . Then, we query the grid for . For a point with obtained this way, we report for all such that and , which is equivalent to .
The querying of the grid reports points in running time, and each reported point gives at least one occurrence. The additional occurrences can be found in constant time per occurrence. Thus, the total query time is .
We can deal with the case where analogously, by building the same structure on and reversing the pattern.
4.2 When both and are the Empty String
If both and are the empty string, then we cannot “anchor” our occurrences at the start of a run – i.e., may occur in runs whose period is a shift of . To deal with this, we characterize all runs by their Lyndon root, and write as a query of the form , where is a Lyndon word. In the following, we show how to answer these kinds of queries.
We create a structure that given a primitive string that is a Lyndon word, numbers , , , , and , finds all occurrences of patterns of the form in , where and .
Data structure.
For a run with in , let be the Lyndon root of the run, and let , and be such that . We build a three-dimensional grid . For each run, we store and the point . We store in a linear space data structure which supports five-sided range queries in time , where is the number of reported points, given in [39]. By the runs theorem, the total number of points in all s is bounded by , and thus so is the space of our data structure.
Query.
Assume we are given a query . In the following, we have to again find runs of which are long enough, but with an extra caveat: we need to treat the runs differently depending on i) if and ii) if , since depending on those, the leftmost and rightmost occurrences in the run have different positions. This gives us four cases to investigate.
-
1.
We find all points in . For each such, we output the following occurrences: , where and .
-
2.
We find all points in . For each such, we output all occurrences of the form , where and .
-
3.
We find all points in and output the occurrences of the form , where and .
-
4.
We find all points in and output all occurrences of the form , where and .
Each range query uses time, where is the number of reported points, and each reported point gives at least one occurrence. Additional occurrences within the same run can be found in constant time per occurrence. Thus, the total time is .
In summary, we have proved Theorem 3.
5 Conditional Lower Bound for Character-class Kleene-star Patterns without an Anchor
We now prove Theorem 2. The conditional lower bound is based on the Strong Set Disjointness Conjecture formulated in [20] and stated in the following.
Definition 10 (The Set Disjointness Problem).
In the Set Disjointness problem, the goal is to preprocess sets of elements from a universe into a data structure, to answer the following kind of query: For a pair of sets and , is empty or not?
Conjecture 11 (The Strong Set Disjointness Conjecture).
For an instance satisfying , any solution to the Set Disjointness problem answering queries in time must use space.
The lower bound example in [9], Section 5.2, specifically shows that, assuming Conjecture 11, indexing to solve queries of the form requires space, assuming one desires to answer queries in time, for any . The alphabet size in their lower bound example is 3. To extend this lower bound to queries of the form , we have to slightly adapt this lower bound and increase the alphabet size to 4 ( will equal 3 in the example).
When reducing from Set Disjointness, as a first step, [9] shows that we can assume that every universe element appears in the same number of sets (Lemma 6 in [9]). Call this number . Then, they construct a string of length from alphabet as follows: For each element , they build a gadget consisting of the concatenation of the binary encodings of the sets is contained in, each encoding followed by a . Such a gadget has length . To each gadget, they append a block of many , and then append the resulting strings of length in an arbitrary order.
We adapt this reduction as follows: the gadgets are defined in the same way as before, only each gadget is followed by a symbol , where , instead of a block . The rest of the construction is the same. Now, if we want to answer a query to the Set Disjointness problem, we set to the binary encoding of , to the binary encoding of , and . It will find an occurrence if and only if there is a gadget corresponding to an element , which contains both the encoding of and , which means that is contained in both and . The rest of the proof proceeds as in [9].
References
- [1] Arne Andersson, Torben Hagerup, Stefan Nilsson, and Rajeev Raman. Sorting in linear time? In Proc. 27th STOC, pages 427–436, 1995. doi:10.1145/225058.225173.
- [2] Arturs Backurs and Piotr Indyk. Which regular expression patterns are hard to match? In Proc. 57th FOCS, pages 457–466, 2016. doi:10.1109/FOCS.2016.56.
- [3] Hideo Bannai, Tomohiro I, Shunsuke Inenaga, Yuto Nakashima, Masayuki Takeda, and Kazuya Tsuruta. The "runs" theorem. SIAM J. Comput., 46(5):1501–1514, 2017. doi:10.1137/15M1011032.
- [4] Philip Bille. New algorithms for regular expression matching. In Proc. 33rd ICALP, pages 643–654, 2006. doi:10.1007/11786986_56.
- [5] Philip Bille and Martin Farach-Colton. Fast and compact regular expression matching. Theoret. Comput. Sci., 409:486–496, 2008. doi:10.1016/J.TCS.2008.08.042.
- [6] Philip Bille and Inge Li Gørtz. Substring range reporting. Algorithmica, 69:384–396, 2014. doi:10.1007/S00453-012-9733-4.
- [7] Philip Bille and Inge Li Gørtz. Sparse regular expression matching. In Proc. 35th SODA, pages 3354–3375, 2024. doi:10.1137/1.9781611977912.120.
- [8] Philip Bille, Inge Li Gørtz, Max Rishøj Pedersen, and Teresa Anna Steiner. Gapped indexing for consecutive occurrences. Algorithmica, 85(4):879–901, 2023. doi:10.1007/S00453-022-01051-6.
- [9] Philip Bille, Inge Li Gørtz, Max Rishøj Pedersen, and Teresa Anna Steiner. Gapped indexing for consecutive occurrences. Algorithmica, 85(4):879–901, 2023. doi:10.1007/S00453-022-01051-6.
- [10] Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and Søren Vind. String indexing for patterns with wildcards. Theory Comput. Syst., 55(1):41–60, 2014. doi:10.1007/S00224-013-9498-4.
- [11] Philip Bille and Mikkel Thorup. Faster regular expression matching. In Proc. 36th ICALP, pages 171–182, 2009. doi:10.1007/978-3-642-02927-1_16.
- [12] Philip Bille and Mikkel Thorup. Regular expression matching with multi-strings and intervals. In Proc. 21st SODA, 2010.
- [13] Karl Bringmann, Allan Grønlund, and Kasper Green Larsen. A dichotomy for regular expression membership testing. In Proc. 58th FOCS, pages 307–318, 2017. doi:10.1109/FOCS.2017.36.
- [14] Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. Dictionary matching and indexing with errors and don’t cares. In Proc. 36th STOC, pages 91–100, 2004. doi:10.1145/1007352.1007374.
- [15] Martin Dietzfelbinger and Friedhelm Meyer auf der Heide. A new universal class of hash functions and dynamic hashing in real time. In Proc. 17th ICALP, pages 6–19, 1990. doi:10.1007/BFB0032018.
- [16] Bartłomiej Dudek, Paweł Gawrychowski, Garance Gourdel, and Tatiana Starikovskaya. Streaming regular expression membership and pattern matching. In Proc. 33rd SODA, pages 670–694, 2022.
- [17] Minos N Garofalakis, Rajeev Rastogi, and Kyuseok Shim. SPIRIT: Sequential pattern mining with regular expression constraints. In Proc. 25th VLDB, pages 223–234, 1999. URL: http://www.vldb.org/conf/1999/P22.pdf.
- [18] Daniel Gibney. An efficient elastic-degenerate text index? not likely. In Proc. 27th SPIRE, pages 76–88, 2020. doi:10.1007/978-3-030-59212-7_6.
- [19] Daniel Gibney and Sharma V. Thankachan. Text indexing for regular expression matching. Algorithms, 14(5), 2021. doi:10.3390/a14050133.
- [20] Isaac Goldstein, Tsvi Kopelowitz, Moshe Lewenstein, and Ely Porat. Conditional lower bounds for space/time tradeoffs. In Proc. 15th WADS, pages 421–436, 2017. doi:10.1007/978-3-319-62127-2_36.
- [21] Costas S. Iliopoulos and M. Sohel Rahman. Indexing factors with gaps. Algorithmica, 55(1):60–70, 2009. doi:10.1007/S00453-007-9141-3.
- [22] Theodore Johnson, S. Muthukrishnan, and Irina Rozenbaum. Monitoring regular expressions on out-of-order streams. In Proc. 23nd ICDE, pages 1315–1319, 2007. doi:10.1109/ICDE.2007.369001.
- [23] Kenrick Kin, Björn Hartmann, Tony DeRose, and Maneesh Agrawala. Proton: multitouch gestures as regular expressions. In Proc. SIGCHI, pages 2885–2894, 2012. doi:10.1145/2207676.2208694.
- [24] S. C. Kleene. Representation of events in nerve nets and finite automata. In C. E. Shannon and J. McCarthy, editors, Automata Studies, Ann. Math. Stud. No. 34, pages 3–41. Princeton U. Press, 1956.
- [25] Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(2):323–350, 1977. doi:10.1137/0206024.
- [26] Roman M. Kolpakov and Gregory Kucherov. Finding maximal repetitions in a word in linear time. In Proc. 40th FOCS, pages 596–604, 1999. doi:10.1109/SFFCS.1999.814634.
- [27] Tsvi Kopelowitz and Robert Krauthgamer. Color-distance oracles and snippets. In Proc. 27th CPM, pages 24:1–24:10, 2016. doi:10.4230/LIPICS.CPM.2016.24.
- [28] Sailesh Kumar, Sarang Dharmapurikar, Fang Yu, Patrick Crowley, and Jonathan Turner. Algorithms to accelerate multiple regular expressions matching for deep packet inspection. In Proc. SIGCOMM, pages 339–350, 2006. doi:10.1145/1159913.1159952.
- [29] Moshe Lewenstein. Indexing with gaps. In Proc. 18th SPIRE, pages 135–143, 2011. doi:10.1007/978-3-642-24583-1_14.
- [30] Moshe Lewenstein, J. Ian Munro, Venkatesh Raman, and Sharma V. Thankachan. Less space: Indexing for queries with wildcards. Theor. Comput. Sci., 557:120–127, 2014. doi:10.1016/J.TCS.2014.09.003.
- [31] Moshe Lewenstein, Yakov Nekrich, and Jeffrey Scott Vitter. Space-efficient string indexing for wildcard pattern matching. In Proc. 31st STACS, pages 506–517, 2014. doi:10.4230/LIPICS.STACS.2014.506.
- [32] Quanzhong Li and Bongki Moon. Indexing and querying XML data for regular path expressions. In Proc. 27th VLDB, pages 361–370, 2001. URL: http://www.vldb.org/conf/2001/P361.pdf.
- [33] Makoto Murata. Extended path expressions of XML. In Proc. 20th PODS, pages 126–137, 2001.
- [34] E. W. Myers. A four-russian algorithm for regular expression pattern matching. J. ACM, 39(2):430–448, 1992. doi:10.1145/128749.128755.
- [35] G. Navarro and Y. Nekrich. Top- document retrieval in compressed space. In Proc. 36th SODA, pages 4009–4030, 2025.
- [36] Gonzalo Navarro and Veli Mäkinen. Compressed full-text indexes. ACM Comput. Surv., 39(1):2–es, 2007. doi:10.1145/1216370.1216372.
- [37] Gonzalo Navarro and Mathieu Raffinot. Fast and simple character classes and bounded gaps pattern matching, with applications to protein searching. J. Comput. Bio., 10(6):903–923, 2003. doi:10.1089/106652703322756140.
- [38] Yakov Nekrich. New data structures for orthogonal range reporting and range minima queries. arXiv preprint arXiv:2007.11094, 2020. arXiv:2007.11094.
- [39] Yakov Nekrich. New data structures for orthogonal range reporting and range minima queries. In Proc. 32nd SODA, pages 1191–1205, 2021. doi:10.1137/1.9781611976465.73.
- [40] Yakov Nekrich and Gonzalo Navarro. Sorted range reporting. In Proc. 13th SWAT, pages 271–282, 2012. doi:10.1007/978-3-642-31155-0_24.
- [41] Pierre Peterlongo, Julien Allali, and Marie-France Sagot. Indexing gapped-factors using a tree. Int. J. Found. Comput. Sci., 19(1):71–87, 2008. doi:10.1142/S0129054108005541.
- [42] Rajeev Raman, Venkatesh Raman, and S. Srinivasa Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proc. 13th SODA, pages 233–242, 2002. URL: http://dl.acm.org/citation.cfm?id=545381.545411.
- [43] Philipp Schepper. Fine-grained complexity of regular expression pattern matching and membership. In Proc. 28th ESA, 2020.
- [44] K. Thompson. Regular expression search algorithm. Commun. ACM, 11:419–422, 1968. doi:10.1145/363347.363387.
- [45] Larry Wall. The Perl Programming Language. Prentice Hall Software Series, 1994.
- [46] Fang Yu, Zhifeng Chen, Yanlei Diao, T. V. Lakshman, and Randy H. Katz. Fast and memory-efficient regular expression matching for deep packet inspection. In Proc. ANCS, pages 93–102, 2006. doi:10.1145/1185347.1185360.