Text Indexing for Simple Regular Expressions

Bannai, Hideo; Bille, Philip; Gørtz, Inge Li; Landau, Gad M.; Navarro, Gonzalo; Prezza, Nicola; Steiner, Teresa Anna; Tarnow, Simon Rumle

doi:10.4230/LIPIcs.CPM.2025.20

Text Indexing for Simple Regular Expressions

Hideo Bannai

M&D Data Science Center, Institute of Integrated Research, Institute of Science Tokyo, Japan Philip Bille

Technical University of Denmark, Lyngby, Denmark Inge Li Gørtz

Technical University of Denmark, Lyngby, Denmark Gad M. Landau

Department of Computer Science, University of Haifa, Israel Gonzalo Navarro

Department of Computer Science, University of Chile, Santiago, Chile
Center for Biotechnology and Bioengineering (CeBiB), Santiago, Chile Nicola Prezza

DAIS, Ca’ Foscari University of Venice, Venice, Italy Teresa Anna Steiner

University of Southern Denmark, Odense, Denmark Simon Rumle Tarnow

Technical University of Denmark, Lyngby, Denmark

Abstract

We study the problem of indexing a text $T[1..n]\in\Sigma^{n}$ so that, later, given a query regular expression pattern $R$ of size $m=|R|$ , we can report all the $o c c$ substrings $T[i..j]$ of $T$ matching $R$ . The problem is known to be hard for arbitrary patterns $R$ , so in this paper, we consider the following two types of patterns. (1) Character-class Kleene-star patterns of the form $P_{1}D^{*}P_{2}$ , where $P_{1}$ and $P_{2}$ are strings and $D=\{c_{1},\ldots,c_{k}\}\subset\Sigma$ is a character-class (shorthand for the regular expression $(c_{1}|c_{2}|\cdots|c_{k})$ ) and (2) String Kleene-star patterns of the form $P_{1}P^{*}P_{2}$ where $P$ , $P_{1}$ and $P_{2}$ are strings. In case (1), we describe an index of $O(n\log^{1+\epsilon}n)$ space (for any constant $\epsilon>0$ ) solving queries in time $O(m+\log n/\log\log n+occ)$ on constant-sized alphabets. We also describe a general solution for any alphabet size. This result is conditioned on the existence of an anchor: a character of $P_{1}P_{2}$ that does not belong to $D$ . We justify this assumption by proving that no efficient indexing solution can exist if an anchor is not present unless the Set Disjointness Conjecture fails. In case (2), we describe an index of size $O(n)$ answering queries in time $O(m+(occ+1)\log^{\epsilon}n)$ on any alphabet size.

Keywords and phrases:

Text indexing, regular expressions, data structures

Funding:

Hideo Bannai: JSPS KAKENHI Grant Number JP24K02899.

Philip Bille: Danish Research Council grant DFF-8021-002498.

Inge Li Gørtz: Danish Research Council grant DFF-8021-002498.

Gonzalo Navarro: Basal Funds FB0001 and AFB240001, ANID, Chile.

Nicola Prezza: Funded by the European Union (ERC, REGINDEX, 101039208). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.

Teresa Anna Steiner: Supported by a research grant (VIL51463) from VILLUM FONDEN.

Copyright and License:

2012 ACM Subject Classification:

Theory of computation

\rightarrow

Pattern matching

Acknowledgements:

Work initiated at Dagstuhl Seminar 24472 “Regular Expressions: Matching and Indexing.”

DOI:

10.4230/LIPIcs.CPM.2025.20

Event:

36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)

Editors:

Paola Bonizzoni and Veli Mäkinen

Series and Publisher:

Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

A regular expression $R$ specifies a set of strings formed by characters from an alphabet $\Sigma$ combined with concatenation ( $\cdot$ ), union ( $\mid$ ), and Kleene star (^∗) operators. For instance, $(a|(b\cdot a))^{\ast}$ describes the set of strings of $a$ s and $b$ s such that every $b$ is followed by an $a$ . The text indexing for regular expressions problem is to preprocess a text $T$ to support efficient regular expression matching queries on $T$ , that is, given a regular expression $R$ , report all occurrences of $R$ in $T$ . Here, an occurrence is a substring $T[i..j]$ that matches any of the strings belonging to the regular language of $R$ . We also consider existential regular expression matching queries, that is, determining whether or not there is an occurrence of $R$ in $T$ . The goal is to obtain a compact data structure while supporting efficient queries.

Regular expressions are a fundamental concept in formal language theory introduced by Kleene in the 1950s [24], and regular expression pattern matching queries are a basic tool in computer science for searching and processing text. Standard tools such as grep and sed provide direct support for regular expression matching in files, and the scripting language perl [45] is a complete programming language designed to support regular expression matching queries easily. Regular expression matching appears in many large-scale data processing applications, such as internet traffic analysis [22, 46, 28], data mining [17], databases [32, 33], computational biology [37], and human-computer interaction [23]. Most of the solutions are based on the efficient algorithms for the classic regular expression matching problem, where we are given both the text $T$ and the regular expression $R$ as input, and the goal is to report the occurrences of $R$ in $T$ . However, in many scenarios, the text $T$ is available before we are given the regular expressions, and we may want to ask multiple regular expression matching queries on $T$ . In this case, we ideally want to take advantage of preprocessing to speed up the queries, and thus, the indexing version of the problem applies.

While the regular expression matching problem is a well-studied classic problem [44, 34, 5, 4, 11, 12, 2, 13, 43, 7, 2, 16], surprisingly few results are known for the text indexing for regular expressions problem. Let $n$ and $m$ be the length of $T$ and $R$ , respectively. Gibney and Thankachan [19] recently showed that text indexing for regular expression is hard to solve efficiently under popular complexity conjectures. More precisely, they showed that conditioned on the online matrix-vector multiplication conjecture, even with arbitrary polynomial preprocessing time, we cannot answer existential queries in $O(n^{1-\varepsilon})$ for any $\varepsilon>0$ . They also show that if conditioned on a slightly stronger assumption, we cannot even answer existential queries in $O(n^{3/2-\varepsilon})$ time, for any $\varepsilon>0$ . Gibney and Thankachan also studied upper bound time-space trade-offs with exponential preprocessing. Specifically, given a parameter $t$ , $1\leq t\leq n$ , fixed at preprocessing, we can solve the problem using $2^{O(tn)}$ space and preprocessing time and $O(nm/t)$ query time.

On the other hand, a few text indexing solutions have been studied for highly restricted kinds of regular expressions or regular expression-like patterns. These include text indexing for string patterns (simple strings corresponding to regular expressions that only use concatenations) and string patterns with wildcards and gaps (strings that include special characters or sequences of special characters that match any other character) and similar extensions [14, 10, 31, 41, 6, 21, 29, 27, 8, 30, 18].

Thus, we should not hope to efficiently solve text indexing for general regular expressions, and efficient solutions are only known for highly restricted regular expressions. Hence, a natural question is if there are simple regular expressions for which efficient solutions are possible and that form a large subset of those used in practice. This paper considers the following two such kinds of regular expressions and provides either efficient solutions or conditional lower bounds to them:

$\blacksquare$

Character-class Kleene-star patterns. These are patterns of the form $P_{1}D^{*}P_{2}$ where $P_{1}$ and $P_{2}$ are strings and $D=\{c_{1},\ldots,c_{k}\}\subset\Sigma$ is a character-class that is shorthand for the regular expression $(c_{1}|c_{2}|\cdots|c_{k})$ .
$\blacksquare$

String Kleene-star patterns. These are patterns of the form $P_{1}P^{*}P_{2}$ where $P$ , $P_{1}$ and $P_{2}$ are strings.

In other words, we provide solutions (or lower bounds) for all regular patterns containing only concatenations and at most one occurrence of a Kleene star (either of a string or a character-class). Using the notation introduced by the seminal paper of Backurs and Indyk [2] on the hardness of (non-indexed) regular expression matching, character-class Kleene-star patterns belong to the “ $\cdot*|$ ” type: a concatenation of Kleene stars of (possibly degenerate, i.e. $|D|=1$ ) unions. To see this, observe that the characters of $P_{1}$ and $P_{2}$ can be interpreted as degenerate unions of one character (without Kleene). String Kleene-star patterns, on the other hand, belong to the “ $\cdot*\cdot$ ” type: a concatenation of Kleene stars of concatenations. Again (as discussed in [2]), since any level of the regular expression tree is allowed to contain leaves (i.e. an individual character), patterns of the form $P_{1}P^{*}P_{2}$ belong to this type by interpreting the characters of $P_{1}$ and $P_{2}$ as leaves in the regular expression tree. Our main results are new text indices that use near-linear space while supporting both kind of queries in time near-linear in the length of the pattern (under certain unavoidable assumptions discussed in detail below: if the assumptions fail, we show that the problem becomes again hard). Below, we introduce our results and discuss them in the context of the results obtained in [2].

1.1 Setup and Results

We first consider text indexing for character-class Kleene-star patterns $R=P_{1}D^{*}P_{2}$ , where $D$ is a characters class. We say that the pattern is anchored if either $P_{1}$ or $P_{2}$ has a character that is not in $D$ , and we call such a character an anchor. If the pattern is anchored, we show the following result.

Theorem 1.

Let $T$ be a text of length $n$ over an alphabet $\Sigma$ . Given a parameter $k_{\max}<|\Sigma|$ and a constant $\epsilon>0$ fixed at preprocessing time, we can build a data structure that uses $O(k_{\max}\,n\log^{1+\epsilon}n)$ space and supports anchored character-class Kleene-star queries $P_{1}D^{*}P_{2}$ , where $D$ is a characters class with $|D|=k\leq k_{\max}$ characters in $O(m+2^{k}\log n/\log\log n+\mathrm{occ})$ time with high probability. Here, $m=|P_{1}|+|D|+|P_{2}|$ and $\mathrm{occ}$ is the number of occurrences of the pattern in $T$ .

In particular, our solution supports queries in almost optimal $O(m+\log n/\log\log n+\mathrm{occ})$ time for constant-sized alphabets. We also extend Theorem 1 result to handle slightly more general character-class interval patterns of the form $P_{1}D^{\geq l}P_{2}$ , $P_{1}D^{\leq r}P_{2}$ , and $P_{1}D^{[l,r]}P_{2}$ , meaning that there are at least, at most, and between $l$ and $r$ copies of characters from $D$ .

Intuitively, our strategy is to identify all the right-maximal substrings $T[i..j]$ of $T$ , for every possible starting position $i$ , that contain only symbols in $D$ for every possible set $D$ . Such a substring will form the “ $D^{*}$ ” part of the occurrences. For each such $T[i..j]$ , we then insert in a range reporting data structure a three-dimensional point with (lexicographically-sorted) coordinates $(T[1..i-1]^{rev},T[1..j]^{rev},T[j+1..n])$ . The data structure is labeled by set $D$ . We finally observe that the pattern $R$ can be used to query the right range data structure and report all matches of $R$ in $T$ .

Conversely, we show the following conditional lower bound if the pattern is not anchored.

Theorem 2.

Let $T$ be a text of length $n$ over an alphabet $\Sigma$ with $|\Sigma|\geq 4$ and let $\delta\in[0,1/2]$ . Assuming the strong Set Disjointness Conjecture, any data structure that supports existential (non-anchored) character-class Kleene-star pattern matching queries $P_{1}D^{*}P_{2}$ , where $D$ is a character class with at least 3 characters, in $O(n^{\delta})$ time, requires $\tilde{\Omega}(n^{2-2\delta-o(1)})$ space.

With $\delta=1/2$ , Theorem 2 implies that any near linear space solution must have query time $\tilde{\Omega}(\sqrt{n})$ . On the other hand, with $\delta=0$ , Theorem 2 implies that any solution using time independent from $n$ must use $\tilde{\Omega}(n^{2-o(1)})$ space.

To get Theorem 2, we reduce from the Set Disjointness Problem: preprocessing some sets so we can quickly answer, for any pair of sets, if they are disjoint or not. [9] showed that wlog, we can assume every element appears in the same number of sets. The idea is then to define a string gadget representing any set, and a block for each element in the universe containing the string gadget for every set it is included in. The blocks are separated by a character not in the block. This way, the intersection of two sets is non-empty if and only if their gadgets appear somewhere in the string only separated by characters which appear in a block.

As noted above, character-class Kleene-star patterns belong to the “ $\cdot*|$ ” type. Backurs and Indyk [2] prove a quadratic lower bound for this class of regular expressions. Our result shows that even the more restricted sub-class $P_{1}D^{*}P_{2}$ of “ $\cdot*|$ ” is hard if no anchors are present.

We then consider text indexing for String Kleene-star patterns $R=P_{1}P^{*}P_{2}$ . We show the following result.

Theorem 3.

Let $T$ be a text of length $n$ over an alphabet $\Sigma$ . Given a constant $\epsilon>0$ fixed at preprocessing time, we can build a data structure that uses $O(n)$ space and supports String Kleene-star patterns $P_{1}P^{*}P_{2}$ in time $O(m+(\mathrm{occ}+1)\log^{\epsilon}n)$ , where $m=|P_{1}|+|P|+|P_{2}|$ and $\mathrm{occ}$ is the number of occurrences of the pattern in $T$ .

As discussed above, String Kleene-star patterns belong to the “ $\cdot*\cdot$ ” type. For this type of patterns, Backurs and Indyk [2] proved a conditional lower bound of $\Omega((mn)^{1-\epsilon})$ (for any constant $\epsilon>0$ ) in the offline setting for both pattern matching and membership queries. Our result, instead, implies an offline solution running in $O(m+\log^{\epsilon}n)$ time (by stopping after locating the first pattern occurrence) after the indexing phase. This does not contradict Backurs and Indyk’s lower bound, since our patterns $P_{1}P^{*}P_{2}$ are a very specific case of the (broader) type “ $\cdot*\cdot$ ”. Equivalently, this indicates that including more than one Kleene star makes the problem hard again and thus justifies an index for the simpler case $P_{1}P^{*}P_{2}$ .

The main idea behind the strategy for Theorem 3 is to preprocess all maximal periodic substrings (called runs) in the string, so we can quickly find patterns ending just before or starting just after a run. However, there are some difficulties to overcome: firstly, $P$ may be periodic - e. g. if $P=ww$ , we do not want to report occurrences of $P_{1}w^{3}P_{2}$ ; secondly, a run may end with a partial occurrence of the period; and lastly, $P$ may share a suffix with $P_{1}$ or a prefix with $P_{2}$ , in which case their occurrences should overlap with the run. We show how to deal with these difficulties in Section 4.

2 Preliminaries

A string $T$ of length $|T|=n$ is a sequence $T[1]\cdots T[n]$ of $n$ characters drawn from an ordered alphabet $\Sigma$ of size $|\Sigma|$ . The string $T[i]\cdots T[j]$ , denoted $T[i..j]$ , is called a substring of $T$ ; $T[1..j]$ and $T[i..n]$ are called the $j^{th}$ prefix and $i^{th}$ suffix of $T$ , respectively. We use $\epsilon$ to denote the empty string (i.e., the string of length 0). The reverse string of a string $T$ of length $n$ , denoted by $T^{rev}$ , is given by $T^{rev}=T[n]\dots T[1]$ . Let $P$ and $T$ be strings over an alphabet $\Sigma$ . We say that the range $[i..j]$ is an occurrence of $P$ in $T$ iff $T[i..j]=P$ .

Lexicographic order and Lyndon words.

The order of the alphabet defines a lexicographic order on the set of strings as follows: For two strings $T_{1}\neq T_{2}$ , let $i$ be the length of the longest common prefix of $T_{1}$ and $T_{2}$ . We have $T_{1}<T_{2}$ if and only if either i) $|T_{1}|=i$ or ii) both $T_{1}$ and $T_{2}$ have a length at least $i+1$ and $T_{1}[i+1]<T_{2}[i+1]$ . A string $T$ is a Lyndon word if it is lexicographically smaller than any of its proper cyclic shifts, i.e., $T<T[i..n]T[1..i-1]$ , for all $1<i\leq n$ .

Concatenation of strings.

The concatenation of two strings $A$ and $B$ is defined as $AB=A[1]\cdots A[|A|]B[1]\cdots B[|B|]$ . The concatenation of $k$ copies of a string $A$ is denoted by $A^{k}$ , where $k\in\mathbb{N}$ ; i.e. $A^{0}=\epsilon$ and $A^{k}=AA^{k-1}$ . A string $B$ is called primitive if there is no string $A$ and $k>1$ such that $B=A^{k}$ .

Sets of strings.

We denote by $A^{\geq l}=\bigcup_{k\geq l}\{A^{k}\}$ , $A^{\leq r}=\bigcup_{k\leq r}\{A^{k}\}$ , $A^{[l,r]}=\bigcup_{l\leq k\leq r}\{A^{k}\}$ , and $A^{*}=A^{\geq 0}$ . The concatenation of a string $A$ with a set of strings $S$ is defined as $AS=\{AB:B\in S\}$ . Similarly, the concatenation of two sets of strings $S_{1}$ and $S_{2}$ is defined as $S_{1}S_{2}=\{AB:A\in S_{1},B\in S_{2}\}$ . We define $S^{\geq l}$ , $S^{\leq r}$ , $S^{[l,r]}$ , and $S^{*}=S^{\geq 0}$ for sets analogously. We say that the range $[i..j]$ is an occurrence of a set of strings $S$ if there is a $P\in S$ such that $[i..j]$ is an occurrence of $P$ in $T$ .

Period of a string.

An integer $p$ is a period of a string $T$ of length $n$ if and only if $T[i]=T[i+p]$ for all $1\leq i\leq n-p$ . A string $T$ is called periodic if it has a period $p\leq n/2$ . The smallest period of $T$ will be called the period of $T$ .

Tries and suffix trees.

A trie for a collection of strings $\mathcal{C}=\{T_{1},\ldots,T_{n}\}$ , is a rooted labeled tree $\mathcal{T}$ such that: (1) The label on each edge is a character in some $T_{i}\leavevmode\nobreak\ (i\in[1,n])$ . (2) Each string in $\mathcal{C}$ is represented by a path in $\mathcal{T}$ going from the root down to some node (obtained by concatenating the labels on the edges of the path). (3) Each root-to-leaf path represents a string from $\mathcal{C}$ . (4) Common prefixes of two strings share the same path maximally. A compact trie is obtained from $\mathcal{T}$ by dissolving all nodes except the root, the branching nodes, and the leaves, and concatenating the labels on the edges incident to dissolved nodes to obtain string labels for the remaining edges.

Let $T$ be a string over an alphabet $\Sigma$ . The suffix tree of a string $T$ is the compacted trie of the set of all suffixes of $T$ . Throughout this paper, we assume that nodes in a compact trie or the suffix tree use deterministic dictionaries to store their children.

3 Character-class Kleene-star Patterns

In this section we give our data structure for answering anchored character-class Kleene-star pattern queries. Without loss of generality, we can assume that the anchor belongs to $P_{2}$ (the other case is captured by building our structures on the reversed text and querying the reversed pattern).

Recall that we assume $k=|D|\leq k_{\max}$ for some parameter $k_{\max}<|\Sigma|$ fixed at construction time. We first describe a solution for the case $k_{\max}<\log n$ , and then in Section 3.3 show how to handle the case where $k_{\max}\geq\log n$ .

Figure 1: Illustration of the general strategy to capture patterns of the form

P_{1}D^{*}P_{2}

. A

k

-run is a right-maximal substring

T[i..j]

containing exactly

k

distinct symbols.

Our general strategy is to identify all the right-maximal substrings $T[i..j]$ of $T$ , for every possible starting position $i$ , that contain all and only the symbols of $D$ (we later generalize the solution to consider all the possible subsets of $D$ ). Such a substring forms the “ $D^{*}$ ” part of the occurrences. For this sake, $D^{*}$ must be preceded by $P_{1}$ and followed by $P_{2}$ . However, if $P_{2}$ starts with some symbols in $D$ , those symbols will belong to the right-maximal substring $T[i..j]$ . We therefore separate $P_{2}=\mathrm{pref}_{D}\cdot\mathrm{suff}_{D}$ , where $\mathrm{pref}_{D}$ is the longest prefix of $P_{2}$ that contains only symbols from $D$ , and $\mathrm{suff}_{D}$ starts with the anchor. The new condition is then that the substring $T[i..j]$ ends with $\mathrm{pref}_{D}$ and is followed by $\mathrm{suff}_{D}$ . See Figure 1.

We need the following definitions.

Definition 4.

The $D$ -prefix of $P_{2}$ , denoted $\mathrm{pref}_{D}(P_{2})$ is the longest prefix of $P_{2}$ that is formed only by symbols in $D$ . We define $\mathrm{suff}_{D}(P_{2})$ so that $P_{2}=\mathrm{pref}_{D}(P_{2})\cdot\mathrm{suff}_{D}(P_{2})$

Definition 5.

The $k$ -run of $T$ that starts at position $i$ is the maximal range $[i..j]$ such that $T[i..j]$ contains exactly $k$ distinct symbols. If the suffix $T[i..n]$ has less than $k$ different symbols, then there is no $k$ -run starting at $i$ . We call $D_{i,k}$ the set of $k$ symbols that occur in the $k$ -run that starts at position $i$ .

Note that $T$ contains at most $n$ $k$ -runs, each starting at a distinct position $i\in[1..n]$ .

We first show how to find occurrences matching all $k$ symbols of $D$ in the $D^{*}$ part of the pattern $P_{1}D^{*}P_{2}$ . Then, we complete this solution by allowing matches with any subset of $D$ .

3.1 Matching all $𝒌$ Characters of $𝑫$

We show how to build a data structure for the case where $k=|D|$ is known at construction time, and we only find the occurrences that match exactly all $k$ distinct letters in the $D^{*}$ part of the occurrence. Recall that we also assume that $P_{2}$ contains an anchor.

Data structure.

Let $\mathcal{D}_{k}$ be the set of subsets $D\subseteq\Sigma$ of size $k$ that occur as a $k$ -run in $T$ . Our data structure consists of the following:

$\blacksquare$

The suffix tree $\cal T$ of $T$ and the suffix tree ${\cal T}^{rev}$ of the reversed text, $T^{rev}$ .
$\blacksquare$
A data structure $S_{D}$ for each set $D\in\mathcal{D}_{k}$ indexing all the text positions $P_{D}=\{i\leavevmode\nobreak\ |\leavevmode\nobreak\ D_{i,k}=D\}$ . The structure consists of an orthogonal range reporting data structure for a four-dimensional grid in $[1..n]^{4}$ with $|P_{D}|$ points, one per $k$ -run $[i..j]$ with $i\in P_{D}$ . For each such $k$ -run $[i..j]$ we store a point with coordinates $(x_{i},y_{i},z_{i},j-i+1)$ , where
- –
  
  $x_{i}$ is the lexicographic rank of $T[1..i-1]^{rev}$ among all the reversed prefixes of $T$ .
- –
  
  $y_{i}$ is the lexicographic rank of $T[1..j]^{rev}$ among all the reversed prefixes of $T$ .
- –
  
  $z_{i}$ is the lexicographic rank of $T[j+1..n]$ among all the suffixes of $T$ .
Each point stores the limits $[i..j]$ of its $k$ -run (so as to report occurrence positions).
$\blacksquare$

A trie $\tau_{k}$ storing all the strings $s_{D}$ of length $k$ formed by sorting in increasing order the $k$ characters of $D$ , for every $D\in\mathcal{D}_{k}$ .

Note that the fourth coordinate $j-i+1$ of point $(x_{i},y_{i},z_{i},j-i+1)$ could be avoided (i.e. using a 3D range reporting data structure) by defining $y_{i}$ to be the lexicographic rank of $T[1..j]^{rev}\$$ (where $\$$ is a special terminator character) in the set formed by all the reversed prefixes of $T$ and strings of the form $T[1..j]^{rev}\$$ , for all $k$ -runs $T[i..j]$ . While this solution would work in the same asymptotic space and query time (because we will only need one-sided queries on the fourth coordinate), we will need the fourth dimension in Subsection 3.4.

Basic search.

At query time, we first compute $\mathrm{pref}_{D}(P_{2})$ . For any occurrence of the query pattern, $\mathrm{pref}_{D}(P_{2})$ will necessarily be the suffix of a $k$ -run. This is why we need $P_{2}$ to contain an anchor; $P_{1}$ is not restricted because we index every possible initial position $i$ .

We then sort the symbols of $D$ and use the trie $\tau_{k}$ to find the data structure $S_{D}$ .

We now find the lexicographic range $[x_{1},x_{2}]\times[y_{1},y_{2}]\times[z_{1},z_{2}]\times[|\mathrm{pref}_{D}(P% _{2})|,+\infty]$ using the suffix tree $\cal T$ of $T$ and the suffix tree ${\cal T}^{rev}$ of the reversed text, $T^{rev}$ . The range $[x_{1},x_{2}]$ then corresponds to the leaf range of the locus of $P_{1}^{rev}$ in ${\cal T}^{rev}$ , the range $[y_{1},y_{2}]$ to the leaf range of the locus of $\mathrm{pref}_{D}(P_{2})^{rev}$ in ${\cal T}^{rev}$ , and the range $[z_{1},z_{2}]$ to the leaf range of the locus of $\mathrm{suff}_{D}(P_{2})$ in $\cal T$ .

Once the four-dimensional range is identified, we extract all the points from $S_{D}$ in the range using the range reporting data structure.

Time and space.

The suffix trees use space $O(n)$ . The total number of points in the range reporting data structures is $O(n)$ as there are at most $n$ $k$ -runs. Because we will perform one-sided searches on the fourth coordinate, the grid of $S_{D}$ can be represented in $O(|P_{D}|\log^{1+\epsilon}n)$ space, for any constant $\epsilon>0$ , so that range searches on it take time $O(\mathrm{occ}+\log n/\log\log n)$ to report the $\mathrm{occ}$ points in the range [38, Thm. 7]. Thus, the total space for the range reporting data structures is $O(n\log^{1+\epsilon}n)$ . The space of the trie $\tau_{k}$ is $k|\mathcal{D}_{k}|\in O(kn)$ .

The string $\mathrm{pref}_{D}(P_{2})$ can easily be computed in $O(k+|P_{2}|)$ time with high probability using a dictionary data structure [15]. Sorting $D$ can be done in $O(k\log\log k)$ time [1]. By implementing the pointers of node children in $\tau_{k}$ and in the suffix trees $\cal T$ and ${\cal T}^{rev}$ using perfect hashing (see [36]), the search in $\tau_{k}$ takes $O(k)$ worst-case time and the three searches in $\cal T$ and ${\cal T}^{rev}$ take total time $O(|P_{1}|+|P_{2}|)$ . The range reporting query takes time $O(\log n/\log\log n+\mathrm{occ})$ . In total, a query takes $O(m+k\log\log k+\log n/\log\log n+\mathrm{occ})$ time with high probability¹¹1Unfortunately, [38, Thm. 7] does not describe construction of the range reporting data structure that we use, so we are not able to provide construction time and working space of our index..

3.2 Matching any Subset of $𝑫$

We now show how to find all occurrences of $P_{1}D^{*}P_{2}$ , that is, also the ones containing only a subset of the characters of $D$ in the $D^{*}$ part of the occurrence.

Our previous search will not capture the $(k-i)$ -runs, for $1\leq i<k$ , containing only characters appearing in subsets of $D$ , as we only find $P_{1}$ and $\mathrm{suff}_{D}(P_{2})$ surrounding the $k$ -runs containing all characters from $D$ . To solve this we will build an orthogonal range reporting data structure for all $D\in\bigcup_{1\leq k\leq k_{\max}}\mathcal{D}_{k}$ . To capture all the $o c c$ occurrences of $P_{1}D^{*}P_{2}$ , we search the corresponding grids of all the $2^{k}-1$ nonempty subsets of $D$ , which leads to the cost $O(2^{k}\log n/\log\log n+occ)$ . We wish to avoid, however, the cost of searching for $P_{1}$ , $\mathrm{pref}_{D^{\prime}}(P_{2})$ , and $\mathrm{suff}_{D^{\prime}}(P_{2})$ in the suffix trees for every subset $D^{\prime}$ of $D$ . In the following we show how to do this.

Data Structure.

Let $\mathcal{D}=\bigcup_{1\leq k\leq k_{\max}}\mathcal{D}_{k}$ . Our data structure consists of the following.

$\blacksquare$

The suffix tree $\cal T$ of $T$ and the suffix tree ${\cal T}^{rev}$ of the reversed text, $T^{rev}$ .
$\blacksquare$

The data structure $S_{D}$ from Section 3.1 for each set $D\in\mathcal{D}$ .
$\blacksquare$

A trie $\tau$ storing all the strings of length 1 to $k_{\max}$ , in increasing alphabetic order of characters, that correspond to some $D\in\mathcal{D}$ .

The suffix trees uses linear space. The space for each of the $k$ range reporting data structures is $O(n\log^{1+\epsilon}n)$ . Added over all $k\in[1..k_{\max}]$ , the total space becomes $O(k_{\max}\,n\log^{1+\epsilon}n)$ . The space for the trie $\tau$ is $O(nk_{\max}^{2})$ since there are at most $k_{\max}n$ strings each of length at most $k_{\max}$ . Since we assume $k_{\max}<\log n$ , the total space is $O(k_{\max}n\log^{1+\epsilon}n)$ .

Search.

To perform the search, we traverse $\tau$ to find all the subsets of $D$ as follows. Let $s_{D}=c_{1}c_{2}\dots c_{k}$ be the string formed by concatenating all symbols of $D$ in sorted order. Letting $N_{i}$ be the set of nodes of $\tau$ reached after processing $s_{D}[1..i]$ (initially, $i=0$ and $N_{0}$ contains only the root of $\tau$ ), $N_{i+1}$ is obtained by inserting in $N_{i}$ the nodes reached by following the edges labeled with character $s_{D}[i+1]$ from nodes in $N_{i}$ . In other words, for each symbol of $s_{D}$ we try both skipping it or descending by it in $\tau$ . The last set, $N_{k}$ , contains all the $2^{k}-1$ nodes of $\tau$ corresponding to subsets of $D$ . Each time we are in a node of $\tau$ corresponding to some set $D^{\prime}\subseteq D$ which has an associated range reporting data structure $S_{D^{\prime}}$ , we perform a range reporting query $(x_{1},x_{2},y_{1},y_{2},z_{1},z_{2},|\mathrm{pref}_{D^{\prime}}(P_{2})|,\infty)$ .

Since the range $[x_{1},x_{2}]$ is the same for all queries, we only compute this once. This is done by a search for $P_{1}^{rev}$ in ${\cal T}^{rev}$ . The intervals $[y_{1},y_{2}]$ and $[z_{1},z_{2}]$ , on the other hand, change during the search, as the split of $P_{2}$ into $\mathrm{pref}_{D^{\prime}}(P_{2})$ and $\mathrm{suff}_{D^{\prime}}(P_{2})$ depends on the subset $D^{\prime}$ . To compute these intervals we first preprocess $P_{2}$ as follows. Compute the ranges $[y_{1},y_{2}]$ for all reversed prefixes of $P_{2}$ using the suffix tree ${\cal T}^{rev}$ : Start by looking up the locus for $P_{2}^{rev}$ and then find the remaining ones by following suffix links. Similarly, we compute the ranges $[z_{1},z_{2}]$ for the suffixes of $P_{2}$ following suffix links in ${\cal T}$ . If we know the length $\ell$ of $\mathrm{pref}_{D^{\prime}}(P_{2})$ , we can then easily look up the corresponding intervals.

Maintaining $\ell$ . We now explain how to maintain the length $\ell$ of $\mathrm{pref}_{D^{\prime}}(P_{2})$ for $D^{\prime}\subset D$ in constant time for every trie node we meet during the traversal of $\tau$ . The difficulty with maintaining $|\mathrm{pref}_{D^{\prime}}(P_{2})|$ while $D^{\prime}$ changes is that when traversing the trie we add the characters to $D^{\prime}$ in lexicographical order and not in the order they occur in $P_{2}$ (see Figure 2).

First we compute for each character $c\in D$ the position $p_{c}$ of the first occurrence of $c$ in $\mathrm{pref}_{D}(P_{2})$ . If $c$ does not occur in $\mathrm{pref}_{D}(P_{2})$ , we set $p_{c}=\infty$ . For each $c\in D$ , we furthermore compute the position rank $r_{c}$ of $c$ , i.e., the rank of $p_{c}$ in the sorted set $\{p_{c}\ :\ c\in D\}$ . We build:

$\blacksquare$

a dictionary $R$ saving the position rank $r_{c}$ of each element $c\in D$ .
$\blacksquare$

an array $B$ containing the position of the first occurrence of the characters in $D$ in rank order, i.e., for each character $c\in D$ , $B[r_{c}]=p_{c}$ .

Let $\alpha$ be the first character in position rank order that is not in $D^{\prime}$ . Then $\ell=p_{\alpha}-1$ . The main idea is to maintain the intervals $I_{D^{\prime}}$ of characters in $D^{\prime}$ in position rank order. The position rank $r_{\alpha}$ of $\alpha$ can then easily be computed from the set of intervals $I_{D^{\prime}}$ and used to compute $p_{\alpha}=B[r_{\alpha}]$ . Let $[i,j]$ be the first interval in $I_{D^{\prime}}$ in sorted order. If $i\neq 1$ then $r_{\alpha}=1$ otherwise, $r_{\alpha}=j+1$ . We use an array $A[0..|D|+1]$ to store the intervals of $I_{D^{\prime}}$ . We will maintain the invariant that $A[i]\neq 0$ if and only if the element with position rank $i$ is in $D^{\prime}$ . Furthermore, we will maintain the invariant that the first, respectively last, position of an interval of nonzero entries in $A$ contains the position of the end, respectively start, of the interval. Initially, all positions in $A$ are $0$ .

We proceed as follows. Initialize $\ell=0$ and initialize an empty stack $S$ . We now maintain $\ell=|\mathrm{pref}_{D^{\prime}}(P_{2})|$ as follows:

When we go down during the traversal adding a character $c$ to the set, we first lookup $r_{c}$ in $R$ and set $p_{c}=B[r_{c}]$ . If $p_{c}=\infty$ there are no changes. Otherwise, let $D^{\prime\prime}=D\setminus\{c\}$ , i.e., $D^{\prime\prime}$ is the set we had before inserting $c$ . We set $A[r_{c}]=r_{c}$ and compute the leftmost position $l p$ of the interval in $I_{D^{\prime}}$ containing $c$ : If $A[r_{c}-1]=0$ then set $lp=r_{c}$ . Otherwise, there is an interval in $I_{D^{\prime\prime}}$ ending at position $r_{c}-1$ that must be merged with $[r_{c},r_{c}]$ , and $A[r_{c}-1]$ contains the left endpoint of this interval. Therefore we set $lp=A[r_{c}-1]$ . To compute the rightmost position $r p$ of the interval in $I_{D^{\prime}}$ containing $c$ : If $A[r_{c}+1]=0$ then set $rp=r_{c}$ . Otherwise, there is another interval starting at position $r_{c}-1$ and we set $r p$ to be the end of this interval, i.e., $rp=A[r_{c}+1]$ . We then push $(lp,A[lp],rp,A[rp],\ell)$ onto the stack to be able to quickly undo the operations later. Then we update $A$ by setting $A[lp]=rp$ and $A[rp]=lp$ . Finally, we update $\ell$ : If $A[1]\geq r_{c}$ set $\ell=B[A[1]+1]-1$ . Otherwise, $\ell$ does not change.

When going up in the traversal removing character $c$ we first lookup $p_{c}$ . If $p_{c}=\infty$ there are no changes. Otherwise, we pop $(lp,lv,rp,rv,\ell^{\prime})$ from the stack and set $A[lp]=lv$ , $A[rp]=rv$ , $A[r_{c}]=0$ , and $\ell=\ell^{\prime}$ .

	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17
$P_{2}=$	b	b	a	a	b	e	a	b	e	e	c	e	a	d	a	h	$\dots$

$D=\{a,b,c,d,e\}$	$D$ in position rank order: $[b,a,e,c,d]$
$D_{1}=\{a\}$	$\ell=0$	$A=[0,0,2,0,0,0,0]$	$I_{D_{1}}=\{[2,2]\}$
$D_{2}=\{a,b\}$	$\ell=5$	$A=[0,2,1,0,0,0,0]$	$I_{D_{2}}=\{[1,2]\}$
$D_{3}=\{a,b,c\}$	$\ell=5$	$A=[0,2,1,0,4,0,0]$	$I_{D_{3}}=\{[1,2],[4,4]\}$
$D_{4}=\{a,b,c,d\}$	$\ell=5$	$A=[0,2,1,0,5,4,0]$	$I_{D_{4}}=\{[1,2],[4,5]\}$
$D_{5}=\{a,b,c,d,e\}$	$\ell=15$	$A=[0,5,1,3,5,1,0]$	$I_{D_{5}}=\{[1,5]\}$
$D_{6}=\{b\}$	$\ell=2$	$A=[0,1,0,0,0,0,0]$	$I_{D_{6}}=\{[1,1]\}$
$\vdots$	$\vdots$	$\vdots$	$\vdots$

Figure 2: Computing

\ell=|\mathrm{pref}_{D^{\prime}}(P_{2})|

as

D^{\prime}

changes during the traversal of the trie. The array

A

maintains the intervals of characters in position rank order (the order in which the characters appear in

P_{2}

) that are in

D^{\prime}

.

Time.

It takes $O(|P_{1}|)$ time to search for $P_{1}^{rev}$ in ${\cal T}^{rev}$ . Computing $[y_{1},y_{2}]$ and $[z_{1},z_{2}]$ for all splits of $P_{2}$ takes time $O(|P_{2}|)$ . Sorting $D$ can be done in time $O(k\log\log k)$ [1]. Computing $p_{c}$ for all characters in $D$ , sorting them, computing the ranks $r_{c}$ , and constructing the arrays $B$ and $P$ and the dictionary $R$ takes linear time in the pattern length with high probability. The size of the subtrie we visit in the search is $O(2^{k})$ and in each step we use constant time to compute the length of $\ell$ . The total time for the range queries is $O(2^{k}\log n/\log\log n+\mathrm{occ})$ . Thus, in total we use $O(m+2^{k}\log n/\log\log n+\mathrm{occ})$ time with high probability.

3.3 Solution for $k_{\max}\geq\log n$

In the discussion above, we assumed that $k_{\max}<\log n$ . If $k_{\max}\geq\log n$ , we build the data structure described above by replacing $k_{\max}$ with $k_{\max}^{\prime}=\log n$ . The space of the data structure is still $O(k_{\max}^{\prime}\,n\log^{1+\epsilon}n)\subseteq O(k_{\max}\,n\log^{1+% \epsilon}n)$ . At query time, if $|D|=k\leq\log n$ we use the data structure to answer queries in $O(m+2^{k}\log n/\log\log n+\mathrm{occ})$ time.

If, on the other hand, $|D|=k>\log n$ then $n\in O(2^{k}\log n/\log\log n)$ . We first find all occurrences of $P_{1}$ and $P_{2}$ using the suffix tree ${\cal T}$ . Let $L_{1}$ be the end positions of the occurrences of $P_{1}$ and let $P_{2}$ be the start positions of the occurrences of $P_{2}$ . We sort the lists $L_{1}$ and $L_{2}$ . This can all be done in $O(m+n)$ time and linear space using radix sort. We also mark with a 1 in a bitvector $B_{D}$ of length $n$ all text positions $i$ such that $T[i]\in D$ . This can be done in $O(n)$ time with high probability, with a simple scan of $T$ and a dictionary over $D$ [15]. We build a data structure over the bitvector supporting rank queries in constant time [42]. We can now find all occurrences of the pattern by considering the occurrences in sorted order in a merge like fashion. Recall, that $P_{2}$ has an anchor. We consider the first occurrence $p_{1}$ in the list $L_{1}$ and find the first occurrence $p_{2}$ in $L_{2}$ that comes after $L_{1}$ , i.e. $p_{2}>p_{1}$ . If all characters between $p_{1}$ and $p_{2}$ are from $D$ (constant time with two rank operations over bitvector $B_{D}$ ) we output the occurrence. We delete $p_{1}$ from the list and continue the same way. In total, we find all occurrences in $O(n+\textrm{occ})\in O(2^{k}\log n/\log\log n+\textrm{occ})$ time with high probability. In summary, this proves Theorem 1.

3.4 Character-Class Interval Patterns

We extend our solution to handle patterns of the form $P_{1}D^{\geq l}P_{2}$ , $P_{1}D^{\leq r}P_{2}$ , and $P_{1}D^{[l,r]}P_{2}$ , meaning that there are at least, at most, and between $l$ and $r$ copies of characters from $D$ . We collectively call these character-class interval patterns.

By using one-sided restrictions on the fourth dimension, we can easily handle queries of the form $P_{1}D^{\geq l}P_{2}$ in our solution from the previous section. Handling queries of the form $P_{1}D^{\leq r}P_{2}$ or $P_{1}D^{[l,r]}P_{2}$ requires a two-sided restriction on the fourth dimension. This raises the space of the grid to $O(|P_{D}|\log^{2+\epsilon}n)$ , while retaining its query time [38, Thm. 7] [39]. With these observations we obtain the following results.

Theorem 6.

Let $T$ be a text of length $n$ over an alphabet $\Sigma$ . Given a parameter $k_{\max}<|\Sigma|$ and a constant $\epsilon>0$ fixed at preprocessing time, we can build a data structure that uses $O(k_{\max}\,n\log^{1+\epsilon}n)$ space and supports anchored character-class interval queries of the form $P_{1}D^{\geq l}P_{2}$ in time $O(m+2^{k}\log n/\log\log n+\mathrm{occ})$ , where $D$ is a character class with $k\leq k_{\max}$ characters, $m=|P_{1}|+|D|+|P_{2}|$ , and $\mathrm{occ}$ is the number of occurrences of the pattern in $T$ .

Theorem 7.

Let $T$ be a text of length $n$ over an alphabet $\Sigma$ . Given a parameter $k_{\max}<|\Sigma|$ and a constant $\epsilon>0$ fixed at preprocessing time, we can build a data structure that uses $O(k_{\max}\,n\log^{2+\epsilon}n)$ space and supports anchored character-class interval queries of the form $P_{1}D^{\leq r}P_{2}$ or $P_{1}D^{[l,r]}P_{2}$ in time $O(m+2^{k}\log n/\log\log n+\mathrm{occ})$ , where $D$ is a characters class with $k\leq k_{\max}$ characters, $m=|P_{1}|+|D|+|P_{2}|$ , and $\mathrm{occ}$ is the number of occurrences of the pattern in $T$ .

An alternative solution, when longer matches are more interesting than shorter ones, is to store the points $(x_{i},y_{i},z_{i})$ in a three-dimensional grid, and use $j-i+1$ as the point weights. Three-dimensional grids on weighted points can use $O(|P_{D}|\log^{2+\epsilon}n)$ space and report points from larger to smaller weight (i.e., $j-i+1$ ) in time $O(p+\log n)$ [35, Lem. A.5]. We can use this to report the occurrences from longer to shorter $k$ -runs, thereby stopping when the length drops below $|\mathrm{pref}_{D}(P_{2})|$ . We insert the first answer of each of the $2^{k}-1$ grids into a priority queue, where the priority will be the length $j-i+1$ of the matched $k^{\prime}$ -run $[i..j]$ minus $|\mathrm{pref}_{D^{\prime}}(P_{2})|$ , then extract the longest answer and replace it by the next point from the same grid, repeating until returning all the desired answers. The time per returned element now includes a factor $O(\log\log n)$ if we implement the priority queue with a dynamic predecessor search data structure, plus $O(2^{k}\log\log n)$ for the initial insertions. We can also return $t$ longest answers in this case, within a total time of $O(m+2^{k}\log n+t\log\log n)$ .

4 String Kleene-star Patterns

In this section we give our data structure for supporting string Kleene-star pattern queries.

As an intermediate step, we first create a structure that, given strings $S_{1}$ and $S_{2}$ , a primitive string $w$ , and numbers $a,b,c,d\in\mathbb{N}$ with $b<a$ and $d<|w|$ , where $S_{1}$ and $w$ do not share a suffix and $S_{2}$ and $w[d+1..]$ do not share a prefix, finds all occurrences in $T$ of patterns of the form $S_{1}w^{aq+b}w[1..d]S_{2}$ , where $q\geq c$ and $q\in\mathbb{N}$ . Later we will show that this is sufficient to find occurrences of $P_{1}P^{*}P_{2}$ . For now, we assume that $S_{1}$ and $S_{2}$ are not the empty string; we will handle these cases later. We will also assume that $w$ is not the empty string - in our transformation from $P_{1}P^{*}P_{2}$ to $S_{1}w^{aq+b}w[1..d]S_{2}$ , $w$ will be empty if and only if $P$ is empty. In this case, the problem reduces to matching $P_{1}P_{2}=S_{1}S_{2}$ in the suffix tree.

To define our data structures, we need the notion of a run (or maximal repetition) in $T$ .

Definition 8.

A run of $T$ is a periodic substring $T[i..j]$ , such that the period cannot be extended to the left or the right. That is, if the smallest period of $T[i..j]$ is $p$ , then $T[i-1]\neq T[i+p-1]$ and $T[j+1]\neq T[j-p+1]$ . We can write $T[i..j]=w^{t}w[1..r]$ , where $t\in\mathbb{N}$ , $|w|=p$ and $r<|w|$ . We also call $T[i..j]$ a run of $w$ . The Lyndon root of a run of $w$ is the cyclic shift of $w$ that is a Lyndon word.

Our general strategy is to preprocess all runs into a data structure, such that we can quickly determine the runs preceded by $S_{1}$ and followed by $S_{2}$ , which additionally end on $w[1..d]$ and have a length that matches the query.

Data structure.

Let $T[i..j+r]=w^{t}w[1..r]$ with $r<|w|$ be a run in $T$ . For each $1\leq a\leq t$ we insert a point in a three-dimensional grid $G_{w,a,b}$ where $b=t\bmod a$ . Each point stores the positions $i, j$ and has coordinates $x, y, z$ defined as follows:

$\blacksquare$

$x$ is the lexicographic rank of $T[1..i-1]^{rev}$ among all the reversed prefixes of $T$ .
$\blacksquare$

$y$ is the lexicographical rank of $T[j+1..n]$ among all the suffixes of $T$ .
$\blacksquare$

$z=\lfloor t/a\rfloor$ .

Furthermore, we construct a compact trie of the strings $w$ of all runs and a lookup table such that given $a$ and $b$ we can find $G_{w,a,b}$ . Finally, we store the suffix tree $\mathcal{T}$ of $T$ and the suffix tree $\mathcal{T}^{rev}$ of the reversed text $T^{rev}$ .

By the runs theorem, the sum of exponents of all runs in $T$ is $O(n)$ [26, 3], hence the total number of grids and points is $O(n)$ . Let $|G_{w,a,b}|$ be the number of points in the grid $G_{w,a,b}$ . We store $G_{w,a,b}$ in the orthogonal range reporting data structure [39] using $O(|G_{w,a,b}|)$ space, so that 5-sided searches on it take time $O((p+1)\log^{\epsilon}|G_{w,a,b}|)$ , for any constant $\epsilon>0$ , to report the $p$ points in the range. Hence, our structure uses $O(n)$ space in total.

Query.

To answer a query as above, we find the query ranges $[x_{1},x_{2}]\times[y_{1},y_{2}]$ using the suffix trees $\mathcal{T}$ and $\mathcal{T}^{rev}$ . The ranges $[x_{1},x_{2}]$ and $[y_{1},y_{2}]$ correspond to the leaf ranges of the loci of $S_{1}^{rev}$ in $\mathcal{T}^{rev}$ and $w[1..d]S_{2}$ in $\mathcal{T}$ , respectively. Finally, we find all occurrences of $S_{1}w^{aq+b}w[1..d]S_{2}$ with $q\geq c$ as the points in $G_{w,a,b}$ inside the 5-sided query $[x_{1},x_{2}]\times[y_{1},y_{2}]\times[c,+\infty]$ .

The ranges in $\mathcal{T}$ and $\mathcal{T}^{rev}$ can be found in time $O(|d|+|S_{1}|+|S_{2}|)=O(|w|+|S_{1}|+|S_{2}|)$ if the suffix tree nodes use deterministic dictionaries to store their children (see [36]). Again, we augment each suffix tree node $x$ with the lexicographic range of the suffixes represented by the leaves below $x$ . We then do a single query to the range data structure $G_{w,a,b}$ , which reports $\mathrm{occ}$ points in $O((\mathrm{occ}+1)\log^{\epsilon}n)$ time. We have proven the following:

Lemma 9.

Given a text $T[1..n]$ over alphabet $\Sigma$ , we can build a data structure that uses $O(n)$ space and can answer the following queries: Given two non-empty strings $S_{1}$ and $S_{2}$ , a primitive string $w$ , and numbers $a,b,c,d\in\mathbb{N}$ with $b<a$ and $d<|w|$ , where $S_{1}$ and $w$ do not share a suffix and $S_{2}$ and $w[d+1..]$ do not share a prefix, find all occurrences in $T$ of patterns of the form $S_{1}w^{aq+b}w[1..d]S_{2}$ , where $q\geq c$ and $q\in\mathbb{N}$ . The query time is $O(|S_{1}S_{2}w|+(occ+1)\log^{\epsilon}n)$ , where $o c c$ is the number of occurrences of $S_{1}w^{aq+b}w[1..d]S_{2}$ .

Transforming $P_{1}P^{*}P_{2}$ into $S_{1}w^{aq+b}w[1..d]S_{2}$ .

Given $P_{1}P^{*}P_{2}$ we compute the strings $S_{1}$ , $w$ and $S_{2}$ and the numbers $a$ , $b$ , $c$ , and $d$ as follows: The string $S_{1}$ is $P_{1}[1..|P_{1}|-i]$ where $i$ is the length of the longest common suffix of $P_{1}$ and $P^{\lceil|P_{1}|/|P|\rceil}$ . Let $P^{\prime}=P[(-i\bmod|P|)+1..|P|]\cdot P[1..(-i\bmod|P|)]$ and $P_{2}^{\prime}=P_{1}[|P_{1}|-i+1..|P_{1}|]P_{2}$ . We compute $w$ and $a$ such that $P^{\prime}=w^{a}$ and $a\in\mathbb{N}$ is maximal (this can be done in time $O(|P^{\prime}|)$ e.g. using KMP [25]). By definition of $P^{\prime}$ and $i$ , we have that $P^{\prime}[|P^{\prime}|]=P[-i\bmod|P|]\neq P_{1}[|P_{1}|-i]$ . Therefore, $S_{1}$ and $w$ do not share a suffix.

Let $j$ be the length of the longest common prefix of $P_{2}^{\prime}$ and $w^{\lceil|P_{2}^{\prime}|/|w|\rceil}$ . We define $S_{2}$ as $P_{2}^{\prime}[j+1..|P_{2}^{\prime}|]$ and $d=j\bmod|w|$ . Note that by definition of $S_{2}$ , $S_{2}$ and $w[d+1..]$ do not share a prefix. Finally, we let $b=(j-d)/|w|\bmod a$ and $c=\lceil\frac{j-d}{a|w|}\rceil-b$ . See Figure 3.

	$\displaystyle DBC(ABCABCABC)^{*}ABCABCABCABCABCB$
	$\displaystyle D(BCABCABCA)^{*}BCABCABCABCABCABCB$	$\displaystyle//\quad S_{1}=D\textrm{ and }P\textrm{ rotated}$
	$\displaystyle D(BCA)^{3q}BCABCABCABCABCABCB$	$\displaystyle//\quad P^{\prime}\textrm{ reduced to }w^{3}=(BCA)^{3}$
	$\displaystyle D(BCA)^{3q}BCABCABCB\quad q\geq c=1$	$\displaystyle//\quad w^{3}\textrm{ occurs at least once}$
	$\displaystyle D(BCA)^{3q+2}BCB,\quad q\geq c=1$	$\displaystyle//\quad S_{1}w^{3q+2}w[1..2]S_{2}$

Figure 3: An example of the transformation applied when

P_{1}=DBC

,

P=ABCABCABC

, and

P_{2}=ABCABCABCABCABCB

. Here

S_{1}=D

,

w=BCA

,

S_{2}=B

,

a=3

,

b=2

,

c=1

and

d=2

.

The transformation can be done in $O(|P_{1}|+|P_{2}|+|P|)$ time: The longest common suffix of $P_{1}$ and $P^{\lceil|P_{1}|/|P|\rceil}$ can be computed in $O(|P_{1}|)$ time and the longest common prefix of $P_{2}^{\prime}$ and $w^{\lceil|P_{2}^{\prime}|/|w|\rceil}$ in $O(|P_{2}^{\prime}|)=O(|P_{1}|+|P_{2}|)$ time. Further, as mentioned, the period of $|P^{\prime}|$ can be found in $O(|P^{\prime}|)=O(|P|)$ time. Other than that, the transformation consists of modulo calculations and cyclic shifts, which clearly can be done in linear time.

4.1 When one of $S_{1}$ and $S_{2}$ is the Empty String

In the transformation above, it might happen that $S_{1}$ or $S_{2}$ or both are empty, in which case the data structure from Lemma 9 cannot be used. We give additional data structures to handle these cases in this and the next subsection. Let us first consider the case where $S_{2}=\epsilon$ and $S_{1}\neq\epsilon$ . The general idea is that to answer a query $S_{1}w^{aq+b}w[1..d]$ , $q\geq c$ , where $S_{1}$ and $w$ do not share a suffix, we need to find all occurrences of $S_{1}$ followed by a long enough run of $w$ . Note that each one of these occurrences can contain multiple occurrences of our pattern, for different choices of $q$ .

Data structure.

Let $T[i..j+r]=w^{t}w[1..r]$ with $r<|w|$ be a run in $T$ . For each run in $T$ , we insert a point into a two-dimensional grid $G_{w}$ . Each point stores the positions $i, j$ and $r$ of the occurrence of the run. The coordinates $x, y$ of the point in $G_{w}$ are defined as follows:

$\blacksquare$

$x$ is the lexicographic rank of $T[1..i-1]^{rev}$ among all reversed prefixes of $T$ .
$\blacksquare$

$y=t|w|+r$ .

In terms of space complexity, as before, by the runs theorem, the sum of exponents of all runs in $T$ is $O(n)$ [26, 3]. Thus, the total number of points in $G_{w}$ is $O(n)$ . Further, we store a compact trie of all $w$ ’s together with a dictionary for finding $t$ and $d$ using linear space. The two-dimensional points can be processed into a data structure allowing $3$ -sided range queries in linear space and $O((\mathrm{occ}+1)\log^{\epsilon}n)$ running time [40], where $\mathrm{occ}$ is the number of reported points.

Query.

To answer a query $S_{1}w^{aq+b}w[1..d]$ , as before, we find the lexicographical range $[x_{1},x_{2}]$ for $S_{1}$ using the suffix tree $\mathcal{T}$ . Then, we query the grid $G_{w}$ for $[x_{1},x_{2}]\times[(ac+b)|w|+d,\infty]$ . For a point $(x,y)$ with $(i,j,r)$ obtained this way, we report $T[i-|S_{1}|+1,i+|w|(aq+b)+d]$ for all $q$ such that $c\leq q$ and $i+|w|(aq+b)+d\leq j+r$ , which is equivalent to $q\leq\lfloor\frac{(y-d)/|w|-b}{a}\rfloor$ .

The querying of the grid reports $\mathrm{occ}$ points in $O((\mathrm{occ}+1)\log^{\epsilon}n)$ running time, and each reported point gives at least one occurrence. The additional occurrences can be found in constant time per occurrence. Thus, the total query time is $O(|S_{1}S_{2}w|+(1+\mathrm{occ})\log^{\epsilon}n)$ .

We can deal with the case where $S_{1}=\epsilon$ analogously, by building the same structure on $T^{rev}$ and reversing the pattern.

4.2 When both $S_{1}$ and $S_{2}$ are the Empty String

If both $S_{1}$ and $S_{2}$ are the empty string, then we cannot “anchor” our occurrences at the start of a run – i.e., $w^{aq+b}w[1..d]$ may occur in runs whose period is a shift of $w$ . To deal with this, we characterize all runs by their Lyndon root, and write $w^{aq+b}w[1..d]$ as a query of the form $w^{\prime}[|w|-e+1]w^{\prime a^{\prime}q+b^{\prime}}w^{\prime}[1..d^{\prime}]$ , where $w^{\prime}$ is a Lyndon word. In the following, we show how to answer these kinds of queries.

We create a structure that given a primitive string $w$ that is a Lyndon word, numbers $a$ , $b$ , $c$ , $d<|w|$ , and $e<|w|$ , finds all occurrences of patterns of the form $w[|w|-e+1]w^{aq+b}w[1..d]$ in $T$ , where $q\geq c$ and $q\in\mathbb{N}$ .

Data structure.

For a run $T[i^{\prime}..j^{\prime}+r^{\prime}]=u^{t^{\prime}}u[1..r^{\prime}]$ with $r^{\prime}<|u|$ in $T$ , let $w$ be the Lyndon root of the run, and let $r<|w|$ , $l<|w|$ and $t$ be such that $T[i^{\prime}..j^{\prime}+r^{\prime}]=T[i-l+1..j+r]=w[|w|-l+1]w^{t}w[1..r]$ . We build a three-dimensional grid $G_{w}$ . For each run, we store $i, j$ and the point $(x,y,z)=(l,t,r)$ . We store $G_{w}$ in a linear space data structure which supports five-sided range queries in time $O((\mathrm{occ}+1)\log^{\epsilon}n)$ , where $\mathrm{occ}$ is the number of reported points, given in [39]. By the runs theorem, the total number of points in all $G_{w}$ s is bounded by $O(n)$ , and thus so is the space of our data structure.

Query.

Assume we are given a query $w, a, b, c, d, e$ . In the following, we have to again find runs of $w$ which are long enough, but with an extra caveat: we need to treat the runs $w[|w|-l+1]w^{t}w[1..r]$ differently depending on i) if $e\leq l$ and ii) if $d\leq r$ , since depending on those, the leftmost and rightmost occurrences in the run have different positions. This gives us four cases to investigate.

1.

We find all points in $[e,\infty]\times[ac+b,\infty]\times[d,\infty]$ . For each such, we output the following occurrences: $T[i-e+k\cdot|w|,i+(k+aq+b)|w|+d]$ , where $k\leq t-ac-b$ and $c\leq q\leq\lfloor\frac{t-b-k}{a}\rfloor$ .
2.

We find all points in $[e,\infty]\times[ac+b+1,\infty]\times[0,d-1]$ . For each such, we output all occurrences of the form $T[i-e+k\cdot|w|,i+(k+aq+b)|w|+d]$ , where $k\leq t-1-ac-b$ and $c\leq q\leq\lfloor\frac{t-1-b-k}{a}\rfloor$ .
3.

We find all points in $[0,e-1]\times[ac+b+1,\infty]\times[d,\infty]$ and output the occurrences of the form $T[i+|w|-e+k\cdot|w|,i+|w|+(k+aq+b)|w|+d]$ , where $k\leq t-ac-b-1$ and $c\leq q\leq\lfloor\frac{t-b-k-1}{a}\rfloor$ .
4.

We find all points in $[0,e-1]\times[ac+b+2,\infty]\times[0,d-1]$ and output all occurrences of the form $T[i+|w|-e+k\cdot|w|,i+|w|+(k+aq+b)|w|+d]$ , where $k\leq t-ac-b-2$ and $c\leq q\leq\lfloor\frac{t-b-k-2}{a}\rfloor$ .

Each range query uses $O((\mathrm{occ}+1)\log^{\epsilon}n)$ time, where $\mathrm{occ}$ is the number of reported points, and each reported point gives at least one occurrence. Additional occurrences within the same run can be found in constant time per occurrence. Thus, the total time is $O((occ+1)\log^{\epsilon}n)$ .

In summary, we have proved Theorem 3.

5 Conditional Lower Bound for Character-class Kleene-star Patterns without an Anchor

We now prove Theorem 2. The conditional lower bound is based on the Strong Set Disjointness Conjecture formulated in [20] and stated in the following.

Definition 10 (The Set Disjointness Problem).

In the Set Disjointness problem, the goal is to preprocess sets $S_{1},\dots,S_{m}$ of elements from a universe $U$ into a data structure, to answer the following kind of query: For a pair of sets $S_{i}$ and $S_{j}$ , is $S_{i}\cap S_{j}$ empty or not?

Conjecture 11 (The Strong Set Disjointness Conjecture).

For an instance $S_{1},\dots,S_{m}$ satisfying $\sum_{i=1}^{m}|S_{i}|=N$ , any solution to the Set Disjointness problem answering queries in $O(t)$ time must use $\tilde{\Omega}\left(\frac{N^{2}}{t^{2}}\right)$ space.

The lower bound example in [9], Section 5.2, specifically shows that, assuming Conjecture 11, indexing $T[1..n]$ to solve queries of the form $P_{1}\Sigma^{\leq r}P_{2}$ requires $\tilde{\Omega}(n^{2-2\delta-o(1)})$ space, assuming one desires to answer queries in $O(n^{\delta})$ time, for any $\delta\in[0,1/2]$ . The alphabet size in their lower bound example is 3. To extend this lower bound to queries of the form $P_{1}D^{*}P_{2}$ , we have to slightly adapt this lower bound and increase the alphabet size to 4 ( $k_{\max}$ will equal 3 in the example).

When reducing from Set Disjointness, as a first step, [9] shows that we can assume that every universe element appears in the same number of sets (Lemma 6 in [9]). Call this number $f$ . Then, they construct a string of length $2N\log m+2N$ from alphabet $\{0,1,\$\}$ as follows: For each element $e\in U$ , they build a gadget consisting of the concatenation of the binary encodings of the sets $e$ is contained in, each encoding followed by a $\$$ . Such a gadget has length $B=f\log m+f$ . To each gadget, they append a block of $B$ many $\$$ , and then append the resulting strings of length $2B$ in an arbitrary order.

We adapt this reduction as follows: the gadgets are defined in the same way as before, only each gadget is followed by a symbol $\#$ , where $\#\notin\{0,1,\$\}$ , instead of a block $\$^{B}$ . The rest of the construction is the same. Now, if we want to answer a query $S_{i},S_{j}$ to the Set Disjointness problem, we set $P_{1}$ to the binary encoding of $i$ , $P_{2}$ to the binary encoding of $j$ , and $D=\{0,1,\$\}$ . It will find an occurrence if and only if there is a gadget corresponding to an element $e$ , which contains both the encoding of $i$ and $j$ , which means that $e$ is contained in both $S_{i}$ and $S_{j}$ . The rest of the proof proceeds as in [9].

References

[1] Arne Andersson, Torben Hagerup, Stefan Nilsson, and Rajeev Raman. Sorting in linear time? In Proc. 27th STOC, pages 427–436, 1995. doi:10.1145/225058.225173.
[2] Arturs Backurs and Piotr Indyk. Which regular expression patterns are hard to match? In Proc. 57th FOCS, pages 457–466, 2016. doi:10.1109/FOCS.2016.56.
[3] Hideo Bannai, Tomohiro I, Shunsuke Inenaga, Yuto Nakashima, Masayuki Takeda, and Kazuya Tsuruta. The "runs" theorem. SIAM J. Comput., 46(5):1501–1514, 2017. doi:10.1137/15M1011032.
[4] Philip Bille. New algorithms for regular expression matching. In Proc. 33rd ICALP, pages 643–654, 2006. doi:10.1007/11786986_56.
[5] Philip Bille and Martin Farach-Colton. Fast and compact regular expression matching. Theoret. Comput. Sci., 409:486–496, 2008. doi:10.1016/J.TCS.2008.08.042.
[6] Philip Bille and Inge Li Gørtz. Substring range reporting. Algorithmica, 69:384–396, 2014. doi:10.1007/S00453-012-9733-4.
[7] Philip Bille and Inge Li Gørtz. Sparse regular expression matching. In Proc. 35th SODA, pages 3354–3375, 2024. doi:10.1137/1.9781611977912.120.
[8] Philip Bille, Inge Li Gørtz, Max Rishøj Pedersen, and Teresa Anna Steiner. Gapped indexing for consecutive occurrences. Algorithmica, 85(4):879–901, 2023. doi:10.1007/S00453-022-01051-6.
[9] Philip Bille, Inge Li Gørtz, Max Rishøj Pedersen, and Teresa Anna Steiner. Gapped indexing for consecutive occurrences. Algorithmica, 85(4):879–901, 2023. doi:10.1007/S00453-022-01051-6.
[10] Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and Søren Vind. String indexing for patterns with wildcards. Theory Comput. Syst., 55(1):41–60, 2014. doi:10.1007/S00224-013-9498-4.
[11] Philip Bille and Mikkel Thorup. Faster regular expression matching. In Proc. 36th ICALP, pages 171–182, 2009. doi:10.1007/978-3-642-02927-1_16.
[12] Philip Bille and Mikkel Thorup. Regular expression matching with multi-strings and intervals. In Proc. 21st SODA, 2010.
[13] Karl Bringmann, Allan Grønlund, and Kasper Green Larsen. A dichotomy for regular expression membership testing. In Proc. 58th FOCS, pages 307–318, 2017. doi:10.1109/FOCS.2017.36.
[14] Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. Dictionary matching and indexing with errors and don’t cares. In Proc. 36th STOC, pages 91–100, 2004. doi:10.1145/1007352.1007374.
[15] Martin Dietzfelbinger and Friedhelm Meyer auf der Heide. A new universal class of hash functions and dynamic hashing in real time. In Proc. 17th ICALP, pages 6–19, 1990. doi:10.1007/BFB0032018.
[16] Bartłomiej Dudek, Paweł Gawrychowski, Garance Gourdel, and Tatiana Starikovskaya. Streaming regular expression membership and pattern matching. In Proc. 33rd SODA, pages 670–694, 2022.
[17] Minos N Garofalakis, Rajeev Rastogi, and Kyuseok Shim. SPIRIT: Sequential pattern mining with regular expression constraints. In Proc. 25th VLDB, pages 223–234, 1999. URL: http://www.vldb.org/conf/1999/P22.pdf.
[18] Daniel Gibney. An efficient elastic-degenerate text index? not likely. In Proc. 27th SPIRE, pages 76–88, 2020. doi:10.1007/978-3-030-59212-7_6.
[19] Daniel Gibney and Sharma V. Thankachan. Text indexing for regular expression matching. Algorithms, 14(5), 2021. doi:10.3390/a14050133.
[20] Isaac Goldstein, Tsvi Kopelowitz, Moshe Lewenstein, and Ely Porat. Conditional lower bounds for space/time tradeoffs. In Proc. 15th WADS, pages 421–436, 2017. doi:10.1007/978-3-319-62127-2_36.
[21] Costas S. Iliopoulos and M. Sohel Rahman. Indexing factors with gaps. Algorithmica, 55(1):60–70, 2009. doi:10.1007/S00453-007-9141-3.
[22] Theodore Johnson, S. Muthukrishnan, and Irina Rozenbaum. Monitoring regular expressions on out-of-order streams. In Proc. 23nd ICDE, pages 1315–1319, 2007. doi:10.1109/ICDE.2007.369001.
[23] Kenrick Kin, Björn Hartmann, Tony DeRose, and Maneesh Agrawala. Proton: multitouch gestures as regular expressions. In Proc. SIGCHI, pages 2885–2894, 2012. doi:10.1145/2207676.2208694.
[24] S. C. Kleene. Representation of events in nerve nets and finite automata. In C. E. Shannon and J. McCarthy, editors, Automata Studies, Ann. Math. Stud. No. 34, pages 3–41. Princeton U. Press, 1956.
[25] Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(2):323–350, 1977. doi:10.1137/0206024.
[26] Roman M. Kolpakov and Gregory Kucherov. Finding maximal repetitions in a word in linear time. In Proc. 40th FOCS, pages 596–604, 1999. doi:10.1109/SFFCS.1999.814634.
[27] Tsvi Kopelowitz and Robert Krauthgamer. Color-distance oracles and snippets. In Proc. 27th CPM, pages 24:1–24:10, 2016. doi:10.4230/LIPICS.CPM.2016.24.
[28] Sailesh Kumar, Sarang Dharmapurikar, Fang Yu, Patrick Crowley, and Jonathan Turner. Algorithms to accelerate multiple regular expressions matching for deep packet inspection. In Proc. SIGCOMM, pages 339–350, 2006. doi:10.1145/1159913.1159952.
[29] Moshe Lewenstein. Indexing with gaps. In Proc. 18th SPIRE, pages 135–143, 2011. doi:10.1007/978-3-642-24583-1_14.
[30] Moshe Lewenstein, J. Ian Munro, Venkatesh Raman, and Sharma V. Thankachan. Less space: Indexing for queries with wildcards. Theor. Comput. Sci., 557:120–127, 2014. doi:10.1016/J.TCS.2014.09.003.
[31] Moshe Lewenstein, Yakov Nekrich, and Jeffrey Scott Vitter. Space-efficient string indexing for wildcard pattern matching. In Proc. 31st STACS, pages 506–517, 2014. doi:10.4230/LIPICS.STACS.2014.506.
[32] Quanzhong Li and Bongki Moon. Indexing and querying XML data for regular path expressions. In Proc. 27th VLDB, pages 361–370, 2001. URL: http://www.vldb.org/conf/2001/P361.pdf.
[33] Makoto Murata. Extended path expressions of XML. In Proc. 20th PODS, pages 126–137, 2001.
[34] E. W. Myers. A four-russian algorithm for regular expression pattern matching. J. ACM, 39(2):430–448, 1992. doi:10.1145/128749.128755.
[35] G. Navarro and Y. Nekrich. Top- $k$ document retrieval in compressed space. In Proc. 36th SODA, pages 4009–4030, 2025.
[36] Gonzalo Navarro and Veli Mäkinen. Compressed full-text indexes. ACM Comput. Surv., 39(1):2–es, 2007. doi:10.1145/1216370.1216372.
[37] Gonzalo Navarro and Mathieu Raffinot. Fast and simple character classes and bounded gaps pattern matching, with applications to protein searching. J. Comput. Bio., 10(6):903–923, 2003. doi:10.1089/106652703322756140.
[38] Yakov Nekrich. New data structures for orthogonal range reporting and range minima queries. arXiv preprint arXiv:2007.11094, 2020. arXiv:2007.11094.
[39] Yakov Nekrich. New data structures for orthogonal range reporting and range minima queries. In Proc. 32nd SODA, pages 1191–1205, 2021. doi:10.1137/1.9781611976465.73.
[40] Yakov Nekrich and Gonzalo Navarro. Sorted range reporting. In Proc. 13th SWAT, pages 271–282, 2012. doi:10.1007/978-3-642-31155-0_24.
[41] Pierre Peterlongo, Julien Allali, and Marie-France Sagot. Indexing gapped-factors using a tree. Int. J. Found. Comput. Sci., 19(1):71–87, 2008. doi:10.1142/S0129054108005541.
[42] Rajeev Raman, Venkatesh Raman, and S. Srinivasa Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proc. 13th SODA, pages 233–242, 2002. URL: http://dl.acm.org/citation.cfm?id=545381.545411.
[43] Philipp Schepper. Fine-grained complexity of regular expression pattern matching and membership. In Proc. 28th ESA, 2020.
[44] K. Thompson. Regular expression search algorithm. Commun. ACM, 11:419–422, 1968. doi:10.1145/363347.363387.
[45] Larry Wall. The Perl Programming Language. Prentice Hall Software Series, 1994.
[46] Fang Yu, Zhifeng Chen, Yanlei Diao, T. V. Lakshman, and Randy H. Katz. Fast and memory-efficient regular expression matching for deep packet inspection. In Proc. ANCS, pages 93–102, 2006. doi:10.1145/1185347.1185360.

[bib.bib1] [1] Arne Andersson, Torben Hagerup, Stefan Nilsson, and Rajeev Raman. Sorting in linear time? In Proc. 27th STOC, pages 427–436, 1995. doi:10.1145/225058.225173.

[bib.bib2] [2] Arturs Backurs and Piotr Indyk. Which regular expression patterns are hard to match? In Proc. 57th FOCS, pages 457–466, 2016. doi:10.1109/FOCS.2016.56.

[bib.bib3] [3] Hideo Bannai, Tomohiro I, Shunsuke Inenaga, Yuto Nakashima, Masayuki Takeda, and Kazuya Tsuruta. The "runs" theorem. SIAM J. Comput., 46(5):1501–1514, 2017. doi:10.1137/15M1011032.

[bib.bib4] [4] Philip Bille. New algorithms for regular expression matching. In Proc. 33rd ICALP, pages 643–654, 2006. doi:10.1007/11786986_56.

[bib.bib5] [5] Philip Bille and Martin Farach-Colton. Fast and compact regular expression matching. Theoret. Comput. Sci., 409:486–496, 2008. doi:10.1016/J.TCS.2008.08.042.

[bib.bib6] [6] Philip Bille and Inge Li Gørtz. Substring range reporting. Algorithmica, 69:384–396, 2014. doi:10.1007/S00453-012-9733-4.

[bib.bib7] [7] Philip Bille and Inge Li Gørtz. Sparse regular expression matching. In Proc. 35th SODA, pages 3354–3375, 2024. doi:10.1137/1.9781611977912.120.

[bib.bib8] [8] Philip Bille, Inge Li Gørtz, Max Rishøj Pedersen, and Teresa Anna Steiner. Gapped indexing for consecutive occurrences. Algorithmica, 85(4):879–901, 2023. doi:10.1007/S00453-022-01051-6.

[bib.bib9] [9] Philip Bille, Inge Li Gørtz, Max Rishøj Pedersen, and Teresa Anna Steiner. Gapped indexing for consecutive occurrences. Algorithmica, 85(4):879–901, 2023. doi:10.1007/S00453-022-01051-6.

[bib.bib10] [10] Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and Søren Vind. String indexing for patterns with wildcards. Theory Comput. Syst., 55(1):41–60, 2014. doi:10.1007/S00224-013-9498-4.

[bib.bib11] [11] Philip Bille and Mikkel Thorup. Faster regular expression matching. In Proc. 36th ICALP, pages 171–182, 2009. doi:10.1007/978-3-642-02927-1_16.

[bib.bib12] [12] Philip Bille and Mikkel Thorup. Regular expression matching with multi-strings and intervals. In Proc. 21st SODA, 2010.

[bib.bib13] [13] Karl Bringmann, Allan Grønlund, and Kasper Green Larsen. A dichotomy for regular expression membership testing. In Proc. 58th FOCS, pages 307–318, 2017. doi:10.1109/FOCS.2017.36.

[bib.bib14] [14] Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. Dictionary matching and indexing with errors and don’t cares. In Proc. 36th STOC, pages 91–100, 2004. doi:10.1145/1007352.1007374.

[bib.bib15] [15] Martin Dietzfelbinger and Friedhelm Meyer auf der Heide. A new universal class of hash functions and dynamic hashing in real time. In Proc. 17th ICALP, pages 6–19, 1990. doi:10.1007/BFB0032018.

[bib.bib16] [16] Bartłomiej Dudek, Paweł Gawrychowski, Garance Gourdel, and Tatiana Starikovskaya. Streaming regular expression membership and pattern matching. In Proc. 33rd SODA, pages 670–694, 2022.

[bib.bib17] [17] Minos N Garofalakis, Rajeev Rastogi, and Kyuseok Shim. SPIRIT: Sequential pattern mining with regular expression constraints. In Proc. 25th VLDB, pages 223–234, 1999. URL: http://www.vldb.org/conf/1999/P22.pdf.

[bib.bib18] [18] Daniel Gibney. An efficient elastic-degenerate text index? not likely. In Proc. 27th SPIRE, pages 76–88, 2020. doi:10.1007/978-3-030-59212-7_6.

[bib.bib19] [19] Daniel Gibney and Sharma V. Thankachan. Text indexing for regular expression matching. Algorithms, 14(5), 2021. doi:10.3390/a14050133.

[bib.bib20] [20] Isaac Goldstein, Tsvi Kopelowitz, Moshe Lewenstein, and Ely Porat. Conditional lower bounds for space/time tradeoffs. In Proc. 15th WADS, pages 421–436, 2017. doi:10.1007/978-3-319-62127-2_36.

[bib.bib21] [21] Costas S. Iliopoulos and M. Sohel Rahman. Indexing factors with gaps. Algorithmica, 55(1):60–70, 2009. doi:10.1007/S00453-007-9141-3.

[bib.bib22] [22] Theodore Johnson, S. Muthukrishnan, and Irina Rozenbaum. Monitoring regular expressions on out-of-order streams. In Proc. 23nd ICDE, pages 1315–1319, 2007. doi:10.1109/ICDE.2007.369001.

[bib.bib23] [23] Kenrick Kin, Björn Hartmann, Tony DeRose, and Maneesh Agrawala. Proton: multitouch gestures as regular expressions. In Proc. SIGCHI, pages 2885–2894, 2012. doi:10.1145/2207676.2208694.

[bib.bib24] [24] S. C. Kleene. Representation of events in nerve nets and finite automata. In C. E. Shannon and J. McCarthy, editors, Automata Studies, Ann. Math. Stud. No. 34, pages 3–41. Princeton U. Press, 1956.

[bib.bib25] [25] Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(2):323–350, 1977. doi:10.1137/0206024.

[bib.bib26] [26] Roman M. Kolpakov and Gregory Kucherov. Finding maximal repetitions in a word in linear time. In Proc. 40th FOCS, pages 596–604, 1999. doi:10.1109/SFFCS.1999.814634.

[bib.bib27] [27] Tsvi Kopelowitz and Robert Krauthgamer. Color-distance oracles and snippets. In Proc. 27th CPM, pages 24:1–24:10, 2016. doi:10.4230/LIPICS.CPM.2016.24.

[bib.bib28] [28] Sailesh Kumar, Sarang Dharmapurikar, Fang Yu, Patrick Crowley, and Jonathan Turner. Algorithms to accelerate multiple regular expressions matching for deep packet inspection. In Proc. SIGCOMM, pages 339–350, 2006. doi:10.1145/1159913.1159952.

[bib.bib29] [29] Moshe Lewenstein. Indexing with gaps. In Proc. 18th SPIRE, pages 135–143, 2011. doi:10.1007/978-3-642-24583-1_14.

[bib.bib30] [30] Moshe Lewenstein, J. Ian Munro, Venkatesh Raman, and Sharma V. Thankachan. Less space: Indexing for queries with wildcards. Theor. Comput. Sci., 557:120–127, 2014. doi:10.1016/J.TCS.2014.09.003.

[bib.bib31] [31] Moshe Lewenstein, Yakov Nekrich, and Jeffrey Scott Vitter. Space-efficient string indexing for wildcard pattern matching. In Proc. 31st STACS, pages 506–517, 2014. doi:10.4230/LIPICS.STACS.2014.506.

[bib.bib32] [32] Quanzhong Li and Bongki Moon. Indexing and querying XML data for regular path expressions. In Proc. 27th VLDB, pages 361–370, 2001. URL: http://www.vldb.org/conf/2001/P361.pdf.

[bib.bib33] [33] Makoto Murata. Extended path expressions of XML. In Proc. 20th PODS, pages 126–137, 2001.

[bib.bib34] [34] E. W. Myers. A four-russian algorithm for regular expression pattern matching. J. ACM, 39(2):430–448, 1992. doi:10.1145/128749.128755.

[bib.bib35] [35] G. Navarro and Y. Nekrich. Top- $k$ document retrieval in compressed space. In Proc. 36th SODA, pages 4009–4030, 2025.

[bib.bib36] [36] Gonzalo Navarro and Veli Mäkinen. Compressed full-text indexes. ACM Comput. Surv., 39(1):2–es, 2007. doi:10.1145/1216370.1216372.

[bib.bib37] [37] Gonzalo Navarro and Mathieu Raffinot. Fast and simple character classes and bounded gaps pattern matching, with applications to protein searching. J. Comput. Bio., 10(6):903–923, 2003. doi:10.1089/106652703322756140.

[bib.bib38] [38] Yakov Nekrich. New data structures for orthogonal range reporting and range minima queries. arXiv preprint arXiv:2007.11094, 2020. arXiv:2007.11094.

[bib.bib39] [39] Yakov Nekrich. New data structures for orthogonal range reporting and range minima queries. In Proc. 32nd SODA, pages 1191–1205, 2021. doi:10.1137/1.9781611976465.73.

[bib.bib40] [40] Yakov Nekrich and Gonzalo Navarro. Sorted range reporting. In Proc. 13th SWAT, pages 271–282, 2012. doi:10.1007/978-3-642-31155-0_24.

[bib.bib41] [41] Pierre Peterlongo, Julien Allali, and Marie-France Sagot. Indexing gapped-factors using a tree. Int. J. Found. Comput. Sci., 19(1):71–87, 2008. doi:10.1142/S0129054108005541.

[bib.bib42] [42] Rajeev Raman, Venkatesh Raman, and S. Srinivasa Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proc. 13th SODA, pages 233–242, 2002. URL: http://dl.acm.org/citation.cfm?id=545381.545411.

[bib.bib43] [43] Philipp Schepper. Fine-grained complexity of regular expression pattern matching and membership. In Proc. 28th ESA, 2020.

[bib.bib44] [44] K. Thompson. Regular expression search algorithm. Commun. ACM, 11:419–422, 1968. doi:10.1145/363347.363387.

[bib.bib45] [45] Larry Wall. The Perl Programming Language. Prentice Hall Software Series, 1994.

[bib.bib46] [46] Fang Yu, Zhifeng Chen, Yanlei Diao, T. V. Lakshman, and Randy H. Katz. Fast and memory-efficient regular expression matching for deep packet inspection. In Proc. ANCS, pages 93–102, 2006. doi:10.1145/1185347.1185360.

Text Indexing for Simple Regular Expressions

Abstract

Keywords and phrases:

Funding:

Copyright and License:

2012 ACM Subject Classification:

Acknowledgements:

DOI:

Event:

Editors:

Series and Publisher:

1 Introduction

1.1 Setup and Results

Theorem 1.

Theorem 2.

Theorem 3.

2 Preliminaries

Lexicographic order and Lyndon words.

Concatenation of strings.

Sets of strings.

Period of a string.

Tries and suffix trees.

3 Character-class Kleene-star Patterns

Definition 4.

Definition 5.

3.1 Matching all 𝒌 Characters of 𝑫

Data structure.

Basic search.

Time and space.

3.2 Matching any Subset of 𝑫

Data Structure.

Search.

Time.

3.3 Solution for 𝒌𝐦𝐚𝐱≥𝐥𝐨𝐠⁡𝒏

3.4 Character-Class Interval Patterns

Theorem 6.

Theorem 7.

4 String Kleene-star Patterns

Definition 8.

Data structure.

Query.

Lemma 9.

Transforming 𝑷𝟏⁢𝑷∗⁢𝑷𝟐 into 𝑺𝟏𝒘𝒂⁢𝒒+𝒃𝒘[𝟏..𝒅]𝑺𝟐.

4.1 When one of 𝑺𝟏 and 𝑺𝟐 is the Empty String

Data structure.

Query.

4.2 When both 𝑺𝟏 and 𝑺𝟐 are the Empty String

Data structure.

Query.

5 Conditional Lower Bound for Character-class Kleene-star Patterns without an Anchor

Definition 10 (The Set Disjointness Problem).

Conjecture 11 (The Strong Set Disjointness Conjecture).

References

3.1 Matching all $𝒌$ Characters of $𝑫$

3.2 Matching any Subset of $𝑫$

3.3 Solution for $k_{\max}\geq\log n$

Transforming $P_{1}P^{*}P_{2}$ into $S_{1}w^{aq+b}w[1..d]S_{2}$ .

4.1 When one of $S_{1}$ and $S_{2}$ is the Empty String

4.2 When both $S_{1}$ and $S_{2}$ are the Empty String