Linear Time Subsequence and Supersequence Regex Matching

Amarilli, Antoine; Manea, Florin; Ringleb, Tina; Schmid, Markus L.

doi:10.4230/LIPIcs.MFCS.2025.9

Linear Time Subsequence and Supersequence Regex Matching

Antoine Amarilli

Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France Florin Manea

Institute for Computer Science and CIDAS, University of Göttingen, Germany Tina Ringleb

Institute for Computer Science, University of Göttingen, Germany Markus L. Schmid

Humboldt-Universität zu Berlin, Germany

Abstract

It is well-known that checking whether a given string $w$ matches a given regular expression $r$ can be done in quadratic time $O(|w|\cdot|r|)$ and that this cannot be improved to a truly subquadratic running time of $O((|w|\cdot|r|)^{1-\epsilon})$ assuming the strong exponential time hypothesis (SETH). We study a different matching paradigm where we ask instead whether $w$ has a subsequence that matches $r$ , and show that regex matching in this sense can be solved in linear time $O(|w|+|r|)$ . Further, the same holds if we ask for a supersequence. We show that the quantitative variants where we want to compute a longest or shortest subsequence or supersequence of $w$ that matches $r$ can be solved in $O(|w|\cdot|r|)$ , i. e., asymptotically no worse than classical regex matching; and we show that $O(|w|+|r|)$ is conditionally not possible for these problems. We also investigate these questions with respect to other natural string relations like the infix, prefix, left-extension or extension relation instead of the subsequence and supersequence relation. We further study the complexity of the universal problem where we ask if all subsequences (or supersequences, infixes, prefixes, left-extensions or extensions) of an input string satisfy a given regular expression.

Keywords and phrases:

subsequence, supersequence, regular language, regular expression, automata

Funding:

Antoine Amarilli: Partially supported by the ANR project EQUUS ANR-19-CE48-0019 and by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – 431183758.

Florin Manea: Supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Heisenberg grant 466789228.

Markus L. Schmid: Supported by the German Research Foundation (Deutsche Forschungsgemeinschaft, DFG) – project number 522576760 (gefördert durch die Deutsche Forschungsgemeinschaft (DFG) – Projektnummer 522576760).

Copyright and License:

2012 ACM Subject Classification:

Theory of computation

\rightarrow

Regular languages

Editors:

Paweł Gawrychowski, Filip Mazowiecki, and Michał Skrzypczak

Series and Publisher:

Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

Regular expressions (also called regex) were first introduced by Kleene in 1956 [37] as a theoretical concept and quickly found their way into practice with the classical construction by Thompson [60]. Nowadays, they are a standard tool for text processing and data retrieval tasks, and they constitute computational primitives in virtually all modern programming languages (see the standard textbook [24]). The most important problem about regular expressions is their matching problem, i. e., checking whether a given regular expression $r$ matches a given string $w$ . One of the two main algorithmic approaches to this problem is still Thompson’s original construction: transform $r$ into a nondeterministic finite automaton with $\varepsilon$ -transitions (or $\varepsilon$ NFA) $A$ in linear time, and then simulate $A$ on $w$ in time $O(|w|\cdot|A|)$ . (The other approach is to transform $r$ into a deterministic finite automaton or DFA.)

In addition to their well-known applications as text analysis tools, regular expressions are also used in many different areas and are at the core of several data management principles: graph databases [6], information extraction [20], complex event processing [31], network monitoring [44], financial fraud detection [59], infrastructure security [41], etc. They have also recently gained new interest in the field of fine-grained complexity, where it has been shown that the long-standing quadratic upper bound of $O(|w|\cdot|r|)$ for matching cannot be improved to truly subquadratic complexity of $O((|w|\cdot|r|)^{1-\epsilon})$ unless the strong exponential time hypothesis (SETH) fails: see [9, 12]. In particular, the following question has been investigated [9]: For which classes of regex can the upper bound be improved, ideally to linear time $O(|w|+|r|)$ , and for which classes is this impossible (assuming SETH).

We contribute to the research on regex matching by exploring the following new angle. Instead of asking whether the whole input string $w$ matches the regex $r$ , we ask whether $w$ has a subsequence that matches $r$ , or a supersequence that matches $r$ , or, more generally, whether $r$ matches some string $u$ with $u\preceq w$ , where $\preceq$ is a fixed string relation (and classical regex matching is then the case when $\preceq$ is the equality relation). While technically any possible string relation $\preceq$ instantiates a $\preceq$ -version of the regex matching problem, we focus on relations that are relevant for string matching: the infix relation ( $u\preceq_{\mathsf{in}}w\Leftrightarrow w=vuv^{\prime}$ ), prefix relation ( $u\preceq_{\mathsf{pre}}w\Leftrightarrow w=uv$ ), left-extension relation ( $u\preceq_{\mathsf{lext}}w\Leftrightarrow u=vw$ ), extension relation ( $u\preceq_{\mathsf{ext}}w\Leftrightarrow w\preceq_{\mathsf{in}}u$ ), subsequence relation ( $x_{1}x_{2}\ldots x_{n}\preceq_{\mathsf{sub}}w\Leftrightarrow w=w_{0}\,x_{1}\,w% _{1}\,x_{2}\ldots x_{n}\,w_{n}$ ), and supersequence relation ( $u\preceq_{\mathsf{sup}}w\Leftrightarrow w\preceq_{\mathsf{sub}}u$ ).

Our Results.

Our main focus is on the subsequence and supersequence relations, and our first contribution is to show that the complexity of regular expression matching can be substantially improved when performed with these relations. Namely, we show that given a regex $r$ and string $w$ , we can check in time $O(|w|+|r|)$ whether some subsequence of $w$ matches $r$ , or whether some supersequence of $w$ matches $r$ . As we show, this surprising tractability result is in strong contrast to the same problem for other string relations (infix, prefix, left-extension and extension), or for regex matching in the usual sense – in all these contexts, the $O(|w|\cdot|r|)$ algorithm is optimal assuming SETH.

Our linear time upper bound for subsequence and supersequence matching is achieved by transforming the regex $r$ into an $\varepsilon$ NFA $A$ that accepts $w$ if and only if $w$ has a subsequence (or supersequence) that matches $r$ , following a known construction in the context of so-called upward and downward closures [8]. We then exploit special properties of $A$ that allow us to decide whether it accepts $w$ in linear time. While it is in general easy to see that the quadratic upper bound can be improved in the case of subsequence and supersequence matching, we think that linear time complexity is rather unexpected, and some algorithmic care is required to achieve it. In particular, we stress the fact that our $O(|w|+|r|)$ complexity bound does not assume the input alphabet to be constant.

Motivated by this positive algorithmic result, we investigate a natural generalisation of the matching problem: compute a maximum-length/minimum-length string $u$ with $u\preceq w$ that matches $r$ . For the subsequence and supersequence relations, we can show (conditionally) that this generalisation worsens the complexity: our $O(|w|+|r|)$ matching algorithm does not generalise to this stronger problem. More precisely, for the max-variant of the subsequence relation or the min-variant of the supersequence relation, we show that truly subquadratic algorithms would imply truly subquadratic algorithms for the longest common subsequence or the shortest common supersequence problems, which is impossible under SETH (see [13, 1]). For the min-variant of the subsequence relation or the max-variant of the supersequence relation, we show a weaker bound: an $O(|w|+|r|)$ algorithm would imply that we can detect triangles in a dense graph in linear time. On the positive side, however, we show that all these problems can be solved within the original matching complexity of $O(|w|\cdot|r|)$ , in particular generalising the fact that longest common subsequences and shortest common supersequences can be computed within this bound. We also show that the same upper bound applies for the infix, prefix, and (left-)extension relations, and that it is optimal under SETH.

Finally, we investigate the complexity of checking whether all strings $u$ with $u\preceq w$ match $r$ . While it is easy to see that this can be solved efficiently for the infix and prefix relation, it becomes intractable for the (left-)extension, subsequence and supersequence relations. We pinpoint the complexity of all these variants (conditionally to SETH) except for the infix relation, i. e., we leave a gap between the upper and lower bound for the problem of deciding whether the input string has an infix which is rejected by an input regex.

Motivations and Related Work.

Subsequences and supersequences play an important role in many different areas of theoretical computer science: in formal languages and logics (e. g., piecewise testable languages [56, 57, 36, 48], or subword order and downward closures [32, 43, 42, 62]), in combinatorics on words [51, 23, 54, 46, 52, 53] or combinatorial pattern matching [33, 35, 45, 58, 19, 27], for modelling concurrency [50, 55, 15], in fine-grained complexity [11, 14, 1, 2, 49]), etc. See also the surveys [10, 18, 40] and the references therein. Moreover, algorithmic problems related to the analysis of the set of subsequences of the strings of a formal language, given as a grammar or automaton, are studied in [4] and [21]. Closer to our topic, matching regular expressions to the subsequences of a string is an important matching paradigm in event stream processing, where we receive a stream of events that result from different processes, which means that consecutive events from the same process are represented as a subsequence in this string (see, e. g., [7, 28, 63, 38, 39, 25, 31]).

To prove our linear upper bound for the subsequence regex matching problem, we first transform the regular expressions into an $\varepsilon$ NFA $A_{\mathsf{sub}}$ that accepts the upward closure of $\mathsf{L}(r)$ , i. e., the set $\{u\mid\exists v\in\mathsf{L}(r):v\preceq_{\mathsf{sub}}u\}$ (note that $w$ has a subsequence that matches $r$ if and only if $w\in\mathsf{L}(A_{\mathsf{sub}})$ ). Similarly, the upper bound for the supersequence regex matching problem is based on constructing an $\varepsilon$ NFA $A_{\mathsf{sup}}$ for the downward closure $\{u\mid\exists v\in\mathsf{L}(r):v\preceq_{\mathsf{sup}}u\}$ . The downward and upward closures are well-investigated concepts in formal language theory (see citations above). In particular, it has been noted in [8, Lemma 9] that the powerset-DFA of $A_{\mathsf{sub}}$ always transitions from a state set to a superset of this state set (analogously, the powerset-DFA of $A_{\mathsf{sup}}$ always transitions from a state set to a subset of this state set). However, this property does not imply that the powerset-DFA of $A_{\mathsf{sub}}$ and $A_{\mathsf{sup}}$ is necessarily small, since [47] proves an exponential lower bound on their number of states. Our notion of subsequence or supersequence matching is related to so-called upward-closed and downward-closed languages (which are languages respectively equal to their upward or downward closure), because for such languages the usual notion of regex matching coincides with subsequence and supersequence matching, respectively. These languages have been investigated, e. g., for downward-closed languages in [26] or in [3] (under the name “simple regular expressions”) and for upward-closed languages in [29]. However, to our knowledge, none of the works focusing on the upward or downward closure, or on upward-closed or downward-closed languages, have investigated the complexity of the matching problem like we do – so it was not known that these problems could be solved in time $O(|w|+m)$ , or indeed that their complexity was lower than that of standard regex matching.

Paper Structure.

In Section 2, we give preliminaries and formally define the matching problems that we study. In Section 3, we first recall some basics about the state-set simulation of $\varepsilon$ NFAs and then explain how our regex matching problems for the subsequence and supersequence relation reduce to the state-set simulation of certain $\varepsilon$ NFAs. Then, in Section 4, we give the corresponding linear time algorithms. Sections 5 and 6 are concerned with the quantitative and universal problem variants for all considered string relations. The conditional lower bounds that complement our upper bounds are discussed in Section 7. Due to space restrictions, we only give proof sketches of some of our results (all proof details can be found in the full version [5] of this paper).

2 Preliminaries and Problem Statement

Strings, Regular Expressions and Automata.

We let $\Sigma$ be a finite alphabet of symbols (sometimes called letters) and write $\Sigma^{*}$ for the set of strings (sometimes called words) over $\Sigma$ . We write $\varepsilon$ for the empty string. For a string $w\in\Sigma^{*}$ , we denote by $|w|$ its length and, for every $i\in\{1,2,\ldots,|w|\}$ , we denote by $w[i]$ the $i^{\text{th}}$ symbol of $w$ . For $i,j\in\{1,2,\ldots,|w|\}$ with $i\leq j$ , we denote by $w[i:j]$ the factor (also called infix) $w[i]w[i+1]\cdots w[j]$ ; in particular, $w[i:i]=w[i]$ .

Regular expressions over $\Sigma$ are defined as follows. The empty set $\emptyset$ is a regular expression with $\mathsf{L}(\emptyset)=\emptyset$ , and every $x\in\Sigma\cup\{\varepsilon\}$ is a regular expression with $\mathsf{L}(x)=\{x\}$ . If $s$ and $t$ are regular expressions, then $s\cdot t$ , $s\vee t$ , and $s^{*}$ are regular expressions with $\mathsf{L}(s\cdot t)=\mathsf{L}(s)\cdot\mathsf{L}(t)$ , $\mathsf{L}(s\vee t)=\mathsf{L}(s)\cup\mathsf{L}(t)$ , and $\mathsf{L}(s^{*})=(\mathsf{L}(s))^{*}$ . Here, we define as usual: $L_{1}\cdot L_{2}=\{uv\mid u\in L_{1},v\in L_{2}\}$ , $L^{0}=\{\varepsilon\}$ , $L^{k}=L^{k-1}\cdot L$ for every $k\geq 1$ , and $L^{*}=\bigcup_{k\geq 0}L^{k}$ . The length $|s|$ of a regular expression $s$ is the total number of symbols of $s$ considered as a string.

We work with nondeterministic finite automata with $\varepsilon$ -transitions, called $\varepsilon$ NFA for brevity. An $\varepsilon$ NFA is a tuple $A=(Q,\Sigma,q_{0},q_{\mathsf{f}},\delta)$ where $Q$ is a finite set of states, $q_{0}$ is the initial state, $q_{\mathsf{f}}$ is the final state, and $\delta\subseteq Q\times\Sigma\cup\{\varepsilon\}\times Q$ is the set of transitions. A transition of the form $(p,a,q)$ is called an $a$ -transition. We will also interpret an $\varepsilon$ NFA $A$ as a graph which has vertex set $Q$ and which has directed edges labelled by symbols from $\Sigma\cup\{\varepsilon\}$ given by the transitions of $\delta$ , i. e., any transition $(p,a,q)\in\delta$ is interpreted as a directed edge from $p$ to $q$ labelled by $a$ . A transition $(p,a,p)$ with $p\in Q$ and $a\in\Sigma\cup\{\varepsilon\}$ is called a self-loop. A run of $A$ on an input string $w\in\Sigma^{*}$ is a path (not necessarily simple) from $q_{0}$ to some state $p$ which is labelled with $w$ , where the label of a path is just the concatenation of all $\Sigma$ -labels (ignoring $\varepsilon$ -labels). A run is accepting if $p=q_{\mathsf{f}}$ . We write $\mathsf{L}(A)$ for the language accepted by $A$ , i.e., the set of all strings for which $A$ has an accepting run. The size $|A|$ of $A$ is its number of transitions: we usually denote it by $m$ , while $n$ denotes the number $|Q|$ of states.

It is well-known that a given regular expression $r$ can be converted in time $O(|r|)$ into an $\varepsilon$ NFA $A$ such that $\mathsf{L}(A)=\mathsf{L}(r)$ and $|A|=O(|r|)$ . This can be achieved, e. g., using Thompson’s construction [34, Section 3.2.3]. Thus, in the sequel, we assume that all input regular expressions are given as $\varepsilon$ NFAs, and we state our results directly for arbitrary $\varepsilon$ NFAs.

Moreover, we assume that our $\varepsilon$ NFAs are trimmed, which means that every state is reachable from $q_{0}$ and every state can reach $q_{\mathsf{f}}$ . This can always be ensured in time $O(m)$ . If an $\varepsilon$ NFA is trimmed, then we also have that the number of states of the input automaton is at most twice the number of transitions. We also assume that for each $x\in\Sigma$ we have at least one transition labelled with $x$ , which means that $|\Sigma|$ is in $O(m)$ .

String Relations.

A string relation (over $\Sigma$ ) is a subset of $\Sigma^{*}\times\Sigma^{*}$ . For any string relation $\preceq$ and $w\in\Sigma^{*}$ we define $\Lambda_{\preceq}(w)=\{u\in\Sigma^{*}\mid u\preceq w\}$ , i.e., the set of all strings that are in $\preceq$ relation to $w$ ; we also lift this notation to languages $L\subseteq\Sigma^{*}$ , i.e., $\Lambda_{\preceq}(L)=\bigcup_{w\in L}\Lambda_{\preceq}(w)$ . Next, we define several well-known string relations.¹¹1Our generalised regex matching setting works for any string relation, but those that we study are all reflexive, transitive and antisymmetric; hence, we use the symbol $\preceq$ .

The prefix and infix relations are denoted by $\preceq_{\mathsf{pre}}$ and $\preceq_{\mathsf{in}}$ : formally, we write $u\preceq_{\mathsf{pre}}w$ when $uv=w$ for some $v\in\Sigma^{*}$ and we write $u\preceq_{\mathsf{in}}w$ when $vuv^{\prime}=w$ for some $v,v^{\prime}\in\Sigma^{*}$ . The left-extension and extension relations are denoted by $\preceq_{\mathsf{lext}}$ and $\preceq_{\mathsf{ext}}$ : formally, we write $u\preceq_{\mathsf{lext}}w$ when $u=vw$ for some $v\in\Sigma^{*}$ and write $u\preceq_{\mathsf{ext}}w$ when $w\preceq_{\mathsf{in}}u$ . The subsequence and supersequence relations are denoted by $\preceq_{\mathsf{sub}}$ and $\preceq_{\mathsf{sup}}$ : we write $u\preceq_{\mathsf{sub}}w$ when $w[j_{1}]w[j_{2}]\ldots w[j_{|u|}]=u$ for some $1\leq j_{1}<j_{2}<\ldots<j_{|u|}\leq|w|$ , and write $u\preceq_{\mathsf{sup}}w$ when $w\preceq_{\mathsf{sub}}u$ . Note that we do not study the suffix and right-extension relations because they amount to prefix and left-extension up to mirroring the strings.

Variants of Regex Matching.

The well-known regex matching problem is to decide whether $w\in\mathsf{L}(r)$ for a given regular expression $r$ and input string $w$ . As explained above, this generalises to the $\varepsilon$ NFA acceptance problem, where we want to decide whether $w\in\mathsf{L}(A)$ for a given $\varepsilon$ NFA $A$ and input string $w$ . In this paper, we study the $\preceq$ -matching problem for $\preceq$ a string relation among those presented above: For a given string $w\in\Sigma^{*}$ and an $\varepsilon$ NFA $A$ , decide whether $A$ accepts a string $u$ with $u\preceq w$ , i. e., decide whether $\Lambda_{\preceq}(w)\cap\mathsf{L}(r)\neq\emptyset$ .

We also define quantitative versions of the matching problem. For a string relation $\preceq$ , the min-variant of the $\preceq$ -matching problem is as follows: For a given string $w\in\Sigma^{*}$ and an $\varepsilon$ NFA $A$ , compute a shortest string $u$ with $u\preceq w$ and $u\in\mathsf{L}(A)$ , or report that no such string $u$ exists. The max-variant of the $\preceq$ -matching problem is as follows: For a given string $w\in\Sigma^{*}$ and an $\varepsilon$ NFA $A$ , either report that there are arbitrarily large strings $u$ with $u\preceq w$ and $u\in\mathsf{L}(A)$ , or compute a longest string $u$ with $u\preceq w$ and $u\in\mathsf{L}(A)$ , or report that no such string $u$ exists. Further, all lower bounds on the quantitative versions of the matching problem will already apply in the setting where we are only required to compute the length of the resulting string.

Finally, we define a universal variant of the matching problem. For a string relation $\preceq$ , the universal-variant of the $\preceq$ -matching problem is as follows: For a given string $w\in\Sigma^{*}$ and an $\varepsilon$ NFA $A$ , decide whether $A$ matches all strings $u$ with $u\preceq w$ , i. e., decide the inclusion problem $\Lambda_{\preceq}(w)\subseteq\mathsf{L}(A)$ .

See Figure 1 for an overview of all our upper and lower bounds with respect to the variants of the regex matching problem. Recall that $m$ is the size of the input $\varepsilon$ NFA.

①	$\mathsf{in}$	$\mathsf{pre}$	$\mathsf{ext}/\mathsf{lext}$	$\mathsf{sub}$	$\mathsf{sup}$
$\preceq$ -matching	$O(\|w\|m)$	$O(\|w\|m)$	$O(\|w\|m)$	$O(\|w\|+m)$	$O(\|w\|+m)$
min-variant	$O(\|w\|m)$	$O(\|w\|m)$	$O(\|w\|m)$	$O(\|w\|m)$	$O(\|w\|m)$
max-variant	$O(\|w\|m)$	$O(\|w\|m)$	$O(\|w\|m)$	$O(\|w\|m)$	$O(\|w\|m)$
universal-variant	$O(\|w\|^{2}m)$	$O(\|w\|m)$	PSPACE	coNP	PSPACE

②	$\mathsf{in}/\mathsf{pre}$	$\mathsf{ext}/\mathsf{lext}$	$\mathsf{sub}$	$\mathsf{sup}$
$\preceq$ -matching	no $O((\|w\|m)^{1-\epsilon})$	no $O((\|w\|m)^{1-\epsilon})$	—	—
min-variant	no $O((\|w\|m)^{1-\epsilon})$	no $O((\|w\|m)^{1-\epsilon})$	no $O(\|w\|+m)$	no $O((\|w\|m)^{1-\epsilon})$
max-variant	no $O((\|w\|m)^{1-\epsilon})$	no $O((\|w\|m)^{1-\epsilon})$	no $O((\|w\|m)^{1-\epsilon})$	no $O(\|w\|+m)$
universal-variant	no $O((\|w\|m)^{1-\epsilon})$	PSPACE-hard	coNP-hard	PSPACE-hard

Figure 1: Upper bounds ① and conditional lower bounds ② for the different problem variants. Recall that

m

is the size of the

\varepsilon

NFA.

Computational Model and Basic Input-Preprocessing.

The computational model we use to state our algorithms is the standard unit-cost word RAM with logarithmic word-size $\omega$ , which allows processing inputs of size $n$ , where $\omega\geq\log n$ . This is a standard computational model for the analysis of algorithms, defined in [22]. To make some of our algorithms faster, we use the standard technique of lazy initialisation of arrays [30] to avoid spending linear time in the array size to initialise its cells. The time needed, after the lazy initialisation, to store a value in a cell of the array, or to check if a cell of the array was initialised and if so return the value it stores, is $O(1)$ .

Recall that the inputs for our problems (see previous paragraph) are always an $\varepsilon$ NFA and a string over $\Sigma$ . For these inputs, we make the following assumptions. For an $\varepsilon$ NFA $A=(Q,\Sigma,q_{0},q_{\mathsf{f}},\delta)$ with $|Q|=n$ , $|\Sigma|=\sigma$ , and $m=|\delta|$ , we assume that $Q=\{1,\ldots,n\}$ and $\Sigma=\{1,\ldots,\sigma\}$ ; we can assume that both $Q$ and $\Sigma$ are ordered sets, w.r.t. the canonical order on natural numbers. Hence, the processed strings are sequences of integers (representing the symbols), each of these integers fitting in one memory word. This is a common assumption in string algorithms: the input alphabet is said to be an integer alphabet (see, e. g., [17]). As we assume $\varepsilon$ NFAs to be trimmed, we also know that $n,\sigma$ are in $O(m)$ . Moreover, an $\varepsilon$ NFA $A$ is given by its number of states, size of the input alphabet, initial and final state, and a list of $m$ transitions of the form $(q,a,q^{\prime})$ , with $q,q^{\prime}\in Q$ and $a\in\Sigma\cup\{\varepsilon\}$ .

Further, using radix sort, we can prepare for each state $q$ having outgoing transitions the list $L[q]$ of these transitions, sorted first according to the symbol labelling them, and then by the target state; for states without outgoing transitions $L[q]\leftarrow\emptyset$ . This allows us to simultaneously construct, for each $q\in Q$ and $a\in\Sigma\cup\{\varepsilon\}$ for which $q$ has an outgoing $a$ -transition, the list $T[q,a]$ of transitions of the form $(q,a,q^{\prime})$ ; $T[q,a]$ is undefined for $q\in Q$ and $a\in\Sigma\cup\{\varepsilon\}$ in the case where $q$ has no outgoing $a$ -transition. We will make the convention that, in the case that a self-loop $(q,a,q)$ exists, then $(q,a,q)$ is the first transition stored in $T[q,a]$ ; this allows us to test in $O(1)$ whether there exists a self-loop $(q,a,q)$ for any $q\in Q$ and $a\in\Sigma\cup\{\varepsilon\}$ . Both $L[\cdot]$ and $T[\cdot,\cdot]$ are implemented as arrays (of lists), which are lazily initialised (as mentioned above). So, computing the defined elements of $L[\cdot]$ and $T[\cdot,\cdot]$ can be done in linear time, i. e., in $O(m)$ .

3 State-Set Simulation and Simple Upper Bounds

Given a string $w\in\Sigma^{*}$ and a regular expression represented as an $\varepsilon$ NFA $A$ , we can solve the regex matching problem of deciding whether $w\in\mathsf{L}(A)$ using the classical approach of state-set simulation. It is well-known [9] that, assuming the strong exponential time hypothesis (SETH), this yields an optimal $O(|w|\cdot|A|)$ algorithm. For our variants of regex matching, it is often possible to build an $\varepsilon$ NFA $A^{\prime}$ from $A$ that accepts suitable strings (e. g., the subsequences or supersequences of strings of $\mathsf{L}(A)$ ), so that we can solve the matching problem by checking whether $w\in\mathsf{L}(A^{\prime})$ by state-set simulation (though this is generally not optimal). Since many of our algorithms will nevertheless be specialised variants of state-set simulation, we review the technique in more detail below.

State-Set Simulation for $\varepsilon$ NFA.

For an NFA A without $\varepsilon$ -transitions and an input string $w$ , the state-set simulation starts with $S_{0}=\{q_{0}\}$ and then, for every $i=1,2,\ldots,|w|$ , computes the set $S_{i}=\mathsf{C}_{w[i]}(S_{i-1})$ , where, for any state set $S$ and symbol $b\in\Sigma$ , we call $\mathsf{C}_{b}(S)$ the set of all states that can be reached from a state in $S$ by a $b$ -transition, i. e., $\mathsf{C}_{b}(S)=\{q\mid p\in S\wedge(p,b,q)\in\delta\}$ . Clearly, each $S_{i}$ contains exactly the states that are reachable from $q_{0}$ by a $w[1:i]$ -labelled path. Now if we are dealing with an $\varepsilon$ NFA, then we can use the same idea, but we also have to compute the $\varepsilon$ -closures $\mathsf{C}^{*}_{\varepsilon}(S)$ in between, i. e., the set of all states that can be reached from a state in $S$ by a path of $\varepsilon$ -transitions. More precisely, we start with $S_{0}=\mathsf{C}^{*}_{\varepsilon}(\{q_{0}\})$ and then, for every $i=1,2,\ldots,|w|$ , we compute the set $S_{i}=\mathsf{C}^{*}_{\varepsilon}(\mathsf{C}_{w[i]}(S_{i-1}))$ . Now each $S_{i}$ contains exactly the states that are reachable from $q_{0}$ by a path that is labelled with $w[1:i]$ (with $\varepsilon$ -transitions being ignored in the path label). Computing $S_{0}$ is called the initialisation and computing $S_{i}$ from $S_{i-1}$ is called an update step of the state-set simulation. For every $i=0,1,2,\ldots,|w|$ , the set $S_{i}$ is the set of active states at step $i$ . We also say that a state $p$ is active at step $i$ if $p\in S_{i}$ .

Given a state set $S$ and a symbol $b\in\Sigma$ , computing the set $\mathsf{C}_{b}(S)$ can be easily done in time $O(m)$ , since we only have to compute for $p\in S$ the states $q$ with a transition $(p,b,q)$ , which are given by $T[p,b]$ . Thus, we have to inspect every transition at most once. Computing the $\varepsilon$ -closure $\mathsf{C}^{*}_{\varepsilon}(S)$ can also be done in time $O(m)$ by a breadth-first search (BFS) that starts in all states of $S$ and only considers $\varepsilon$ -transitions (again, the entries $T[p,\varepsilon]$ can be used for that). This means that the initialisation and each update step can be done in time $O(m)$ , which yields a total running time of $O(|w|m)$ for the whole state-set simulation. What is more, this running time is conditionally optimal. Indeed:

Lemma 3.1 ([9]).

Given an $\varepsilon$ NFA $A$ and string $w$ , we can check in time $O(|A|\cdot|w|)$ whether $w\in\mathsf{L}(A)$ , but not in time $O((|A|\cdot|w|)^{1-\epsilon})$ for any $\epsilon>0$ unless SETH fails.

Solving Sub- and Supersequence Matching via State-Set Simulation.

We now explain how we can use state-set simulation to solve the $\preceq$ -matching problem, specifically for the subsequence and supersequence relations (which are our main focus in this paper). It is easy to see that we can transform an $\varepsilon$ NFA $A$ into two $\varepsilon$ NFAs $A_{\mathsf{sub}}$ and $A_{\mathsf{sup}}$ such that $w\in\mathsf{L}(A_{\mathsf{sub}})\iff\Lambda_{\preceq_{\mathsf{sub}}}(w)\cap% \mathsf{L}(A)\neq\emptyset$ and $w\in\mathsf{L}(A_{\mathsf{sup}})\iff\Lambda_{\preceq_{\mathsf{sup}}}(w)\cap% \mathsf{L}(A)\neq\emptyset$ . For example, this is done in [8, Lemma 8]. We review this construction as we adapt it later.

The $\varepsilon$ NFA $A_{\mathsf{sub}}$ is obtained from $A$ by simply adding a transition $(p,b,p)$ for every state $p$ and $b\in\Sigma$ . Intuitively speaking, these loops equip $A$ with the general possibility to consume symbols from $w$ without transitioning in a new state, which corresponds to ignoring symbols from $w$ . Hence, the “non-ignoring” transitions of an accepting run of $A_{\mathsf{sub}}$ on $w$ spell out a subsequence of $w$ accepted by $A$ . On the other hand, the $\varepsilon$ NFA $A_{\mathsf{sup}}$ is obtained from $A$ by simply adding a transition $(p,\varepsilon,q)$ for every existing transition $(p,b,q)$ with $b\in\Sigma$ . This means that while reading $w$ the automaton can always virtually process some symbols that do not belong to $w$ . Hence, in an accepting run of $A_{\mathsf{sup}}$ on $w$ , the actual transitions that read the symbols from $w$ along with the added $\varepsilon$ -transitions that virtually read some symbols spell out a supersequence of $w$ accepted by $A$ .

Consequently, we can solve the subsequence and supersequence matching problem by checking whether $w\in\mathsf{L}(A_{\mathsf{sub}})$ and $w\in\mathsf{L}(A_{\mathsf{sup}})$ , respectively, via state-set simulation. Since $|A_{\mathsf{sub}}|=O(n|\Sigma|+m)$ and $|A_{\mathsf{sup}}|=O(m)$ , we get the following:

Theorem 3.2.

The subsequence matching problem can be solved in time $O(|w|(n|\Sigma|+m))$ and the supersequence matching problem can be solved in time $O(|w|m)$ .

It has been observed in [8, Lemma 9] that the $\varepsilon$ NFAs $A_{\mathsf{sub}}$ and $A_{\mathsf{sup}}$ have special properties that can be exploited in the state-set simulations to improve the bounds of Theorem 3.2.

Firstly, for $A_{\mathsf{sub}}$ , whenever a state $p$ of $A_{\mathsf{sub}}$ is added to the set of active states in the state-set simulation, it will stay active until the end of the state-set simulation. This is due to the existence of the transition $(p,b,p)$ for every $b\in\Sigma$ . Consequently, for the state-set simulation of $A_{\mathsf{sub}}$ , we have $S_{0}\subseteq S_{1}\subseteq\ldots\subseteq S_{|w|}$ . Secondly, for $A_{\mathsf{sup}}$ , we can observe that the state-set simulation starts with $S_{0}=Q$ , i. e., the set of all states, which is due to the fact that every state in $A$ is reachable from $q_{0}$ by $\varepsilon$ -transitions and therefore is in the $\varepsilon$ -closure of $q_{0}$ with respect to $A_{\mathsf{sup}}$ . Moreover, when a state $q$ is removed, it is never added back. Indeed, assume by contradiction that $q\notin S_{i-1}$ but $q$ gets added to $S_{i}$ by an update step. Then $q$ is not reachable from $S_{i-1}$ by $\varepsilon$ -transitions, but it is reachable from $\mathsf{C}_{w[i]}(S_{i-1})$ by $\varepsilon$ -transitions: this is impossible because in $A_{\mathsf{sup}}$ there is a transition $(p,\varepsilon,q)$ for every existing transition $(p,b,q)$ with $b\in\Sigma$ . Thus, $S_{0}\supseteq S_{1}\supseteq\ldots\supseteq S_{|w|}$ .

This means that in the state-set simulation for $A_{\mathsf{sub}}$ and $A_{\mathsf{sup}}$ on an input string $w$ , we can encounter at most $n+1$ different sets of active states. Further, for each set of active sets, there are at most $|\Sigma|$ possible update steps necessary (since if $\mathsf{C}^{*}_{\varepsilon}(\mathsf{C}_{w[i+1]}(S_{i}))=S_{i}$ and $w[i+1]=w[j+1]$ and $S_{i}=S_{j}$ for some $i<j$ hold, then we do not need to update $S_{j}$ ). Every actual update can again be done in time $O(m)$ , which yields the following:

Theorem 3.3.

There are algorithms that solve the subsequence matching problem in time $O(|w|+n|\Sigma|\cdot(n|\Sigma|+m))$ and the supersequence matching problem in time $O(|w|+n|\Sigma|m)$ .

This is already a significant improvement over Theorem 3.2, since the running time is linear in $|w|$ . In the next two sections, we look deeper into the problem of subsequence and supersequence matching and manage to lower its complexity to the optimum of $O(|w|+m)$ .

4 Subsequence and Supersequence Matching in Linear Time

In this section, we prove that the $\preceq_{\mathsf{sub}}$ - and $\preceq_{\mathsf{sup}}$ -matching problem can be solved in linear time $O(|w|+m)$ . Note that this also holds for the usual regex matching problem when the input regex or $\varepsilon$ NFA is assumed to accept an upward-closed or downward-closed language (as matching is then equivalent to $\preceq_{\mathsf{sub}}$ - or $\preceq_{\mathsf{sup}}$ -matching). We first prove the result for the subsequence relation, which is slightly easier, and then cover the supersequence relation.

4.1 Subsequence Matching in Linear Time

Theorem 4.1.

Given a string $w$ and $\varepsilon$ NFA $A$ with $n$ states and $m$ transitions, the subsequence matching problem can be solved in time $O(|w|+m)$ .

Proof Sketch.

We perform the state-set simulation of $A_{\mathsf{sub}}$ on $w$ , i. e., computing the state sets $S_{0},\ldots,S_{|w|}$ , but without explicitly constructing $A_{\mathsf{sub}}$ , i. e., we work on $A$ directly. Thus, the correctness follows from $w\in\mathsf{L}(A_{\mathsf{sub}})\iff\Lambda_{\preceq_{\mathsf{sub}}}(w)\cap% \mathsf{L}(A)\neq\emptyset$ (see Section 3).

To do this in time $O(|w|+m)$ , we can consider every transition only a constant number of times, which is possible due to the property $S_{0}\subseteq S_{1}\subseteq\ldots\subseteq S_{|w|}$ . More precisely, when we update $S_{i-1}$ to $S_{i}$ , we have to compute $S^{\prime}=\mathsf{C}_{w[i]}(S_{i-1})$ and then $\mathsf{C}^{*}_{\varepsilon}(S^{\prime})$ . However, if $q$ is added to $S^{\prime}$ due to a transition $(p,w[i],q)$ with $p\in S_{i-1}$ , then $q$ will stay in the set of active states until the end of the state-set simulation, and the transition $(p,w[i],q)$ can therefore be ignored afterwards (i. e., it will not play a role in the computation of $\mathsf{C}_{w[j]}(S_{j-1})$ for any $j>i$ ). In the computation of $\mathsf{C}_{w[i]}(S_{i-1})$ we therefore consider only unmarked transitions $(p,w[i],q)$ with $p\in S_{i-1}$ (i. e., the ones not already used in any previous update step) and then mark them when we use them. After having computed $\mathsf{C}_{w[i]}(S_{i-1})$ , we must compute $\mathsf{C}^{*}_{\varepsilon}(\mathsf{C}_{w[i]}(S_{i-1}))$ , but again we can note that transitions $(p,\varepsilon,q)$ that have already been traversed while computing some previous $\mathsf{C}^{*}_{\varepsilon}(\mathsf{C}_{w[j]}(S_{j-1}))$ with $j<i$ can be ignored since they necessarily lead to states that are already in $S_{i-1}$ and therefore in $S_{i}$ . Consequently, when we compute the $\varepsilon$ -closure, we ignore marked transitions and also mark all transitions that we traverse; this also handles the initialisation, where we compute $\mathsf{C}^{*}_{\varepsilon}(\{q_{0}\})$ .

To implement this idea efficiently, we use a lazily initialised array $H[\cdot]$ of lists, indexed by the symbols of $\Sigma$ , such that, after $S_{i-1}$ is computed, $H[a]$ contains exactly the unmarked transitions $(p,a,q)$ with $p\in S_{i-1}$ . Hence, the transitions from $H[w[i]]$ are exactly the ones that we need to compute $\mathsf{C}_{w[i]}(S_{i-1})$ as described above. While computing $\mathsf{C}_{w[i]}(S_{i-1})$ , we store all new states (i.e., the states discovered via transitions from $H[w[i]]$ and that are not in $S_{i-1}$ already) into a queue $R$ . Then we compute the $\varepsilon$ -closure $\mathsf{C}^{*}_{\varepsilon}(R)$ , ignoring marked transitions and marking traversed transitions. During the computation $\mathsf{C}^{*}_{\varepsilon}(R)$ , we add to $R$ all new states that we discover. Finally, in order to update $H$ , we only have to look at outgoing transitions $(p,a,q)$ with $p\in R$ (which are necessarily unmarked) and add them to $H[a]$ .

In this procedure, every transition is either not marked at all, or it is marked in the computation of some $\varepsilon$ -closure (including the initialisation) and then ignored, or it is put into $H[a]$ at some update step (or initialisation) and then marked at some later update step and then ignored. Hence, in addition to the $|w|$ update steps, we have to spend $O(m)$ additional time over the whole state-set simulation. $\hfill\blacktriangleleft$

4.2 Supersequence Matching in Linear Time

The general idea is again to check $w\in\mathsf{L}(A_{\mathsf{sup}})$ , and unlike in the previous subsection, we can afford to build $A_{\mathsf{sup}}$ in time $O(m)$ explicitly. However, performing a state-set simulation in linear time with $A_{\mathsf{sup}}$ is more difficult. Intuitively speaking, in order to obtain $S_{i}$ , we have to remove from $S_{i-1}$ all states that cannot be reached by any $w[i]$ -labelled path from some other state from $S_{i-1}$ . It is not clear how this can done by touching each transition only a constant number of times over the whole course of the state-set simulation. One ingredient to achieve this is to first decompose $A_{\mathsf{sup}}$ into its strongly connected components (SCCs).

Recall that the SCCs of a directed graph $G$ are the maximal subsets of vertices $R$ such that, for any two distinct vertices $u$ and $v$ in $R$ , there is a directed path from $u$ to $v$ and from $v$ to $u$ in $G$ . The SCCs of an $\varepsilon$ NFA are simply its SCCs when seeing the automaton as a directed graph (ignoring the edge labels in $\Sigma\cup\{\varepsilon\}$ ).

The condensation of an $\varepsilon$ NFA $A$ , denoted by $\textsf{cond}(A)$ , is an $\varepsilon$ NFA whose states are the SCCs of $A$ . For convenience, for every state $p$ of $A$ , we denote by $\textsf{SCC}_{A}[p]$ (or simply $\textsf{SCC}[p]$ if $A$ is clear from the context) the SCC of $A$ that contains $p$ , which, by definition, is a state of $\textsf{cond}(A)$ . The transitions of $\textsf{cond}(A)$ are as follows: for every transition $(p,a,q)$ of $A$ , we have a transition $(\textsf{SCC}_{A}[p],a,\textsf{SCC}_{A}[q])$ in $\textsf{cond}(A)$ . Note that $\textsf{SCC}_{A}[p]=\textsf{SCC}_{A}[q]$ is possible, and note that several transitions may be merged in $\textsf{cond}(A)$ . The initial state of $\textsf{cond}(A)$ is $\textsf{SCC}_{A}[q_{0}]$ and the final state of $\textsf{cond}(A)$ is $\textsf{SCC}_{A}[q_{\mathsf{f}}]$ .

Proposition 4.2.

Given an $\varepsilon$ NFA $A$ , we can construct $\textsf{cond}(A)$ in time $O(m)$ .

Leaving aside the self-loops, the condensation of an $\varepsilon$ NFA has the convenient property of being a directed acyclic graph (DAG). However, constructing the condensation of an automaton changes in general the language that it accepts. However, we shall next see that this is not the case for the $\varepsilon$ NFA $A_{\mathsf{sup}}$ , i. e., we have $\mathsf{L}(A_{\mathsf{sup}})=\mathsf{L}(\textsf{cond}(A_{\mathsf{sup}}))$ .

Lemma 4.3.

Let $A$ be an $\varepsilon$ NFA and let $w\in\Sigma^{*}$ . Then $\mathsf{L}(A_{\mathsf{sup}})=\mathsf{L}(\textsf{cond}(A_{\mathsf{sup}}))$ .

Hence, $\text{$A$ accepts a supersequence of $w$}\stackrel{{\scriptstyle\text{Sec.% \leavevmode\nobreak\ \ref{sec:naive}}}}{{\Longleftrightarrow}}w\in\mathsf{L}(A% _{\mathsf{sup}})\stackrel{{\scriptstyle\text{Lem.\leavevmode\nobreak\ \ref{lem% :SCC_basic}}}}{{\Longleftrightarrow}}w\in\mathsf{L}(\textsf{cond}(A_{\mathsf{% sup}}))\,.$ Moreover, $\textsf{cond}(A_{\mathsf{sup}})$ inherits the crucial properties of $A_{\mathsf{sup}}$ , namely:

Proposition 4.4.

If $\textsf{cond}(A_{\mathsf{sup}})$ has a transition $(C,b,C^{\prime})$ for some $b\in\Sigma$ , then it also has a transition $(C,\varepsilon,C^{\prime})$ . Further, if $S_{0},S_{1},\ldots,S_{|w|}$ are the state sets of the state-set simulation of $\textsf{cond}(A_{\mathsf{sup}})$ on $w$ , then $S_{0}\supseteq S_{1}\supseteq\ldots\supseteq S_{|w|}$ and $S_{0}$ contains all states of $\textsf{cond}(A_{\mathsf{sup}})$ .

Our next goal is to show that we can implement the state-set simulation of $\textsf{cond}(A_{\mathsf{sup}})$ on $w$ in linear time. For this, we will need to efficiently identify states $C\in S_{i-1}$ that do not have a self-loop for the next input symbol $w[i]$ (since if such a self-loop exists, then $C\in S_{i}$ holds). Identifying such states is challenging: while there are at most $m$ self-loops, there might be up to $n|\Sigma|$ pairs $(C,a)$ overall where $C$ does not have a self-loop labelled $a$ , so we cannot explicitly materialize these pairs. Instead, we use a specific data structure:

Lemma 4.5.

For an $\varepsilon$ NFA $A=(Q,\Sigma,q_{0},q_{\mathsf{f}},\delta)$ , we can build in time $O(1)$ a data structure ${\mathcal{R}}$ storing a set of states, initially empty, and supporting the following operations:

$\blacksquare$

Push: Insert a state in ${\mathcal{R}}$ .
$\blacksquare$

Pop: Given $a\in\Sigma$ , retrieve and remove all states $q\in{\mathcal{R}}$ without a self-loop labelled with $a$ (or indicate that no such state exists).

Over a sequence of $\ell$ push and pop operations where each state of $A$ is pushed at most once, the total running time is $O(\ell+m)$ where $m$ is the number of transitions of $A$ .

We will use the above structure for $A=\textsf{cond}(A_{\mathsf{sup}})$ . We can finally show our main result:

Theorem 4.6.

Given a string $w$ and $\varepsilon$ NFA $A$ with $n$ states and $m$ transitions, the supersequence matching problem can be solved in time $O(|w|+m)$ .

Proof Sketch.

By our considerations from above, we only have to check whether $w\in\mathsf{L}(A^{\prime})$ , where $A^{\prime}=\textsf{cond}(A_{\mathsf{sup}})=(Q^{\prime},\Sigma,q^{\prime}_{0},q% ^{\prime}_{\mathsf{f}},\delta^{\prime})$ with $n^{\prime}=|Q^{\prime}|$ and $m^{\prime}=|\delta^{\prime}|$ ; recall that $A^{\prime}$ can be constructed in linear time. We will do this by a state-set simulation as usual, but we exploit the properties of $A^{\prime}$ (see Proposition 4.4) to implement it in linear time.

We consider an arbitrary update step. The current state-set $S_{i-1}$ is a downward closed set of states (meaning that if a state is in $S_{i-1}$ , then so are all its successors); this follows from the fact that $A^{\prime}$ is a DAG and from Proposition 4.4. The states in $Q^{\prime}\setminus S_{i-1}$ and their outgoing transitions can be completely ignored (they will never become active again or be responsible for removing some states of the current set of active states) so we can as well assume that they are not part of $A^{\prime}$ anymore. As $A^{\prime}$ is a DAG, the set $S_{i-1}$ has some roots (at least one) without incoming transitions (apart from those from $Q^{\prime}\setminus S_{i-1}$ , which we ignore anyway). Let $R$ be the set of these roots: it will be efficiently maintained throughout the computation using the data structure of Lemma 4.5. To compute $S_{i}$ , we now have to identify and remove those states from $S_{i-1}$ that have no $w[i]$ -labelled path from some other state from $S_{i-1}$ , which can be done as follows.

For every root $p\in R$ we know that $p\in S_{i}$ if and only if $p$ has a $w[i]$ -self-loop, and if $p$ has such a self-loop, then all its successors are also in $S_{i}$ (and can thus be ignored in the rest of the update procedure). The roots $p\in R$ without a $w[i]$ -self-loop and their outgoing transitions are found using Lemma 4.5 and removed; further for every outgoing transition $(p,x,q)$ with $x\in\Sigma\cup\{\varepsilon\}$ , we have to do the following. If $x=w[i]$ , then $q\in S_{i}$ , and if $x\neq w[i]$ , then we check if $q$ has a $w[i]$ -self-loop and, if so, $q\in S_{i}$ as well. If, on the other hand, $q$ has no self-loop with $w[i]$ and at least one other incoming transition, then we do not yet know whether it is in $S_{i}$ or not, since this might be determined by other incoming transitions. If $q$ has neither a $w[i]$ -self-loop nor $x=w[i]$ , but it becomes a root when we delete $(p,x,q)$ , then it has to be treated as one of the initial roots (i. e., it will be deleted at some point and its outgoing transitions have to be deleted and processed as described above). The states that become roots can be easily found by keeping for each state $q$ a count of the number of transitions that still lead to $q$ . This procedure is repeated until we have taken care of all the initial roots and states that become roots (that are not already determined to be in $S_{i}$ ) during this procedure. Once this terminates, we have computed the set $S_{i}$ , and the set of roots was also updated accordingly.

Intuitively, the correctness follows from the fact that our algorithm constructs exactly the same sets $S_{i}$ as the state-set simulation. With respect to the running time, note that we perform $O(|w|)$ update steps and consider each transition at most once. However, at the beginning of each update step (say, when computing $S_{i}$ ), we need access to all the roots of $S_{i-1}$ without $w[i]$ -self-loops. For this, we use the data structure from Lemma 4.5. We initially perform $O(n^{\prime})$ push operations (corresponding to the roots of $A^{\prime}$ ) and then, over the course of all update steps, we push and pop each state at most once. So we use $O(n^{\prime})$ operations in total, which, by Lemma 4.5, adds up to $O(n^{\prime}+m^{\prime})$ time. $\hfill\blacktriangleleft$

5 Quantitative Problem Variants

For every $x\in\{\mathsf{in},\mathsf{pre},\mathsf{ext},\mathsf{lext},\mathsf{sub},\mathsf% {sup}\}$ , we consider now the min- and max-variant of the $\preceq_{x}$ -matching problem, i. e., computing a shortest or longest string $u\in\mathsf{L}(A)$ with $u\preceq_{x}w$ . We can show an upper bound of $O(|w|m)$ for all these problem variants.²²2Observe that, of course, these upper bounds will also all constitute upper bounds for the (non-quantitative) matching problems for $\preceq_{\mathsf{in}}$ , $\preceq_{\mathsf{pre}}$ , $\preceq_{\mathsf{ext}}$ and $\preceq_{\mathsf{lext}}$ (see the upper table in Figure 1).

Theorem 5.1.

The min- and max-variant of the matching problem can be solved in time $O(|w|m)$ for all relations $\preceq_{x}$ with $x\in\{\mathsf{in},\mathsf{pre},\mathsf{ext},\mathsf{lext},\mathsf{sub},\mathsf% {sup}\}$ .

However, there are some differences with respect to the lower bounds (to be presented in Section 7): We can show that the $O(|w|m)$ upper bound is conditionally tight, i. e., $O((|w|m)^{1-\epsilon})$ is not possible unless SETH fails, for all problem variants except the min-variant of subsequence matching and the max-variant of supersequence matching. For these two variants, we can only show the somewhat weaker statement that an $O(|w|+m)$ algorithm yields an $O(|G|)$ algorithm that decides whether a dense graph $G$ contains a triangle (note that any truly subcubic combinatorial algorithm for triangle detection implies a truly subcubic algorithm for Boolean matrix multiplication [61]).

Let us now give proof sketches for the results of Theorem 5.1.

General Proof Technique.

We prove the upper bounds of the different problem variants in the same way: We reduce $A$ and $w$ to a $\Sigma\cup\{\varepsilon\}$ -labelled directed graph $G$ with edge weights of $0$ or $1$ and with a designated start vertex $s$ and target vertex $t$ , such that the shortest or longest infix, prefix, extension, left-extension, subsequence and supersequence corresponds to a path from $s$ to $t$ with minimum or maximum weight, respectively, and the label of the path corresponds to the witness string (i. e., the infix, prefix, etc.) and its weight to the length of the witness string. By basic graph algorithmic techniques, it is not difficult to check in time $O(|G|)$ whether there is a path from $s$ to $t$ and if so, to compute such a path (along with its label) having minimum or maximum weight (or to report that the weight is unbounded for such paths). Since $O(|G|)=O(|w|m)$ for all our reductions, the $O(|w|m)$ upper bound follows. Let us in the following sketch the reductions for the different relations.

Infix and Prefix Relations.

Let us just discuss the infix relation, since the prefix relation is similar. We can build in time $O(|w|m)$ the product graph $G_{A,w}$ , which has vertices $\{0,\ldots,|w|\}\times Q$ and edges as follows. For each transition $(q,\varepsilon,q^{\prime})$ in $A$ and for each $i\in\{0,\ldots,|w|\}$ , add an edge from $(i,q)$ to $(i,q^{\prime})$ in $G_{A,w}$ with label $\varepsilon$ and weight $0$ . For each transition $(q,b,q^{\prime})$ with $b\in\Sigma$ and for each $i\in\{1,\ldots,|w|\}$ with $w[i]=b$ , add an edge from $(i-1,q)$ to $(i,q^{\prime})$ in $G_{A,w}$ with label $b$ and weight $1$ .

To this product graph $G_{A,w}$ we add a source vertex $s$ with $\varepsilon$ -labelled edges with weight $0$ to $(i,q_{0})$ for each $i\in\{0,\ldots,|w|\}$ and a target vertex $t$ with $\varepsilon$ -labelled edges with weight $0$ from $(i,q_{\mathsf{f}})$ for each $i\in\{0,\ldots,|w|\}$ . It can now be seen that paths from $s$ to $t$ in $G$ are in one-to-one-correspondence with accepting runs of $A$ over the infixes of $G$ , with the weight of the path being the length of the corresponding infix. As mentioned above, we can compute weight-minimum and -maximum paths from $s$ to $t$ along with their weights in time $O(|G|)=O(|w|m)$ .

Extension and Left-Extension Relations.

Let us only discuss the case of the extension relation, since the left-extension can be handled similarly (and is generally a bit simpler).

We build again the product graph $G_{A,w}$ , but we also add to $G_{A,w}$ two graphs that are basically copies of the $\varepsilon$ NFA $A$ . One such copy has vertices $\{q_{\leftarrow}\mid q\in Q\}$ and edges as follows: for each transition $(q,x,q^{\prime})$ in $A$ there is an edge from $q_{\leftarrow}$ to $q_{\leftarrow}^{\prime}$ labelled $x$ and having weight $0$ if $x=\varepsilon$ and weight $1$ otherwise. The other copy has vertices $\{q_{\rightarrow}\mid q\in Q\}$ and is defined in exactly the same way as the first copy but replacing the “ $\leftarrow$ ” subscripts by “ $\rightarrow$ ” subscripts. Then, for each state $q\in Q$ we add an $\varepsilon$ -labelled edge with weight $0$ from $q_{\leftarrow}$ to $(0,q)$ , and for each state $q^{\prime}\in Q$ we add an $\varepsilon$ -labelled edge with weight $0$ from $(|w|,q^{\prime})$ to $q^{\prime}_{\rightarrow}$ . We call the resulting graph $G$ and we note that this construction can be done in time $O(|w|m)$ . Let the source vertex $s$ of $G$ be $(q_{0})_{\leftarrow}$ and let the target vertex $t$ of $G$ be $(q_{\mathsf{f}})_{\rightarrow}$ .

Now, every path from $s$ to $t$ in $G$ can be decomposed in three parts: (1) a path from $s$ to $q_{\leftarrow}$ for some $q\in Q$ , which is a path labelled with a string $v\in\Sigma^{*}$ , with weight $|v|$ , and such that there is a run from $q_{0}$ to $q$ in $A$ when reading $v$ ; (2) a path from $(0,q)$ to $(|w|,q^{\prime})$ for some $q^{\prime}\in Q$ , which is a path labelled by $w$ , with weight $|w|$ , and that is witnessing that there is a run from $q$ to $q^{\prime}$ in $A$ when reading $w$ ; (3) a path from $(q^{\prime})_{\rightarrow}$ to $t$ labelled with a string $v^{\prime}\in\Sigma^{*}$ , with weight $|v^{\prime}|$ and such that there is a run from $q^{\prime}$ to $q_{\mathsf{f}}$ in $A$ when reading $v^{\prime}$ . Consequently, $vwv^{\prime}$ is an extension of $w$ accepted by $A$ . Conversely, it can be seen that, for any strings $v,v^{\prime}\in\Sigma^{*}$ such that the extension $vwv^{\prime}$ of $w$ is accepted by $A$ , there must be a path in $G$ from $s$ to $t$ labelled with $vwv^{\prime}$ and with weight $|vwv^{\prime}|$ .

Subsequence and Supersequence Relations.

Let us begin with the subsequence relation. We start with the product graph $G_{A,w}$ and add edges with weight $0$ and label $\varepsilon$ from $(i,q)$ to $(i+1,q)$ for each state $q$ of $A$ and each $0\leq i<|w|$ . Let us call these additional edges extra-edges, and note that they mimic the self-loops $(q,b,q)$ for every $b\in\Sigma$ added in the $\varepsilon$ NFA $A_{\mathsf{sub}}$ defined in Section 3, relative to the original $\varepsilon$ NFA $A$ . However, note that we do not explicitly construct $A_{\mathsf{sub}}$ , as we cannot afford within the time bound $O(|w|m)$ . We call this graph $G$ and note that it can be constructed in time $O(|w|m)$ . Let the source $s$ of $G$ be $(0,q_{0})$ and let the target $t$ be $(|w|,q_{\mathsf{f}})$ .

An accepting run of $A_{\mathsf{sub}}$ on $w$ that corresponds to a subsequence $u$ of $w$ (by the symbols read by the original transitions from $A$ ) translates into a path from $s$ to $t$ with label $u$ and weight $|u|$ : Each transition $(q,x,q^{\prime})$ of the run that is an original transition of $A$ translates into a corresponding $x$ -labelled edge of $G_{A,w}$ from $(i,q)$ to $(i+1,q^{\prime})$ or from $(i,q)$ to $(i,q^{\prime})$ (depending on whether $x\in\Sigma$ or $x=\varepsilon$ ), where $i$ depends on the length of the prefix of $w$ already consumed; and each loop-transition $(q,b,q)$ that is added to $A$ in the construction of $A_{\mathsf{sub}}$ translates into an extra-edge from $(i,q)$ to $(i+1,q)$ . Conversely, any path from $s$ to $t$ with label $u$ and weight $|u|$ translates into an accepting run of $A_{\mathsf{sub}}$ on $w$ such that its original transitions of $A$ read exactly $u$ , which therefore must be a subsequence of $w$ .

For the supersequence relation, the reduction is slightly different. We start again with the product graph $G_{A,w}$ , but now we add for each transition $(q,b,q^{\prime})$ in $A$ with $b\in\Sigma$ and each $i\in\{0,\ldots,|w|\}$ a $b$ -labelled edge with weight $1$ from $(i,q)$ to $(i,q^{\prime})$ . Now, these extra-edges mimic the $\varepsilon$ -transitions added in the $\varepsilon$ NFA $A_{\mathsf{sup}}$ defined in Section 3, relative to the original $\varepsilon$ NFA $A$ . We note that for any $\varepsilon$ -transition $(q,\varepsilon,q^{\prime})$ of $A_{\mathsf{sup}}$ that is not a transition of $A$ , there are several (and at least one) transitions $(q,b,q^{\prime})$ with $b\in\Sigma$ in $A$ ; thus, there are also several (and at least one) extra-edges from $(i,q)$ to $(i,q^{\prime})$ that differ in their labels.

Similar as in the subsequence-case, the paths from $s$ to $t$ in $G$ correspond to accepting runs of $A_{\mathsf{sup}}$ on $w$ , where the label of the path is the supersequence $u$ of $w$ induced by the accepting run and its weight is $|u|$ . The original transitions of $A$ translate into edges of $G_{A,w}$ as before, but every transition $(q,\varepsilon,q^{\prime})$ of $A_{\mathsf{sup}}$ that is not a transition of $A$ translates into an extra-edge from $(i,q)$ to $(i,q^{\prime})$ of $G$ (where $i$ again depends on the length of the prefix of $w$ that has already been consumed). As observed above, there might be several extra-edges from $(i,q)$ to $(i,q^{\prime})$ that differ in their labels; we just take any of those (in fact, this decision only affects the symbols of the subsequence $u$ that do not belong to $w$ , but not its length). Also note that unlike in the subsequence case, extra-edges have now weight $1$ , since they correspond to symbols of the supersequence of $w$ (while in the subsequence-case they correspond to symbols of $w$ that do not belong to its accepted subsequence $u$ ).

6 Universal Problem Variants

We now consider the universal variants, i. e., checking whether all strings $u\preceq w$ are in $\mathsf{L}(A)$ .

Theorem 6.1.

The universal-variant of the matching problem can be solved in time $O(|w|m)$ for the prefix relation and in time $O({|w|}^{2}m)$ for the infix relation. It is PSPACE-complete for the extension and left-extension relation, coNP-complete for the subsequence relation, and PSPACE-complete for the supersequence relation.

Proof Sketch.

Checking whether all prefixes of a string are accepted by an automaton can be easily done via state-set simulation, and checking the same for all infixes can be done similarly by running a state-set simulation from every starting position. Let us show why the cases of the relations $\preceq_{\mathsf{ext}},\preceq_{\mathsf{lext}},\preceq_{\mathsf{sub}}$ and $\preceq_{\mathsf{sup}}$ follow from known facts about finite automata.

The PSPACE upper bound with respect to $\preceq_{\mathsf{ext}}$ follows from the fact that $\Lambda_{\preceq_{\mathsf{ext}}}(w)$ can be expressed by an $\varepsilon$ NFA, and $\varepsilon$ NFA-inclusion can be decided in PSPACE. The reasoning is analogous for $\preceq_{\mathsf{lext}}$ . The same argument shows PSPACE membership for $\preceq_{\mathsf{sup}}$ by building an $\varepsilon$ NFA accepting the supersequences of $w$ . The $\preceq_{\mathsf{sub}}$ -variant is in coNP, since we can guess a subsequence of $w$ and check acceptance for it.

The PSPACE-hardness with respect to $\preceq_{\mathsf{ext}}$ , $\preceq_{\mathsf{lext}}$ and $\preceq_{\mathsf{sup}}$ is easy to see: For $w=\varepsilon$ , these problems are equivalent to $\varepsilon$ NFA-universality, i. e., checking whether $\mathsf{L}(A)=\Sigma^{*}$ , which is PSPACE-hard. This leaves the coNP-hardness for $\preceq_{\mathsf{sub}}$ , which can be shown as follows. For a $3$ -CNF formula $F$ over $n$ variables we build an $\varepsilon$ NFA $A_{F}$ that accepts all $\{0,1\}$ -strings of length $k\in\{1,2,\ldots,n-1,n+1,\ldots,2n\}$ and all $\{0,1\}$ -strings of length $n$ that represent non-satisfying assignments of $F$ . Then, $\Lambda_{\preceq_{\mathsf{sub}}}((01)^{n})\subseteq\mathsf{L}(A_{F})$ if and only if $F$ is not satisfiable. $\hfill\blacktriangleleft$

7 Conditional Lower Bounds

We now present the conditional lower bounds that are stated in Figure 1. These bounds apply to all problems for which we showed a time-complexity higher than the optimal linear complexity $O(|w|+m)$ . Further, they match the upper bounds except in three cases: the universal-variant of the infix relation, and the min-variant (resp., max-variant) of the subsequence relation (resp., supersequence relation). The section is structured in two parts. We first show lower bounds for the infix, prefix, extension and left-extension relations: Theorem 7.1 covers the matching problem and its min- and max-variants, and Theorem 7.2 covers lower bounds for the universal-variant with the infix and prefix relations, noting that the universal-variant for extension and left-extension was shown earlier to be PSPACE-complete (Theorem 6.1). Second, we show lower bounds for the min- and max-variants of the sub- and supersequence relations; the case of the universal-variant was covered in Theorem 6.1 too.

Infix, Prefix, Extension and Left-Extension Relations.

We first observe that SETH-based lower bounds for the $\preceq_{\mathsf{in}}$ -, $\preceq_{\mathsf{pre}}$ -, $\preceq_{\mathsf{ext}}$ - and $\preceq_{\mathsf{lext}}$ -matching problem can be concluded by minor adaptions of the original construction from [9]. Moreover, since the min- and max-variants also solve the non-quantitative version of the matching problem, we can conclude that these lower bounds also hold for the quantitative variants:

Theorem 7.1.

If the matching problem for the infix, prefix, extension and left-extension relation can be solved in time $O((|w|m)^{1-\epsilon})$ for some $\epsilon>0$ , then SETH fails. The same holds for the min- and max-variants, even if we only require the length of the answer strings.

Further, we can show that the $O(|w|m)$ upper bound for the universal variant of the matching problem for the prefix or infix relation is also optimal under SETH. The construction can be sketched as follows. For a string $w$ and $\varepsilon$ NFA $A$ , we build an $\varepsilon$ NFA $A^{\prime}$ with $\mathsf{L}(A^{\prime})=(\{\#\}\cdot\mathsf{L}(A)\cdot\{\#\})\cup(\{\#\}\cdot% \Sigma^{*})\cup(\Sigma^{*}\cdot\{\#\})\cup\Sigma^{*}$ for a fresh symbol $\#$ . It is not hard to see that $w\in\mathsf{L}(A)\iff\Lambda_{\preceq_{\mathsf{in}}}(\#w\#)\subseteq\mathsf{L}% (A^{\prime})$ . Hence, the universal variant of the $\preceq_{\mathsf{in}}$ -matching problem reduces to the normal $\preceq_{\mathsf{in}}$ -matching problem, so that the lower bounds of Theorem 7.1 apply. The case of $\preceq_{\mathsf{pre}}$ is very similar. We therefore have:

Theorem 7.2.

If the universal variant of the matching problem for the prefix or infix relation can be solved in time $O((|w|m)^{1-\epsilon})$ for some $\epsilon>0$ , then SETH fails.

Subsequence and Supersequence Relations.

With respect to $\preceq_{\mathsf{sub}}$ and $\preceq_{\mathsf{sup}}$ , we have proven optimal linear upper bounds for the non-quantitative variants in Section 4. However, for the min- and max-variants, we only have upper bounds of $O(|w|m)$ , which we will now complement with conditional lower bounds.

We can easily construct $\varepsilon$ NFAs $A_{u,\mathsf{sub}}$ and $A_{u,\mathsf{sup}}$ that accept exactly the subsequences of a string $u$ (and supersequences of a string $u$ , respectively) – note that this is an easy special case of [8, Lemma 8] which we reviewed in Section 3. This means that the max-variant of the $\preceq_{\mathsf{sub}}$ -matching problem applied to the $\varepsilon$ NFA $A_{u,\mathsf{sub}}$ and a string $v$ amounts to computing the longest common subsequence of $u$ and $v$ . Likewise, the min-variant of the $\preceq_{\mathsf{sup}}$ -matching problem applied to $A_{u,\mathsf{sup}}$ and $v$ admits a reduction from the shortest common supersequence problem. Thus, solving these problems would allow us to solve the longest common subsequence problem (or shortest common supersequence problem, respectively) for $u$ and $v$ . This means that the known SETH-conditional lower bounds on the longest common subsequence and shortest common supersequence problems carry over to these quantitative variants of the matching problem (see [13, 1]). Note that these lower bounds already hold when computing the length of the sequences, and further they already hold for constant-sized alphabets so it is not a problem that the automaton $A_{u,\mathsf{sup}}$ generally has size $\Theta(|u|\cdot|\Sigma|)$ .

Theorem 7.3.

If the max-variant of the matching problem for the subsequence relation or the min-variant of the matching problem for the supersequence relation can be solved in time $O((|w|m)^{1-\epsilon})$ for some $\epsilon>0$ , then SETH fails. This holds even if we only require the length of the answer strings.

This leaves the question of the optimality of $O(|w|m)$ -time for the min-variant of the $\preceq_{\mathsf{sub}}$ -matching problem and the max-variant of the $\preceq_{\mathsf{sup}}$ -matching problem. For these, we are not able to prove SETH lower bounds that exclude $O((|w|m)^{1-\epsilon})$ for some $\epsilon>0$ , but we can argue that the existence of algorithms with running time $O(|w|+m)$ has some unlikely consequences.

Theorem 7.4.

If the min-variant of the matching problem for the subsequence relation or the max-variant of the matching problem for the supersequence relation can be solved in time $O(|w|+m)$ , then we can decide whether a given dense graph $G$ has a triangle in time $O(|G|)$ . This holds even if we only require the length of the answer strings.

Let us mention that any truly subcubic combinatorial algorithm for triangle detection (i. e., an algorithm with running time $O(n^{3-\epsilon})$ for some $\epsilon>0$ , where $n$ is the number of vertices) yields a truly subcubic combinatorial algorithm for Boolean matrix multiplication, which is considered unlikely (see [61]). So, conditional to this hypothesis, the problems above cannot be solved in time $O(|w|+m)$ , i. e., they do not enjoy the same linear-time complexity as the non-quantitative versions of subsequence or supersequence matching.

We now sketch the proof of Theorem 7.4. A graph $G=(\{v_{1},\ldots,v_{n}\},E)$ which is dense (i.e., a graph with $|E|=\Omega(n^{2})$ ) can be transformed into an NFA $M_{G}$ as follows. We initialise $M_{G}$ with $4$ layers of states $\{p_{i},q_{i},r_{i},s_{i}\mid 1\leq i\leq n\}$ and with $b$ -transitions following $G$ , i. e., every $\{v_{i},v_{j}\}\in E$ translates into transitions $(p_{i},b,q_{j}),(q_{i},b,r_{j}),(r_{i},b,s_{j})$ . This is called the middle part of $M_{G}$ . For every $i\in\{1,\ldots,n\}$ , we call $p_{i}$ the $i^{\text{th}}$ -entry point and $s_{i}$ the $i^{\text{th}}$ -exit point. Now, $G$ has a triangle if and only if there is an $i\in\{1,\ldots,n\}$ and a $b b b$ -labelled path from the $i^{\text{th}}$ -entry point to the $i^{\text{th}}$ -exit point. We next add a left part to $M_{G}$ : we add a state $s$ , which will be the initial state, and for every $i\in\{1,\ldots,n\}$ there is a path from $s$ to the $i^{\text{th}}$ -entry point with $2(n+1)-i$ transitions, $i$ many of which are labelled with $a$ and the rest (so $2(n+1)-2i$ transitions) are labelled with $b$ (the order does not matter). For the right part, we add a state $t$ , which will be the accepting state, and for every $i\in\{1,\ldots,n\}$ there is a path from the $i^{\text{th}}$ -exit point to $t$ with $n+1+i$ transitions, $n-i$ many of which are labelled with $a$ and the rest (so $n+1+i-(n-i)=2i+1$ transitions) are labelled with $b$ (the order does not matter). This concludes the definition of $M_{G}$ . Note that any string accepted by $M_{G}$ must go through precisely one entry point and precisely one exit point. We make the following general observation: Every string accepted by $M_{G}$ with an accepting run going through the $i^{\text{th}}$ -entry point and the $j^{\text{th}}$ -exit point (1.) has length $3(n+1)+(j-i)+3$ and (2.) has $n-(j-i)$ occurrences of $a$ (due to how the $a$ ’s are chosen in the left and right part of $M_{G}$ ).

Now for the min-variant of the matching problem for the subsequence relation, we consider the string $w=b^{2(n+1)}bbb(ab^{2(n+1)}bbb)^{n}$ and ask if the minimum length of a subsequence of $w$ accepted by $M_{G}$ is $N:=3(n+1)+3$ . The value $N$ is a lower bound on the length of subsequences of $w$ accepted by $M$ . Indeed, consider any subsequence $u$ of $w$ accepted via the $i^{\text{th}}$ -entry point and the $j^{\text{th}}$ -exit point: it has at most $n$ occurrences of $a$ (by definition of $w$ ), so by item (2.) of our general observation, $j\geq i$ , which means by item (1.) of the general observation that $N=3(n+1)+3\leq|u|$ . Now, let us show that a length- $N$ subsequence of $w$ is accepted by $M_{G}$ if and only if $G$ has a triangle. In one direction, if a length- $N$ subsequence $u$ of $w$ is accepted by $M_{G}$ , then this is only possible via the same entry point and exit point $i$ by item (1.) of the general observation; thus, $M_{G}$ has a triangle. And if $M_{G}$ has a triangle containing $v_{i}$ , then we can read a string with entry point and exit point $i$ , which, since it contains $n$ occurrences of $a$ in total, can easily be obtained as a length- $(3(n+1)+3)$ subsequence of $w$ .

For the max-variant of the matching problem for the supersequence relation, we consider the string $w=a^{n}$ and ask if the maximum length of a supersequence of $w$ accepted by $M_{G}$ is $N:=3(n+1)+3$ (i. e., the same length as in the previous proof). The value $N$ is now an upper bound on the length of supersequences of $w$ accepted by $M$ . Indeed, any supersequence $u$ of $w$ accepted via the $i^{\text{th}}$ -entry point and the $j^{\text{th}}$ -exit point has at least $n$ occurrences of $a$ , so by item (1.) of our general observation we have $i\geq j$ , which means that $|u|\leq 3(n+1)+3=N$ by item (2.) of our general observation. Now, the argument is similar as before: If $M_{G}$ accepts a length- $N$ supersequence $u$ of $w$ , then this is only possible via the same entry point and exit point $i$ , and therefore $M_{G}$ has a triangle; and if $M_{G}$ has a triangle containing $v_{i}$ , then we can read a string with entry point and exit point $i$ , which, since it has $n$ occurrences of $a$ in total, is a length- $N$ supersequence of $w$ .

8 Conclusion and Future Work

We investigated the regex matching problem in a more general setting, where instead of asking whether $w$ matches $r$ , we ask whether there is a string $u$ with $u\preceq w$ that matches $r$ , where $\preceq$ is some string relation. We demonstrated that for $\preceq$ being the well-known subsequence and supersequence relation, this makes regex matching linear-time solvable. Interestingly, this tractability does not extend to other natural and simple relations, and it also does not generalise to quantitative variants asking for the shortest or longest result. For future research it would be interesting to find more string relations with this property, e. g., the subsequence relation with bounded gap size of $k$ : $x_{1}x_{2}\ldots x_{n}\preceq_{\mathsf{sub},k}w\Leftrightarrow w=w_{0}\,x_{1}% \,w_{1}\,x_{2}\ldots x_{n}\,w_{n}$ and $|w_{i}|\leq k$ for every $i\in\{1,2,\ldots,n-1\}$ . However, we expect that such small modifications would make the complexity quadratic again, i. e., the classical SETH-reduction applies.

It seems particularly interesting that we can compute in $O(|w|\cdot|r|)$ a longest subsequence $u$ and a shortest supersequence $v$ of $w$ that match $r$ , since the values $(|w|-|u|)$ and $(|v|-|w|)$ (or their sum) can be interpreted as a measure of how much $w$ does not match $r$ (note that $(|w|-|u|)=(|v|-|w|)=0\iff w\in\mathsf{L}(r)$ ). It might be interesting to investigate how fast this measure can be computed for other (practically motivated) classes of regular expressions, e. g., deterministic regular expressions or regular expressions with counters (or even strictly more powerful language classes). Another relevant question is whether we can reconcile the gap between the upper bound and lower bound, for the universal-variant of the infix problem. In other words, given an $\varepsilon$ NFA $A$ and string $w$ , can one efficiently check whether there is an infix of $w$ which is not accepted by $A$ , faster than the easy $O(|w|^{2}\cdot|A|)$ upper bound? Another similar question is whether there is a genuine difference between the complexities of the min- and max-variants of the subsequence and supersequence problems, given that we could not show the same conditional lower bounds for these problems.

Let us also point out that none of our upper bounds rely on heavy algorithmic machinery or data structures and therefore our asymptotic running times do not hide huge constant factors. Hence, we believe that our algorithms have straightforward and practically relevant implementations (just like the classical state-set simulation does).

Finally, our tractability results on sub- and supersequence matching can also be seen as tractability results for the matching problem on automata that satisfy certain conditions, namely, their language is upward- or downward-closed. Our results imply that, on such automata, we can perform matching in time $O(|w|+m)$ instead of $O(|w|m)$ . One natural question is which other restrictions on automata make it possible to achieve such improved bounds. Examples in this direction are the regex classes studied in [9, 12] or automata classes where determinisation is done in polynomial time, e. g., NFAs of bounded co-lex width [16].

References

[1] Amir Abboud, Arturs Backurs, and Virginia Vassilevska Williams. Tight hardness results for LCS and other sequence similarity measures. In FOCS, 2015. doi:10.1109/FOCS.2015.14.
[2] Amir Abboud, Virginia Vassilevska Williams, and Oren Weimann. Consequences of faster alignment of sequences. In ICALP, 2014. doi:10.1007/978-3-662-43948-7_4.
[3] Parosh Aziz Abdulla, Ahmed Bouajjani, and Bengt Jonsson. On-the-fly analysis of systems with unbounded, lossy fifo channels. In CAV, 1998. doi:10.1007/BFB0028754.
[4] Duncan Adamson, Pamela Fleischmann, Annika Huch, Tore Koß, Florin Manea, and Dirk Nowotka. k-Universality of regular languages. In ISAAC, volume 283 of LIPIcs, 2023. doi:10.4230/LIPICS.ISAAC.2023.4.
[5] Antoine Amarilli, Florin Manea, Tina Ringleb, and Markus L. Schmid. Linear time subsequence and supersequence regex matching. CoRR, abs/2504.16288, 2025. doi:10.48550/arXiv.2504.16288.
[6] Renzo Angles, Marcelo Arenas, Pablo Barceló, Aidan Hogan, Juan L. Reutter, and Domagoj Vrgoc. Foundations of modern query languages for graph databases. ACM Comput. Surv., 50(5), 2017. doi:10.1145/3104031.
[7] Alexander Artikis, Alessandro Margara, Martín Ugarte, Stijn Vansummeren, and Matthias Weidlich. Complex event recognition languages: Tutorial. In DEBS, 2017. doi:10.1145/3093742.3095106.
[8] Georg Bachmeier, Michael Luttenberger, and Maximilian Schlund. Finite automata for the sub-and superword closure of CFLs: Descriptional and computational complexity. In ICALP, 2015. doi:10.1007/978-3-319-15579-1_37.
[9] Arturs Backurs and Piotr Indyk. Which regular expression patterns are hard to match? In FOCS, 2016. doi:10.1109/FOCS.2016.56.
[10] Lasse Bergroth, Harri Hakonen, and Timo Raita. A survey of longest common subsequence algorithms. In SPIRE, 2000. doi:10.1109/SPIRE.2000.878178.
[11] Karl Bringmann and Bhaskar Ray Chaudhury. Sketching, streaming, and fine-grained complexity of (weighted) LCS. In FSTTCS, LIPIcs, 2018. doi:10.4230/LIPICS.FSTTCS.2018.40.
[12] Karl Bringmann, Allan Grønlund, Marvin Künnemann, and Kasper Green Larsen. The NFA acceptance hypothesis: Non-combinatorial and dynamic lower bounds. In ITCS, 2024. doi:10.4230/LIPICS.ITCS.2024.22.
[13] Karl Bringmann and Marvin Künnemann. Quadratic conditional lower bounds for string problems and dynamic time warping. In IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015, pages 79–97, 2015. doi:10.1109/FOCS.2015.15.
[14] Karl Bringmann and Marvin Künnemann. Multivariate fine-grained complexity of longest common subsequence. In SODA, 2018. doi:10.1137/1.9781611975031.79.
[15] Sam Buss and Michael Soltys. Unshuffling a square is NP-hard. J. Comput. Syst. Sci., 80(4), 2014. doi:10.1016/j.jcss.2013.11.002.
[16] Nicola Cotumaccio, Giovanna D’Agostino, Alberto Policriti, and Nicola Prezza. Co-lexicographically ordering automata and regular languages - Part I. J. ACM, 70(4), 2023. doi:10.1145/3607471.
[17] Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. Algorithms on strings. Cambridge University Press, 2007.
[18] Maxime Crochemore, Borivoj Melichar, and Zdenek Tronícek. Directed acyclic subsequence graph — overview. J. Discrete Algorithms, 1(3-4), 2003. doi:10.1016/S1570-8667(03)00029-7.
[19] Joel D. Day, Pamela Fleischmann, Maria Kosche, Tore Koß, Florin Manea, and Stefan Siemer. The edit distance to $k$ -subsequence universality. In STACS, 2021. doi:10.4230/LIPIcs.STACS.2021.25.
[20] Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Document spanners: A formal approach to information extraction. J. ACM, 62(2), 2015. doi:10.1145/2699442.
[21] Szilárd Zsolt Fazekas, Tore Koß, Florin Manea, Robert Mercas, and Timo Specht. Subsequence matching and analysis problems for formal languages. In ISAAC, volume 322 of LIPIcs, 2024. doi:10.4230/LIPICS.ISAAC.2024.28.
[22] Michael L. Fredman and Dan E. Willard. BLASTING through the information theoretic barrier with FUSION TREES. In STOC, 1990. doi:10.1145/100216.100217.
[23] Dominik D. Freydenberger, Pawel Gawrychowski, Juhani Karhumäki, Florin Manea, and Wojciech Rytter. Testing $k$ -binomial equivalence. In Multidisciplinary Creativity, a collection of papers dedicated to G. Păun 65th birthday, 2015. available in CoRR abs/1509.00622.
[24] Jeffrey E. F. Friedl. Mastering regular expressions - Understand your data and be more productive: for Perl, PHP, Java, .NET, Ruby, and more. O’Reilly, 2006.
[25] André Frochaux and Sarah Kleest-Meißner. Puzzling over subsequence-query extensions: Disjunction and generalised gaps. In AMW, 2023. URL: https://ceur-ws.org/Vol-3409/paper3.pdf.
[26] Moses Ganardi, Irmak Sağlam, and Georg Zetzsche. Directed regular and context-free languages. In STACS, 2024.
[27] Pawel Gawrychowski, Maria Kosche, Tore Koß, Florin Manea, and Stefan Siemer. Efficiently testing Simon’s congruence. In STACS, volume 187 of LIPIcs, 2021. doi:10.4230/LIPICS.STACS.2021.34.
[28] Nikos Giatrakos, Elias Alevizos, Alexander Artikis, Antonios Deligiannakis, and Minos N. Garofalakis. Complex event recognition in the big data era: A survey. VLDB J., 29(1), 2020. doi:10.1007/s00778-019-00557-w.
[29] Jean Goubault-Larrecq, Simon Halfon, Prateek Karandikar, K Narayan Kumar, and Philippe Schnoebelen. The ideal approach to computing closed subsets in well-quasi-orderings. Well-Quasi Orders in Computation, Logic, Language and Reasoning: A Unifying Concept of Proof Theory, Automata Theory, Formal Languages and Descriptive Set Theory, 2020.
[30] Étienne Grandjean and Louis Jachiet. Which arithmetic operations can be performed in constant time in the RAM model with addition?, 2023. URL: https://arxiv.org/abs/2206.13851.
[31] Alejandro Grez, Cristian Riveros, Martín Ugarte, and Stijn Vansummeren. A formal framework for complex event recognition. ACM Trans. Database Syst., 46(4), 2021. doi:10.1145/3485463.
[32] Simon Halfon, Philippe Schnoebelen, and Georg Zetzsche. Decidability, complexity, and expressiveness of first-order logic over the subword ordering. In LICS, 2017. doi:10.1109/LICS.2017.8005141.
[33] Daniel S. Hirschberg. Algorithms for the longest common subsequence problem. J. ACM, 24(4), 1977. doi:10.1145/322033.322044.
[34] John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, 2001.
[35] James W. Hunt and Thomas G. Szymanski. A fast algorithm for computing longest subsequences. Communications of the ACM, 20(5), 1977. doi:10.1145/359581.359603.
[36] Prateek Karandikar and Philippe Schnoebelen. The height of piecewise-testable languages and the complexity of the logic of subwords. Log. Methods Comput. Sci., 15(2), 2019. doi:10.23638/LMCS-15(2:6)2019.
[37] S.C. Kleene. Representation of events in nerve nets and finite automata. In Automata Studies, volume 34 of Annals of Mathematics Studies, pages 3–41. Princeton University Press, 1956.
[38] Sarah Kleest-Meißner, Rebecca Sattler, Markus L. Schmid, Nicole Schweikardt, and Matthias Weidlich. Discovering event queries from traces: Laying foundations for subsequence-queries with wildcards and gap-size constraints. In ICDT, 2022.
[39] Sarah Kleest-Meißner, Rebecca Sattler, Markus L. Schmid, Nicole Schweikardt, and Matthias Weidlich. Discovering multi-dimensional subsequence queries from traces - from theory to practice. In Datenbanksysteme für Business, Technologie und Web (BTW 2023), 20. Fachtagung des GI-Fachbereichs ,,Datenbanken und Informationssysteme" (DBIS), 2023. doi:10.18420/BTW2023-24.
[40] Maria Kosche, Tore Koß, Florin Manea, and Stefan Siemer. Combinatorial algorithms for subsequence matching: A survey. In NCMA, volume 367 of EPTCS, 2022. doi:10.4204/EPTCS.367.2.
[41] Sailesh Kumar, Balakrishnan Chandrasekaran, Jonathan S. Turner, and George Varghese. Curing regular expressions matching algorithms from insomnia, amnesia, and acalculia. In ANCS, 2007. doi:10.1145/1323548.1323574.
[42] Dietrich Kuske. The subtrace order and counting first-order logic. In CSR, volume 12159 of LNCS, 2020. doi:10.1007/978-3-030-50026-9_21.
[43] Dietrich Kuske and Georg Zetzsche. Languages ordered by the subword order. In FOSSACS, volume 11425 of LNCS, 2019. doi:10.1007/978-3-030-17127-8_20.
[44] Alex X. Liu and Eric Norige. A de-compositional approach to regular expression matching for network security. IEEE/ACM Trans. Netw., 27(6), 2019. doi:10.1109/TNET.2019.2941920.
[45] David Maier. The complexity of some problems on subsequences and supersequences. J. ACM, 25(2), 1978. doi:10.1145/322063.322075.
[46] Alexandru Mateescu, Arto Salomaa, and Sheng Yu. Subword histories and Parikh matrices. J. Comput. Syst. Sci., 68(1):1–21, 2004. doi:10.1016/J.JCSS.2003.04.001.
[47] Alexander Okhotin. On the state complexity of scattered substrings and superstrings. Fundamenta Informaticae, 99(3), 2010. doi:10.3233/FI-2010-252.
[48] M. Praveen, Philippe Schnoebelen, Julien Veron, and Isa Vialard. On the piecewise complexity of words and periodic words. In SOFSEM, 2024. doi:10.1007/978-3-031-52113-3_32.
[49] Steven Purtzel and Matthias Weidlich. Suse: Summary selection for regular expression subsequence aggregation over streams. In SIGMOD, 2025. To appear.
[50] William E. Riddle. An approach to software system modelling and analysis. Comput. Lang., 4(1), 1979. doi:10.1016/0096-0551(79)90009-2.
[51] Michel Rigo and Pavel Salimov. Another generalization of abelian equivalence: Binomial complexity of infinite words. Theor. Comput. Sci., 601, 2015. doi:10.1016/J.TCS.2015.07.025.
[52] Arto Salomaa. Connections between subwords and certain matrix mappings. Theoret. Comput. Sci., 340(2):188–203, 2005. doi:10.1016/J.TCS.2005.03.024.
[53] Philippe Schnoebelen and Julien Veron. On arch factorization and subword universality for words and compressed words. In WORDS, 2023. doi:10.1007/978-3-031-33180-0_21.
[54] Shinnosuke Seki. Absoluteness of subword inequality is undecidable. Theor. Comput. Sci., 418, 2012. doi:10.1016/J.TCS.2011.10.017.
[55] Alan C. Shaw. Software descriptions with flow expressions. IEEE Trans. Software Eng., 4(3), 1978. doi:10.1109/TSE.1978.231501.
[56] Imre Simon. Hierarchies of events with dot-depth one. PhD thesis, University of Waterloo, 1972.
[57] Imre Simon. Piecewise testable events. In Automata Theory and Formal Languages, 2nd GI Conference, volume 33 of LNCS, 1975. doi:10.1007/3-540-07407-4_23.
[58] Imre Simon. Words distinguished by their subwords (Extended abstract). In WORDS, volume 27 of TUCS General Publication, 2003.
[59] Peter Snyder and Chris Kanich. No please, after you: Detecting fraud in affiliate marketing networks. In WEIS, 2015. URL: http://www.econinfosec.org/archive/weis2015/papers/WEIS_2015_snyder.pdf.
[60] K. Thompson. Programming techniques: Regular expression search algorithm. Communications of the ACM, 11, 1968.
[61] Virginia Vassilevska Williams and R. Ryan Williams. Subcubic equivalences between path, matrix, and triangle problems. J. ACM, 65(5), 2018. doi:10.1145/3186893.
[62] Georg Zetzsche. The complexity of downward closure comparisons. In ICALP, volume 55 of LIPIcs, 2016. doi:10.4230/LIPICS.ICALP.2016.123.
[63] Haopeng Zhang, Yanlei Diao, and Neil Immerman. On complexity and optimization of expensive queries in complex event processing. In SIGMOD, 2014. doi:10.1145/2588555.2593671.

[bib.bib1] [1] Amir Abboud, Arturs Backurs, and Virginia Vassilevska Williams. Tight hardness results for LCS and other sequence similarity measures. In FOCS, 2015. doi:10.1109/FOCS.2015.14.

[bib.bib2] [2] Amir Abboud, Virginia Vassilevska Williams, and Oren Weimann. Consequences of faster alignment of sequences. In ICALP, 2014. doi:10.1007/978-3-662-43948-7_4.

[bib.bib3] [3] Parosh Aziz Abdulla, Ahmed Bouajjani, and Bengt Jonsson. On-the-fly analysis of systems with unbounded, lossy fifo channels. In CAV, 1998. doi:10.1007/BFB0028754.

[bib.bib4] [4] Duncan Adamson, Pamela Fleischmann, Annika Huch, Tore Koß, Florin Manea, and Dirk Nowotka. k-Universality of regular languages. In ISAAC, volume 283 of LIPIcs, 2023. doi:10.4230/LIPICS.ISAAC.2023.4.

[bib.bib5] [5] Antoine Amarilli, Florin Manea, Tina Ringleb, and Markus L. Schmid. Linear time subsequence and supersequence regex matching. CoRR, abs/2504.16288, 2025. doi:10.48550/arXiv.2504.16288.

[bib.bib6] [6] Renzo Angles, Marcelo Arenas, Pablo Barceló, Aidan Hogan, Juan L. Reutter, and Domagoj Vrgoc. Foundations of modern query languages for graph databases. ACM Comput. Surv., 50(5), 2017. doi:10.1145/3104031.

[bib.bib7] [7] Alexander Artikis, Alessandro Margara, Martín Ugarte, Stijn Vansummeren, and Matthias Weidlich. Complex event recognition languages: Tutorial. In DEBS, 2017. doi:10.1145/3093742.3095106.

[bib.bib8] [8] Georg Bachmeier, Michael Luttenberger, and Maximilian Schlund. Finite automata for the sub-and superword closure of CFLs: Descriptional and computational complexity. In ICALP, 2015. doi:10.1007/978-3-319-15579-1_37.

[bib.bib9] [9] Arturs Backurs and Piotr Indyk. Which regular expression patterns are hard to match? In FOCS, 2016. doi:10.1109/FOCS.2016.56.

[bib.bib10] [10] Lasse Bergroth, Harri Hakonen, and Timo Raita. A survey of longest common subsequence algorithms. In SPIRE, 2000. doi:10.1109/SPIRE.2000.878178.

[bib.bib11] [11] Karl Bringmann and Bhaskar Ray Chaudhury. Sketching, streaming, and fine-grained complexity of (weighted) LCS. In FSTTCS, LIPIcs, 2018. doi:10.4230/LIPICS.FSTTCS.2018.40.

[bib.bib12] [12] Karl Bringmann, Allan Grønlund, Marvin Künnemann, and Kasper Green Larsen. The NFA acceptance hypothesis: Non-combinatorial and dynamic lower bounds. In ITCS, 2024. doi:10.4230/LIPICS.ITCS.2024.22.

[bib.bib13] [13] Karl Bringmann and Marvin Künnemann. Quadratic conditional lower bounds for string problems and dynamic time warping. In IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015, pages 79–97, 2015. doi:10.1109/FOCS.2015.15.

[bib.bib14] [14] Karl Bringmann and Marvin Künnemann. Multivariate fine-grained complexity of longest common subsequence. In SODA, 2018. doi:10.1137/1.9781611975031.79.

[bib.bib15] [15] Sam Buss and Michael Soltys. Unshuffling a square is NP-hard. J. Comput. Syst. Sci., 80(4), 2014. doi:10.1016/j.jcss.2013.11.002.

[bib.bib16] [16] Nicola Cotumaccio, Giovanna D’Agostino, Alberto Policriti, and Nicola Prezza. Co-lexicographically ordering automata and regular languages - Part I. J. ACM, 70(4), 2023. doi:10.1145/3607471.

[bib.bib17] [17] Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. Algorithms on strings. Cambridge University Press, 2007.

[bib.bib18] [18] Maxime Crochemore, Borivoj Melichar, and Zdenek Tronícek. Directed acyclic subsequence graph — overview. J. Discrete Algorithms, 1(3-4), 2003. doi:10.1016/S1570-8667(03)00029-7.

[bib.bib19] [19] Joel D. Day, Pamela Fleischmann, Maria Kosche, Tore Koß, Florin Manea, and Stefan Siemer. The edit distance to $k$ -subsequence universality. In STACS, 2021. doi:10.4230/LIPIcs.STACS.2021.25.

[bib.bib20] [20] Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Document spanners: A formal approach to information extraction. J. ACM, 62(2), 2015. doi:10.1145/2699442.

[bib.bib21] [21] Szilárd Zsolt Fazekas, Tore Koß, Florin Manea, Robert Mercas, and Timo Specht. Subsequence matching and analysis problems for formal languages. In ISAAC, volume 322 of LIPIcs, 2024. doi:10.4230/LIPICS.ISAAC.2024.28.

[bib.bib22] [22] Michael L. Fredman and Dan E. Willard. BLASTING through the information theoretic barrier with FUSION TREES. In STOC, 1990. doi:10.1145/100216.100217.

[bib.bib23] [23] Dominik D. Freydenberger, Pawel Gawrychowski, Juhani Karhumäki, Florin Manea, and Wojciech Rytter. Testing $k$ -binomial equivalence. In Multidisciplinary Creativity, a collection of papers dedicated to G. Păun 65th birthday, 2015. available in CoRR abs/1509.00622.

[bib.bib24] [24] Jeffrey E. F. Friedl. Mastering regular expressions - Understand your data and be more productive: for Perl, PHP, Java, .NET, Ruby, and more. O’Reilly, 2006.

[bib.bib25] [25] André Frochaux and Sarah Kleest-Meißner. Puzzling over subsequence-query extensions: Disjunction and generalised gaps. In AMW, 2023. URL: https://ceur-ws.org/Vol-3409/paper3.pdf.

[bib.bib26] [26] Moses Ganardi, Irmak Sağlam, and Georg Zetzsche. Directed regular and context-free languages. In STACS, 2024.

[bib.bib27] [27] Pawel Gawrychowski, Maria Kosche, Tore Koß, Florin Manea, and Stefan Siemer. Efficiently testing Simon’s congruence. In STACS, volume 187 of LIPIcs, 2021. doi:10.4230/LIPICS.STACS.2021.34.

[bib.bib28] [28] Nikos Giatrakos, Elias Alevizos, Alexander Artikis, Antonios Deligiannakis, and Minos N. Garofalakis. Complex event recognition in the big data era: A survey. VLDB J., 29(1), 2020. doi:10.1007/s00778-019-00557-w.

[bib.bib29] [29] Jean Goubault-Larrecq, Simon Halfon, Prateek Karandikar, K Narayan Kumar, and Philippe Schnoebelen. The ideal approach to computing closed subsets in well-quasi-orderings. Well-Quasi Orders in Computation, Logic, Language and Reasoning: A Unifying Concept of Proof Theory, Automata Theory, Formal Languages and Descriptive Set Theory, 2020.

[bib.bib30] [30] Étienne Grandjean and Louis Jachiet. Which arithmetic operations can be performed in constant time in the RAM model with addition?, 2023. URL: https://arxiv.org/abs/2206.13851.

[bib.bib31] [31] Alejandro Grez, Cristian Riveros, Martín Ugarte, and Stijn Vansummeren. A formal framework for complex event recognition. ACM Trans. Database Syst., 46(4), 2021. doi:10.1145/3485463.

[bib.bib32] [32] Simon Halfon, Philippe Schnoebelen, and Georg Zetzsche. Decidability, complexity, and expressiveness of first-order logic over the subword ordering. In LICS, 2017. doi:10.1109/LICS.2017.8005141.

[bib.bib33] [33] Daniel S. Hirschberg. Algorithms for the longest common subsequence problem. J. ACM, 24(4), 1977. doi:10.1145/322033.322044.

[bib.bib34] [34] John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, 2001.

[bib.bib35] [35] James W. Hunt and Thomas G. Szymanski. A fast algorithm for computing longest subsequences. Communications of the ACM, 20(5), 1977. doi:10.1145/359581.359603.

[bib.bib36] [36] Prateek Karandikar and Philippe Schnoebelen. The height of piecewise-testable languages and the complexity of the logic of subwords. Log. Methods Comput. Sci., 15(2), 2019. doi:10.23638/LMCS-15(2:6)2019.

[bib.bib37] [37] S.C. Kleene. Representation of events in nerve nets and finite automata. In Automata Studies, volume 34 of Annals of Mathematics Studies, pages 3–41. Princeton University Press, 1956.

[bib.bib38] [38] Sarah Kleest-Meißner, Rebecca Sattler, Markus L. Schmid, Nicole Schweikardt, and Matthias Weidlich. Discovering event queries from traces: Laying foundations for subsequence-queries with wildcards and gap-size constraints. In ICDT, 2022.

[bib.bib39] [39] Sarah Kleest-Meißner, Rebecca Sattler, Markus L. Schmid, Nicole Schweikardt, and Matthias Weidlich. Discovering multi-dimensional subsequence queries from traces - from theory to practice. In Datenbanksysteme für Business, Technologie und Web (BTW 2023), 20. Fachtagung des GI-Fachbereichs ,,Datenbanken und Informationssysteme" (DBIS), 2023. doi:10.18420/BTW2023-24.

[bib.bib40] [40] Maria Kosche, Tore Koß, Florin Manea, and Stefan Siemer. Combinatorial algorithms for subsequence matching: A survey. In NCMA, volume 367 of EPTCS, 2022. doi:10.4204/EPTCS.367.2.

[bib.bib41] [41] Sailesh Kumar, Balakrishnan Chandrasekaran, Jonathan S. Turner, and George Varghese. Curing regular expressions matching algorithms from insomnia, amnesia, and acalculia. In ANCS, 2007. doi:10.1145/1323548.1323574.

[bib.bib42] [42] Dietrich Kuske. The subtrace order and counting first-order logic. In CSR, volume 12159 of LNCS, 2020. doi:10.1007/978-3-030-50026-9_21.

[bib.bib43] [43] Dietrich Kuske and Georg Zetzsche. Languages ordered by the subword order. In FOSSACS, volume 11425 of LNCS, 2019. doi:10.1007/978-3-030-17127-8_20.

[bib.bib44] [44] Alex X. Liu and Eric Norige. A de-compositional approach to regular expression matching for network security. IEEE/ACM Trans. Netw., 27(6), 2019. doi:10.1109/TNET.2019.2941920.

[bib.bib45] [45] David Maier. The complexity of some problems on subsequences and supersequences. J. ACM, 25(2), 1978. doi:10.1145/322063.322075.

[bib.bib46] [46] Alexandru Mateescu, Arto Salomaa, and Sheng Yu. Subword histories and Parikh matrices. J. Comput. Syst. Sci., 68(1):1–21, 2004. doi:10.1016/J.JCSS.2003.04.001.

[bib.bib47] [47] Alexander Okhotin. On the state complexity of scattered substrings and superstrings. Fundamenta Informaticae, 99(3), 2010. doi:10.3233/FI-2010-252.

[bib.bib48] [48] M. Praveen, Philippe Schnoebelen, Julien Veron, and Isa Vialard. On the piecewise complexity of words and periodic words. In SOFSEM, 2024. doi:10.1007/978-3-031-52113-3_32.

[bib.bib49] [49] Steven Purtzel and Matthias Weidlich. Suse: Summary selection for regular expression subsequence aggregation over streams. In SIGMOD, 2025. To appear.

[bib.bib50] [50] William E. Riddle. An approach to software system modelling and analysis. Comput. Lang., 4(1), 1979. doi:10.1016/0096-0551(79)90009-2.

[bib.bib51] [51] Michel Rigo and Pavel Salimov. Another generalization of abelian equivalence: Binomial complexity of infinite words. Theor. Comput. Sci., 601, 2015. doi:10.1016/J.TCS.2015.07.025.

[bib.bib52] [52] Arto Salomaa. Connections between subwords and certain matrix mappings. Theoret. Comput. Sci., 340(2):188–203, 2005. doi:10.1016/J.TCS.2005.03.024.

[bib.bib53] [53] Philippe Schnoebelen and Julien Veron. On arch factorization and subword universality for words and compressed words. In WORDS, 2023. doi:10.1007/978-3-031-33180-0_21.

[bib.bib54] [54] Shinnosuke Seki. Absoluteness of subword inequality is undecidable. Theor. Comput. Sci., 418, 2012. doi:10.1016/J.TCS.2011.10.017.

[bib.bib55] [55] Alan C. Shaw. Software descriptions with flow expressions. IEEE Trans. Software Eng., 4(3), 1978. doi:10.1109/TSE.1978.231501.

[bib.bib56] [56] Imre Simon. Hierarchies of events with dot-depth one. PhD thesis, University of Waterloo, 1972.

[bib.bib57] [57] Imre Simon. Piecewise testable events. In Automata Theory and Formal Languages, 2nd GI Conference, volume 33 of LNCS, 1975. doi:10.1007/3-540-07407-4_23.

[bib.bib58] [58] Imre Simon. Words distinguished by their subwords (Extended abstract). In WORDS, volume 27 of TUCS General Publication, 2003.

[bib.bib59] [59] Peter Snyder and Chris Kanich. No please, after you: Detecting fraud in affiliate marketing networks. In WEIS, 2015. URL: http://www.econinfosec.org/archive/weis2015/papers/WEIS_2015_snyder.pdf.

[bib.bib60] [60] K. Thompson. Programming techniques: Regular expression search algorithm. Communications of the ACM, 11, 1968.

[bib.bib61] [61] Virginia Vassilevska Williams and R. Ryan Williams. Subcubic equivalences between path, matrix, and triangle problems. J. ACM, 65(5), 2018. doi:10.1145/3186893.

[bib.bib62] [62] Georg Zetzsche. The complexity of downward closure comparisons. In ICALP, volume 55 of LIPIcs, 2016. doi:10.4230/LIPICS.ICALP.2016.123.

[bib.bib63] [63] Haopeng Zhang, Yanlei Diao, and Neil Immerman. On complexity and optimization of expensive queries in complex event processing. In SIGMOD, 2014. doi:10.1145/2588555.2593671.

Linear Time Subsequence and Supersequence Regex Matching

Abstract

Keywords and phrases:

Funding:

Copyright and License:

2012 ACM Subject Classification:

Related Version:

DOI:

Event:

Editors:

Series and Publisher:

1 Introduction

Our Results.

Motivations and Related Work.

Paper Structure.

2 Preliminaries and Problem Statement

Strings, Regular Expressions and Automata.

String Relations.

Variants of Regex Matching.

Computational Model and Basic Input-Preprocessing.

3 State-Set Simulation and Simple Upper Bounds

State-Set Simulation for 𝜺NFA.

Lemma 3.1 ([9]).

Solving Sub- and Supersequence Matching via State-Set Simulation.

Theorem 3.2.

Theorem 3.3.

4 Subsequence and Supersequence Matching in Linear Time

4.1 Subsequence Matching in Linear Time

Theorem 4.1.

Proof Sketch.

4.2 Supersequence Matching in Linear Time

Proposition 4.2.

Lemma 4.3.

Proposition 4.4.

Lemma 4.5.

Theorem 4.6.

Proof Sketch.

5 Quantitative Problem Variants

Theorem 5.1.

General Proof Technique.

Infix and Prefix Relations.

Extension and Left-Extension Relations.

Subsequence and Supersequence Relations.

6 Universal Problem Variants

Theorem 6.1.

Proof Sketch.

7 Conditional Lower Bounds

Infix, Prefix, Extension and Left-Extension Relations.

Theorem 7.1.

Theorem 7.2.

Subsequence and Supersequence Relations.

Theorem 7.3.

Theorem 7.4.

8 Conclusion and Future Work

References

State-Set Simulation for $\varepsilon$ NFA.