Abstract 1 Introduction 2 Preliminaries and Problem Statement 3 State-Set Simulation and Simple Upper Bounds 4 Subsequence and Supersequence Matching in Linear Time 5 Quantitative Problem Variants 6 Universal Problem Variants 7 Conditional Lower Bounds 8 Conclusion and Future Work References

Linear Time Subsequence and Supersequence Regex Matching

Antoine Amarilli ORCID Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France Florin Manea ORCID Institute for Computer Science and CIDAS, University of Göttingen, Germany Tina Ringleb ORCID Institute for Computer Science, University of Göttingen, Germany Markus L. Schmid ORCID Humboldt-Universität zu Berlin, Germany
Abstract

It is well-known that checking whether a given string w matches a given regular expression r can be done in quadratic time O(|w||r|) and that this cannot be improved to a truly subquadratic running time of O((|w||r|)1ϵ) assuming the strong exponential time hypothesis (SETH). We study a different matching paradigm where we ask instead whether w has a subsequence that matches r, and show that regex matching in this sense can be solved in linear time O(|w|+|r|). Further, the same holds if we ask for a supersequence. We show that the quantitative variants where we want to compute a longest or shortest subsequence or supersequence of w that matches r can be solved in O(|w||r|), i. e., asymptotically no worse than classical regex matching; and we show that O(|w|+|r|) is conditionally not possible for these problems. We also investigate these questions with respect to other natural string relations like the infix, prefix, left-extension or extension relation instead of the subsequence and supersequence relation. We further study the complexity of the universal problem where we ask if all subsequences (or supersequences, infixes, prefixes, left-extensions or extensions) of an input string satisfy a given regular expression.

Keywords and phrases:
subsequence, supersequence, regular language, regular expression, automata
Funding:
Antoine Amarilli: Partially supported by the ANR project EQUUS ANR-19-CE48-0019 and by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – 431183758.
Florin Manea: Supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Heisenberg grant 466789228.
Markus L. Schmid: Supported by the German Research Foundation (Deutsche Forschungsgemeinschaft, DFG) – project number 522576760 (gefördert durch die Deutsche Forschungsgemeinschaft (DFG) – Projektnummer 522576760).
Copyright and License:
[Uncaptioned image] © Antoine Amarilli, Florin Manea, Tina Ringleb, and Markus L. Schmid; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Regular languages
Related Version:
Full Version: https://arxiv.org/abs/2504.16288 [5]
Editors:
Paweł Gawrychowski, Filip Mazowiecki, and Michał Skrzypczak

1 Introduction

Regular expressions (also called regex) were first introduced by Kleene in 1956 [37] as a theoretical concept and quickly found their way into practice with the classical construction by Thompson [60]. Nowadays, they are a standard tool for text processing and data retrieval tasks, and they constitute computational primitives in virtually all modern programming languages (see the standard textbook [24]). The most important problem about regular expressions is their matching problem, i. e., checking whether a given regular expression r matches a given string w. One of the two main algorithmic approaches to this problem is still Thompson’s original construction: transform r into a nondeterministic finite automaton with ε-transitions (or εNFA) A in linear time, and then simulate A on w in time O(|w||A|). (The other approach is to transform r into a deterministic finite automaton or DFA.)

In addition to their well-known applications as text analysis tools, regular expressions are also used in many different areas and are at the core of several data management principles: graph databases [6], information extraction [20], complex event processing [31], network monitoring [44], financial fraud detection [59], infrastructure security [41], etc. They have also recently gained new interest in the field of fine-grained complexity, where it has been shown that the long-standing quadratic upper bound of O(|w||r|) for matching cannot be improved to truly subquadratic complexity of O((|w||r|)1ϵ) unless the strong exponential time hypothesis (SETH) fails: see [9, 12]. In particular, the following question has been investigated [9]: For which classes of regex can the upper bound be improved, ideally to linear time O(|w|+|r|), and for which classes is this impossible (assuming SETH).

We contribute to the research on regex matching by exploring the following new angle. Instead of asking whether the whole input string w matches the regex r, we ask whether w has a subsequence that matches r, or a supersequence that matches r, or, more generally, whether r matches some string u with uw, where is a fixed string relation (and classical regex matching is then the case when is the equality relation). While technically any possible string relation instantiates a -version of the regex matching problem, we focus on relations that are relevant for string matching: the infix relation (u𝗂𝗇ww=vuv), prefix relation (u𝗉𝗋𝖾ww=uv), left-extension relation (u𝗅𝖾𝗑𝗍wu=vw), extension relation (u𝖾𝗑𝗍ww𝗂𝗇u), subsequence relation (x1x2xn𝗌𝗎𝖻ww=w0x1w1x2xnwn), and supersequence relation (u𝗌𝗎𝗉ww𝗌𝗎𝖻u).

Our Results.

Our main focus is on the subsequence and supersequence relations, and our first contribution is to show that the complexity of regular expression matching can be substantially improved when performed with these relations. Namely, we show that given a regex r and string w, we can check in time O(|w|+|r|) whether some subsequence of w matches r, or whether some supersequence of w matches r. As we show, this surprising tractability result is in strong contrast to the same problem for other string relations (infix, prefix, left-extension and extension), or for regex matching in the usual sense – in all these contexts, the O(|w||r|) algorithm is optimal assuming SETH.

Our linear time upper bound for subsequence and supersequence matching is achieved by transforming the regex r into an εNFA A that accepts w if and only if w has a subsequence (or supersequence) that matches r, following a known construction in the context of so-called upward and downward closures [8]. We then exploit special properties of A that allow us to decide whether it accepts w in linear time. While it is in general easy to see that the quadratic upper bound can be improved in the case of subsequence and supersequence matching, we think that linear time complexity is rather unexpected, and some algorithmic care is required to achieve it. In particular, we stress the fact that our O(|w|+|r|) complexity bound does not assume the input alphabet to be constant.

Motivated by this positive algorithmic result, we investigate a natural generalisation of the matching problem: compute a maximum-length/minimum-length string u with uw that matches r. For the subsequence and supersequence relations, we can show (conditionally) that this generalisation worsens the complexity: our O(|w|+|r|) matching algorithm does not generalise to this stronger problem. More precisely, for the max-variant of the subsequence relation or the min-variant of the supersequence relation, we show that truly subquadratic algorithms would imply truly subquadratic algorithms for the longest common subsequence or the shortest common supersequence problems, which is impossible under SETH (see [13, 1]). For the min-variant of the subsequence relation or the max-variant of the supersequence relation, we show a weaker bound: an O(|w|+|r|) algorithm would imply that we can detect triangles in a dense graph in linear time. On the positive side, however, we show that all these problems can be solved within the original matching complexity of O(|w||r|), in particular generalising the fact that longest common subsequences and shortest common supersequences can be computed within this bound. We also show that the same upper bound applies for the infix, prefix, and (left-)extension relations, and that it is optimal under SETH.

Finally, we investigate the complexity of checking whether all strings u with uw match r. While it is easy to see that this can be solved efficiently for the infix and prefix relation, it becomes intractable for the (left-)extension, subsequence and supersequence relations. We pinpoint the complexity of all these variants (conditionally to SETH) except for the infix relation, i. e., we leave a gap between the upper and lower bound for the problem of deciding whether the input string has an infix which is rejected by an input regex.

Motivations and Related Work.

Subsequences and supersequences play an important role in many different areas of theoretical computer science: in formal languages and logics (e. g., piecewise testable languages [56, 57, 36, 48], or subword order and downward closures [32, 43, 42, 62]), in combinatorics on words [51, 23, 54, 46, 52, 53] or combinatorial pattern matching [33, 35, 45, 58, 19, 27], for modelling concurrency [50, 55, 15], in fine-grained complexity [11, 14, 1, 2, 49]), etc. See also the surveys [10, 18, 40] and the references therein. Moreover, algorithmic problems related to the analysis of the set of subsequences of the strings of a formal language, given as a grammar or automaton, are studied in [4] and [21]. Closer to our topic, matching regular expressions to the subsequences of a string is an important matching paradigm in event stream processing, where we receive a stream of events that result from different processes, which means that consecutive events from the same process are represented as a subsequence in this string (see, e. g., [7, 28, 63, 38, 39, 25, 31]).

To prove our linear upper bound for the subsequence regex matching problem, we first transform the regular expressions into an εNFA A𝗌𝗎𝖻 that accepts the upward closure of 𝖫(r), i. e., the set {uv𝖫(r):v𝗌𝗎𝖻u} (note that w has a subsequence that matches r if and only if w𝖫(A𝗌𝗎𝖻)). Similarly, the upper bound for the supersequence regex matching problem is based on constructing an εNFA A𝗌𝗎𝗉 for the downward closure {uv𝖫(r):v𝗌𝗎𝗉u}. The downward and upward closures are well-investigated concepts in formal language theory (see citations above). In particular, it has been noted in [8, Lemma 9] that the powerset-DFA of A𝗌𝗎𝖻 always transitions from a state set to a superset of this state set (analogously, the powerset-DFA of A𝗌𝗎𝗉 always transitions from a state set to a subset of this state set). However, this property does not imply that the powerset-DFA of A𝗌𝗎𝖻 and A𝗌𝗎𝗉 is necessarily small, since [47] proves an exponential lower bound on their number of states. Our notion of subsequence or supersequence matching is related to so-called upward-closed and downward-closed languages (which are languages respectively equal to their upward or downward closure), because for such languages the usual notion of regex matching coincides with subsequence and supersequence matching, respectively. These languages have been investigated, e. g., for downward-closed languages in [26] or in [3] (under the name “simple regular expressions”) and for upward-closed languages in [29]. However, to our knowledge, none of the works focusing on the upward or downward closure, or on upward-closed or downward-closed languages, have investigated the complexity of the matching problem like we do – so it was not known that these problems could be solved in time O(|w|+m), or indeed that their complexity was lower than that of standard regex matching.

Paper Structure.

In Section 2, we give preliminaries and formally define the matching problems that we study. In Section 3, we first recall some basics about the state-set simulation of εNFAs and then explain how our regex matching problems for the subsequence and supersequence relation reduce to the state-set simulation of certain εNFAs. Then, in Section 4, we give the corresponding linear time algorithms. Sections 5 and 6 are concerned with the quantitative and universal problem variants for all considered string relations. The conditional lower bounds that complement our upper bounds are discussed in Section 7. Due to space restrictions, we only give proof sketches of some of our results (all proof details can be found in the full version [5] of this paper).

2 Preliminaries and Problem Statement

Strings, Regular Expressions and Automata.

We let Σ be a finite alphabet of symbols (sometimes called letters) and write Σ for the set of strings (sometimes called words) over Σ. We write ε for the empty string. For a string wΣ, we denote by |w| its length and, for every i{1,2,,|w|}, we denote by w[i] the ith symbol of w. For i,j{1,2,,|w|} with ij, we denote by w[i:j] the factor (also called infix) w[i]w[i+1]w[j]; in particular, w[i:i]=w[i].

Regular expressions over Σ are defined as follows. The empty set is a regular expression with 𝖫()=, and every xΣ{ε} is a regular expression with 𝖫(x)={x}. If s and t are regular expressions, then st, st, and s are regular expressions with 𝖫(st)=𝖫(s)𝖫(t), 𝖫(st)=𝖫(s)𝖫(t), and 𝖫(s)=(𝖫(s)). Here, we define as usual: L1L2={uvuL1,vL2}, L0={ε}, Lk=Lk1L for every k1, and L=k0Lk. The length |s| of a regular expression s is the total number of symbols of s considered as a string.

We work with nondeterministic finite automata with ε-transitions, called εNFA for brevity. An εNFA is a tuple A=(Q,Σ,q0,q𝖿,δ) where Q is a finite set of states, q0 is the initial state, q𝖿 is the final state, and δQ×Σ{ε}×Q is the set of transitions. A transition of the form (p,a,q) is called an a-transition. We will also interpret an εNFA A as a graph which has vertex set Q and which has directed edges labelled by symbols from Σ{ε} given by the transitions of δ, i. e., any transition (p,a,q)δ is interpreted as a directed edge from p to q labelled by a. A transition (p,a,p) with pQ and aΣ{ε} is called a self-loop. A run of A on an input string wΣ is a path (not necessarily simple) from q0 to some state p which is labelled with w, where the label of a path is just the concatenation of all Σ-labels (ignoring ε-labels). A run is accepting if p=q𝖿. We write 𝖫(A) for the language accepted by A, i.e., the set of all strings for which A has an accepting run. The size |A| of A is its number of transitions: we usually denote it by m, while n denotes the number |Q| of states.

It is well-known that a given regular expression r can be converted in time O(|r|) into an εNFA A such that 𝖫(A)=𝖫(r) and |A|=O(|r|). This can be achieved, e. g., using Thompson’s construction [34, Section 3.2.3]. Thus, in the sequel, we assume that all input regular expressions are given as εNFAs, and we state our results directly for arbitrary εNFAs.

Moreover, we assume that our εNFAs are trimmed, which means that every state is reachable from q0 and every state can reach q𝖿. This can always be ensured in time O(m). If an εNFA is trimmed, then we also have that the number of states of the input automaton is at most twice the number of transitions. We also assume that for each xΣ we have at least one transition labelled with x, which means that |Σ| is in O(m).

String Relations.

A string relation (over Σ) is a subset of Σ×Σ. For any string relation  and wΣ we define Λ(w)={uΣuw}, i.e., the set of all strings that are in relation to w; we also lift this notation to languages LΣ, i.e., Λ(L)=wLΛ(w). Next, we define several well-known string relations.111Our generalised regex matching setting works for any string relation, but those that we study are all reflexive, transitive and antisymmetric; hence, we use the symbol .

The prefix and infix relations are denoted by 𝗉𝗋𝖾 and 𝗂𝗇: formally, we write u𝗉𝗋𝖾w when uv=w for some vΣ and we write u𝗂𝗇w when vuv=w for some v,vΣ. The left-extension and extension relations are denoted by 𝗅𝖾𝗑𝗍 and 𝖾𝗑𝗍: formally, we write u𝗅𝖾𝗑𝗍w when u=vw for some vΣ and write u𝖾𝗑𝗍w when w𝗂𝗇u. The subsequence and supersequence relations are denoted by 𝗌𝗎𝖻 and 𝗌𝗎𝗉: we write u𝗌𝗎𝖻w when w[j1]w[j2]w[j|u|]=u for some 1j1<j2<<j|u||w|, and write u𝗌𝗎𝗉w when w𝗌𝗎𝖻u. Note that we do not study the suffix and right-extension relations because they amount to prefix and left-extension up to mirroring the strings.

Variants of Regex Matching.

The well-known regex matching problem is to decide whether w𝖫(r) for a given regular expression r and input string w. As explained above, this generalises to the εNFA acceptance problem, where we want to decide whether w𝖫(A) for a given εNFA A and input string w. In this paper, we study the -matching problem for a string relation among those presented above: For a given string wΣ and an εNFA A, decide whether A accepts a string u with uw, i. e., decide whether Λ(w)𝖫(r).

We also define quantitative versions of the matching problem. For a string relation , the min-variant of the -matching problem is as follows: For a given string wΣ and an εNFA A, compute a shortest string u with uw and u𝖫(A), or report that no such string u exists. The max-variant of the -matching problem is as follows: For a given string wΣ and an εNFA A, either report that there are arbitrarily large strings u with uw and u𝖫(A), or compute a longest string u with uw and u𝖫(A), or report that no such string u exists. Further, all lower bounds on the quantitative versions of the matching problem will already apply in the setting where we are only required to compute the length of the resulting string.

Finally, we define a universal variant of the matching problem. For a string relation , the universal-variant of the -matching problem is as follows: For a given string wΣ and an εNFA A, decide whether A matches all strings u with uw, i. e., decide the inclusion problem Λ(w)𝖫(A).

See Figure 1 for an overview of all our upper and lower bounds with respect to the variants of the regex matching problem. Recall that m is the size of the input εNFA.

𝗂𝗇 𝗉𝗋𝖾 𝖾𝗑𝗍/𝗅𝖾𝗑𝗍 𝗌𝗎𝖻 𝗌𝗎𝗉
-matching O(|w|m) O(|w|m) O(|w|m) O(|w|+m) O(|w|+m)
min-variant O(|w|m) O(|w|m) O(|w|m) O(|w|m) O(|w|m)
max-variant O(|w|m) O(|w|m) O(|w|m) O(|w|m) O(|w|m)
universal-variant O(|w|2m) O(|w|m) PSPACE coNP PSPACE
𝗂𝗇/𝗉𝗋𝖾 𝖾𝗑𝗍/𝗅𝖾𝗑𝗍 𝗌𝗎𝖻 𝗌𝗎𝗉
-matching no O((|w|m)1ϵ) no O((|w|m)1ϵ)
min-variant no O((|w|m)1ϵ) no O((|w|m)1ϵ) no O(|w|+m) no O((|w|m)1ϵ)
max-variant no O((|w|m)1ϵ) no O((|w|m)1ϵ) no O((|w|m)1ϵ) no O(|w|+m)
universal-variant no O((|w|m)1ϵ) PSPACE-hard coNP-hard PSPACE-hard
Figure 1: Upper bounds ① and conditional lower bounds ② for the different problem variants. Recall that m is the size of the εNFA.

Computational Model and Basic Input-Preprocessing.

The computational model we use to state our algorithms is the standard unit-cost word RAM with logarithmic word-size ω, which allows processing inputs of size n, where ωlogn. This is a standard computational model for the analysis of algorithms, defined in [22]. To make some of our algorithms faster, we use the standard technique of lazy initialisation of arrays [30] to avoid spending linear time in the array size to initialise its cells. The time needed, after the lazy initialisation, to store a value in a cell of the array, or to check if a cell of the array was initialised and if so return the value it stores, is O(1).

Recall that the inputs for our problems (see previous paragraph) are always an εNFA and a string over Σ. For these inputs, we make the following assumptions. For an εNFA A=(Q,Σ,q0,q𝖿,δ) with |Q|=n, |Σ|=σ, and m=|δ|, we assume that Q={1,,n} and Σ={1,,σ}; we can assume that both Q and Σ are ordered sets, w.r.t. the canonical order on natural numbers. Hence, the processed strings are sequences of integers (representing the symbols), each of these integers fitting in one memory word. This is a common assumption in string algorithms: the input alphabet is said to be an integer alphabet (see, e. g., [17]). As we assume εNFAs to be trimmed, we also know that n,σ are in O(m). Moreover, an εNFA A is given by its number of states, size of the input alphabet, initial and final state, and a list of m transitions of the form (q,a,q), with q,qQ and aΣ{ε}.

Further, using radix sort, we can prepare for each state q having outgoing transitions the list L[q] of these transitions, sorted first according to the symbol labelling them, and then by the target state; for states without outgoing transitions L[q]. This allows us to simultaneously construct, for each qQ and aΣ{ε} for which q has an outgoing a-transition, the list T[q,a] of transitions of the form (q,a,q); T[q,a] is undefined for qQ and aΣ{ε} in the case where q has no outgoing a-transition. We will make the convention that, in the case that a self-loop (q,a,q) exists, then (q,a,q) is the first transition stored in T[q,a]; this allows us to test in O(1) whether there exists a self-loop (q,a,q) for any qQ and aΣ{ε}. Both L[] and T[,] are implemented as arrays (of lists), which are lazily initialised (as mentioned above). So, computing the defined elements of L[] and T[,] can be done in linear time, i. e., in O(m).

3 State-Set Simulation and Simple Upper Bounds

Given a string wΣ and a regular expression represented as an εNFA A, we can solve the regex matching problem of deciding whether w𝖫(A) using the classical approach of state-set simulation. It is well-known [9] that, assuming the strong exponential time hypothesis (SETH), this yields an optimal O(|w||A|) algorithm. For our variants of regex matching, it is often possible to build an εNFA A from A that accepts suitable strings (e. g., the subsequences or supersequences of strings of 𝖫(A)), so that we can solve the matching problem by checking whether w𝖫(A) by state-set simulation (though this is generally not optimal). Since many of our algorithms will nevertheless be specialised variants of state-set simulation, we review the technique in more detail below.

State-Set Simulation for 𝜺NFA.

For an NFA A without ε-transitions and an input string w, the state-set simulation starts with S0={q0} and then, for every i=1,2,,|w|, computes the set Si=𝖢w[i](Si1), where, for any state set S and symbol bΣ, we call 𝖢b(S) the set of all states that can be reached from a state in S by a b-transition, i. e., 𝖢b(S)={qpS(p,b,q)δ}. Clearly, each Si contains exactly the states that are reachable from q0 by a w[1:i]-labelled path. Now if we are dealing with an εNFA, then we can use the same idea, but we also have to compute the ε-closures 𝖢ε(S) in between, i. e., the set of all states that can be reached from a state in S by a path of ε-transitions. More precisely, we start with S0=𝖢ε({q0}) and then, for every i=1,2,,|w|, we compute the set Si=𝖢ε(𝖢w[i](Si1)). Now each Si contains exactly the states that are reachable from q0 by a path that is labelled with w[1:i] (with ε-transitions being ignored in the path label). Computing S0 is called the initialisation and computing Si from Si1 is called an update step of the state-set simulation. For every i=0,1,2,,|w|, the set Si is the set of active states at step i. We also say that a state p is active at step i if pSi.

Given a state set S and a symbol bΣ, computing the set 𝖢b(S) can be easily done in time O(m), since we only have to compute for pS the states q with a transition (p,b,q), which are given by T[p,b]. Thus, we have to inspect every transition at most once. Computing the ε-closure 𝖢ε(S) can also be done in time O(m) by a breadth-first search (BFS) that starts in all states of S and only considers ε-transitions (again, the entries T[p,ε] can be used for that). This means that the initialisation and each update step can be done in time O(m), which yields a total running time of O(|w|m) for the whole state-set simulation. What is more, this running time is conditionally optimal. Indeed:

Lemma 3.1 ([9]).

Given an εNFA A and string w, we can check in time O(|A||w|) whether w𝖫(A), but not in time O((|A||w|)1ϵ) for any ϵ>0 unless SETH fails.

Solving Sub- and Supersequence Matching via State-Set Simulation.

We now explain how we can use state-set simulation to solve the -matching problem, specifically for the subsequence and supersequence relations (which are our main focus in this paper). It is easy to see that we can transform an εNFA A into two εNFAs A𝗌𝗎𝖻 and A𝗌𝗎𝗉 such that w𝖫(A𝗌𝗎𝖻)Λ𝗌𝗎𝖻(w)𝖫(A) and w𝖫(A𝗌𝗎𝗉)Λ𝗌𝗎𝗉(w)𝖫(A). For example, this is done in [8, Lemma 8]. We review this construction as we adapt it later.

The εNFA A𝗌𝗎𝖻 is obtained from A by simply adding a transition (p,b,p) for every state p and bΣ. Intuitively speaking, these loops equip A with the general possibility to consume symbols from w without transitioning in a new state, which corresponds to ignoring symbols from w. Hence, the “non-ignoring” transitions of an accepting run of A𝗌𝗎𝖻 on w spell out a subsequence of w accepted by A. On the other hand, the εNFA A𝗌𝗎𝗉 is obtained from A by simply adding a transition (p,ε,q) for every existing transition (p,b,q) with bΣ. This means that while reading w the automaton can always virtually process some symbols that do not belong to w. Hence, in an accepting run of A𝗌𝗎𝗉 on w, the actual transitions that read the symbols from w along with the added ε-transitions that virtually read some symbols spell out a supersequence of w accepted by A.

Consequently, we can solve the subsequence and supersequence matching problem by checking whether w𝖫(A𝗌𝗎𝖻) and w𝖫(A𝗌𝗎𝗉), respectively, via state-set simulation. Since |A𝗌𝗎𝖻|=O(n|Σ|+m) and |A𝗌𝗎𝗉|=O(m), we get the following:

Theorem 3.2.

The subsequence matching problem can be solved in time O(|w|(n|Σ|+m)) and the supersequence matching problem can be solved in time O(|w|m).

It has been observed in [8, Lemma 9] that the εNFAs A𝗌𝗎𝖻 and A𝗌𝗎𝗉 have special properties that can be exploited in the state-set simulations to improve the bounds of Theorem 3.2.

Firstly, for A𝗌𝗎𝖻, whenever a state p of A𝗌𝗎𝖻 is added to the set of active states in the state-set simulation, it will stay active until the end of the state-set simulation. This is due to the existence of the transition (p,b,p) for every bΣ. Consequently, for the state-set simulation of A𝗌𝗎𝖻, we have S0S1S|w|. Secondly, for A𝗌𝗎𝗉, we can observe that the state-set simulation starts with S0=Q, i. e., the set of all states, which is due to the fact that every state in A is reachable from q0 by ε-transitions and therefore is in the ε-closure of q0 with respect to A𝗌𝗎𝗉. Moreover, when a state q is removed, it is never added back. Indeed, assume by contradiction that qSi1 but q gets added to Si by an update step. Then q is not reachable from Si1 by ε-transitions, but it is reachable from 𝖢w[i](Si1) by ε-transitions: this is impossible because in A𝗌𝗎𝗉 there is a transition (p,ε,q) for every existing transition (p,b,q) with bΣ. Thus, S0S1S|w|.

This means that in the state-set simulation for A𝗌𝗎𝖻 and A𝗌𝗎𝗉 on an input string w, we can encounter at most n+1 different sets of active states. Further, for each set of active sets, there are at most |Σ| possible update steps necessary (since if 𝖢ε(𝖢w[i+1](Si))=Si and w[i+1]=w[j+1] and Si=Sj for some i<j hold, then we do not need to update Sj). Every actual update can again be done in time O(m), which yields the following:

Theorem 3.3.

There are algorithms that solve the subsequence matching problem in time O(|w|+n|Σ|(n|Σ|+m)) and the supersequence matching problem in time O(|w|+n|Σ|m).

This is already a significant improvement over Theorem 3.2, since the running time is linear in |w|. In the next two sections, we look deeper into the problem of subsequence and supersequence matching and manage to lower its complexity to the optimum of O(|w|+m).

4 Subsequence and Supersequence Matching in Linear Time

In this section, we prove that the 𝗌𝗎𝖻- and 𝗌𝗎𝗉-matching problem can be solved in linear time O(|w|+m). Note that this also holds for the usual regex matching problem when the input regex or εNFA is assumed to accept an upward-closed or downward-closed language (as matching is then equivalent to 𝗌𝗎𝖻- or 𝗌𝗎𝗉-matching). We first prove the result for the subsequence relation, which is slightly easier, and then cover the supersequence relation.

4.1 Subsequence Matching in Linear Time

Theorem 4.1.

Given a string w and εNFA A with n states and m transitions, the subsequence matching problem can be solved in time O(|w|+m).

Proof Sketch.

We perform the state-set simulation of A𝗌𝗎𝖻 on w, i. e., computing the state sets S0,,S|w|, but without explicitly constructing A𝗌𝗎𝖻, i. e., we work on A directly. Thus, the correctness follows from w𝖫(A𝗌𝗎𝖻)Λ𝗌𝗎𝖻(w)𝖫(A) (see Section 3).

To do this in time O(|w|+m), we can consider every transition only a constant number of times, which is possible due to the property S0S1S|w|. More precisely, when we update Si1 to Si, we have to compute S=𝖢w[i](Si1) and then 𝖢ε(S). However, if q is added to S due to a transition (p,w[i],q) with pSi1, then q will stay in the set of active states until the end of the state-set simulation, and the transition (p,w[i],q) can therefore be ignored afterwards (i. e., it will not play a role in the computation of 𝖢w[j](Sj1) for any j>i). In the computation of 𝖢w[i](Si1) we therefore consider only unmarked transitions (p,w[i],q) with pSi1 (i. e., the ones not already used in any previous update step) and then mark them when we use them. After having computed 𝖢w[i](Si1), we must compute 𝖢ε(𝖢w[i](Si1)), but again we can note that transitions (p,ε,q) that have already been traversed while computing some previous 𝖢ε(𝖢w[j](Sj1)) with j<i can be ignored since they necessarily lead to states that are already in Si1 and therefore in Si. Consequently, when we compute the ε-closure, we ignore marked transitions and also mark all transitions that we traverse; this also handles the initialisation, where we compute 𝖢ε({q0}).

To implement this idea efficiently, we use a lazily initialised array H[] of lists, indexed by the symbols of Σ, such that, after Si1 is computed, H[a] contains exactly the unmarked transitions (p,a,q) with pSi1. Hence, the transitions from H[w[i]] are exactly the ones that we need to compute 𝖢w[i](Si1) as described above. While computing 𝖢w[i](Si1), we store all new states (i.e., the states discovered via transitions from H[w[i]] and that are not in Si1 already) into a queue R. Then we compute the ε-closure 𝖢ε(R), ignoring marked transitions and marking traversed transitions. During the computation 𝖢ε(R), we add to R all new states that we discover. Finally, in order to update H, we only have to look at outgoing transitions (p,a,q) with pR (which are necessarily unmarked) and add them to H[a].

In this procedure, every transition is either not marked at all, or it is marked in the computation of some ε-closure (including the initialisation) and then ignored, or it is put into H[a] at some update step (or initialisation) and then marked at some later update step and then ignored. Hence, in addition to the |w| update steps, we have to spend O(m) additional time over the whole state-set simulation.

4.2 Supersequence Matching in Linear Time

The general idea is again to check w𝖫(A𝗌𝗎𝗉), and unlike in the previous subsection, we can afford to build A𝗌𝗎𝗉 in time O(m) explicitly. However, performing a state-set simulation in linear time with A𝗌𝗎𝗉 is more difficult. Intuitively speaking, in order to obtain Si, we have to remove from Si1 all states that cannot be reached by any w[i]-labelled path from some other state from Si1. It is not clear how this can done by touching each transition only a constant number of times over the whole course of the state-set simulation. One ingredient to achieve this is to first decompose A𝗌𝗎𝗉 into its strongly connected components (SCCs).

Recall that the SCCs of a directed graph G are the maximal subsets of vertices R such that, for any two distinct vertices u and v in R, there is a directed path from u to v and from v to u in G. The SCCs of an εNFA are simply its SCCs when seeing the automaton as a directed graph (ignoring the edge labels in Σ{ε}).

The condensation of an εNFA A, denoted by cond(A), is an εNFA whose states are the SCCs of A. For convenience, for every state p of A, we denote by SCCA[p] (or simply SCC[p] if A is clear from the context) the SCC of A that contains p, which, by definition, is a state of cond(A). The transitions of cond(A) are as follows: for every transition (p,a,q) of A, we have a transition (SCCA[p],a,SCCA[q]) in cond(A). Note that SCCA[p]=SCCA[q] is possible, and note that several transitions may be merged in cond(A). The initial state of cond(A) is SCCA[q0] and the final state of cond(A) is SCCA[q𝖿].

Proposition 4.2.

Given an εNFA A, we can construct cond(A) in time O(m).

Leaving aside the self-loops, the condensation of an εNFA has the convenient property of being a directed acyclic graph (DAG). However, constructing the condensation of an automaton changes in general the language that it accepts. However, we shall next see that this is not the case for the εNFA A𝗌𝗎𝗉, i. e., we have 𝖫(A𝗌𝗎𝗉)=𝖫(cond(A𝗌𝗎𝗉)).

Lemma 4.3.

Let A be an εNFA and let wΣ. Then 𝖫(A𝗌𝗎𝗉)=𝖫(cond(A𝗌𝗎𝗉)).

Hence, A accepts a supersequence of wSec. 3w𝖫(A𝗌𝗎𝗉)Lem. 4.3w𝖫(cond(A𝗌𝗎𝗉)). Moreover, cond(A𝗌𝗎𝗉) inherits the crucial properties of A𝗌𝗎𝗉, namely:

Proposition 4.4.

If cond(A𝗌𝗎𝗉) has a transition (C,b,C) for some bΣ, then it also has a transition (C,ε,C). Further, if S0,S1,,S|w| are the state sets of the state-set simulation of cond(A𝗌𝗎𝗉) on w, then S0S1S|w| and S0 contains all states of cond(A𝗌𝗎𝗉).

Our next goal is to show that we can implement the state-set simulation of cond(A𝗌𝗎𝗉) on w in linear time. For this, we will need to efficiently identify states CSi1 that do not have a self-loop for the next input symbol w[i] (since if such a self-loop exists, then CSi holds). Identifying such states is challenging: while there are at most m self-loops, there might be up to n|Σ| pairs (C,a) overall where C does not have a self-loop labelled a, so we cannot explicitly materialize these pairs. Instead, we use a specific data structure:

Lemma 4.5.

For an εNFA A=(Q,Σ,q0,q𝖿,δ), we can build in time O(1) a data structure storing a set of states, initially empty, and supporting the following operations:

  • Push: Insert a state in .

  • Pop: Given aΣ, retrieve and remove all states q without a self-loop labelled with a (or indicate that no such state exists).

Over a sequence of push and pop operations where each state of A is pushed at most once, the total running time is O(+m) where m is the number of transitions of A.

We will use the above structure for A=cond(A𝗌𝗎𝗉). We can finally show our main result:

Theorem 4.6.

Given a string w and εNFA A with n states and m transitions, the supersequence matching problem can be solved in time O(|w|+m).

Proof Sketch.

By our considerations from above, we only have to check whether w𝖫(A), where A=cond(A𝗌𝗎𝗉)=(Q,Σ,q0,q𝖿,δ) with n=|Q| and m=|δ|; recall that A can be constructed in linear time. We will do this by a state-set simulation as usual, but we exploit the properties of A (see Proposition 4.4) to implement it in linear time.

We consider an arbitrary update step. The current state-set Si1 is a downward closed set of states (meaning that if a state is in Si1, then so are all its successors); this follows from the fact that A is a DAG and from Proposition 4.4. The states in QSi1 and their outgoing transitions can be completely ignored (they will never become active again or be responsible for removing some states of the current set of active states) so we can as well assume that they are not part of A anymore. As A is a DAG, the set Si1 has some roots (at least one) without incoming transitions (apart from those from QSi1, which we ignore anyway). Let R be the set of these roots: it will be efficiently maintained throughout the computation using the data structure of Lemma 4.5. To compute Si, we now have to identify and remove those states from Si1 that have no w[i]-labelled path from some other state from Si1, which can be done as follows.

For every root pR we know that pSi if and only if p has a w[i]-self-loop, and if p has such a self-loop, then all its successors are also in Si (and can thus be ignored in the rest of the update procedure). The roots pR without a w[i]-self-loop and their outgoing transitions are found using Lemma 4.5 and removed; further for every outgoing transition (p,x,q) with xΣ{ε}, we have to do the following. If x=w[i], then qSi, and if xw[i], then we check if q has a w[i]-self-loop and, if so, qSi as well. If, on the other hand, q has no self-loop with w[i] and at least one other incoming transition, then we do not yet know whether it is in Si or not, since this might be determined by other incoming transitions. If q has neither a w[i]-self-loop nor x=w[i], but it becomes a root when we delete (p,x,q), then it has to be treated as one of the initial roots (i. e., it will be deleted at some point and its outgoing transitions have to be deleted and processed as described above). The states that become roots can be easily found by keeping for each state q a count of the number of transitions that still lead to q. This procedure is repeated until we have taken care of all the initial roots and states that become roots (that are not already determined to be in Si) during this procedure. Once this terminates, we have computed the set Si, and the set of roots was also updated accordingly.

Intuitively, the correctness follows from the fact that our algorithm constructs exactly the same sets Si as the state-set simulation. With respect to the running time, note that we perform O(|w|) update steps and consider each transition at most once. However, at the beginning of each update step (say, when computing Si), we need access to all the roots of Si1 without w[i]-self-loops. For this, we use the data structure from Lemma 4.5. We initially perform O(n) push operations (corresponding to the roots of A) and then, over the course of all update steps, we push and pop each state at most once. So we use O(n) operations in total, which, by Lemma 4.5, adds up to O(n+m) time.

5 Quantitative Problem Variants

For every x{𝗂𝗇,𝗉𝗋𝖾,𝖾𝗑𝗍,𝗅𝖾𝗑𝗍,𝗌𝗎𝖻,𝗌𝗎𝗉}, we consider now the min- and max-variant of the x-matching problem, i. e., computing a shortest or longest string u𝖫(A) with uxw. We can show an upper bound of O(|w|m) for all these problem variants.222Observe that, of course, these upper bounds will also all constitute upper bounds for the (non-quantitative) matching problems for 𝗂𝗇, 𝗉𝗋𝖾, 𝖾𝗑𝗍 and 𝗅𝖾𝗑𝗍 (see the upper table in Figure 1).

Theorem 5.1.

The min- and max-variant of the matching problem can be solved in time O(|w|m) for all relations x with x{𝗂𝗇,𝗉𝗋𝖾,𝖾𝗑𝗍,𝗅𝖾𝗑𝗍,𝗌𝗎𝖻,𝗌𝗎𝗉}.

However, there are some differences with respect to the lower bounds (to be presented in Section 7): We can show that the O(|w|m) upper bound is conditionally tight, i. e., O((|w|m)1ϵ) is not possible unless SETH fails, for all problem variants except the min-variant of subsequence matching and the max-variant of supersequence matching. For these two variants, we can only show the somewhat weaker statement that an O(|w|+m) algorithm yields an O(|G|) algorithm that decides whether a dense graph G contains a triangle (note that any truly subcubic combinatorial algorithm for triangle detection implies a truly subcubic algorithm for Boolean matrix multiplication [61]).

Let us now give proof sketches for the results of Theorem 5.1.

General Proof Technique.

We prove the upper bounds of the different problem variants in the same way: We reduce A and w to a Σ{ε}-labelled directed graph G with edge weights of 0 or 1 and with a designated start vertex s and target vertex t, such that the shortest or longest infix, prefix, extension, left-extension, subsequence and supersequence corresponds to a path from s to t with minimum or maximum weight, respectively, and the label of the path corresponds to the witness string (i. e., the infix, prefix, etc.) and its weight to the length of the witness string. By basic graph algorithmic techniques, it is not difficult to check in time O(|G|) whether there is a path from s to t and if so, to compute such a path (along with its label) having minimum or maximum weight (or to report that the weight is unbounded for such paths). Since O(|G|)=O(|w|m) for all our reductions, the O(|w|m) upper bound follows. Let us in the following sketch the reductions for the different relations.

Infix and Prefix Relations.

Let us just discuss the infix relation, since the prefix relation is similar. We can build in time O(|w|m) the product graph GA,w, which has vertices {0,,|w|}×Q and edges as follows. For each transition (q,ε,q) in A and for each i{0,,|w|}, add an edge from (i,q) to (i,q) in GA,w with label ε and weight 0. For each transition (q,b,q) with bΣ and for each i{1,,|w|} with w[i]=b, add an edge from (i1,q) to (i,q) in GA,w with label b and weight 1.

To this product graph GA,w we add a source vertex s with ε-labelled edges with weight 0 to (i,q0) for each i{0,,|w|} and a target vertex t with ε-labelled edges with weight 0 from (i,q𝖿) for each i{0,,|w|}. It can now be seen that paths from s to t in G are in one-to-one-correspondence with accepting runs of A over the infixes of G, with the weight of the path being the length of the corresponding infix. As mentioned above, we can compute weight-minimum and -maximum paths from s to t along with their weights in time O(|G|)=O(|w|m).

Extension and Left-Extension Relations.

Let us only discuss the case of the extension relation, since the left-extension can be handled similarly (and is generally a bit simpler).

We build again the product graph GA,w, but we also add to GA,w two graphs that are basically copies of the εNFA A. One such copy has vertices {qqQ} and edges as follows: for each transition (q,x,q) in A there is an edge from q to q labelled x and having weight 0 if x=ε and weight 1 otherwise. The other copy has vertices {qqQ} and is defined in exactly the same way as the first copy but replacing the “” subscripts by “” subscripts. Then, for each state qQ we add an ε-labelled edge with weight 0 from q to (0,q), and for each state qQ we add an ε-labelled edge with weight 0 from (|w|,q) to q. We call the resulting graph G and we note that this construction can be done in time O(|w|m). Let the source vertex s of G be (q0) and let the target vertex t of G be (q𝖿).

Now, every path from s to t in G can be decomposed in three parts: (1) a path from s to q for some qQ, which is a path labelled with a string vΣ, with weight |v|, and such that there is a run from q0 to q in A when reading v; (2) a path from (0,q) to (|w|,q) for some qQ, which is a path labelled by w, with weight |w|, and that is witnessing that there is a run from q to q in A when reading w; (3) a path from (q) to t labelled with a string vΣ, with weight |v| and such that there is a run from q to q𝖿 in A when reading v. Consequently, vwv is an extension of w accepted by A. Conversely, it can be seen that, for any strings v,vΣ such that the extension vwv of w is accepted by A, there must be a path in G from s to t labelled with vwv and with weight |vwv|.

Subsequence and Supersequence Relations.

Let us begin with the subsequence relation. We start with the product graph GA,w and add edges with weight 0 and label ε from (i,q) to (i+1,q) for each state q of A and each 0i<|w|. Let us call these additional edges extra-edges, and note that they mimic the self-loops (q,b,q) for every bΣ added in the εNFA A𝗌𝗎𝖻 defined in Section 3, relative to the original εNFA A. However, note that we do not explicitly construct A𝗌𝗎𝖻, as we cannot afford within the time bound O(|w|m). We call this graph G and note that it can be constructed in time O(|w|m). Let the source s of G be (0,q0) and let the target t be (|w|,q𝖿).

An accepting run of A𝗌𝗎𝖻 on w that corresponds to a subsequence u of w (by the symbols read by the original transitions from A) translates into a path from s to t with label u and weight |u|: Each transition (q,x,q) of the run that is an original transition of A translates into a corresponding x-labelled edge of GA,w from (i,q) to (i+1,q) or from (i,q) to (i,q) (depending on whether xΣ or x=ε), where i depends on the length of the prefix of w already consumed; and each loop-transition (q,b,q) that is added to A in the construction of A𝗌𝗎𝖻 translates into an extra-edge from (i,q) to (i+1,q). Conversely, any path from s to t with label u and weight |u| translates into an accepting run of A𝗌𝗎𝖻 on w such that its original transitions of A read exactly u, which therefore must be a subsequence of w.

For the supersequence relation, the reduction is slightly different. We start again with the product graph GA,w, but now we add for each transition (q,b,q) in A with bΣ and each i{0,,|w|} a b-labelled edge with weight 1 from (i,q) to (i,q). Now, these extra-edges mimic the ε-transitions added in the εNFA A𝗌𝗎𝗉 defined in Section 3, relative to the original εNFA A. We note that for any ε-transition (q,ε,q) of A𝗌𝗎𝗉 that is not a transition of A, there are several (and at least one) transitions (q,b,q) with bΣ in A; thus, there are also several (and at least one) extra-edges from (i,q) to (i,q) that differ in their labels.

Similar as in the subsequence-case, the paths from s to t in G correspond to accepting runs of A𝗌𝗎𝗉 on w, where the label of the path is the supersequence u of w induced by the accepting run and its weight is |u|. The original transitions of A translate into edges of GA,w as before, but every transition (q,ε,q) of A𝗌𝗎𝗉 that is not a transition of A translates into an extra-edge from (i,q) to (i,q) of G (where i again depends on the length of the prefix of w that has already been consumed). As observed above, there might be several extra-edges from (i,q) to (i,q) that differ in their labels; we just take any of those (in fact, this decision only affects the symbols of the subsequence u that do not belong to w, but not its length). Also note that unlike in the subsequence case, extra-edges have now weight 1, since they correspond to symbols of the supersequence of w (while in the subsequence-case they correspond to symbols of w that do not belong to its accepted subsequence u).

6 Universal Problem Variants

We now consider the universal variants, i. e., checking whether all strings uw are in 𝖫(A).

Theorem 6.1.

The universal-variant of the matching problem can be solved in time O(|w|m) for the prefix relation and in time O(|w|2m) for the infix relation. It is PSPACE-complete for the extension and left-extension relation, coNP-complete for the subsequence relation, and PSPACE-complete for the supersequence relation.

Proof Sketch.

Checking whether all prefixes of a string are accepted by an automaton can be easily done via state-set simulation, and checking the same for all infixes can be done similarly by running a state-set simulation from every starting position. Let us show why the cases of the relations 𝖾𝗑𝗍,𝗅𝖾𝗑𝗍,𝗌𝗎𝖻 and 𝗌𝗎𝗉 follow from known facts about finite automata.

The PSPACE upper bound with respect to 𝖾𝗑𝗍 follows from the fact that Λ𝖾𝗑𝗍(w) can be expressed by an εNFA, and εNFA-inclusion can be decided in PSPACE. The reasoning is analogous for 𝗅𝖾𝗑𝗍. The same argument shows PSPACE membership for 𝗌𝗎𝗉 by building an εNFA accepting the supersequences of w. The 𝗌𝗎𝖻-variant is in coNP, since we can guess a subsequence of w and check acceptance for it.

The PSPACE-hardness with respect to 𝖾𝗑𝗍, 𝗅𝖾𝗑𝗍 and 𝗌𝗎𝗉 is easy to see: For w=ε, these problems are equivalent to εNFA-universality, i. e., checking whether 𝖫(A)=Σ, which is PSPACE-hard. This leaves the coNP-hardness for 𝗌𝗎𝖻, which can be shown as follows. For a 3-CNF formula F over n variables we build an εNFA AF that accepts all {0,1}-strings of length k{1,2,,n1,n+1,,2n} and all {0,1}-strings of length n that represent non-satisfying assignments of F. Then, Λ𝗌𝗎𝖻((01)n)𝖫(AF) if and only if F is not satisfiable.

7 Conditional Lower Bounds

We now present the conditional lower bounds that are stated in Figure 1. These bounds apply to all problems for which we showed a time-complexity higher than the optimal linear complexity O(|w|+m). Further, they match the upper bounds except in three cases: the universal-variant of the infix relation, and the min-variant (resp., max-variant) of the subsequence relation (resp., supersequence relation). The section is structured in two parts. We first show lower bounds for the infix, prefix, extension and left-extension relations: Theorem 7.1 covers the matching problem and its min- and max-variants, and Theorem 7.2 covers lower bounds for the universal-variant with the infix and prefix relations, noting that the universal-variant for extension and left-extension was shown earlier to be PSPACE-complete (Theorem 6.1). Second, we show lower bounds for the min- and max-variants of the sub- and supersequence relations; the case of the universal-variant was covered in Theorem 6.1 too.

Infix, Prefix, Extension and Left-Extension Relations.

We first observe that SETH-based lower bounds for the 𝗂𝗇-, 𝗉𝗋𝖾-, 𝖾𝗑𝗍- and 𝗅𝖾𝗑𝗍-matching problem can be concluded by minor adaptions of the original construction from [9]. Moreover, since the min- and max-variants also solve the non-quantitative version of the matching problem, we can conclude that these lower bounds also hold for the quantitative variants:

Theorem 7.1.

If the matching problem for the infix, prefix, extension and left-extension relation can be solved in time O((|w|m)1ϵ) for some ϵ>0, then SETH fails. The same holds for the min- and max-variants, even if we only require the length of the answer strings.

Further, we can show that the O(|w|m) upper bound for the universal variant of the matching problem for the prefix or infix relation is also optimal under SETH. The construction can be sketched as follows. For a string w and εNFA A, we build an εNFA A with 𝖫(A)=({#}𝖫(A){#})({#}Σ)(Σ{#})Σ for a fresh symbol #. It is not hard to see that w𝖫(A)Λ𝗂𝗇(#w#)𝖫(A). Hence, the universal variant of the 𝗂𝗇-matching problem reduces to the normal 𝗂𝗇-matching problem, so that the lower bounds of Theorem 7.1 apply. The case of 𝗉𝗋𝖾 is very similar. We therefore have:

Theorem 7.2.

If the universal variant of the matching problem for the prefix or infix relation can be solved in time O((|w|m)1ϵ) for some ϵ>0, then SETH fails.

Subsequence and Supersequence Relations.

With respect to 𝗌𝗎𝖻 and 𝗌𝗎𝗉, we have proven optimal linear upper bounds for the non-quantitative variants in Section 4. However, for the min- and max-variants, we only have upper bounds of O(|w|m), which we will now complement with conditional lower bounds.

We can easily construct εNFAs Au,𝗌𝗎𝖻 and Au,𝗌𝗎𝗉 that accept exactly the subsequences of a string u (and supersequences of a string u, respectively) – note that this is an easy special case of [8, Lemma 8] which we reviewed in Section 3. This means that the max-variant of the 𝗌𝗎𝖻-matching problem applied to the εNFA Au,𝗌𝗎𝖻 and a string v amounts to computing the longest common subsequence of u and v. Likewise, the min-variant of the 𝗌𝗎𝗉-matching problem applied to Au,𝗌𝗎𝗉 and v admits a reduction from the shortest common supersequence problem. Thus, solving these problems would allow us to solve the longest common subsequence problem (or shortest common supersequence problem, respectively) for u and v. This means that the known SETH-conditional lower bounds on the longest common subsequence and shortest common supersequence problems carry over to these quantitative variants of the matching problem (see [13, 1]). Note that these lower bounds already hold when computing the length of the sequences, and further they already hold for constant-sized alphabets so it is not a problem that the automaton Au,𝗌𝗎𝗉 generally has size Θ(|u||Σ|).

Theorem 7.3.

If the max-variant of the matching problem for the subsequence relation or the min-variant of the matching problem for the supersequence relation can be solved in time O((|w|m)1ϵ) for some ϵ>0, then SETH fails. This holds even if we only require the length of the answer strings.

This leaves the question of the optimality of O(|w|m)-time for the min-variant of the 𝗌𝗎𝖻-matching problem and the max-variant of the 𝗌𝗎𝗉-matching problem. For these, we are not able to prove SETH lower bounds that exclude O((|w|m)1ϵ) for some ϵ>0, but we can argue that the existence of algorithms with running time O(|w|+m) has some unlikely consequences.

Theorem 7.4.

If the min-variant of the matching problem for the subsequence relation or the max-variant of the matching problem for the supersequence relation can be solved in time O(|w|+m), then we can decide whether a given dense graph G has a triangle in time O(|G|). This holds even if we only require the length of the answer strings.

Let us mention that any truly subcubic combinatorial algorithm for triangle detection (i. e., an algorithm with running time O(n3ϵ) for some ϵ>0, where n is the number of vertices) yields a truly subcubic combinatorial algorithm for Boolean matrix multiplication, which is considered unlikely (see [61]). So, conditional to this hypothesis, the problems above cannot be solved in time O(|w|+m), i. e., they do not enjoy the same linear-time complexity as the non-quantitative versions of subsequence or supersequence matching.

We now sketch the proof of Theorem 7.4. A graph G=({v1,,vn},E) which is dense (i.e., a graph with |E|=Ω(n2)) can be transformed into an NFA MG as follows. We initialise MG with 4 layers of states {pi,qi,ri,si1in} and with b-transitions following G, i. e., every {vi,vj}E translates into transitions (pi,b,qj),(qi,b,rj),(ri,b,sj). This is called the middle part of MG. For every i{1,,n}, we call pi the ith-entry point and si the ith-exit point. Now, G has a triangle if and only if there is an i{1,,n} and a bbb-labelled path from the ith-entry point to the ith-exit point. We next add a left part to MG: we add a state s, which will be the initial state, and for every i{1,,n} there is a path from s to the ith-entry point with 2(n+1)i transitions, i many of which are labelled with a and the rest (so 2(n+1)2i transitions) are labelled with b (the order does not matter). For the right part, we add a state t, which will be the accepting state, and for every i{1,,n} there is a path from the ith-exit point to t with n+1+i transitions, ni many of which are labelled with a and the rest (so n+1+i(ni)=2i+1 transitions) are labelled with b (the order does not matter). This concludes the definition of MG. Note that any string accepted by MG must go through precisely one entry point and precisely one exit point. We make the following general observation: Every string accepted by MG with an accepting run going through the ith-entry point and the jth-exit point (1.) has length 3(n+1)+(ji)+3 and (2.) has n(ji) occurrences of a (due to how the a’s are chosen in the left and right part of MG).

Now for the min-variant of the matching problem for the subsequence relation, we consider the string w=b2(n+1)bbb(ab2(n+1)bbb)n and ask if the minimum length of a subsequence of w accepted by MG is N:=3(n+1)+3. The value N is a lower bound on the length of subsequences of w accepted by M. Indeed, consider any subsequence u of w accepted via the ith-entry point and the jth-exit point: it has at most n occurrences of a (by definition of w), so by item (2.) of our general observation, ji, which means by item (1.) of the general observation that N=3(n+1)+3|u|. Now, let us show that a length-N subsequence of w is accepted by MG if and only if G has a triangle. In one direction, if a length-N subsequence u of w is accepted by MG, then this is only possible via the same entry point and exit point i by item (1.) of the general observation; thus, MG has a triangle. And if MG has a triangle containing vi, then we can read a string with entry point and exit point i, which, since it contains n occurrences of a in total, can easily be obtained as a length-(3(n+1)+3) subsequence of w.

For the max-variant of the matching problem for the supersequence relation, we consider the string w=an and ask if the maximum length of a supersequence of w accepted by MG is N:=3(n+1)+3 (i. e., the same length as in the previous proof). The value N is now an upper bound on the length of supersequences of w accepted by M. Indeed, any supersequence u of w accepted via the ith-entry point and the jth-exit point has at least n occurrences of a, so by item (1.) of our general observation we have ij, which means that |u|3(n+1)+3=N by item (2.) of our general observation. Now, the argument is similar as before: If MG accepts a length-N supersequence u of w, then this is only possible via the same entry point and exit point i, and therefore MG has a triangle; and if MG has a triangle containing vi, then we can read a string with entry point and exit point i, which, since it has n occurrences of a in total, is a length-N supersequence of w.

8 Conclusion and Future Work

We investigated the regex matching problem in a more general setting, where instead of asking whether w matches r, we ask whether there is a string u with uw that matches r, where is some string relation. We demonstrated that for being the well-known subsequence and supersequence relation, this makes regex matching linear-time solvable. Interestingly, this tractability does not extend to other natural and simple relations, and it also does not generalise to quantitative variants asking for the shortest or longest result. For future research it would be interesting to find more string relations with this property, e. g., the subsequence relation with bounded gap size of k: x1x2xn𝗌𝗎𝖻,kww=w0x1w1x2xnwn and |wi|k for every i{1,2,,n1}. However, we expect that such small modifications would make the complexity quadratic again, i. e., the classical SETH-reduction applies.

It seems particularly interesting that we can compute in O(|w||r|) a longest subsequence u and a shortest supersequence v of w that match r, since the values (|w||u|) and (|v||w|) (or their sum) can be interpreted as a measure of how much w does not match r (note that (|w||u|)=(|v||w|)=0w𝖫(r)). It might be interesting to investigate how fast this measure can be computed for other (practically motivated) classes of regular expressions, e. g., deterministic regular expressions or regular expressions with counters (or even strictly more powerful language classes). Another relevant question is whether we can reconcile the gap between the upper bound and lower bound, for the universal-variant of the infix problem. In other words, given an εNFA A and string w, can one efficiently check whether there is an infix of w which is not accepted by A, faster than the easy O(|w|2|A|) upper bound? Another similar question is whether there is a genuine difference between the complexities of the min- and max-variants of the subsequence and supersequence problems, given that we could not show the same conditional lower bounds for these problems.

Let us also point out that none of our upper bounds rely on heavy algorithmic machinery or data structures and therefore our asymptotic running times do not hide huge constant factors. Hence, we believe that our algorithms have straightforward and practically relevant implementations (just like the classical state-set simulation does).

Finally, our tractability results on sub- and supersequence matching can also be seen as tractability results for the matching problem on automata that satisfy certain conditions, namely, their language is upward- or downward-closed. Our results imply that, on such automata, we can perform matching in time O(|w|+m) instead of O(|w|m). One natural question is which other restrictions on automata make it possible to achieve such improved bounds. Examples in this direction are the regex classes studied in [9, 12] or automata classes where determinisation is done in polynomial time, e. g., NFAs of bounded co-lex width [16].

References

  • [1] Amir Abboud, Arturs Backurs, and Virginia Vassilevska Williams. Tight hardness results for LCS and other sequence similarity measures. In FOCS, 2015. doi:10.1109/FOCS.2015.14.
  • [2] Amir Abboud, Virginia Vassilevska Williams, and Oren Weimann. Consequences of faster alignment of sequences. In ICALP, 2014. doi:10.1007/978-3-662-43948-7_4.
  • [3] Parosh Aziz Abdulla, Ahmed Bouajjani, and Bengt Jonsson. On-the-fly analysis of systems with unbounded, lossy fifo channels. In CAV, 1998. doi:10.1007/BFB0028754.
  • [4] Duncan Adamson, Pamela Fleischmann, Annika Huch, Tore Koß, Florin Manea, and Dirk Nowotka. k-Universality of regular languages. In ISAAC, volume 283 of LIPIcs, 2023. doi:10.4230/LIPICS.ISAAC.2023.4.
  • [5] Antoine Amarilli, Florin Manea, Tina Ringleb, and Markus L. Schmid. Linear time subsequence and supersequence regex matching. CoRR, abs/2504.16288, 2025. doi:10.48550/arXiv.2504.16288.
  • [6] Renzo Angles, Marcelo Arenas, Pablo Barceló, Aidan Hogan, Juan L. Reutter, and Domagoj Vrgoc. Foundations of modern query languages for graph databases. ACM Comput. Surv., 50(5), 2017. doi:10.1145/3104031.
  • [7] Alexander Artikis, Alessandro Margara, Martín Ugarte, Stijn Vansummeren, and Matthias Weidlich. Complex event recognition languages: Tutorial. In DEBS, 2017. doi:10.1145/3093742.3095106.
  • [8] Georg Bachmeier, Michael Luttenberger, and Maximilian Schlund. Finite automata for the sub-and superword closure of CFLs: Descriptional and computational complexity. In ICALP, 2015. doi:10.1007/978-3-319-15579-1_37.
  • [9] Arturs Backurs and Piotr Indyk. Which regular expression patterns are hard to match? In FOCS, 2016. doi:10.1109/FOCS.2016.56.
  • [10] Lasse Bergroth, Harri Hakonen, and Timo Raita. A survey of longest common subsequence algorithms. In SPIRE, 2000. doi:10.1109/SPIRE.2000.878178.
  • [11] Karl Bringmann and Bhaskar Ray Chaudhury. Sketching, streaming, and fine-grained complexity of (weighted) LCS. In FSTTCS, LIPIcs, 2018. doi:10.4230/LIPICS.FSTTCS.2018.40.
  • [12] Karl Bringmann, Allan Grønlund, Marvin Künnemann, and Kasper Green Larsen. The NFA acceptance hypothesis: Non-combinatorial and dynamic lower bounds. In ITCS, 2024. doi:10.4230/LIPICS.ITCS.2024.22.
  • [13] Karl Bringmann and Marvin Künnemann. Quadratic conditional lower bounds for string problems and dynamic time warping. In IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015, pages 79–97, 2015. doi:10.1109/FOCS.2015.15.
  • [14] Karl Bringmann and Marvin Künnemann. Multivariate fine-grained complexity of longest common subsequence. In SODA, 2018. doi:10.1137/1.9781611975031.79.
  • [15] Sam Buss and Michael Soltys. Unshuffling a square is NP-hard. J. Comput. Syst. Sci., 80(4), 2014. doi:10.1016/j.jcss.2013.11.002.
  • [16] Nicola Cotumaccio, Giovanna D’Agostino, Alberto Policriti, and Nicola Prezza. Co-lexicographically ordering automata and regular languages - Part I. J. ACM, 70(4), 2023. doi:10.1145/3607471.
  • [17] Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. Algorithms on strings. Cambridge University Press, 2007.
  • [18] Maxime Crochemore, Borivoj Melichar, and Zdenek Tronícek. Directed acyclic subsequence graph — overview. J. Discrete Algorithms, 1(3-4), 2003. doi:10.1016/S1570-8667(03)00029-7.
  • [19] Joel D. Day, Pamela Fleischmann, Maria Kosche, Tore Koß, Florin Manea, and Stefan Siemer. The edit distance to k-subsequence universality. In STACS, 2021. doi:10.4230/LIPIcs.STACS.2021.25.
  • [20] Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Document spanners: A formal approach to information extraction. J. ACM, 62(2), 2015. doi:10.1145/2699442.
  • [21] Szilárd Zsolt Fazekas, Tore Koß, Florin Manea, Robert Mercas, and Timo Specht. Subsequence matching and analysis problems for formal languages. In ISAAC, volume 322 of LIPIcs, 2024. doi:10.4230/LIPICS.ISAAC.2024.28.
  • [22] Michael L. Fredman and Dan E. Willard. BLASTING through the information theoretic barrier with FUSION TREES. In STOC, 1990. doi:10.1145/100216.100217.
  • [23] Dominik D. Freydenberger, Pawel Gawrychowski, Juhani Karhumäki, Florin Manea, and Wojciech Rytter. Testing k-binomial equivalence. In Multidisciplinary Creativity, a collection of papers dedicated to G. Păun 65th birthday, 2015. available in CoRR abs/1509.00622.
  • [24] Jeffrey E. F. Friedl. Mastering regular expressions - Understand your data and be more productive: for Perl, PHP, Java, .NET, Ruby, and more. O’Reilly, 2006.
  • [25] André Frochaux and Sarah Kleest-Meißner. Puzzling over subsequence-query extensions: Disjunction and generalised gaps. In AMW, 2023. URL: https://ceur-ws.org/Vol-3409/paper3.pdf.
  • [26] Moses Ganardi, Irmak Sağlam, and Georg Zetzsche. Directed regular and context-free languages. In STACS, 2024.
  • [27] Pawel Gawrychowski, Maria Kosche, Tore Koß, Florin Manea, and Stefan Siemer. Efficiently testing Simon’s congruence. In STACS, volume 187 of LIPIcs, 2021. doi:10.4230/LIPICS.STACS.2021.34.
  • [28] Nikos Giatrakos, Elias Alevizos, Alexander Artikis, Antonios Deligiannakis, and Minos N. Garofalakis. Complex event recognition in the big data era: A survey. VLDB J., 29(1), 2020. doi:10.1007/s00778-019-00557-w.
  • [29] Jean Goubault-Larrecq, Simon Halfon, Prateek Karandikar, K Narayan Kumar, and Philippe Schnoebelen. The ideal approach to computing closed subsets in well-quasi-orderings. Well-Quasi Orders in Computation, Logic, Language and Reasoning: A Unifying Concept of Proof Theory, Automata Theory, Formal Languages and Descriptive Set Theory, 2020.
  • [30] Étienne Grandjean and Louis Jachiet. Which arithmetic operations can be performed in constant time in the RAM model with addition?, 2023. URL: https://arxiv.org/abs/2206.13851.
  • [31] Alejandro Grez, Cristian Riveros, Martín Ugarte, and Stijn Vansummeren. A formal framework for complex event recognition. ACM Trans. Database Syst., 46(4), 2021. doi:10.1145/3485463.
  • [32] Simon Halfon, Philippe Schnoebelen, and Georg Zetzsche. Decidability, complexity, and expressiveness of first-order logic over the subword ordering. In LICS, 2017. doi:10.1109/LICS.2017.8005141.
  • [33] Daniel S. Hirschberg. Algorithms for the longest common subsequence problem. J. ACM, 24(4), 1977. doi:10.1145/322033.322044.
  • [34] John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, 2001.
  • [35] James W. Hunt and Thomas G. Szymanski. A fast algorithm for computing longest subsequences. Communications of the ACM, 20(5), 1977. doi:10.1145/359581.359603.
  • [36] Prateek Karandikar and Philippe Schnoebelen. The height of piecewise-testable languages and the complexity of the logic of subwords. Log. Methods Comput. Sci., 15(2), 2019. doi:10.23638/LMCS-15(2:6)2019.
  • [37] S.C. Kleene. Representation of events in nerve nets and finite automata. In Automata Studies, volume 34 of Annals of Mathematics Studies, pages 3–41. Princeton University Press, 1956.
  • [38] Sarah Kleest-Meißner, Rebecca Sattler, Markus L. Schmid, Nicole Schweikardt, and Matthias Weidlich. Discovering event queries from traces: Laying foundations for subsequence-queries with wildcards and gap-size constraints. In ICDT, 2022.
  • [39] Sarah Kleest-Meißner, Rebecca Sattler, Markus L. Schmid, Nicole Schweikardt, and Matthias Weidlich. Discovering multi-dimensional subsequence queries from traces - from theory to practice. In Datenbanksysteme für Business, Technologie und Web (BTW 2023), 20. Fachtagung des GI-Fachbereichs ,,Datenbanken und Informationssysteme" (DBIS), 2023. doi:10.18420/BTW2023-24.
  • [40] Maria Kosche, Tore Koß, Florin Manea, and Stefan Siemer. Combinatorial algorithms for subsequence matching: A survey. In NCMA, volume 367 of EPTCS, 2022. doi:10.4204/EPTCS.367.2.
  • [41] Sailesh Kumar, Balakrishnan Chandrasekaran, Jonathan S. Turner, and George Varghese. Curing regular expressions matching algorithms from insomnia, amnesia, and acalculia. In ANCS, 2007. doi:10.1145/1323548.1323574.
  • [42] Dietrich Kuske. The subtrace order and counting first-order logic. In CSR, volume 12159 of LNCS, 2020. doi:10.1007/978-3-030-50026-9_21.
  • [43] Dietrich Kuske and Georg Zetzsche. Languages ordered by the subword order. In FOSSACS, volume 11425 of LNCS, 2019. doi:10.1007/978-3-030-17127-8_20.
  • [44] Alex X. Liu and Eric Norige. A de-compositional approach to regular expression matching for network security. IEEE/ACM Trans. Netw., 27(6), 2019. doi:10.1109/TNET.2019.2941920.
  • [45] David Maier. The complexity of some problems on subsequences and supersequences. J. ACM, 25(2), 1978. doi:10.1145/322063.322075.
  • [46] Alexandru Mateescu, Arto Salomaa, and Sheng Yu. Subword histories and Parikh matrices. J. Comput. Syst. Sci., 68(1):1–21, 2004. doi:10.1016/J.JCSS.2003.04.001.
  • [47] Alexander Okhotin. On the state complexity of scattered substrings and superstrings. Fundamenta Informaticae, 99(3), 2010. doi:10.3233/FI-2010-252.
  • [48] M. Praveen, Philippe Schnoebelen, Julien Veron, and Isa Vialard. On the piecewise complexity of words and periodic words. In SOFSEM, 2024. doi:10.1007/978-3-031-52113-3_32.
  • [49] Steven Purtzel and Matthias Weidlich. Suse: Summary selection for regular expression subsequence aggregation over streams. In SIGMOD, 2025. To appear.
  • [50] William E. Riddle. An approach to software system modelling and analysis. Comput. Lang., 4(1), 1979. doi:10.1016/0096-0551(79)90009-2.
  • [51] Michel Rigo and Pavel Salimov. Another generalization of abelian equivalence: Binomial complexity of infinite words. Theor. Comput. Sci., 601, 2015. doi:10.1016/J.TCS.2015.07.025.
  • [52] Arto Salomaa. Connections between subwords and certain matrix mappings. Theoret. Comput. Sci., 340(2):188–203, 2005. doi:10.1016/J.TCS.2005.03.024.
  • [53] Philippe Schnoebelen and Julien Veron. On arch factorization and subword universality for words and compressed words. In WORDS, 2023. doi:10.1007/978-3-031-33180-0_21.
  • [54] Shinnosuke Seki. Absoluteness of subword inequality is undecidable. Theor. Comput. Sci., 418, 2012. doi:10.1016/J.TCS.2011.10.017.
  • [55] Alan C. Shaw. Software descriptions with flow expressions. IEEE Trans. Software Eng., 4(3), 1978. doi:10.1109/TSE.1978.231501.
  • [56] Imre Simon. Hierarchies of events with dot-depth one. PhD thesis, University of Waterloo, 1972.
  • [57] Imre Simon. Piecewise testable events. In Automata Theory and Formal Languages, 2nd GI Conference, volume 33 of LNCS, 1975. doi:10.1007/3-540-07407-4_23.
  • [58] Imre Simon. Words distinguished by their subwords (Extended abstract). In WORDS, volume 27 of TUCS General Publication, 2003.
  • [59] Peter Snyder and Chris Kanich. No please, after you: Detecting fraud in affiliate marketing networks. In WEIS, 2015. URL: http://www.econinfosec.org/archive/weis2015/papers/WEIS_2015_snyder.pdf.
  • [60] K. Thompson. Programming techniques: Regular expression search algorithm. Communications of the ACM, 11, 1968.
  • [61] Virginia Vassilevska Williams and R. Ryan Williams. Subcubic equivalences between path, matrix, and triangle problems. J. ACM, 65(5), 2018. doi:10.1145/3186893.
  • [62] Georg Zetzsche. The complexity of downward closure comparisons. In ICALP, volume 55 of LIPIcs, 2016. doi:10.4230/LIPICS.ICALP.2016.123.
  • [63] Haopeng Zhang, Yanlei Diao, and Neil Immerman. On complexity and optimization of expensive queries in complex event processing. In SIGMOD, 2014. doi:10.1145/2588555.2593671.