Linear Time Subsequence and Supersequence Regex Matching
Abstract
It is well-known that checking whether a given string matches a given regular expression can be done in quadratic time and that this cannot be improved to a truly subquadratic running time of assuming the strong exponential time hypothesis (SETH). We study a different matching paradigm where we ask instead whether has a subsequence that matches , and show that regex matching in this sense can be solved in linear time . Further, the same holds if we ask for a supersequence. We show that the quantitative variants where we want to compute a longest or shortest subsequence or supersequence of that matches can be solved in , i. e., asymptotically no worse than classical regex matching; and we show that is conditionally not possible for these problems. We also investigate these questions with respect to other natural string relations like the infix, prefix, left-extension or extension relation instead of the subsequence and supersequence relation. We further study the complexity of the universal problem where we ask if all subsequences (or supersequences, infixes, prefixes, left-extensions or extensions) of an input string satisfy a given regular expression.
Keywords and phrases:
subsequence, supersequence, regular language, regular expression, automataFunding:
Antoine Amarilli: Partially supported by the ANR project EQUUS ANR-19-CE48-0019 and by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – 431183758.Copyright and License:
2012 ACM Subject Classification:
Theory of computation Regular languagesEditors:
Paweł Gawrychowski, Filip Mazowiecki, and Michał SkrzypczakSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Regular expressions (also called regex) were first introduced by Kleene in 1956 [37] as a theoretical concept and quickly found their way into practice with the classical construction by Thompson [60]. Nowadays, they are a standard tool for text processing and data retrieval tasks, and they constitute computational primitives in virtually all modern programming languages (see the standard textbook [24]). The most important problem about regular expressions is their matching problem, i. e., checking whether a given regular expression matches a given string . One of the two main algorithmic approaches to this problem is still Thompson’s original construction: transform into a nondeterministic finite automaton with -transitions (or NFA) in linear time, and then simulate on in time . (The other approach is to transform into a deterministic finite automaton or DFA.)
In addition to their well-known applications as text analysis tools, regular expressions are also used in many different areas and are at the core of several data management principles: graph databases [6], information extraction [20], complex event processing [31], network monitoring [44], financial fraud detection [59], infrastructure security [41], etc. They have also recently gained new interest in the field of fine-grained complexity, where it has been shown that the long-standing quadratic upper bound of for matching cannot be improved to truly subquadratic complexity of unless the strong exponential time hypothesis (SETH) fails: see [9, 12]. In particular, the following question has been investigated [9]: For which classes of regex can the upper bound be improved, ideally to linear time , and for which classes is this impossible (assuming SETH).
We contribute to the research on regex matching by exploring the following new angle. Instead of asking whether the whole input string matches the regex , we ask whether has a subsequence that matches , or a supersequence that matches , or, more generally, whether matches some string with , where is a fixed string relation (and classical regex matching is then the case when is the equality relation). While technically any possible string relation instantiates a -version of the regex matching problem, we focus on relations that are relevant for string matching: the infix relation (), prefix relation (), left-extension relation (), extension relation (), subsequence relation (), and supersequence relation ().
Our Results.
Our main focus is on the subsequence and supersequence relations, and our first contribution is to show that the complexity of regular expression matching can be substantially improved when performed with these relations. Namely, we show that given a regex and string , we can check in time whether some subsequence of matches , or whether some supersequence of matches . As we show, this surprising tractability result is in strong contrast to the same problem for other string relations (infix, prefix, left-extension and extension), or for regex matching in the usual sense – in all these contexts, the algorithm is optimal assuming SETH.
Our linear time upper bound for subsequence and supersequence matching is achieved by transforming the regex into an NFA that accepts if and only if has a subsequence (or supersequence) that matches , following a known construction in the context of so-called upward and downward closures [8]. We then exploit special properties of that allow us to decide whether it accepts in linear time. While it is in general easy to see that the quadratic upper bound can be improved in the case of subsequence and supersequence matching, we think that linear time complexity is rather unexpected, and some algorithmic care is required to achieve it. In particular, we stress the fact that our complexity bound does not assume the input alphabet to be constant.
Motivated by this positive algorithmic result, we investigate a natural generalisation of the matching problem: compute a maximum-length/minimum-length string with that matches . For the subsequence and supersequence relations, we can show (conditionally) that this generalisation worsens the complexity: our matching algorithm does not generalise to this stronger problem. More precisely, for the max-variant of the subsequence relation or the min-variant of the supersequence relation, we show that truly subquadratic algorithms would imply truly subquadratic algorithms for the longest common subsequence or the shortest common supersequence problems, which is impossible under SETH (see [13, 1]). For the min-variant of the subsequence relation or the max-variant of the supersequence relation, we show a weaker bound: an algorithm would imply that we can detect triangles in a dense graph in linear time. On the positive side, however, we show that all these problems can be solved within the original matching complexity of , in particular generalising the fact that longest common subsequences and shortest common supersequences can be computed within this bound. We also show that the same upper bound applies for the infix, prefix, and (left-)extension relations, and that it is optimal under SETH.
Finally, we investigate the complexity of checking whether all strings with match . While it is easy to see that this can be solved efficiently for the infix and prefix relation, it becomes intractable for the (left-)extension, subsequence and supersequence relations. We pinpoint the complexity of all these variants (conditionally to SETH) except for the infix relation, i. e., we leave a gap between the upper and lower bound for the problem of deciding whether the input string has an infix which is rejected by an input regex.
Motivations and Related Work.
Subsequences and supersequences play an important role in many different areas of theoretical computer science: in formal languages and logics (e. g., piecewise testable languages [56, 57, 36, 48], or subword order and downward closures [32, 43, 42, 62]), in combinatorics on words [51, 23, 54, 46, 52, 53] or combinatorial pattern matching [33, 35, 45, 58, 19, 27], for modelling concurrency [50, 55, 15], in fine-grained complexity [11, 14, 1, 2, 49]), etc. See also the surveys [10, 18, 40] and the references therein. Moreover, algorithmic problems related to the analysis of the set of subsequences of the strings of a formal language, given as a grammar or automaton, are studied in [4] and [21]. Closer to our topic, matching regular expressions to the subsequences of a string is an important matching paradigm in event stream processing, where we receive a stream of events that result from different processes, which means that consecutive events from the same process are represented as a subsequence in this string (see, e. g., [7, 28, 63, 38, 39, 25, 31]).
To prove our linear upper bound for the subsequence regex matching problem, we first transform the regular expressions into an NFA that accepts the upward closure of , i. e., the set (note that has a subsequence that matches if and only if ). Similarly, the upper bound for the supersequence regex matching problem is based on constructing an NFA for the downward closure . The downward and upward closures are well-investigated concepts in formal language theory (see citations above). In particular, it has been noted in [8, Lemma 9] that the powerset-DFA of always transitions from a state set to a superset of this state set (analogously, the powerset-DFA of always transitions from a state set to a subset of this state set). However, this property does not imply that the powerset-DFA of and is necessarily small, since [47] proves an exponential lower bound on their number of states. Our notion of subsequence or supersequence matching is related to so-called upward-closed and downward-closed languages (which are languages respectively equal to their upward or downward closure), because for such languages the usual notion of regex matching coincides with subsequence and supersequence matching, respectively. These languages have been investigated, e. g., for downward-closed languages in [26] or in [3] (under the name “simple regular expressions”) and for upward-closed languages in [29]. However, to our knowledge, none of the works focusing on the upward or downward closure, or on upward-closed or downward-closed languages, have investigated the complexity of the matching problem like we do – so it was not known that these problems could be solved in time , or indeed that their complexity was lower than that of standard regex matching.
Paper Structure.
In Section 2, we give preliminaries and formally define the matching problems that we study. In Section 3, we first recall some basics about the state-set simulation of NFAs and then explain how our regex matching problems for the subsequence and supersequence relation reduce to the state-set simulation of certain NFAs. Then, in Section 4, we give the corresponding linear time algorithms. Sections 5 and 6 are concerned with the quantitative and universal problem variants for all considered string relations. The conditional lower bounds that complement our upper bounds are discussed in Section 7. Due to space restrictions, we only give proof sketches of some of our results (all proof details can be found in the full version [5] of this paper).
2 Preliminaries and Problem Statement
Strings, Regular Expressions and Automata.
We let be a finite alphabet of symbols (sometimes called letters) and write for the set of strings (sometimes called words) over . We write for the empty string. For a string , we denote by its length and, for every , we denote by the symbol of . For with , we denote by the factor (also called infix) ; in particular, .
Regular expressions over are defined as follows. The empty set is a regular expression with , and every is a regular expression with . If and are regular expressions, then , , and are regular expressions with , , and . Here, we define as usual: , , for every , and . The length of a regular expression is the total number of symbols of considered as a string.
We work with nondeterministic finite automata with -transitions, called NFA for brevity. An NFA is a tuple where is a finite set of states, is the initial state, is the final state, and is the set of transitions. A transition of the form is called an -transition. We will also interpret an NFA as a graph which has vertex set and which has directed edges labelled by symbols from given by the transitions of , i. e., any transition is interpreted as a directed edge from to labelled by . A transition with and is called a self-loop. A run of on an input string is a path (not necessarily simple) from to some state which is labelled with , where the label of a path is just the concatenation of all -labels (ignoring -labels). A run is accepting if . We write for the language accepted by , i.e., the set of all strings for which has an accepting run. The size of is its number of transitions: we usually denote it by , while denotes the number of states.
It is well-known that a given regular expression can be converted in time into an NFA such that and . This can be achieved, e. g., using Thompson’s construction [34, Section 3.2.3]. Thus, in the sequel, we assume that all input regular expressions are given as NFAs, and we state our results directly for arbitrary NFAs.
Moreover, we assume that our NFAs are trimmed, which means that every state is reachable from and every state can reach . This can always be ensured in time . If an NFA is trimmed, then we also have that the number of states of the input automaton is at most twice the number of transitions. We also assume that for each we have at least one transition labelled with , which means that is in .
String Relations.
A string relation (over ) is a subset of . For any string relation and we define , i.e., the set of all strings that are in relation to ; we also lift this notation to languages , i.e., . Next, we define several well-known string relations.111Our generalised regex matching setting works for any string relation, but those that we study are all reflexive, transitive and antisymmetric; hence, we use the symbol .
The prefix and infix relations are denoted by and : formally, we write when for some and we write when for some . The left-extension and extension relations are denoted by and : formally, we write when for some and write when . The subsequence and supersequence relations are denoted by and : we write when for some , and write when . Note that we do not study the suffix and right-extension relations because they amount to prefix and left-extension up to mirroring the strings.
Variants of Regex Matching.
The well-known regex matching problem is to decide whether for a given regular expression and input string . As explained above, this generalises to the NFA acceptance problem, where we want to decide whether for a given NFA and input string . In this paper, we study the -matching problem for a string relation among those presented above: For a given string and an NFA , decide whether accepts a string with , i. e., decide whether .
We also define quantitative versions of the matching problem. For a string relation , the min-variant of the -matching problem is as follows: For a given string and an NFA , compute a shortest string with and , or report that no such string exists. The max-variant of the -matching problem is as follows: For a given string and an NFA , either report that there are arbitrarily large strings with and , or compute a longest string with and , or report that no such string exists. Further, all lower bounds on the quantitative versions of the matching problem will already apply in the setting where we are only required to compute the length of the resulting string.
Finally, we define a universal variant of the matching problem. For a string relation , the universal-variant of the -matching problem is as follows: For a given string and an NFA , decide whether matches all strings with , i. e., decide the inclusion problem .
See Figure 1 for an overview of all our upper and lower bounds with respect to the variants of the regex matching problem. Recall that is the size of the input NFA.
| ① | |||||
|---|---|---|---|---|---|
| -matching | |||||
| min-variant | |||||
| max-variant | |||||
| universal-variant | PSPACE | coNP | PSPACE |
| ② | ||||
|---|---|---|---|---|
| -matching | no | no | — | — |
| min-variant | no | no | no | no |
| max-variant | no | no | no | no |
| universal-variant | no | PSPACE-hard | coNP-hard | PSPACE-hard |
Computational Model and Basic Input-Preprocessing.
The computational model we use to state our algorithms is the standard unit-cost word RAM with logarithmic word-size , which allows processing inputs of size , where . This is a standard computational model for the analysis of algorithms, defined in [22]. To make some of our algorithms faster, we use the standard technique of lazy initialisation of arrays [30] to avoid spending linear time in the array size to initialise its cells. The time needed, after the lazy initialisation, to store a value in a cell of the array, or to check if a cell of the array was initialised and if so return the value it stores, is .
Recall that the inputs for our problems (see previous paragraph) are always an NFA and a string over . For these inputs, we make the following assumptions. For an NFA with , , and , we assume that and ; we can assume that both and are ordered sets, w.r.t. the canonical order on natural numbers. Hence, the processed strings are sequences of integers (representing the symbols), each of these integers fitting in one memory word. This is a common assumption in string algorithms: the input alphabet is said to be an integer alphabet (see, e. g., [17]). As we assume NFAs to be trimmed, we also know that are in . Moreover, an NFA is given by its number of states, size of the input alphabet, initial and final state, and a list of transitions of the form , with and .
Further, using radix sort, we can prepare for each state having outgoing transitions the list of these transitions, sorted first according to the symbol labelling them, and then by the target state; for states without outgoing transitions . This allows us to simultaneously construct, for each and for which has an outgoing -transition, the list of transitions of the form ; is undefined for and in the case where has no outgoing -transition. We will make the convention that, in the case that a self-loop exists, then is the first transition stored in ; this allows us to test in whether there exists a self-loop for any and . Both and are implemented as arrays (of lists), which are lazily initialised (as mentioned above). So, computing the defined elements of and can be done in linear time, i. e., in .
3 State-Set Simulation and Simple Upper Bounds
Given a string and a regular expression represented as an NFA , we can solve the regex matching problem of deciding whether using the classical approach of state-set simulation. It is well-known [9] that, assuming the strong exponential time hypothesis (SETH), this yields an optimal algorithm. For our variants of regex matching, it is often possible to build an NFA from that accepts suitable strings (e. g., the subsequences or supersequences of strings of ), so that we can solve the matching problem by checking whether by state-set simulation (though this is generally not optimal). Since many of our algorithms will nevertheless be specialised variants of state-set simulation, we review the technique in more detail below.
State-Set Simulation for NFA.
For an NFA A without -transitions and an input string , the state-set simulation starts with and then, for every , computes the set , where, for any state set and symbol , we call the set of all states that can be reached from a state in by a -transition, i. e., . Clearly, each contains exactly the states that are reachable from by a -labelled path. Now if we are dealing with an NFA, then we can use the same idea, but we also have to compute the -closures in between, i. e., the set of all states that can be reached from a state in by a path of -transitions. More precisely, we start with and then, for every , we compute the set . Now each contains exactly the states that are reachable from by a path that is labelled with (with -transitions being ignored in the path label). Computing is called the initialisation and computing from is called an update step of the state-set simulation. For every , the set is the set of active states at step . We also say that a state is active at step if .
Given a state set and a symbol , computing the set can be easily done in time , since we only have to compute for the states with a transition , which are given by . Thus, we have to inspect every transition at most once. Computing the -closure can also be done in time by a breadth-first search (BFS) that starts in all states of and only considers -transitions (again, the entries can be used for that). This means that the initialisation and each update step can be done in time , which yields a total running time of for the whole state-set simulation. What is more, this running time is conditionally optimal. Indeed:
Lemma 3.1 ([9]).
Given an NFA and string , we can check in time whether , but not in time for any unless SETH fails.
Solving Sub- and Supersequence Matching via State-Set Simulation.
We now explain how we can use state-set simulation to solve the -matching problem, specifically for the subsequence and supersequence relations (which are our main focus in this paper). It is easy to see that we can transform an NFA into two NFAs and such that and . For example, this is done in [8, Lemma 8]. We review this construction as we adapt it later.
The NFA is obtained from by simply adding a transition for every state and . Intuitively speaking, these loops equip with the general possibility to consume symbols from without transitioning in a new state, which corresponds to ignoring symbols from . Hence, the “non-ignoring” transitions of an accepting run of on spell out a subsequence of accepted by . On the other hand, the NFA is obtained from by simply adding a transition for every existing transition with . This means that while reading the automaton can always virtually process some symbols that do not belong to . Hence, in an accepting run of on , the actual transitions that read the symbols from along with the added -transitions that virtually read some symbols spell out a supersequence of accepted by .
Consequently, we can solve the subsequence and supersequence matching problem by checking whether and , respectively, via state-set simulation. Since and , we get the following:
Theorem 3.2.
The subsequence matching problem can be solved in time and the supersequence matching problem can be solved in time .
It has been observed in [8, Lemma 9] that the NFAs and have special properties that can be exploited in the state-set simulations to improve the bounds of Theorem 3.2.
Firstly, for , whenever a state of is added to the set of active states in the state-set simulation, it will stay active until the end of the state-set simulation. This is due to the existence of the transition for every . Consequently, for the state-set simulation of , we have . Secondly, for , we can observe that the state-set simulation starts with , i. e., the set of all states, which is due to the fact that every state in is reachable from by -transitions and therefore is in the -closure of with respect to . Moreover, when a state is removed, it is never added back. Indeed, assume by contradiction that but gets added to by an update step. Then is not reachable from by -transitions, but it is reachable from by -transitions: this is impossible because in there is a transition for every existing transition with . Thus, .
This means that in the state-set simulation for and on an input string , we can encounter at most different sets of active states. Further, for each set of active sets, there are at most possible update steps necessary (since if and and for some hold, then we do not need to update ). Every actual update can again be done in time , which yields the following:
Theorem 3.3.
There are algorithms that solve the subsequence matching problem in time and the supersequence matching problem in time .
This is already a significant improvement over Theorem 3.2, since the running time is linear in . In the next two sections, we look deeper into the problem of subsequence and supersequence matching and manage to lower its complexity to the optimum of .
4 Subsequence and Supersequence Matching in Linear Time
In this section, we prove that the - and -matching problem can be solved in linear time . Note that this also holds for the usual regex matching problem when the input regex or NFA is assumed to accept an upward-closed or downward-closed language (as matching is then equivalent to - or -matching). We first prove the result for the subsequence relation, which is slightly easier, and then cover the supersequence relation.
4.1 Subsequence Matching in Linear Time
Theorem 4.1.
Given a string and NFA with states and transitions, the subsequence matching problem can be solved in time .
Proof Sketch.
We perform the state-set simulation of on , i. e., computing the state sets , but without explicitly constructing , i. e., we work on directly. Thus, the correctness follows from (see Section 3).
To do this in time , we can consider every transition only a constant number of times, which is possible due to the property . More precisely, when we update to , we have to compute and then . However, if is added to due to a transition with , then will stay in the set of active states until the end of the state-set simulation, and the transition can therefore be ignored afterwards (i. e., it will not play a role in the computation of for any ). In the computation of we therefore consider only unmarked transitions with (i. e., the ones not already used in any previous update step) and then mark them when we use them. After having computed , we must compute , but again we can note that transitions that have already been traversed while computing some previous with can be ignored since they necessarily lead to states that are already in and therefore in . Consequently, when we compute the -closure, we ignore marked transitions and also mark all transitions that we traverse; this also handles the initialisation, where we compute .
To implement this idea efficiently, we use a lazily initialised array of lists, indexed by the symbols of , such that, after is computed, contains exactly the unmarked transitions with . Hence, the transitions from are exactly the ones that we need to compute as described above. While computing , we store all new states (i.e., the states discovered via transitions from and that are not in already) into a queue . Then we compute the -closure , ignoring marked transitions and marking traversed transitions. During the computation , we add to all new states that we discover. Finally, in order to update , we only have to look at outgoing transitions with (which are necessarily unmarked) and add them to .
In this procedure, every transition is either not marked at all, or it is marked in the computation of some -closure (including the initialisation) and then ignored, or it is put into at some update step (or initialisation) and then marked at some later update step and then ignored. Hence, in addition to the update steps, we have to spend additional time over the whole state-set simulation.
4.2 Supersequence Matching in Linear Time
The general idea is again to check , and unlike in the previous subsection, we can afford to build in time explicitly. However, performing a state-set simulation in linear time with is more difficult. Intuitively speaking, in order to obtain , we have to remove from all states that cannot be reached by any -labelled path from some other state from . It is not clear how this can done by touching each transition only a constant number of times over the whole course of the state-set simulation. One ingredient to achieve this is to first decompose into its strongly connected components (SCCs).
Recall that the SCCs of a directed graph are the maximal subsets of vertices such that, for any two distinct vertices and in , there is a directed path from to and from to in . The SCCs of an NFA are simply its SCCs when seeing the automaton as a directed graph (ignoring the edge labels in ).
The condensation of an NFA , denoted by , is an NFA whose states are the SCCs of . For convenience, for every state of , we denote by (or simply if is clear from the context) the SCC of that contains , which, by definition, is a state of . The transitions of are as follows: for every transition of , we have a transition in . Note that is possible, and note that several transitions may be merged in . The initial state of is and the final state of is .
Proposition 4.2.
Given an NFA , we can construct in time .
Leaving aside the self-loops, the condensation of an NFA has the convenient property of being a directed acyclic graph (DAG). However, constructing the condensation of an automaton changes in general the language that it accepts. However, we shall next see that this is not the case for the NFA , i. e., we have .
Lemma 4.3.
Let be an NFA and let . Then .
Hence, Moreover, inherits the crucial properties of , namely:
Proposition 4.4.
If has a transition for some , then it also has a transition . Further, if are the state sets of the state-set simulation of on , then and contains all states of .
Our next goal is to show that we can implement the state-set simulation of on in linear time. For this, we will need to efficiently identify states that do not have a self-loop for the next input symbol (since if such a self-loop exists, then holds). Identifying such states is challenging: while there are at most self-loops, there might be up to pairs overall where does not have a self-loop labelled , so we cannot explicitly materialize these pairs. Instead, we use a specific data structure:
Lemma 4.5.
For an NFA , we can build in time a data structure storing a set of states, initially empty, and supporting the following operations:
-
Push: Insert a state in .
-
Pop: Given , retrieve and remove all states without a self-loop labelled with (or indicate that no such state exists).
Over a sequence of push and pop operations where each state of is pushed at most once, the total running time is where is the number of transitions of .
We will use the above structure for . We can finally show our main result:
Theorem 4.6.
Given a string and NFA with states and transitions, the supersequence matching problem can be solved in time .
Proof Sketch.
By our considerations from above, we only have to check whether , where with and ; recall that can be constructed in linear time. We will do this by a state-set simulation as usual, but we exploit the properties of (see Proposition 4.4) to implement it in linear time.
We consider an arbitrary update step. The current state-set is a downward closed set of states (meaning that if a state is in , then so are all its successors); this follows from the fact that is a DAG and from Proposition 4.4. The states in and their outgoing transitions can be completely ignored (they will never become active again or be responsible for removing some states of the current set of active states) so we can as well assume that they are not part of anymore. As is a DAG, the set has some roots (at least one) without incoming transitions (apart from those from , which we ignore anyway). Let be the set of these roots: it will be efficiently maintained throughout the computation using the data structure of Lemma 4.5. To compute , we now have to identify and remove those states from that have no -labelled path from some other state from , which can be done as follows.
For every root we know that if and only if has a -self-loop, and if has such a self-loop, then all its successors are also in (and can thus be ignored in the rest of the update procedure). The roots without a -self-loop and their outgoing transitions are found using Lemma 4.5 and removed; further for every outgoing transition with , we have to do the following. If , then , and if , then we check if has a -self-loop and, if so, as well. If, on the other hand, has no self-loop with and at least one other incoming transition, then we do not yet know whether it is in or not, since this might be determined by other incoming transitions. If has neither a -self-loop nor , but it becomes a root when we delete , then it has to be treated as one of the initial roots (i. e., it will be deleted at some point and its outgoing transitions have to be deleted and processed as described above). The states that become roots can be easily found by keeping for each state a count of the number of transitions that still lead to . This procedure is repeated until we have taken care of all the initial roots and states that become roots (that are not already determined to be in ) during this procedure. Once this terminates, we have computed the set , and the set of roots was also updated accordingly.
Intuitively, the correctness follows from the fact that our algorithm constructs exactly the same sets as the state-set simulation. With respect to the running time, note that we perform update steps and consider each transition at most once. However, at the beginning of each update step (say, when computing ), we need access to all the roots of without -self-loops. For this, we use the data structure from Lemma 4.5. We initially perform push operations (corresponding to the roots of ) and then, over the course of all update steps, we push and pop each state at most once. So we use operations in total, which, by Lemma 4.5, adds up to time.
5 Quantitative Problem Variants
For every , we consider now the min- and max-variant of the -matching problem, i. e., computing a shortest or longest string with . We can show an upper bound of for all these problem variants.222Observe that, of course, these upper bounds will also all constitute upper bounds for the (non-quantitative) matching problems for , , and (see the upper table in Figure 1).
Theorem 5.1.
The min- and max-variant of the matching problem can be solved in time for all relations with .
However, there are some differences with respect to the lower bounds (to be presented in Section 7): We can show that the upper bound is conditionally tight, i. e., is not possible unless SETH fails, for all problem variants except the min-variant of subsequence matching and the max-variant of supersequence matching. For these two variants, we can only show the somewhat weaker statement that an algorithm yields an algorithm that decides whether a dense graph contains a triangle (note that any truly subcubic combinatorial algorithm for triangle detection implies a truly subcubic algorithm for Boolean matrix multiplication [61]).
Let us now give proof sketches for the results of Theorem 5.1.
General Proof Technique.
We prove the upper bounds of the different problem variants in the same way: We reduce and to a -labelled directed graph with edge weights of or and with a designated start vertex and target vertex , such that the shortest or longest infix, prefix, extension, left-extension, subsequence and supersequence corresponds to a path from to with minimum or maximum weight, respectively, and the label of the path corresponds to the witness string (i. e., the infix, prefix, etc.) and its weight to the length of the witness string. By basic graph algorithmic techniques, it is not difficult to check in time whether there is a path from to and if so, to compute such a path (along with its label) having minimum or maximum weight (or to report that the weight is unbounded for such paths). Since for all our reductions, the upper bound follows. Let us in the following sketch the reductions for the different relations.
Infix and Prefix Relations.
Let us just discuss the infix relation, since the prefix relation is similar. We can build in time the product graph , which has vertices and edges as follows. For each transition in and for each , add an edge from to in with label and weight . For each transition with and for each with , add an edge from to in with label and weight .
To this product graph we add a source vertex with -labelled edges with weight to for each and a target vertex with -labelled edges with weight from for each . It can now be seen that paths from to in are in one-to-one-correspondence with accepting runs of over the infixes of , with the weight of the path being the length of the corresponding infix. As mentioned above, we can compute weight-minimum and -maximum paths from to along with their weights in time .
Extension and Left-Extension Relations.
Let us only discuss the case of the extension relation, since the left-extension can be handled similarly (and is generally a bit simpler).
We build again the product graph , but we also add to two graphs that are basically copies of the NFA . One such copy has vertices and edges as follows: for each transition in there is an edge from to labelled and having weight if and weight otherwise. The other copy has vertices and is defined in exactly the same way as the first copy but replacing the “” subscripts by “” subscripts. Then, for each state we add an -labelled edge with weight from to , and for each state we add an -labelled edge with weight from to . We call the resulting graph and we note that this construction can be done in time . Let the source vertex of be and let the target vertex of be .
Now, every path from to in can be decomposed in three parts: (1) a path from to for some , which is a path labelled with a string , with weight , and such that there is a run from to in when reading ; (2) a path from to for some , which is a path labelled by , with weight , and that is witnessing that there is a run from to in when reading ; (3) a path from to labelled with a string , with weight and such that there is a run from to in when reading . Consequently, is an extension of accepted by . Conversely, it can be seen that, for any strings such that the extension of is accepted by , there must be a path in from to labelled with and with weight .
Subsequence and Supersequence Relations.
Let us begin with the subsequence relation. We start with the product graph and add edges with weight and label from to for each state of and each . Let us call these additional edges extra-edges, and note that they mimic the self-loops for every added in the NFA defined in Section 3, relative to the original NFA . However, note that we do not explicitly construct , as we cannot afford within the time bound . We call this graph and note that it can be constructed in time . Let the source of be and let the target be .
An accepting run of on that corresponds to a subsequence of (by the symbols read by the original transitions from ) translates into a path from to with label and weight : Each transition of the run that is an original transition of translates into a corresponding -labelled edge of from to or from to (depending on whether or ), where depends on the length of the prefix of already consumed; and each loop-transition that is added to in the construction of translates into an extra-edge from to . Conversely, any path from to with label and weight translates into an accepting run of on such that its original transitions of read exactly , which therefore must be a subsequence of .
For the supersequence relation, the reduction is slightly different. We start again with the product graph , but now we add for each transition in with and each a -labelled edge with weight from to . Now, these extra-edges mimic the -transitions added in the NFA defined in Section 3, relative to the original NFA . We note that for any -transition of that is not a transition of , there are several (and at least one) transitions with in ; thus, there are also several (and at least one) extra-edges from to that differ in their labels.
Similar as in the subsequence-case, the paths from to in correspond to accepting runs of on , where the label of the path is the supersequence of induced by the accepting run and its weight is . The original transitions of translate into edges of as before, but every transition of that is not a transition of translates into an extra-edge from to of (where again depends on the length of the prefix of that has already been consumed). As observed above, there might be several extra-edges from to that differ in their labels; we just take any of those (in fact, this decision only affects the symbols of the subsequence that do not belong to , but not its length). Also note that unlike in the subsequence case, extra-edges have now weight , since they correspond to symbols of the supersequence of (while in the subsequence-case they correspond to symbols of that do not belong to its accepted subsequence ).
6 Universal Problem Variants
We now consider the universal variants, i. e., checking whether all strings are in .
Theorem 6.1.
The universal-variant of the matching problem can be solved in time for the prefix relation and in time for the infix relation. It is PSPACE-complete for the extension and left-extension relation, coNP-complete for the subsequence relation, and PSPACE-complete for the supersequence relation.
Proof Sketch.
Checking whether all prefixes of a string are accepted by an automaton can be easily done via state-set simulation, and checking the same for all infixes can be done similarly by running a state-set simulation from every starting position. Let us show why the cases of the relations and follow from known facts about finite automata.
The PSPACE upper bound with respect to follows from the fact that can be expressed by an NFA, and NFA-inclusion can be decided in PSPACE. The reasoning is analogous for . The same argument shows PSPACE membership for by building an NFA accepting the supersequences of . The -variant is in coNP, since we can guess a subsequence of and check acceptance for it.
The PSPACE-hardness with respect to , and is easy to see: For , these problems are equivalent to NFA-universality, i. e., checking whether , which is PSPACE-hard. This leaves the coNP-hardness for , which can be shown as follows. For a -CNF formula over variables we build an NFA that accepts all -strings of length and all -strings of length that represent non-satisfying assignments of . Then, if and only if is not satisfiable.
7 Conditional Lower Bounds
We now present the conditional lower bounds that are stated in Figure 1. These bounds apply to all problems for which we showed a time-complexity higher than the optimal linear complexity . Further, they match the upper bounds except in three cases: the universal-variant of the infix relation, and the min-variant (resp., max-variant) of the subsequence relation (resp., supersequence relation). The section is structured in two parts. We first show lower bounds for the infix, prefix, extension and left-extension relations: Theorem 7.1 covers the matching problem and its min- and max-variants, and Theorem 7.2 covers lower bounds for the universal-variant with the infix and prefix relations, noting that the universal-variant for extension and left-extension was shown earlier to be PSPACE-complete (Theorem 6.1). Second, we show lower bounds for the min- and max-variants of the sub- and supersequence relations; the case of the universal-variant was covered in Theorem 6.1 too.
Infix, Prefix, Extension and Left-Extension Relations.
We first observe that SETH-based lower bounds for the -, -, - and -matching problem can be concluded by minor adaptions of the original construction from [9]. Moreover, since the min- and max-variants also solve the non-quantitative version of the matching problem, we can conclude that these lower bounds also hold for the quantitative variants:
Theorem 7.1.
If the matching problem for the infix, prefix, extension and left-extension relation can be solved in time for some , then SETH fails. The same holds for the min- and max-variants, even if we only require the length of the answer strings.
Further, we can show that the upper bound for the universal variant of the matching problem for the prefix or infix relation is also optimal under SETH. The construction can be sketched as follows. For a string and NFA , we build an NFA with for a fresh symbol . It is not hard to see that . Hence, the universal variant of the -matching problem reduces to the normal -matching problem, so that the lower bounds of Theorem 7.1 apply. The case of is very similar. We therefore have:
Theorem 7.2.
If the universal variant of the matching problem for the prefix or infix relation can be solved in time for some , then SETH fails.
Subsequence and Supersequence Relations.
With respect to and , we have proven optimal linear upper bounds for the non-quantitative variants in Section 4. However, for the min- and max-variants, we only have upper bounds of , which we will now complement with conditional lower bounds.
We can easily construct NFAs and that accept exactly the subsequences of a string (and supersequences of a string , respectively) – note that this is an easy special case of [8, Lemma 8] which we reviewed in Section 3. This means that the max-variant of the -matching problem applied to the NFA and a string amounts to computing the longest common subsequence of and . Likewise, the min-variant of the -matching problem applied to and admits a reduction from the shortest common supersequence problem. Thus, solving these problems would allow us to solve the longest common subsequence problem (or shortest common supersequence problem, respectively) for and . This means that the known SETH-conditional lower bounds on the longest common subsequence and shortest common supersequence problems carry over to these quantitative variants of the matching problem (see [13, 1]). Note that these lower bounds already hold when computing the length of the sequences, and further they already hold for constant-sized alphabets so it is not a problem that the automaton generally has size .
Theorem 7.3.
If the max-variant of the matching problem for the subsequence relation or the min-variant of the matching problem for the supersequence relation can be solved in time for some , then SETH fails. This holds even if we only require the length of the answer strings.
This leaves the question of the optimality of -time for the min-variant of the -matching problem and the max-variant of the -matching problem. For these, we are not able to prove SETH lower bounds that exclude for some , but we can argue that the existence of algorithms with running time has some unlikely consequences.
Theorem 7.4.
If the min-variant of the matching problem for the subsequence relation or the max-variant of the matching problem for the supersequence relation can be solved in time , then we can decide whether a given dense graph has a triangle in time . This holds even if we only require the length of the answer strings.
Let us mention that any truly subcubic combinatorial algorithm for triangle detection (i. e., an algorithm with running time for some , where is the number of vertices) yields a truly subcubic combinatorial algorithm for Boolean matrix multiplication, which is considered unlikely (see [61]). So, conditional to this hypothesis, the problems above cannot be solved in time , i. e., they do not enjoy the same linear-time complexity as the non-quantitative versions of subsequence or supersequence matching.
We now sketch the proof of Theorem 7.4. A graph which is dense (i.e., a graph with ) can be transformed into an NFA as follows. We initialise with layers of states and with -transitions following , i. e., every translates into transitions . This is called the middle part of . For every , we call the -entry point and the -exit point. Now, has a triangle if and only if there is an and a -labelled path from the -entry point to the -exit point. We next add a left part to : we add a state , which will be the initial state, and for every there is a path from to the -entry point with transitions, many of which are labelled with and the rest (so transitions) are labelled with (the order does not matter). For the right part, we add a state , which will be the accepting state, and for every there is a path from the -exit point to with transitions, many of which are labelled with and the rest (so transitions) are labelled with (the order does not matter). This concludes the definition of . Note that any string accepted by must go through precisely one entry point and precisely one exit point. We make the following general observation: Every string accepted by with an accepting run going through the -entry point and the -exit point (1.) has length and (2.) has occurrences of (due to how the ’s are chosen in the left and right part of ).
Now for the min-variant of the matching problem for the subsequence relation, we consider the string and ask if the minimum length of a subsequence of accepted by is . The value is a lower bound on the length of subsequences of accepted by . Indeed, consider any subsequence of accepted via the -entry point and the -exit point: it has at most occurrences of (by definition of ), so by item (2.) of our general observation, , which means by item (1.) of the general observation that . Now, let us show that a length- subsequence of is accepted by if and only if has a triangle. In one direction, if a length- subsequence of is accepted by , then this is only possible via the same entry point and exit point by item (1.) of the general observation; thus, has a triangle. And if has a triangle containing , then we can read a string with entry point and exit point , which, since it contains occurrences of in total, can easily be obtained as a length- subsequence of .
For the max-variant of the matching problem for the supersequence relation, we consider the string and ask if the maximum length of a supersequence of accepted by is (i. e., the same length as in the previous proof). The value is now an upper bound on the length of supersequences of accepted by . Indeed, any supersequence of accepted via the -entry point and the -exit point has at least occurrences of , so by item (1.) of our general observation we have , which means that by item (2.) of our general observation. Now, the argument is similar as before: If accepts a length- supersequence of , then this is only possible via the same entry point and exit point , and therefore has a triangle; and if has a triangle containing , then we can read a string with entry point and exit point , which, since it has occurrences of in total, is a length- supersequence of .
8 Conclusion and Future Work
We investigated the regex matching problem in a more general setting, where instead of asking whether matches , we ask whether there is a string with that matches , where is some string relation. We demonstrated that for being the well-known subsequence and supersequence relation, this makes regex matching linear-time solvable. Interestingly, this tractability does not extend to other natural and simple relations, and it also does not generalise to quantitative variants asking for the shortest or longest result. For future research it would be interesting to find more string relations with this property, e. g., the subsequence relation with bounded gap size of : and for every . However, we expect that such small modifications would make the complexity quadratic again, i. e., the classical SETH-reduction applies.
It seems particularly interesting that we can compute in a longest subsequence and a shortest supersequence of that match , since the values and (or their sum) can be interpreted as a measure of how much does not match (note that ). It might be interesting to investigate how fast this measure can be computed for other (practically motivated) classes of regular expressions, e. g., deterministic regular expressions or regular expressions with counters (or even strictly more powerful language classes). Another relevant question is whether we can reconcile the gap between the upper bound and lower bound, for the universal-variant of the infix problem. In other words, given an NFA and string , can one efficiently check whether there is an infix of which is not accepted by , faster than the easy upper bound? Another similar question is whether there is a genuine difference between the complexities of the min- and max-variants of the subsequence and supersequence problems, given that we could not show the same conditional lower bounds for these problems.
Let us also point out that none of our upper bounds rely on heavy algorithmic machinery or data structures and therefore our asymptotic running times do not hide huge constant factors. Hence, we believe that our algorithms have straightforward and practically relevant implementations (just like the classical state-set simulation does).
Finally, our tractability results on sub- and supersequence matching can also be seen as tractability results for the matching problem on automata that satisfy certain conditions, namely, their language is upward- or downward-closed. Our results imply that, on such automata, we can perform matching in time instead of . One natural question is which other restrictions on automata make it possible to achieve such improved bounds. Examples in this direction are the regex classes studied in [9, 12] or automata classes where determinisation is done in polynomial time, e. g., NFAs of bounded co-lex width [16].
References
- [1] Amir Abboud, Arturs Backurs, and Virginia Vassilevska Williams. Tight hardness results for LCS and other sequence similarity measures. In FOCS, 2015. doi:10.1109/FOCS.2015.14.
- [2] Amir Abboud, Virginia Vassilevska Williams, and Oren Weimann. Consequences of faster alignment of sequences. In ICALP, 2014. doi:10.1007/978-3-662-43948-7_4.
- [3] Parosh Aziz Abdulla, Ahmed Bouajjani, and Bengt Jonsson. On-the-fly analysis of systems with unbounded, lossy fifo channels. In CAV, 1998. doi:10.1007/BFB0028754.
- [4] Duncan Adamson, Pamela Fleischmann, Annika Huch, Tore Koß, Florin Manea, and Dirk Nowotka. k-Universality of regular languages. In ISAAC, volume 283 of LIPIcs, 2023. doi:10.4230/LIPICS.ISAAC.2023.4.
- [5] Antoine Amarilli, Florin Manea, Tina Ringleb, and Markus L. Schmid. Linear time subsequence and supersequence regex matching. CoRR, abs/2504.16288, 2025. doi:10.48550/arXiv.2504.16288.
- [6] Renzo Angles, Marcelo Arenas, Pablo Barceló, Aidan Hogan, Juan L. Reutter, and Domagoj Vrgoc. Foundations of modern query languages for graph databases. ACM Comput. Surv., 50(5), 2017. doi:10.1145/3104031.
- [7] Alexander Artikis, Alessandro Margara, Martín Ugarte, Stijn Vansummeren, and Matthias Weidlich. Complex event recognition languages: Tutorial. In DEBS, 2017. doi:10.1145/3093742.3095106.
- [8] Georg Bachmeier, Michael Luttenberger, and Maximilian Schlund. Finite automata for the sub-and superword closure of CFLs: Descriptional and computational complexity. In ICALP, 2015. doi:10.1007/978-3-319-15579-1_37.
- [9] Arturs Backurs and Piotr Indyk. Which regular expression patterns are hard to match? In FOCS, 2016. doi:10.1109/FOCS.2016.56.
- [10] Lasse Bergroth, Harri Hakonen, and Timo Raita. A survey of longest common subsequence algorithms. In SPIRE, 2000. doi:10.1109/SPIRE.2000.878178.
- [11] Karl Bringmann and Bhaskar Ray Chaudhury. Sketching, streaming, and fine-grained complexity of (weighted) LCS. In FSTTCS, LIPIcs, 2018. doi:10.4230/LIPICS.FSTTCS.2018.40.
- [12] Karl Bringmann, Allan Grønlund, Marvin Künnemann, and Kasper Green Larsen. The NFA acceptance hypothesis: Non-combinatorial and dynamic lower bounds. In ITCS, 2024. doi:10.4230/LIPICS.ITCS.2024.22.
- [13] Karl Bringmann and Marvin Künnemann. Quadratic conditional lower bounds for string problems and dynamic time warping. In IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015, pages 79–97, 2015. doi:10.1109/FOCS.2015.15.
- [14] Karl Bringmann and Marvin Künnemann. Multivariate fine-grained complexity of longest common subsequence. In SODA, 2018. doi:10.1137/1.9781611975031.79.
- [15] Sam Buss and Michael Soltys. Unshuffling a square is NP-hard. J. Comput. Syst. Sci., 80(4), 2014. doi:10.1016/j.jcss.2013.11.002.
- [16] Nicola Cotumaccio, Giovanna D’Agostino, Alberto Policriti, and Nicola Prezza. Co-lexicographically ordering automata and regular languages - Part I. J. ACM, 70(4), 2023. doi:10.1145/3607471.
- [17] Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. Algorithms on strings. Cambridge University Press, 2007.
- [18] Maxime Crochemore, Borivoj Melichar, and Zdenek Tronícek. Directed acyclic subsequence graph — overview. J. Discrete Algorithms, 1(3-4), 2003. doi:10.1016/S1570-8667(03)00029-7.
- [19] Joel D. Day, Pamela Fleischmann, Maria Kosche, Tore Koß, Florin Manea, and Stefan Siemer. The edit distance to -subsequence universality. In STACS, 2021. doi:10.4230/LIPIcs.STACS.2021.25.
- [20] Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Document spanners: A formal approach to information extraction. J. ACM, 62(2), 2015. doi:10.1145/2699442.
- [21] Szilárd Zsolt Fazekas, Tore Koß, Florin Manea, Robert Mercas, and Timo Specht. Subsequence matching and analysis problems for formal languages. In ISAAC, volume 322 of LIPIcs, 2024. doi:10.4230/LIPICS.ISAAC.2024.28.
- [22] Michael L. Fredman and Dan E. Willard. BLASTING through the information theoretic barrier with FUSION TREES. In STOC, 1990. doi:10.1145/100216.100217.
- [23] Dominik D. Freydenberger, Pawel Gawrychowski, Juhani Karhumäki, Florin Manea, and Wojciech Rytter. Testing -binomial equivalence. In Multidisciplinary Creativity, a collection of papers dedicated to G. Păun 65th birthday, 2015. available in CoRR abs/1509.00622.
- [24] Jeffrey E. F. Friedl. Mastering regular expressions - Understand your data and be more productive: for Perl, PHP, Java, .NET, Ruby, and more. O’Reilly, 2006.
- [25] André Frochaux and Sarah Kleest-Meißner. Puzzling over subsequence-query extensions: Disjunction and generalised gaps. In AMW, 2023. URL: https://ceur-ws.org/Vol-3409/paper3.pdf.
- [26] Moses Ganardi, Irmak Sağlam, and Georg Zetzsche. Directed regular and context-free languages. In STACS, 2024.
- [27] Pawel Gawrychowski, Maria Kosche, Tore Koß, Florin Manea, and Stefan Siemer. Efficiently testing Simon’s congruence. In STACS, volume 187 of LIPIcs, 2021. doi:10.4230/LIPICS.STACS.2021.34.
- [28] Nikos Giatrakos, Elias Alevizos, Alexander Artikis, Antonios Deligiannakis, and Minos N. Garofalakis. Complex event recognition in the big data era: A survey. VLDB J., 29(1), 2020. doi:10.1007/s00778-019-00557-w.
- [29] Jean Goubault-Larrecq, Simon Halfon, Prateek Karandikar, K Narayan Kumar, and Philippe Schnoebelen. The ideal approach to computing closed subsets in well-quasi-orderings. Well-Quasi Orders in Computation, Logic, Language and Reasoning: A Unifying Concept of Proof Theory, Automata Theory, Formal Languages and Descriptive Set Theory, 2020.
- [30] Étienne Grandjean and Louis Jachiet. Which arithmetic operations can be performed in constant time in the RAM model with addition?, 2023. URL: https://arxiv.org/abs/2206.13851.
- [31] Alejandro Grez, Cristian Riveros, Martín Ugarte, and Stijn Vansummeren. A formal framework for complex event recognition. ACM Trans. Database Syst., 46(4), 2021. doi:10.1145/3485463.
- [32] Simon Halfon, Philippe Schnoebelen, and Georg Zetzsche. Decidability, complexity, and expressiveness of first-order logic over the subword ordering. In LICS, 2017. doi:10.1109/LICS.2017.8005141.
- [33] Daniel S. Hirschberg. Algorithms for the longest common subsequence problem. J. ACM, 24(4), 1977. doi:10.1145/322033.322044.
- [34] John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, 2001.
- [35] James W. Hunt and Thomas G. Szymanski. A fast algorithm for computing longest subsequences. Communications of the ACM, 20(5), 1977. doi:10.1145/359581.359603.
- [36] Prateek Karandikar and Philippe Schnoebelen. The height of piecewise-testable languages and the complexity of the logic of subwords. Log. Methods Comput. Sci., 15(2), 2019. doi:10.23638/LMCS-15(2:6)2019.
- [37] S.C. Kleene. Representation of events in nerve nets and finite automata. In Automata Studies, volume 34 of Annals of Mathematics Studies, pages 3–41. Princeton University Press, 1956.
- [38] Sarah Kleest-Meißner, Rebecca Sattler, Markus L. Schmid, Nicole Schweikardt, and Matthias Weidlich. Discovering event queries from traces: Laying foundations for subsequence-queries with wildcards and gap-size constraints. In ICDT, 2022.
- [39] Sarah Kleest-Meißner, Rebecca Sattler, Markus L. Schmid, Nicole Schweikardt, and Matthias Weidlich. Discovering multi-dimensional subsequence queries from traces - from theory to practice. In Datenbanksysteme für Business, Technologie und Web (BTW 2023), 20. Fachtagung des GI-Fachbereichs ,,Datenbanken und Informationssysteme" (DBIS), 2023. doi:10.18420/BTW2023-24.
- [40] Maria Kosche, Tore Koß, Florin Manea, and Stefan Siemer. Combinatorial algorithms for subsequence matching: A survey. In NCMA, volume 367 of EPTCS, 2022. doi:10.4204/EPTCS.367.2.
- [41] Sailesh Kumar, Balakrishnan Chandrasekaran, Jonathan S. Turner, and George Varghese. Curing regular expressions matching algorithms from insomnia, amnesia, and acalculia. In ANCS, 2007. doi:10.1145/1323548.1323574.
- [42] Dietrich Kuske. The subtrace order and counting first-order logic. In CSR, volume 12159 of LNCS, 2020. doi:10.1007/978-3-030-50026-9_21.
- [43] Dietrich Kuske and Georg Zetzsche. Languages ordered by the subword order. In FOSSACS, volume 11425 of LNCS, 2019. doi:10.1007/978-3-030-17127-8_20.
- [44] Alex X. Liu and Eric Norige. A de-compositional approach to regular expression matching for network security. IEEE/ACM Trans. Netw., 27(6), 2019. doi:10.1109/TNET.2019.2941920.
- [45] David Maier. The complexity of some problems on subsequences and supersequences. J. ACM, 25(2), 1978. doi:10.1145/322063.322075.
- [46] Alexandru Mateescu, Arto Salomaa, and Sheng Yu. Subword histories and Parikh matrices. J. Comput. Syst. Sci., 68(1):1–21, 2004. doi:10.1016/J.JCSS.2003.04.001.
- [47] Alexander Okhotin. On the state complexity of scattered substrings and superstrings. Fundamenta Informaticae, 99(3), 2010. doi:10.3233/FI-2010-252.
- [48] M. Praveen, Philippe Schnoebelen, Julien Veron, and Isa Vialard. On the piecewise complexity of words and periodic words. In SOFSEM, 2024. doi:10.1007/978-3-031-52113-3_32.
- [49] Steven Purtzel and Matthias Weidlich. Suse: Summary selection for regular expression subsequence aggregation over streams. In SIGMOD, 2025. To appear.
- [50] William E. Riddle. An approach to software system modelling and analysis. Comput. Lang., 4(1), 1979. doi:10.1016/0096-0551(79)90009-2.
- [51] Michel Rigo and Pavel Salimov. Another generalization of abelian equivalence: Binomial complexity of infinite words. Theor. Comput. Sci., 601, 2015. doi:10.1016/J.TCS.2015.07.025.
- [52] Arto Salomaa. Connections between subwords and certain matrix mappings. Theoret. Comput. Sci., 340(2):188–203, 2005. doi:10.1016/J.TCS.2005.03.024.
- [53] Philippe Schnoebelen and Julien Veron. On arch factorization and subword universality for words and compressed words. In WORDS, 2023. doi:10.1007/978-3-031-33180-0_21.
- [54] Shinnosuke Seki. Absoluteness of subword inequality is undecidable. Theor. Comput. Sci., 418, 2012. doi:10.1016/J.TCS.2011.10.017.
- [55] Alan C. Shaw. Software descriptions with flow expressions. IEEE Trans. Software Eng., 4(3), 1978. doi:10.1109/TSE.1978.231501.
- [56] Imre Simon. Hierarchies of events with dot-depth one. PhD thesis, University of Waterloo, 1972.
- [57] Imre Simon. Piecewise testable events. In Automata Theory and Formal Languages, 2nd GI Conference, volume 33 of LNCS, 1975. doi:10.1007/3-540-07407-4_23.
- [58] Imre Simon. Words distinguished by their subwords (Extended abstract). In WORDS, volume 27 of TUCS General Publication, 2003.
- [59] Peter Snyder and Chris Kanich. No please, after you: Detecting fraud in affiliate marketing networks. In WEIS, 2015. URL: http://www.econinfosec.org/archive/weis2015/papers/WEIS_2015_snyder.pdf.
- [60] K. Thompson. Programming techniques: Regular expression search algorithm. Communications of the ACM, 11, 1968.
- [61] Virginia Vassilevska Williams and R. Ryan Williams. Subcubic equivalences between path, matrix, and triangle problems. J. ACM, 65(5), 2018. doi:10.1145/3186893.
- [62] Georg Zetzsche. The complexity of downward closure comparisons. In ICALP, volume 55 of LIPIcs, 2016. doi:10.4230/LIPICS.ICALP.2016.123.
- [63] Haopeng Zhang, Yanlei Diao, and Neil Immerman. On complexity and optimization of expensive queries in complex event processing. In SIGMOD, 2014. doi:10.1145/2588555.2593671.
