Finiteness of Symbolic Derivatives in Lean
Abstract
Brzozowski proved that the set of derivatives of any regular expression is finite modulo associativity, idempotence and, notably, commutativity of the union operator. We extend this result to the case of symbolic location based derivatives, for which we prove finiteness of the state space by quotienting only by associativity, deduplication and idempotence (ADI); the fact that we don’t use commutativity allows for this result to carry over to the derivative based backtracking (PCRE) match semantics, where the union operator is noncommutative. Furthermore, we consider regular expressions extended with lookarounds, intersection, and negation. We also show that our method for proving finiteness allows us to include certain simplification rules in the derivative operation while preserving finiteness. The finiteness proof is constructive: given an expression , we construct a finite set that is an overapproximation (modulo ADI) of the set of derivatives of . We reuse some of the infrastructure provided in previous formalization efforts for regular expressions in Lean 4, showing the flexibility and reusability of the framework.
Keywords and phrases:
Lean, regular languages, lookarounds, derivatives, finitenessCopyright and License:
2012 ACM Subject Classification:
Theory of computation Formal languages and automata theorySupplementary Material:
Software (Source Code): https://github.com/ezhuchko/finiteness-derivativesarchived at
swh:1:dir:c58d9c49e42533ffa3ea99405b0cc991999096ae
Funding:
E. Zhuchko and H. Maarand were supported by the Estonian Research Council grant PRG1210 (Automata in Learning, Interaction and Concurrency).Editors:
Yannick Forster and Chantal KellerSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Regular expression derivatives, first introduced by Brzozowski [6], provide a powerful technique to lazily convert regular expressions into finite state automata. A finite automaton, by definition, has a finite number of states. Thus, the set of iterated derivatives of an expression (which correspond to states) must be finite. Brzozowski showed that, under a suitable quotienting relation, the number of derivatives of any expression is always finite. Remarkably, to obtain finiteness it is sufficient to quotient by associativity, commutativity and idempotence, or ACI, of union (alternation). In this paper, we extend this framework to a broader class of regular expressions that includes lookarounds, and where union is not necessarily commutative. This noncommutativity reflects the behavior of many modern regex engines – including those in Python, Java, and JavaScript – which use a backtracking matching strategy. In such engines, alternation is evaluated sequentially, meaning that the order affects match selection.
Moreover, we consider symbolic regular expressions. Many practical applications need large or even infinite alphabets where predicates of an effective Boolean algebra (EBA), rather than individual characters, are used in regular expressions. A typical example is the predicate or character class \w which denotes the set of all Unicode word-letters. Furthermore, the alphabet algebra can even be undecidable or semidecidable, where one cannot reliably compute satisfiable Boolean combinations of predicates, so-called minterms [10] – implying that there is no guaranteed method for computing a finite alphabet, in which case the problem does not straightforwardly reduce to the classical case.
The main result of this paper, the proof of finiteness, is needed to formalize termination proofs of various algorithms that incrementally unfold symbolic extended regular expressions into symbolic automata, such as: formalizing and generalizing [24, Theorem 7.1] to the case of semidecidable alphabet theories; formalizing correctness of MatchEnd in [19, Theorem 5.3] where alternation is noncommutative; formalizing termination of LLMatch [29, Theorem 4]; and formalizing -regularity modulo theories of RLTL [30, Theorem 6] that uses symbolic derivatives of extended regular expressions modulo alphabet algebras that are not required to be decidable.
Contribution
We develop a formalized proof in Lean111See [31] for the complete self-contained Lean formalization. of finiteness of the set of all iterated derivatives in the class of regular expressions that allow intersection, complement and lookarounds, modulo any effective Boolean algebra (EBA) over characters. A key insight of the proof is that in the equivalence relation of regular expressions we can replace commutativity of alternation with deduplication, which maintains the order of alternations as required in the backtracking semantics.
We first lift the technique of transition regexes from [24] and formalize it in Lean through symbolic location derivatives where the symbolic location derivative of an expression is a transition tree. This way we abstract away from concrete locations and can take advantage of the symbolic representation as transition trees. In this context, the set of iterated (concrete) derivatives of is a subset of the set of leaves of (the transition trees of) all iterated symbolic location derivatives of . This is because transition trees, due to their symbolic representation, may contain unreachable leaves. Our approach to proving finiteness is to first define an explicit procedure that, given a regular expression , constructs a (finite) list of regular expressions. We then show that any leaf, reachable or unreachable, of an iterated symbolic location derivative of is contained in this list (modulo certain equations). We also show how this overapproximation can be used to include certain simplifications in the derivative operation while preserving finiteness.
We use the formalization of the core theory of location derivatives of in [32] to show that symbolic location derivatives indeed preserve the match semantics. In doing so, we work with alternation up to associativity, deduplication and idempotence (ADI) and make use of several standard Lean libraries. Although our ultimate goal is practical efficiency, the finiteness proof presented in this paper prioritizes simplicity and ease of formalization while keeping the set of quotienting operations (relatively) minimal.
Synopsis
The remainder of the paper is structured as follows. Section 2 discusses related work. Section 3 provides the preliminaries used in the rest of the paper, including the key results and definitions needed from [32]. Section 4 introduces the Lean theory of symbolic location derivatives that builds on transition regexes from [24]. Section 5 develops the Lean formalization of the finiteness result including the main theorem, finiteness. Finally, Section 6 discusses future work.
2 Related Work
Location based derivatives were developed and implemented as a new backend in the .NET platform [19]. The finiteness of the state space produced by the rewrite rules in [19] is not directly implied by Brzozowski’s original proof, since in this setting the union operator is noncommutative. This paper fills the gap by considering deduplication, which is present in [19] with the key rule AltUni. To remove duplicates, the AltUni rule uses deduplication and the fact that the union is unital with respect to the element. Location based derivatives were further extended in [29] to support intersection and complement operations, while shifting the match semantics from PCRE to POSIX, where commutativity is permissible. The correctness of the matching algorithm was formally verified in Lean in [32]. The finiteness result in this paper applies directly to both works, as our proof – though based on deduplication – naturally extends to settings where union is commutative.
Symbolic Boolean derivatives and their underlying representation of transition regexes were first introduced in [24], and they generalize the definitions of derivatives in [6] and [2]. The formalism of symbolic derivatives enables more efficient decision procedures for solving extended regular expression constraints. This idea was later generalized to the framework of transition terms in [30].
There are several formally-verified decision procedures based on regular expression derivatives. Coquand and Siles [9] implement a decision procedure for regular language equivalence based on Brzozowski derivatives, and highlight how the key technical difficulty in the formalization work is proving the finiteness of the state space using Brzozowski’s original proof of finiteness, which is done using Bar induction. Similarly, Traytel and Nipkow [20] present a framework for regular expression equivalence which they instantiate with several procedures including Brzozowski derivatives; the authors prove finiteness of the standard fragment of regular expressions and formalize their results in the Isabelle/HOL proof assistant. Given a regular expression , both of the above mentioned works produce a finite list of normalized regular expressions which is closed under derivation. Our work similarly computes an explicit overapproximation but without relying on any normalization procedure.
In terms of relations used, authors of [9] prove finiteness up to language equivalence and remark that it seems possible to prove finiteness up to ACI equivalence only with the extra condition that the derivative function respects ACI equivalence of regular expressions. Authors of [20] also refer to this, and show that the derivative function respects equivalence. Interestingly, in our work the derivative function actually violates ACI preservation (see Section 4.3), but it does not affect the finiteness proof. As noted in [9], the finiteness proof does not depend on the alphabet being finite: our work similarly highlights this since our underlying symbolic representation of transition regexes does not rely on concrete characters and supports infinite alphabets. Moreover, our work highlights that the set of predicates from the underlying EBA may also be infinite.
Several formalizations of Antimirov’s partial derivatives have been developed, including those by [18], [14] and [20]. Antimirov’s definition simplifies the finiteness proof, as discussed in Section 5.3. However, extending it to support extended operators is nontrivial [7]. In contrast, the Brzozowski derivative approach naturally handles extended operators.
Derivatives have also been used to describe match search algorithms for lexing [25, 26] that have subsequently been improved upon and formalized in Isabelle/HOL by [3]. The match semantics in this setting is POSIX [21] style. The work was recently expanded by [28] to allow bounded loops as well as character-sets; the latter extension is captured in our work by the use of effective Boolean algebras , which allow us to formalize regular expressions modulo predicates of . Recently, [27] presented a functional recursive variant of the [25] algorithm, also formalized in Isabelle/HOL. A recent derivative based lexing algorithm [11, 12] has been formalized in Coq, which differs from [28] in the way POSIX tokens are processed, as well as how simplifications are applied to derivatives.
Miyazaki and Minamide [17] introduced derivatives for regular expressions with lookaheads. Similarly, Berglund et al. [5] proposed a matching construction using automata with -transitions to support lookaheads. However, neither approach trivially extends to lookbehinds due to the inherent asymmetry between forward and backward matching. More recently, several works [15, 4] have proposed linear-time matching algorithms that support both lookaheads and lookbehinds through the use of oracle NFAs.
3 Preliminaries
Words, locations and spans
Let be an alphabet type, may denote an infinite set. A word or string over is represented by a list of elements of type , that is denoted by and represented in Lean as List σ. We let or denote the empty word. Concatenation of is formally denoted by . We also write for prepending to (the cons operation on lists). Informally, we write the juxtaposition and in both cases.
A location describes a position in a string and is represented as a pair of words in . Let , a location in is a pair , analogous to a list zipper [13]. A location is initial and a location is final. For a non-final location , we write to refer to the current symbol . Let define the next location from any given nonfinal location and let .
A span is defined as , i.e., a span is a triple of words that we denote by with and where represents the matching section of the string, and and represent the left and right context, respectively. A span can also be viewed as two locations and , each pointing to the beginning and end of the matching section in a string, we refer to this as .
Effective Boolean algebras
An Effective Boolean algebra (EBA) over an element universe [10] is a tuple where is a set of predicates that is closed under the Boolean connectives and contains and . For and the models relation obeys classical Tarski laws. Let . Then and . If then is satisfiable. We require all the connectives to be computable and to be decidable. In Lean, we model EBAs as a type class of the type where is represented by denote . Here we refer to also as the EBA and say that is decidable if satisfiability in is decidable.
An alternative perspective is that an EBA consists of two Boolean algebras and a homomorphism between them. There is a Boolean algebra of syntax, corresponding to in the definition above, for representing predicates (e.g., in regular expressions) and a Boolean algebra of subsets of . The homomorphism interprets each predicate as a subset of (where denotes the set of Booleans).
Regular expressions with lookarounds
Here we formally define extended regular expressions with lookarounds modulo an EBA . The class of regular expressions, or regexes, is here given by the following abstract grammar
where and .
The operators from alternation (or union) to complement ˜ appear in the order of precedence with complement having the highest precedence. There are four lookaround expressions: is lookahead, is lookbehind, is negative lookahead, and is negative lookbehind, where is the body of the lookaround. We refer to these collectively as . The remaining operators are the empty-word regex , intersection , concatenation , and Kleene star *. Concatenation is often implicit by using juxtaposition when this is unambiguous. The regex matching nothing is the predicate in .
The lookaround height of , , is defined as the nesting level of lookarounds in . If does not contain any lookarounds then , otherwise where is the body of the lookaround in with the largest height.
4 Symbolic location derivatives
The intuition behind using locations and spans is that matching with lookarounds is context dependent. Classically, nullability of a regex is defined as a static property that holds when the language of matches the empty word. However, in the presence of lookarounds, nullability is a dynamic property that depends on the location in the string. We will make use of the key definitions of and from [32] where is a location. These definitions are used without modification in our setting. In particular, plays a central role in defining the semantics of transition regexes.
In this section we formally define the framework of symbolic location based derivatives. The definitions build on transition regexes from [24] and use them for the class of regexes with lookarounds and location based derivatives given in [29]. We highlight two key differences from [24]: the first is that we use nullability of regexes (with lookaheads) as conditions in , rather than membership in predicates. The second is that we lift all operations on to and avoid explicit lifting rules as in [24, Section 4.1]; the latter choice makes sense in the SMT context where ITE expressions exist natively.
4.1 Transition regexes revisited
Transition regexes are the underlying representation of symbolic derivatives. The abstract grammar of is as follows, for :
A transition regex is a binary classification tree: it is either a leaf or an ITE term with condition , then-case and else-case . The key idea is that the regular expression in the condition acts as a symbolic predicate: the ITE evaluates to if is nullable at the current location, and to otherwise.
All binary operations and unary operations are lifted to as follows, let and :
The symbolic location derivative of is defined as a transition regex in which, intuitively, represents a partial evaluation of with respect to whose nullability tests have been postponed. Let , :
where , , and ˜ are lifted to . We simplify .
4.2 Semantics of transition regexes
A transition regex encodes a function from locations (in the input string) to regular expressions. That is, given and , we define as the evaluation of at . Let and :
While the evaluation function of transition regexes is well-defined for all locations, transition regexes are used to represent functions from nonfinal locations to . For example, the derivative of a predicate is the transition regex using the lookahead of as its condition. Thus, for all nonfinal locations ,
because, using that is nonfinal,
The following lifting lemma of transition regexes is used frequently in many contexts where and are lifted binary and unary operations.
Lemma 1.
-
1.
-
2.
The above lemma justifies certain useful transformations. For example, we can propagate complement into the leaves of a transition regex without affecting its conditions. This is the justification for the last step in the following equations
where , and . In this case the resulting transition terms can directly be viewed as the Symbolic Finite Automaton [10] given in Figure 1 where the states and are always nullable, hence unconditionally accepting, and is never nullable, hence unconditionally non-accepting.
The regex is, without using ˜, equivalent to the regex that is also clear from Figure 1. Propagation of ˜ into leaves is a superpower of transition regexes that enables lazy propagation of complement as in [24] by avoiding eager powerset constructions for determinization.
Finally, we prove that the symbolic derivative (evaluated at a location ) is the same as the classical derivative:
Theorem 1.
4.3 Iterated derivatives
Recall that the symbolic location derivative is a transition regex (i.e., a tree). Note that Theorem 1 implies that it is enough in the finiteness proof to care about the leaves of as an overapproximation of (i.e., the set of all states reachable in one derivative step from ). To reason about iterated applications of the derivative, we repeatedly collect the leaves from the transition regex tree and apply to each of them. This process captures all states reachable through successive derivations and forms the basis of our finiteness argument.
The following function collects all of the leaves of a transition regex:
We lift unary operations and binary operations on regexes to lists of regexes .
Lifting satisfies the following properties for any :
The unary case is straightforward in Lean, but lifting binary operations requires the helper function productWith, which computes the Cartesian product of two lists and applies a given binary function to each pair. This function differs from zipWith which instead combines corresponding elements pointwise.
Next, we define a function that computes an overapproximation of the (immediate) derivatives of a regular expression as . The following lemmas describe how the step function is well-behaved with respect to the extended regular expression operations.
Lemma 2.
Let , then:
To reason about the set of all derivatives of an expression we need a way to iterate the symbolic location derivative. We define the function steps that systematically explores all possible derivatives of the initial regular expression by using a symbolic approach which at each step expands the current leaves.
Thus steps r n is an overapproximation of derivatives of r wrt. words of length n. Here |> is (reverse) function application and flatten merges nested lists.
Recall that the symbolic location derivative δ can introduce unreachable leaves when constructing the ITE trees. The consequence of these unreachable leaves is that the step function does not respect idempotence.
The following is a concrete example where the idempotence property is violated. Consider the predicate representing the character , denoted by
Taking a single step with a and a ⋓ a produces the following lists of expressions:
Observe that the expressions a and a ⋓ a are similar but the lists of expressions that we get are not. In Figure 2, which is the transition regex δ (a ⋓ a), one can clearly see that evaluating it with a concrete location can never reach the leaves ε ⋓ ⊥ and ⊥ ⋓ ε as the nullability conditions on those paths are contradictory. Consider the path from the leaf ε ⋓ ⊥ to the root: it contains one dashed line which represents the false branch of the nullability test and also one solid line which represents the true branch of the nullability test of the same expression so we reach a contradiction. We see that the unreachable leaves of symbolic location derivatives may be expressions that are not reachable even up to similarity. In a similar manner, the deduplication property is also violated.
For the finiteness proof it is sufficient to consider all possible states without actually pruning unreachable ones.
5 Finiteness of Derivation State Space
The main result of this paper is that the set of all possible derivatives up to similarity is finite. Intuitively, we would like to reason about the following set: for some notion of similarity .
As an intermediate result, we associate with any regular expression a finite set of expressions pieces(R). We show that itself can be represented as the sum of a subset of pieces(R). Moreover, we show that any iterated derivative of can be represented in the same way using pieces of . For example, the pieces of the expression are . The Brzozowski derivative of with respect to is the expression , which can be represented as the sum of a subset of the pieces given above.
5.1 Brzozowski’s proof
Let us first review the main idea of Brzozowski’s proof of finiteness [6]. As main finiteness statement [6, Theorem 5.2], Brzozowski proves that any regular expression has a finite number of dissimilar derivatives. The notion of similarity refers to the associativity, commutativity and idempotence of the union operator. Interestingly, the original proof by Brzozowski does not work without the additional equations and , as first pointed out by Salomaa [23]. Both Salomaa [23] and Aanderaa [1] relied on Brzozowski’s proof of finiteness to establish the completeness of their systems. Later, Antimirov [2] proposed a modification to the original definition of derivatives, ensuring the finiteness result by Brzozowski holds just with the three original equations. This modified approach was subsequently widely adopted as the canonical definition [9, 18, 22].
Brzozowski first proves that every expression has a finite number of “types” of derivatives, where types are equivalence classes of expressions under language equivalence [6, Theorem 4.3(a)]. The proof is by induction and relies on arguments about the shapes of the resulting derivatives. He also gives a method [6, Theorem 4.3(b)] to construct the set of all types of derivatives of by lexicographically enumerating words and iteratively computing their derivatives until a fixpoint is reached under language equivalence. Brzozowski then proves that the number of dissimilar derivatives is also finite [6, Theorem 5.2] by showing that the process of constructing the set of all dissimilar derivatives must terminate. This argument relies on a careful analysis of the proof of [6, Theorem 4.3(a)].
As a comparison, we first note that Brzozowski’s proof technique relies on the finiteness of the underlying alphabet, our strategy does not. The other key difference is that Brzozowski proves the existence of a bound on the number of derivatives whereas our proof simply constructs an overapproximation of the set of derivatives. However, the underlying analysis when constructing the set of pieces of is somewhat similar to the arguments in Brzozowski’s proof of finiteness.
5.2 Similarity
The set of iterated derivatives is clearly not finite in general; to achieve finiteness, we define a relation which we call similarity, which contains the following cases for :
| (associativity) | ||||
| (right deduplication) | ||||
| (idempotence) |
In the Lean formalization, we define the similarity relation Sim as an inductive predicate that is almost a congruence relation; we do not require a congruence rule for lookarounds and Kleene star, as derivatives do not occur under lookarounds or star. Moreover, the congruence rule for concatenation keeps the second argument fixed, i.e., if then .
Both deduplication and idempotence are used to ensure duplicate-free expressions. We conjecture that the finiteness proof would still hold without idempotence, at the cost of a larger state space. Instead of assuming commutativity, we adopt a right-biased deduplication rule: when eliminating duplicates in expressions like , we retain the leftmost occurrence of . This choice aligns with PCRE-style greedy semantics [19], where the first matching alternative has priority. Removing the earlier occurrence would break this behavior, hence the asymmetry. For the same reason, we cannot allow commutativity in the equivalence relation. Ignoring order – e.g., by treating alternation as a set operation – would further reduce the state space, but our finiteness result would still hold. While additional quotienting rules could coarsen the equivalence relation, they are not necessary for proving finiteness.
Since we work modulo an equivalence relation, it is natural for the theorems about regular expressions in this setting not to hold strictly, but up to equivalence under similarity. We therefore begin by setting up essential infrastructure for reasoning about lists of regular expressions modulo an equivalence relation. We define weaker “up to” definitions of membership of an element inside a list, inclusion, and equality (of lists as sets, i.e., mutual inclusions) w.r.t. any given relation R:
-
The notation x ∈[ R ] ys expresses that there is an element in ys which is related to x by R;
-
The notation xs ⊆[ R ] ys expresses that every element of xs is an element of ys up to equivalence under R (i.e., for all x ∈ xs, x ∈[ R ] ys);
-
Finally, xs =[ R ] ys expresses that xs and ys are equal up to equivalence under R.
5.3 Pieces of regular expressions
In this section we define the function pieces which, given a regular expression , computes a list of regular expressions. The expressions in this list can then be combined together with the union operator to reconstruct any derivative of , as discussed in the beginning of Section 5.
To illustrate the underlying idea, we begin with the so-called partial derivatives which were introduced by Antimirov [2]. The main difference with Brzozowski derivatives is that, given an input regular expression and a character, a Brzozowski derivative is simply a single regular expression (naturally leading to a DFA construction) while there is a set of Antimirov derivatives (naturally leading to an NFA construction). An advantage of Antimirov’s approach is that the finiteness of partial derivatives is significantly easier to prove than the finiteness of Brzozowski derivatives.
Interestingly, Mirkin [16] and Antimirov [2] independently proposed a similar method for constructing an NFA from a regular expression. Later, Champarnaud and Ziadi [8] proved that these constructions are equivalent, unifying the idea and presenting the notion of a support of a regular expression. It is defined as the following function where , for some alphabet and denotes the empty language:
In the definition above, the operation is simply concatenation lifted to set of regular expressions on the left and a single expression on the right.
It can be shown that all iterated Antimirov derivatives of are contained in the following set: . It is easy to see that Antimirov derivatives are not of the form . In order to show finiteness of the set of Antimirov derivatives of , it is sufficient to show that the is finite. Note that it is not necessary to consider any similarity relation for this, as the set representation inherently accounts for ACI. The proof is by induction on .
Our proof of finiteness follows a similar strategy, where we approximate the set of all iterated derivatives in a similar fashion to the support function above. Note that Brzozowski derivatives (and their finiteness) are not directly related to Antimirov derivatives, because additional equations would be needed to relate the two. In particular, one needs to include distributivity of concatenation over union in the similarity relation. For example, the Brzozowski derivative of an expression of the form leads to expressions of the form (assuming that or is nullable), where , and represent the derivatives of , and respectively. For Antimirov derivatives, this leads to a set of expressions of the form , or (again assuming that or is nullable).
Since we are working with Brzozowski-style derivatives, which do not compute sets of expressions, we can take the definition of support only as an inspiration. We need to tailor the definition of pieces to our setting which has extended regular expressions, deterministic derivatives and similarity consisting of associativity, deduplication and idempotence.
5.3.1 Definitions
We start by defining the function toSum which creates a union from a list of regular expressions. This function is used as a building block to combine the pieces of an expression together.
Note that in our setting, we never use toSum on empty lists, and we add an extra case for the singleton list which differentiates
the one-element case from the empty list case: this is a non-trivial optimization, since we want
to keep the number of equivalence laws for regular expressions to a minimum, and unnecessarily introducing ⊥ would force us to assume that the union of
regular expressions is unital (on at least one side) with respect to ⊥.
We define the neSubsets function which computes all permutations of non-empty sublists:
The final ingredient necessary to define pieces is the following composition:
We split the definitions of neSubsets and toSumSubsets in order to modularly prove correctness and completeness of the first function, which does not fundamentally rely on properties of regular expressions. We introduce the notation ⊕ to represent toSumSubsets. The function ⊕xs computes all permutations of non-empty sublists of xs, where permutations are represented as sums of expressions. For example, . The need to consider all permutations of a list (of regular expressions) arises from the fact that we eliminate commutativity of union as part of the similarity relation. Otherwise, we could simply take all sublists without considering permutations e.g., . Crucially, we prove that the toSumSubsets function preserves the property of inclusion up-to similarity:
The type class DecidableEq is required in some proofs involving the ⊕ operator as it can lead to simpler proofs in certain cases.
A characterizing property for the above function is the fact that any element of
toSumSubsets can be represented as a non-empty subset of the original list.
We now have all the ingredients to define the pieces function:
It turns out that for any expression r, pieces r contains the parts necessary to represent any iterated derivative of r. We highlight the case for concatenation; for example, the derivative of (e ⋓ f) ☀ g takes one of two forms: (e’ ⋓ f’) ☀ g, when e or f is nullable, or g’ (which corresponds to an element in pieces g). We will prove later that an expression like e’ ⋓ f’ can be built from the pieces of e ⋓ f, by which we mean that e’ ⋓ f’ is equivalent to an element in ⊕(pieces(e ⋓ f)). Consequently, this ensures that (e’ ⋓ f’) ☀ g can be built from the pieces of (e ⋓ f) ☀ g.
As with the support function defined earlier, the union operator does not appear as the top-most operator in any individual pieces:
Using the definition of pieces, we can finally define our overapproximation of the derivatives of a regular expression r: ⊕(pieces r). This list contains all permutations of all non-empty subsets of pieces of r (where each subset is joined together by toSum). Next, we describe properties of pieces and ⊕ that are used for proving finiteness.
The pieces function satisfies the following reflexivity-like property:
The following theorem establishes that similar expressions have the same pieces up-to similarity:
The pieces function also satisfies the following transitivity-like property, in the sense that all pieces of a piece of r are already contained in pieces r (up-to similarity):
To prove this form of transitivity, we make use of the properties toSumSubsets_monotone and toSumSubset_to_neSubset.
So far we have only considered pieces (i.e., elements of pieces r) but the set of all iterated derivatives of r is approximated by ⊕(pieces r). The reflexivity and transitivity properties also extend to ⊕(pieces r):
5.4 Finiteness
From the previous section, we know that for every expression r there is a finite set ⊕(pieces r). Now we will show that every derivative of r is represented by an element in this set.
We first prove that every one-step derivative of f (given by the step function) is similar to a sum of pieces of f. The following theorem captures precisely this property:
It follows that the set of one-step derivatives is contained in the overapproximation. This fact is captured by the following theorem:
To show that all derivatives are contained in the overapproximation ⊕(pieces r), it suffices to prove that any n-step derivative is in the overapproximation (up-to similarity).
The proof proceeds by induction on the number of derivative steps n. The base case follows from toSumSubsets_pieces_refl. For the inductive step, let e be any (n - 1)-step derivative. By the inductive hypothesis, we have that e is in the overapproximation of r, ⊕(pieces r). Then, we have that any single-step derivative f (of e) is in the overapproximation of e, ⊕(pieces e), by lemma step_to_toSumSubsets_pieces. The theorem is concluded by applying toSumSubsets_pieces_trans to show that f is contained in the overapproximation of r.
The main theorem asserts the existence of a list of regular expressions that contains the set of all iterated derivatives of a given regular expression r up-to similarity. We choose ⊕(pieces r) as the overapproximation.
In our formalization, the state space captured by the steps function only accounts for the parts of the expression at lookaround depth (i.e., not under a lookaround). This is because the derivative of a lookaround is always and the actual complexity of lookarounds lies in the nullability check (which is only done at matching time). A transition regex encodes all possible transitions from with respect to any location. For a location , the concrete transition is into and the computation of this target state may involve nullability checks. If the nullability check is for a subexpression of that is a lookaround, then this spawns an automaton for the lookaround body at location .
Let us now informally consider the finiteness of the global state space for , which includes the states induced by nullability checks of lookarounds. Finiteness follows by induction on lookaround height. For the base case , i.e., when contains no lookarounds, the global state space is the set of derivatives that we approximate with ⊕(pieces r) (which is finite). For the inductive case , we assume that the global state space is finite at lookaround height . The set of top-level derivatives of is finite by the finiteness theorem. If a top-level state is an expression that has a lookaround as a subexpression, then computing its successor may spawn an automaton for the body of that lookaround (which has lookaround height at most ). By inductive hypothesis, the automaton corresponding to an expression with lookaround height has a finite global state space.
5.5 pieces as a closure operator
An alternative perspective on pieces is to view it as a closure operator when lifted to lists (sets) of expressions. This lifted function, piecesS, satisfies the standard properties of a closure operator: it is extensive, monotonic, and idempotent. In Lean, we define it as follows, where ∘ denotes function composition:
In order to prove finiteness in this setting, we show that
This proceeds by induction on the number of steps n. The base case follows from extensivity, since steps e 0 is just the singleton [e]. For the inductive case, we assume that
From the theorem step_to_pieces, it follows that
By applying monotonicity and idempotence on the inductive hypothesis, we get that
Finally, by transitivity we can conclude that
In the reasoning above, exactly the three properties of piecesS being a closure operator are needed. We can now prove the main finiteness theorem with piecesS [e] as the overapproximation since piecesS [e] contains all derivatives of e.
5.6 On the computational complexity of the approximation
We briefly comment on the complexity aspects of our approximation for the derivatives of a regular expression. First of all, in no way is pieces r meant to be a (reasonably) precise approximation of the set of derivatives of r. It is just an upper bound of this set. The main feature is that pieces is a total function and thus pieces r is finite for any r.
Our definition for the overapproximation focuses on being easy to conceptualize, and is defined in terms of the composition of simpler components in order to make it more amenable to formal verification. In particular, the overapproximation is given in terms of the ⊕ function, which computes all non-empty subsets of permutations of a list, and the pieces function to actually traverse the regular expression. It can be shown that the computational complexity of ⊕ is already in the size of the input list. The worst case of the pieces function materializes in the case for intersection, where one takes Cartesian products of the approximations of the two subexpressions.
5.7 Simplifications
We have shown that, for every n, every n-step derivative r’ of r is made of pieces of r. By this we mean that every element r’ of steps n r is ADI-equivalent to an element of ⊕(pieces r) (recall that ⊕(pieces r) contains every permutation of every subset of pieces r as a sum). Finiteness of the set of derivatives of r (modulo ADI) follows from the fact that the set pieces r is finite. We now describe how to integrate simplifications into the derivative computation without sacrificing finiteness. The key idea is to ensure that simplifications do not introduce any new pieces.
Our first observation is that ⊕ is monotone: if xs ⊆ ys, then ⊕xs ⊆ ⊕ys. More specifically, if an expression p’ is ADI-equivalent to an element of ⊕(pieces p), then it is also ADI-equivalent to an element of ⊕(pieces r) provided that pieces p ⊆ pieces r. This is the condition that determines allowed simplifications: r can be simplified to p if p does not introduce any new pieces (which is necessary for preserving finiteness). For us, a simplification is a function RE α → RE α that is applied after every derivative step.
We incorporate such a simplification f by defining a modified step function which applies f to every element in the result of the original step.
We define allowed simplifications f to be those that satisfy the following predicate.
Finally, we prove the following property which suffices to keep the rest of the finiteness proof almost the same as it was with just the function step. Every one-step derivative of r, after simplifications with a suitable f, is ADI-equivalent to an element of ⊕(pieces r).
Here we write r ⇝ s to say that r simplifies to s. For example, it is easy to see from the definition of pieces that the simplification rule r ⋓ s ⇝ r is nonincreasing since pieces r ⊆ pieces (r ⋓ s). It is not a valid simplification rule in the sense that it does not preserve the language. However, we can restrict ourselves to the cases where it does: r ⋓ ⊥ ⇝ r and r ⋓ ˜⊥ ⇝ ˜⊥. It is also easy to see that associativity, deduplication and idempotence can also be made into allowed simplification rules.
The simplification rule r ☀ s ⇝ s is also nonincreasing and thus is allowed. Again, this is not a valid rule but we can restrict to the cases where it is: ε ☀ s ⇝ s and r ☀ ⊥ ⇝ ⊥. Note, however, that we are not allowed to simplify r ☀ s to r as that would not be nonincreasing. The following simplification rules are allowed.
We can also compose any sequence of nonincreasing simplifications: the identity function is nonincreasing; if and are nonincreasing, then so is . Also, nonincreasing simplifications satisfy certain congruences. If f is nonincreasing, then so are the following rules:
Note how the last rule keeps s fixed.
Furthermore, as the allowed simplifications are determined by pieces, we could allow more simplifications by extending the set of pieces of an expression. For example, if every expression had ⊥ as a piece, then ⊥ ☀ s ⇝ ⊥ would be allowed (or, we could ensure that pieces r is a subset of pieces (r ☀ s) which would also allow r ☀ ε ⇝ r). Similarly, more simplifications would be allowed if pieces r and pieces s were subsets of pieces (r ⋒ s). We suspect it would be relatively easy to extend pieces in this way so that the following simplifications are also allowed:
6 Future work
A follow-up to the problem of proving finiteness of the state space is to formally show decidability of nonemptiness of modulo , i.e., decidability of . Observe that, for all , , so decidability of is a prerequisite in this case, while it is not needed in the formalization of finiteness of the state space. A related question is to formally study the size complexity of the state space with respect to the size of under various restrictions, such as disallowing intersection and/or complement. The finiteness of the state space is also fundamental in being able to formally reason about complexity results of matching , either for PCRE semantics, where union is noncommutative, or POSIX semantics.
We have described how the function pieces determines certain simplification rules (on regular expressions) that preserve finiteness of the set of derivatives. A question for further investigations is how far can this be pushed in the sense that what simplifications can be allowed and which cannot by varying the definition of pieces. A further distinction could be made by considering PCRE semantics instead where is not equivalent to .
Finally, it would be interesting to see if it is possible to define a one-way derivative operation (that processes the input from left to right) for the class of expressions considered in this paper.
References
- [1] Stål Aanderaa. On the algebra of regular expressions. Technical Report, Applied Mathematics, Harvard University, 1965.
- [2] Valentin Antimirov. Partial derivatives of regular expressions and finite automata constructions. Theoretical Computer Science, 155:291–319, 1996. doi:10.1007/3-540-59042-0_96.
- [3] Fahad Ausaf, Roy Dyckhoff, and Christian Urban. POSIX lexing with derivatives of regular expressions. In Archive of Formal Proofs, 2016. URL: http://www.isa-afp.org/entries/Posix-Lexing.shtml.
- [4] Aurèle Barrière and Clément Pit-Claudel. Linear matching of javascript regular expressions. Proc. ACM Program. Lang., 8(PLDI), June 2024. doi:10.1145/3656431.
- [5] Martin Berglund, Brink van der Merwe, and Steyn van Litsenborgh. Regular expressions with lookahead. Journal of Universal Computer Science, 27(4):324–340, 2021. doi:10.3897/jucs.66330.
- [6] Janusz A. Brzozowski. Derivatives of regular expressions. J. ACM, 11(4):481–494, 1964. doi:10.1145/321239.321249.
- [7] Pascal Caron, Jean-Marc Champarnaud, and Ludovic Mignot. Partial derivatives of an extended regular expression. In Language and Automata Theory and Applications, LATA 2011, volume 6638 of LNCS, pages 179–191. Springer, 2011. doi:10.1007/978-3-642-21254-3_13.
- [8] Jean-Marc Champarnaud and Djelloul Ziadi. From Mirkin’s prebases to Antimirov’s word partial derivatives. Fundam. Inf., 45(3):195–205, January 2001. URL: http://content.iospress.com/articles/fundamenta-informaticae/fi45-3-03.
- [9] Thierry Coquand and Vincent Siles. A decision procedure for regular expression equivalence in type theory. In Certified Programs and Proofs, pages 119–134. Springer Berlin Heidelberg, 2011. doi:10.1007/978-3-642-25379-9_11.
- [10] Loris D’Antoni and Margus Veanes. Automata modulo theories. Communications of the ACM, 64(5):86–95, May 2021. doi:10.1145/3419404.
- [11] Derek Egolf, Sam Lasser, and Kathleen Fisher. Verbatim: A verified lexer generator. In 2021 IEEE Security and Privacy Workshops (SPW), pages 92–100, 2021. doi:10.1109/SPW53761.2021.00022.
- [12] Derek Egolf, Sam Lasser, and Kathleen Fisher. Verbatim++: Verified, optimized, and semantically rich lexing with derivatives. In Proceedings of the 11th ACM SIGPLAN International Conference on Certified Programs and Proofs (CPP 2022), pages 27–39. ACM, 2022. doi:10.1145/3497775.3503694.
- [13] Gérard Huet. The zipper. Journal of Functional Programming, 7(5):549–554, 1997. doi:10.1017/S0956796897002864.
- [14] Vladimir Komendantsky. Reflexive toolbox for regular expression matching: verification of functional programs in Coq+Ssreflect. In Proceedings of the Sixth Workshop on Programming Languages Meets Program Verification, PLPV ’12, pages 61–70, New York, NY, USA, 2012. Association for Computing Machinery. doi:10.1145/2103776.2103784.
- [15] Konstantinos Mamouras and Agnishom Chattopadhyay. Efficient matching of regular expressions with lookaround assertions. Proc. ACM Program. Lang., 8(POPL), January 2024. doi:10.1145/3632934.
- [16] B. G. Mirkin. An algorithm for constructing a base in a language of regular expressions. Journal of Symbolic Logic, 36(4):694–694, 1971. doi:10.2307/2272532.
- [17] Takayuki Miyazaki and Yasuhiko Minamide. Derivatives of regular expressions with lookahead. J. Inf. Process., 27:422–430, 2019. doi:10.2197/ipsjjip.27.422.
- [18] Nelma Moreira, David Pereira, and Simão Melo de Sousa. Deciding regular expressions (in-)equivalence in Coq. In Wolfram Kahl and Timothy G. Griffin, editors, Relational and Algebraic Methods in Computer Science, pages 98–113, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. doi:10.1007/978-3-642-33314-9_7.
- [19] Dan Moseley, Mario Nishio, Jose Perez Rodriguez, Olli Saarikivi, Stephen Toub, Margus Veanes, Tiki Wan, and Eric Xu. Derivative based nonbacktracking real-world regex matching with backtracking semantics. In Nate Foster et al., editor, PLDI ’23: 44th ACM SIGPLAN International Conference on Programming Language Design and Implementation, Florida, USA, June 17-21, 2023, pages 1–2. ACM, 2023. doi:10.1145/3591262.
- [20] Tobias Nipkow and Dmitriy Traytel. Unified decision procedures for regular expression equivalence. In Gerwin Klein and Ruben Gamboa, editors, Interactive Theorem Proving, pages 450–466, Cham, 2014. Springer International Publishing. doi:10.1007/978-3-319-08970-6_29.
- [21] POSIX. IEEE standard for information technology - portable operating system interface (POSIX(R)). IEEE Std 1003.1-2008 (Revision of IEEE Std 1003.1-2004), pages 1–3874, 2008. doi:10.1109/IEEESTD.2008.4694976.
- [22] Olli Saarikivi, Margus Veanes, Tiki Wan, and Eric Xu. Symbolic regex matcher. In Tomáš Vojnar and Lijun Zhang, editors, Tools and Algorithms for the Construction and Analysis of Systems, volume 11427 of LNCS, pages 372–378. Springer, 2019. doi:10.1007/978-3-030-17462-0_24.
- [23] Arto Salomaa. Two complete axiom systems for the algebra of regular events. J. ACM, 13(1):158–169, January 1966. doi:10.1145/321312.321326.
- [24] Caleb Stanford, Margus Veanes, and Nikolaj Bjørner. Symbolic boolean derivatives for efficiently solving extended regular expression constraints. In PLDI’21, pages 620–635. ACM, 2021. doi:10.1145/3453483.3454066.
- [25] Martin Sulzmann and Kenny Zhuo Ming Lu. Regular expression sub-matching using partial derivatives. In Proceedings of the 14th Symposium on Principles and Practice of Declarative Programming (PPDP’12), pages 79–90, New York, NY, USA, 2012. ACM. doi:10.1145/2370776.2370788.
- [26] Martin Sulzmann and Kenny Zhuo Ming Lu. Posix regular expression parsing with derivatives. In M. Codish and E. Sumii, editors, FLOPS 2014, volume 8475 of LNCS, pages 203–220. Springer, 2014. doi:10.1007/978-3-319-43144-4_5.
- [27] Chengsong Tan and Christian Urban. POSIX lexing with bitcoded derivatives. In A. Naumowicz and R. Thiemann, editors, 14th International Conference on Interactive Theorem Proving, number 26 in LIPICS, pages 26:1–26:18. Dagstuhl Publishing, 2023. doi:10.4230/LIPICS.ITP.2023.27.
- [28] Christian Urban. POSIX lexing with derivatives of regular expressions. Journal of Automated Reasoning, 67:1–24, July 2023. doi:10.1007/s10817-023-09667-1.
- [29] Ian Erik Varatalu, Margus Veanes, and Juhan Ernits. RE#: High performance derivative-based regex matching with intersection, complement, and restricted lookarounds. Proc. ACM Program. Lang., 9(POPL), January 2025. doi:10.1145/3704837.
- [30] Margus Veanes, Thomas Ball, Gabriel Ebner, and Ekaterina Zhuchko. Symbolic automata: Omega-regularity modulo theories. Proc. ACM Program. Lang., 9(POPL), January 2025. doi:10.1145/3704838.
- [31] Ekaterina Zhuchko. Lean sources for this paper, 2025. URL: https://github.com/ezhuchko/finiteness-derivatives.
- [32] Ekaterina Zhuchko, Margus Veanes, and Gabriel Ebner. Lean formalization of extended regular expression matching with lookarounds. In Proceedings of the 13th ACM SIGPLAN International Conference on Certified Programs and Proofs (CPP ’24), pages 279–292. ACM, 2024. doi:10.1145/3636501.3636959.
