Minimization of Deterministic Finite Automata Modulo the Edit Distance
Abstract
We propose a novel approach to minimization of deterministic finite automata (DFA), in which the DFA is further minimized at the expense of relaxing equality of languages to merely a similarity. As the notion of similarity of languages, we consider the edit distance between languages , i.e., the minimal number of edits necessary to transform any word from to some word from and vice versa.
In this paper we address two problems: minimization up to a predetermined edit distance given in the input, and minimization up to a bounded edit distance, in which there has to be an upper bound on the number of edits, but it is not specified. We show the first problem to be PSpace-complete and that the second problem is in , and both NP-hard and coNP-hard. We show that if we limit how many strongly connected components can be visited by a single run (i.e., bounded SCC-depth), the problem becomes NP-complete. We also establish maximal subclasses of DFA over which minimization up to a bounded edit distance can be performed in polynomial time.
Additionally, we provide a succinct overview of alternative metrics for assessing language similarity.
Keywords and phrases:
automata theory, automata minimization, edit distanceCopyright and License:
2012 ACM Subject Classification:
Theory of computation Formal languages and automata theoryFunding:
This work was supported by the National Science Centre (NCN), Poland under grant 2020/39/B/ST6/00521.Editors:
Paweł Gawrychowski, Filip Mazowiecki, and Michał SkrzypczakSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Finite automata are one of the most fundamental models of computation, playing a central role in theoretical computer science and numerous practical applications, including pattern recognition [27, 4], rule-based systems [9], planning [1], language recognition [24] or multi-agent systems [6, 23]. Among finite automata, Deterministic Finite Automata (DFA) provide one of the simplest formalisms for representing and reasoning about structured systems, patterns, rules, and processes.
Minimization of DFA is crucial for two reasons: complexity and interpretability. First, in applications involving DFA, it is often advantageous performance-wise to minimize DFA before proceeding with the actual algorithm.
Second, DFA are used as a formalism to approximate more complex models such as neural networks [26, 15]. Such applications operate in environments where inputs are inherently noisy or incomplete. In such cases, it is often sufficient to approximate the behavior of a system rather than achieving perfect fidelity. In this scenario the exact language equivalence is neither feasible nor necessary, and furthermore DFA of smaller size are typically easier to comprehend, which leads to better interpretability of the underlying model.
Furthermore, as DFA already approximate the underlying models, preserving the exact language of a given DFA in the minimization process is not imperative anymore. Instead, to boost state reduction, we can settle for a similarity notion between languages rather than the rigid language equivalence.
Various notions of similarity between languages have been considered. For instance, two languages are similar if their symmetric difference is finite. Minimization of DFA under this similarity notion is called hyper-minimization of DFA [10, 18]. However, this is a rather generic notion as it applies to any set of objects.
Another idea, which relies on the fact that languages are sets of words, is to take a distance on words and lift it to languages in the canonical way: the distance between a word and a language , denoted by , is the minimal over all . Then, the distance between and is the maximum of numbers , over all . This approach allows infinitely many words to be slightly changed, and is more specialized than the finite symmetric difference. We follow the latter approach employing the edit distance as the distance .
The edit distance [17], also called the Levenshtein distance, between words is the minimal number of single letter insertions, deletions and substitutions necessary to transform into . It is one of the fundamental string metrics used in natural language processing [14], DNA analysis [7] or analysis of neural networks [26]. Furthermore, it is closely related to dynamic time warping, which is a basic measure of similarity between timed series [14]. As a similarity measure between languages, edit distance has been considered on regular languages [3] and context-free languages [8].
1.1 Contributions
This paper aims to provide a fresh perspective to the field of DFA minimization, which balances the conflicting objectives of state reduction and similarity of languages.
We introduce the problem of minimization of DFA modulo the edit distance, a novel approach that balances compactness with approximation flexibility. Unlike classical minimization, which focuses on preserving exact language equivalence, this approach allows for slight deviations between the original and minimized automata, measured by the edit distance, which enables significant state reduction while maintaining practical usability.
The problem is considered in two variants. In the fixed threshold variant (MIN-ED[]), the goal is to find a DFA with the minimal number of states such that the edit distance between the languages of the original DFA and is bounded by a given threshold .
In the bounded threshold variant (MIN-BED), the objective is to find a minimal DFA such that the edit distance between the languages of and is finite. This variant further loosens the constraints, allowing for more aggressive state reduction while still capturing the core structure of the original language.
These two perspectives provide a flexible framework for tailoring DFA minimization to the needs of specific applications, particularly in noisy or approximate environments.
Our main technical results are for the bounded threshold variant MIN-BED. We show that this problem is in general and both NP-hard and coNP-hard, but we identify subclasses of DFA, which enjoy better complexity. We consider automata with a bounded depth of strongly connected components. We prove that for any , restricting the problem to automata whose accepting runs visit at most different (maximal) strongly connected components reduces the complexity to NP. For cases , we provide polynomial-time algorithms and show NP-hardness for .
For the fixed threshold variant MIN-ED[], we prove that the problem is robustly PSpace-complete, in the sense that it is unclear how to restrict the class of DFA to gain better complexity results. In particular, we discuss that the hardness proof can be adapted to automata in which accepting runs visit at most one (maximal) strongly connected component.
1.2 Related work
Minimization of finite automata has been extensively studied [22, Chapter 10]. The foundational algorithms in this area were designed to reduce the state complexity of DFA while preserving their exact language equivalence. Notable among these are the algorithms proposed by Hopcroft [11], which works in , the algorithm by Moore [20], which is quadratic in the worst case but enjoys better average-case complexity, and the algorithm by Brzozowski [5], which is exponential in the worst case, but often outperforms other algorithms in practice [2].
Building on these foundational methods, researchers have explored extensions and generalizations of DFA minimization. One significant direction is hyper-minimization, which allows for a relaxation of exact language equivalence by permitting finite differences between the languages of the original and minimized DFA. This approach has been studied extensively, with algorithms achieving complexity [10, 18]. Hyper-minimization has also been adapted to other automata models, such as deterministic tree automata [13] and deterministic weighted tree automata [19].
Language inclusion modulo the edit distance has been studied between regular languages represented by DFA and NFA [3], and regular and context-free languages represented by DFA, NFA, and DPDA [8].
Despite extensive research on DFA minimization, the problem of minimization modulo the edit distance has not been addressed in the literature. Our work fills this gap.
2 Preliminaries
A word over a finite alphabet of letters is a finite sequence of letters. The set of all finite words over is denoted by .
A deterministic finite-state automaton (DFA) is a tuple consisting of the alphabet , a finite set of states , the initial state , a subset of accepting states, and a transition function . The size of an automaton , denoted by , is its number of states.
Given a word , the run of a DFA on is a sequence of states such that is the initial state and for each we have . The run on is accepting if its last state belongs to . The language of , denoted by , is the set of words whose runs are accepting. DFA and are equivalent if and only if .
Let be a DFA and be its state. We denote by the DFA obtained from by changing the initial state to . The DFA is minimal if and only if (a) all states of are reachable from the initial state, and (b) for all states of , if , then . Due to Myhill-Nerode theorem [12], every regular language is recognized by a minimal DFA , which is unique (up to state names), and every DFA language equivalent to other than has more states than . Furthermore, given a DFA , we can compute a minimal DFA equivalent to in time .
For convenience, throughout the paper, we assume that input DFAs are already minimal – the minimization can be done in polynomial time if needed. This in particular means that all the states are reachable.
A state of DFA is a sink if it is not accepting and all the transitions from lead to . In a minimal DFA there is at most one sink, which is the only state from which no accepting state is reachable.
We extend to all words by defining (where is the empty string) and for each state , word and letter . A subset of states of a DFA is a (maximal) strongly connected component (SCC) if for any two states , there is a non-empty word such that , and there is no with this property. We say that has the SCC-depth if is the maximal number such that there is an accepting run visiting different (maximal) SCC of . Importantly, we only consider accepting runs and hence the sink state does not influence the SCC-depth; for example, both automata presented in Figure 2 (page 2) have the SCC-depth 3.
For convenience, we use regular expressions defined in the usual manner, i.e., expressions defined using the BNF , where is a letter and stands for the empty word. The language of is defined inductively: , for each letter , ; ; ; , and .
2.1 Distance between languages and decision problems
The edit distance between words , denoted by , is the minimal number of operations: insertions of a letter, deletions of a letter, and substitutions of a letter, necessary to transform into . For example, (one substitution and one insertion).
For a natural number , we say that a language is contained in a language up to the edit distance , denoted as , if and only if for each there is such that . We extend this notion to by stating that if and only if for some we have . The relation is transitive only for .
We say that DFA are equivalent up to the edit distance , denoted as , if and only if and . Notice that is an equivalence relation if and only if ; in other cases, it is symmetric and reflexive, but not necessarily transitive.
We study decision problems called MIN-ED[] for and MIN-BED defined as follows:
MIN-ED[]: minimization modulo a fixed edit distance
Input: A DFA and a number .
Output: Is there a DFA with at most states such that ?
MIN-BED: minimization modulo a bounded edit distance
Input: A DFA and a number .
Output: Is there a DFA with at most states such that ?
2.2 Containment modulo a bounded edit distance
We recall a condition for checking containment modulo a bounded edit distance presented in [3, Theorem 4.1]. This condition is used in Sections 3 and 4.
Following [3], we consider minimal DFA, which recognize infinite languages. Note that all DFA that recognize finite non-empty languages are equivalent modulo a bounded edit distance, while the DFA that recognizes the empty language is only equivalent to itself. Conversely, a DFA that recognizes an infinite language can be equivalent modulo a bounded edit distance only to a DFA recognizing infinite language. Therefore, considering only DFA that recognize infinite languages removes only trivial cases.
For a DFA (minimal and recognizing an infinite language), let be the set of all (maximal) strongly connected components of containing an accepting state or from which an accepting state is reachable.
We define as a directed acyclic graph whose set of nodes is , and there is an edge from to if and only if and there is a transition from some state of to some state in . Let be the reflexive and transitive closure of .
For , we define as the language of words such that if, starting from some state in , the automaton can read the whole word without leaving . More precisely, it is the language of words such that there are satisfying, for all , .
Now, consider two DFA and . We say that a path in is covered by a path in if and only if and for every we have . Checking whether a path is covered by a path can be decided in polynomial time [3, Lemma 4.4].
It has been shown in [3, Theorem 4.1] that if and only if every path in is covered by some path in (the result from [3, Theorem 4.1] holds even for non-deterministic finite automata). Furthermore, due to [3, Theorem 4.1], we have if and only if for any .
Remark 1.
The upper bound on the edit distance threshold in MIN-BED is , i.e., an instance of MIN-BED is positive if and only if it is a positive instance of MIN-ED[].
Note that a DFA may contain a rejecting sink state, which has no counterpart in . Furthermore, some states of may not belong to any SCC (for example, the initial state of if it is not reachable from any other state). In particular, if has no root, i.e., the node from which all other nodes are reachable, then the initial state of does not belong to any SCC. We will use this fact further on.
Fact 2.
Let be a DFA such that (1) its language is non-empty, (2) for some word , its language does not contain any extension of (), and (3) does not have a root, i.e., a node from which all other nodes are reachable. Then, is at least the number of nodes of plus .
2.3 Example
Consider a program that receives some data, computes something and sends the response, after which it can start over or terminate. Traces of such a program can be represented over the alphabet as depicted in Figure 1 (left).
Observe that is a sink state; the sink state is present in all automata, but we do not draw it or discuss it for better clarity.
Assume that we want to learn the (target) automaton depicted in Figure 1 (left) from noisy data, which contains as a positive example the word instead of . It is possible that the learning procedure produces the automaton that is depicted in Figure 1 (right).
Observe that the automaton from Figure 1 (right) is minimal in the classical sense, and also minimal w.r.t. hyper-minimization [10, 18] as there is no language that differ from the language of on some finite set of words whose automaton has less states.
Observe that . Clearly because . To see , observe that any word accepted by but not by is of the form for some , and it can be modified to be accepted by by appending . In this case, the automaton is the minimal automaton equivalent to up to edit distance . Therefore, minimizing allows to overcome the problem introduced by the noised data.
Now consider an automaton which is a copy of , but with the state accepting. In this case, this automaton accepts the word , which cannot be transformed into any word accepted by with edit distance . Therefore, . It can be checked, however, that and .
3 Bounded edit distance
In this section, we study the minimization problem for DFA modulo a bounded edit distance. We show that the MIN-BED problem is in and it is both coNP-hard and NP-hard. We also identify a class of DFA for which the problem is NP-complete.
3.1 Membership in and coNP-hardness
We first show that the MIN-BED problem is in . Recall that a language is in if there is a language in coNP such that if and only if there exists of polynomial size in such that .
Given two DFA , checking whether amounts to checking two inclusions: and . Since the inclusions can be checked in coNP [3, Theorem 5.2], checking equivalence can also be done in coNP.
To solve MIN-BED for an instance , assume that (otherwise the answer is yes with ). Then, non-deterministically pick a DFA of the size at most and return whether . It follows that the problem is in .
Proposition 3.
The MIN-BED problem is in .
Recall the condition for checking containment modulo a bounded edit distance (Section 2.2), which involves checking coverage of all paths in . As we strive for better complexity results, we will restrict the class of DFA to those with polynomially many paths in dag. Later on, we consider DFA such that has bounded depth.
It has been shown in [3] that deciding language containment up to a bounded edit distance is coNP-complete. This does not directly imply that the MIN-BED problem or even deciding the equivalence up to a bounded edit distance is coNP-hard. For instance the language inclusion on languages represented by deterministic pushdown automata (DPDA) is undecidable [12], while the equivalence problem for DPDA is decidable [25]. Still, we show coNP-hardness of MIN-BED by adapting the reduction from [3].
Theorem 4.
The MIN-BED problem is coNP-hard.
3.2 MIN-ED[] over DFA of a bounded SCC-depth is NP-complete
We now restrict the class of DFA to those with polynomially many paths in dag and show an improved complexity results.
Lemma 5.
Fix . Deciding over DFA such that the number of paths of is at most , is in NP.
Notice that we do not require to have polynomially many paths.
Proof.
To check whether , for each path of , non-deterministically pick a path of that covers it. Checking is more complicated, as may have exponentially many paths. Let be the set of all the paths of . We employ the following algorithm.
For each node of , non-deterministically pick a non-empty subset of paths from such that for all , the last node of satisfies . For nodes such that is a successor of in and , we say that is a continuation of in if and is a prefix of or . The algorithm accepts if for all nodes such that is a successor of in , each has a continuation in .
The algorithm works in polynomial time, as the size of is polynomial. If the algorithm accepts, each path is covered. Indeed, consider a path in . Consider in , which exists as the subsets are non-empty. Then, inductively, let be a continuation of in . From in , we can extract a path in that covers the path .
Now we prove that if every path is covered, then the algorithm accepts. To do so, construct starting with for each node. Then, for each path of , which is covered by a path of , and for each we construct a path for as follows. First, remove repeating nodes from , obtaining . Since is a path of , one can compute a path of such that can be obtained from by possibly inserting some nodes. We add to . Then is as required.
Lemma 6.
For every , the MIN-BED problem over DFA of SCC-depth at most is in NP.
Proof.
Given a DFA and , the algorithm first non-deterministically picks a DFA with at most states (if then the algorithm accepts straight away). Notice that has at most paths as the length of a path is bounded by . Then, the algorithm returns whether using the algorithm from Lemma 5.
We now show NP-hardness of MIN-BED over DFA of SCC-depth .
For a (directed) bipartite graph , let be the minimal number of complete bipartite graphs contained in (not necessarily disjoint), which together cover all edges of .
The following problem is NP-complete [21, Theorem 8.1].
R-bicontent covering
Input: a bipartite graph without isolated nodes and
Output: whether
We employ the R-bicontent covering problem to prove the following hardness result.
Lemma 7.
The MIN-BED problem over DFA of SCC-depth at most is NP-hard.
Proof.
Let , where , be a bipartite graph without isolated nodes, and . Consider a DFA over the alphabet accepting exactly words of the form such that .
The DFA has the following states: an initial state , a sink , a state for each node , and two states , for each node .
For each , has the following transitions:
-
From to over .
-
From , a loop over and a transition to over .
-
From , a loop over and a transition to over .
-
A loop in over .
All the remaining transitions lead to the sink. The states are accepting, while all other states are rejecting. Figure 2 contains an example of this reduction with a corresponding minimal automaton.
Assume that . We claim that the minimal size of a DFA equivalent modulo a bounded edit distance to is .
Let be a minimal R-bicontent covering of . Each element of is a complete bipartite subgraph of . We construct a DFA from by removing all states and introducing states , …, . For each , the state has a loop over and for all , a transition over to . Moreover, for each , we pick one such that and we add a transition from to over . The remaining transitions lead to the sink.
For observe that ; each word is accepted by as there is a transition from to some such that , and from there is a transition to .
For , observe that accepts words of the form such that . For each such a word, there is accepted by such that .
We now show that if for some there is a DFA of the size , then is at most . Due to Fact 2, has at most nodes.
Consider . Since is equivalent modulo a bounded edit distance to , every path in is covered in and vice versa. Therefore, contains three sets of nodes for some , , and for some . Observe that and , as otherwise some paths of would not be covered. Hence .
We claim that each node in corresponds to a (possibly empty) complete bipartite subgraph of . For , let be the labels of the predecessors of from , and be the labels of the successors of from . Observe that for any , we have , because every path in is covered in , which in turn encodes edges of , and hence the complete bipartite graph is a subgraph of . Furthermore, since every path in is covered by , every edge of is contained in some complete subgraph. Thus, is a set of complete bipartite graphs which together cover all edges of . Finally, .
By combining the containment in NP from Lemma 6 and the hardness from Lemma 7, we have the following.
Theorem 8.
For any , the MIN-BED problem restricted to DFA of SCC-depth at most is NP-complete.
4 Effective algorithms for bounded edit distance
In this section, we show that the MIN-BED problem restricted to DFA of SCC-depth at most can be solved in polynomial time. In order to avoid technicalities, we allow the minimized automaton to operate over an extended alphabet with additional fresh letters. We use up to fresh letters and the number of occurrences of each fresh letter is bounded among all accepted words.
We first investigate automata, for which the minimization problem is trivial. There are two reasons for that: either is finite or (conversely) the language is equivalent modulo bounded edit distance to . In the former case, the DFA has only sink and possibly some transient states, i.e., the states that do not belong to any SCC and hence can be visited at most once along any run. In the latter case, the DFA has a state from which no sink is reachable. We argue that such automata are equivalent either to a single-state automaton accepting/rejecting all the words or to a two-state automaton recognizing the language . All finite non-empty languages are equivalent modulo bounded edit distance, but has the DFA with the least number of states.
Proposition 9.
Consider a DFA . (a) If is finite, then either or . (b) If has a state from which no sink is reachable, then .
To see (a), observe that the empty language is equivalent modulo bounded edit distance only to the empty language, and all finite non-empty languages are equivalent.
To see (b), observe that if there is a state from which no sink is reachable, then there is a state in some bottom SCC of from which no sink is reachable. Observe that since is minimal, has an accepting state – otherwise would be reduced to a sink. Let be a shortest word such that reading from ends in . For each word , we now consider the word . Let be the state of after reading , and let be a shortest word such that the state of after reading from is .
Clearly, accepts . Moreover, the sizes of and can be bounded by the number of states of (using a pumping argument). This shows that . Obviously , and thus is equivalent modulo a bounded edit distance to a single state DFA, which accepts every word.
Due to Proposition 9, in the remainder of this section we assume that a given DFA recognizes an infinite language (which can be easily checked) and from every state of the rejecting sink state is reachable. We call such DFA non-trivial. Observe that for a non-trivial DFA , the DAG is non-empty.
4.1 Single SCC with a sink
We now focus on the case of DFA such that is non-trivial and is a singleton . This means that may contain states that are not in any SCC and the sink .
Theorem 10.
For any non-trivial DFA with consisting of a single node, min-scc() returns an automaton equivalent to modulo a bounded edit distance such that any DFA equivalent to modulo a bounded edit distance has at least as many states as min-scc().
Proof sketch.
The automaton can be minimized using the following steps: (1) remove all that states except for and , (2) make all states of accepting, (3) pick any state of as initial, and (4) minimize the resulting automaton using standard DFA minimization.
The automaton obtained this way is equivalent modulo a bounded edit distance to , and by construction and min-scc() are language equivalent.
We argue that min-scc() is minimal. Let and assume that is a minimal automaton such that . Recall that stands for with the initial state changed to . One can show that there is a state such that . Since is by construction a minimal automaton recognizing , we have , and since is minimal, is minimal as well.
A heuristic for the general case
The described algorithm gives us a polynomial-time procedure to minimize SCCs in non-trivial DFA with multiple SCCs. The idea is that one can minimize each SCC independently and then connect the results according to the connections in , as described below.
Consider whose consists of nodes . For each , consider a DFA which is a restriction of to the states and the sink, where the initial state is any state of and each transition from state of that in does not lead to a state in is diverted to the sink. Moreover, let be a two-state automaton where all the transitions from the accepting initial state lead to the sink. We now consider the DFA obtained by taking the disjoint union of automata and unifying the sinks. Consider fresh letters . Now for each that has no predecessor in , we add a transition from all the states of (except the sink) to all the states of (except the sink). Moreover, for each edge of , we add a transition from all the states of (except the sink) to all the states of (except the sink). The remaining edges are to the sink.
The procedure described above constructs in polynomial time a DFA that is equivalent modulo a bounded edit distance to the initial one.
Corollary 11.
Let be a non-trivial DFA. Then, the DFA is equivalent modulo a bounded edit distance to and can be constructed in polynomial time.
This gives us a possible heuristic to reduce the size of a given automaton. The procedure described above does not guarantee that the constructed automaton is minimal – for example, it may be possible to eliminate some SCCs. This comes as no surprise as that the minimization problem is both NP-hard and coNP-hard, and the above algorithm works in polynomial time. Nevertheless, this procedure is optimal whenever the SCC structure of the input automaton matches that of any minimal automaton; in particular, this holds if each SCC recognizes a distinct language.
4.2 DFA of SCC-depth 2
For DFA of SCC-depth 3, the MIN-BED problem is NP-hard. We present a complementary result that for DFA of SCC-depth at most 2, minimization modulo a bounded edit distance can be computed in polynomial time.
Recall the example in Figure 2: the main idea there is to have many SCCs in the “middle layer” of the automaton – SCCs that have predecessors and successors in the corresponding DAG. Because of that, the states (being single-state maximal SCCs) and , that have the same language (), cannot be merged. However, for SCCs with no successors, it holds that they can be merged if they have the same language – and the same holds for SCCs with no predecessors. We build on this idea to solve the minimization modulo a bounded edit distance for DFA of SCC-depth 2, where each SCC has either no successors or no predecessors. One needs to be careful though; even if the input automaton has the SCC depth 2, it might happen that the minimal equivalent one has SCC depth 3, as we will show later on (Example 13).
For a path , by we denote the th element of (starting from 1). We say that a path is weakly covered by if there is a sequence such that covers . For example, the path is weakly covered by because of the sequence .
Let be a DFA of SCC-depth at most . A subset of paths of is:
-
noncollapsible if for each path in of length we have and .
-
irredundant if there is no proper subset of such that every path in is weakly covered by a path in .
-
a minimal cover if it is noncollapsible, irredundant, and paths from weakly cover all paths in .
Let be DFA. A path is tightly covered by a path if for every we have .
We show that if are equivalent modulo a bounded edit distance, then every path in a minimal cover of paths in is tightly covered by some path from . The intuition is that each paths in the minimal cover defines a “maximal” possible language, that has to be covered by some path of , which in turn has to be covered by some path of . The language maximality then guarantees the equality of the languages.
Lemma 12.
Let DFA be equivalent modulo a bounded edit distance, SCC depth of be 2, and be a minimal cover of paths in . Then, every path in is tightly covered by some path from .
Now, we construct a DFA equivalent modulo a bounded edit distance to with the minimal number of states. To do so, we first compute a minimal cover of paths in . This is done iteratively in polynomial time, by starting from all the noncollapsible paths of , and removing paths that can be covered by other paths. Then, we split the SCCs of the minimal cover into two categories: starters, which are the first SCCs of the paths (including paths of length 1), and finishers, which are the second SCCs of the paths of lengths 2. We remove duplicates in each category, minimize each SCC, and construct the target automaton that consists of a disjoint union of all the remaining starters and finishers.
For each path of length 2 in the minimal cover, we find a corresponding starter and a finisher, and connect all the states of the starter to all the states of the finisher (connecting any state from the starter to any state of the finisher would also work). This may require extending the alphabet with additional letters; we need at most such letters.
If it is possible to pick a state of some starter as an initial state, then we do; otherwise we add one additional starting state. We connect the initial state to all the starters.
Example 13.
Consider the automaton depicted in Figure 3. Without the red dashed edge, this automaton is minimal, and a separate initial state is needed. However, if we consider with with the red dashed edge, then it is not minimal – one can remove , make initial and connect to over , obtaining a smaller equivalent automaton. Interestingly, in this case the SCC-depth of the minimal automaton is .
We add transitions from the initial state to all the starters (possibly using another letter). The detailed algorithm is presented in Algorithm 1. A quick check shows that it works in polynomial time.
The DFA min-depth-2() is constructed in such a way that consists of nodes from with edges from to if and only if and the path is tightly covered by some path from .
The automaton min-depth-2() is not unique as the transitions between different SCCs may use different letters; the choice of letters is irrelevant as it can be fixed with a bounded edit distance.
Theorem 14.
Given a non-trivial DFA of SCC-depth at most , we can construct in polynomial time a DFA min-depth-2() equivalent to modulo bounded edit distance such that any DFA equivalent to modulo a bounded edit distance has at least as many states as min-depth-2().
Proof.
Observe that consists of nodes from with edges from to if and only if and the path is tightly covered by some path from . Hence each path in is tightly covered by some path in and hence it is covered by some path in . Conversely, every path in is exactly covered by a path in and hence every path in is covered by a path in . It follows that and min-depth-2() are equivalent modulo a bounded edit distance.
Minimality of min-depth-2() follows from Lemma 12. Indeed, if is a DFA, which is equivalent modulo a bounded edit distance to , then Lemma 12 implies that for each SCC from , the DFA contains an SCC with . Therefore, all SCCs present in min-depth-2() need to have their counterparts in with the same language. It follows that has at least as many states as min-depth-2().
5 Fixed threshold edit distance
In this section we show that for any fixed the MIN-ED[] problem is PSpace-complete. The hardness result is robust in the sense that the subclasses of DFA discussed earlier do not admit a better complexity of the MIN-ED[] problem, and it is not clear how to restrict the class of DFA to obtain better complexity results.
For membership, one can non-deterministically pick a target automaton and verify its equivalence modulo the edit distance in polynomial space using the algorithm presented in [3, Theorem 5.12]. For PSpace-hardness we reduce the following problem.
Automata universality
Input: DFA over an alphabet ,
such that contains one sink state and it accepts all words of length at most 1, and
for all we have that .
Output: whether
This problem is PSpace-complete, which follows from the classic result [16] by adding an appropriate automaton .
Theorem 15.
For each , the MIN-ED[k] problem is PSpace-hard.
Proof.
Fix . We reduce the automata universality problem. Let be disjoint from . We construct a DFA over the alphabet that accepts exactly words of the form such that .
The automaton over the alphabet consists of the disjoint union of the automata and fresh non-accepting states (initial) and for .
The automaton has the following additional transitions:
-
1.
for any ;
-
2.
for any ;
-
3.
is the initial state of , for any .
The remaining transitions, including transitions over from the states of , are to the sink state of . Figure 4 contains an example of such an automaton.
We conclude the reduction by fixing , i.e., we ask whether the input DFA is equivalent modulo the edit distance to some automaton with at most states.
Let be the minimal automaton over recognizing the language . It has three states: (initial), (accepting) and (sink). From the transition over is to , for each the transition from over is to , and all the remaining transitions are to .
Lemma 16.
For all three-state automata such that we have (up to state renaming).
Lemma 17.
if and only if .
Proof.
Assume and let . Since accepts , for some , accepts , which means that accepts . Thus .
Now assume . Each word accepted by is of the form , where . Then, accepts and .
On the other hand, each word accepted by is of the form , where . Since , there is such that . Therefore, for we have that accepts and . Thus .
Lemmas 16 and 17 together imply that the instance of the MIN-ED[k] problem is positive iff we have . This concludes the proof of PSpace-hardness.
Remark 18.
One can assume that the input automata of the automata universality problem consists of one SCC, for example, by adding an additional reset letter to each automaton (i.e., a letter such that the transition from each state over this letter leads to the initial state). This shows that the problem is PSpace-hard even for automata of SCC-depth .
6 Other metrics
The paper focuses on minimizing DFA modulo similarity of languages, where the similarity measure is based on the edit distance between words. We briefly discuss alternative approaches to define similarity of languages: the Hamming distance and asymmetric relations.
6.1 Minimization modulo the Hamming distance
Hamming distance calculates the minimum number of single-character substitutions required to transform one word into another, assuming both words are of the same length. In contrast, edit distance also allows insertions and deletions of characters, making it applicable to words of different lengths. Hamming distance could be more advantageous in scenarios where fixed-length words are present, or situations where insertions and deletions are unrealistic, like comparing fixed-length DNA sequences. However, its limitation to equal-length words makes it less suitable than edit distance in cases where inputs may vary in length or include additional elements.
We now briefly discuss how our results could be adapted for the Hamming distance metrics. For fixed , the PSpace-hardness result from Theorem 15 can be easily adapted to the Hamming distance. For the bounded Hamming distance case, i.e, , the result would be different from those obtained for equivalence modulo a bounded edit distance. First, our results for a bounded edit distance rely on the condition from [3, Theorem 4.1] presented in Section 2.2 for DFA to be equivalent modulo a bounded edit distance. Up to our best knowledge, there is no counterpart of this condition for the Hamming distance. Second, due to Proposition 9, we have focused on non-universal DFA, for which a rejecting sink state is reachable from every state, as all universal DFA are equivalent modulo finite edit distance to a single-state DFA. Proposition 9 does not extend to language equivalence modulo a bounded Hamming distance. For example, the language consisting of words of even length is equivalent modulo a bounded edit distance to the language of all words, whereas it is not equivalent modulo a bounded Hamming distance. For these reasons, we leave the study of minimization modulo a bounded Hamming distance as future work.
6.2 Asymmetric similarity relations
While we have discussed symmetric similarity measure, in some applications only one-sided errors are allowed. For instance, in formal verification, abstraction methods reduce a given model size by constructing an over-approximation, which is verified against a specification. In such a case, there are no false positives, i.e., the answers that the specification is met are correct, while there may be false negatives due to over-approximation. Over-approximation should balance the size of the minimized model and the false-negatives rate. We can address it with similarity relations.
Consider the following relation: if and . For , this is a transitive relation. This leads to the following minimization problem: given a DFA , find a minimal such that . This approach gives us a close over-approximation; an automaton that accepts all the words that accepts and perhaps some additional words, but only if the additional words are close (w.r.t. the edit distance) to some accepted words.
Again, for fixed , the PSpace-hardness result from Theorem 15 can be easily adapted to the asymmetric distance . For , most results can be easily adapted to the asymmetric distance : containment in (Proposition 3), coNP-hardness (Theorem 4), NP-completeness (Theorem 8), triviality of minimization with unrachable sink state (Proposition 9), and minimization of a DFA being a single SCC with a sink (Theorem 10). However, minimization of DFA of SCC-depth does not extend easily to the asymmetric distance .
While potentially useful in applications like formal verification, these asymmetric relations exhibit challenges such as counter-intuitive behaviour, and require further investigation into their properties and practical applications.
7 Conclusions
We investigated the problem of minimizing DFA modulo the edit distance. We proved that this problem is PSpace-complete for fixed edit distance and in for a bounded edit distance. Then, we proposed a restriction of the latter problem which improves the complexity: we showed that for a fixed SCC-depth, the minimization problem is in NP, and for SCC-depth at most 2 the problem can be solved in polynomial time.
An interesting future work are possible heuristics and approximations for the fixed-edit-distance case. In particular, if the edit distance is fixed, but it is not strictly enforced (for example, it can be slightly violated by some fraction of words).
The bounded edit distance minimization is at least as good as hyperminimization in terms of the number of states (except in some corner cases of finite languages). In practice, we observed that it sometimes minimize too much (see, e.g., Proposition 9). To mitigate this, and also preserve the modular structure of the automata, one could restrict the minimization algorithm to work at some subsets of automata states. In our future work we want to explore algorithms for this scenario.
References
- [1] Diego Aineto, Sergio Jimenez, and Eva Onaindia. Generalized Temporal Inference via Planning. In Proceedings of the 18th International Conference on Principles of Knowledge Representation and Reasoning, pages 22–31, November 2021. doi:10.24963/kr.2021/3.
- [2] Marco Almeida, Nelma Moreira, and Rogério Reis. On the performance of automata minimization algorithms. In Proceedings of the 4th Conference on Computation in Europe: Logic and Theory of Algorithms, pages 3–14, 2007.
- [3] Michael Benedikt, Gabriele Puppis, and Cristian Riveros. Bounded repairability of word languages. J. Comput. Syst. Sci., 79(8):1302–1321, 2013. doi:10.1016/j.jcss.2013.06.001.
- [4] Anat Bremler-Barr, David Hay, and Yaron Koral. Compactdfa: Scalable pattern matching using longest prefix match solutions. IEEE/Acm Transactions On Networking, 22(2):415–428, 2013. doi:10.1109/TNET.2013.2253119.
- [5] Janusz A Brzozowski. Canonical regular expressions and minimal state graphs for definite events. In Proc. Symposium of Mathematical Theory of Automata, pages 529–561, 1962.
- [6] David Carmel and Shaul Markovitch. Opponent modeling in multi-agent systems. In International Joint Conference on Artificial Intelligence, pages 40–52. Springer, 1995. doi:10.1007/3-540-60923-7_18.
- [7] William I. Chang and Eugene L. Lawler. Sublinear approximate string matching and biological applications. Algorithmica, 12(4/5):327–344, 1994. doi:10.1007/BF01185431.
- [8] Krishnendu Chatterjee, Thomas A. Henzinger, Rasmus Ibsen-Jensen, and Jan Otop. Edit distance for pushdown automata. Log. Methods Comput. Sci., 13(3), 2017. doi:10.23638/LMCS-13(3:23)2017.
- [9] Massimiliano de Leoni, Paolo Felli, and Marco Montali. Strategy Synthesis for Data-Aware Dynamic Systems with Multiple Actors. In Proceedings of the 17th International Conference on Principles of Knowledge Representation and Reasoning, pages 315–325, September 2020. doi:10.24963/kr.2020/32.
- [10] Pawel Gawrychowski and Artur Jeż. Hyper-minimisation made efficient. In MFCS 2009, pages 356–368, 2009. doi:10.1007/978-3-642-03816-7_31.
- [11] John Hopcroft. An n log n algorithm for minimizing states in a finite automaton. In Theory of machines and computations, pages 189–196. Elsevier, 1971.
- [12] John E. Hopcroft and Jeffrey D. Ullman. Introduction to Automata Theory, Languages and Computation, Second Edition. Addison-Wesley, 2000.
- [13] Artur Jeż and Andreas Maletti. Hyper-minimization for deterministic tree automata. In CIAA 2012, pages 217–228, 2012. doi:10.1007/978-3-642-31606-7_19.
- [14] Dan Jurafsky and James H. Martin. Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition. Prentice Hall series in artificial intelligence. Prentice Hall, Pearson Education International, 2009. URL: https://www.worldcat.org/oclc/315913020.
- [15] Igor Khmelnitsky, Daniel Neider, Rajarshi Roy, Xuan Xie, Benoît Barbot, Benedikt Bollig, Alain Finkel, Serge Haddad, Martin Leucker, and Lina Ye. Analysis of recurrent neural networks via property-directed verification of surrogate models. International Journal on Software Tools for Technology Transfer, 25(3):341–354, 2023. doi:10.1007/s10009-022-00684-w.
- [16] Dexter Kozen. Lower bounds for natural proof systems. In 18th Annual Symposium on Foundations of Computer Science (sfcs 1977), pages 254–266. IEEE, 1977. doi:10.1109/SFCS.1977.16.
- [17] Vladimir I Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady, 10(8):707–710, 1966.
- [18] Andreas Maletti and Daniel Quernheim. Optimal hyper-minimization. Int. J. Found. Comput. Sci., 22(8):1877–1891, 2011. doi:10.1142/S0129054111009094.
- [19] Andreas Maletti and Daniel Quernheim. Hyper-minimization for deterministic weighted tree automata. In AFL 2014, pages 314–326, 2014. doi:10.4204/EPTCS.151.22.
- [20] Edward F Moore et al. Gedanken-experiments on sequential machines. Automata studies, 34:129–153, 1956.
- [21] James Orlin. Contentment in graph theory: Covering graphs with cliques. Indagationes Mathematicae (Proceedings), 80(5):406–424, 1977. URL: https://www.sciencedirect.com/science/article/pii/1385725877900555.
- [22] Jean-Éric Pin, editor. Handbook of Automata Theory. European Mathematical Society Publishing House, Zürich, Switzerland, 2021. doi:10.4171/Automata.
- [23] Senthil Rajasekaran and Moshe Y. Vardi. Verification and Realizability in Finite-Horizon Multiagent Systems. In Proceedings of the 19th International Conference on Principles of Knowledge Representation and Reasoning, pages 278–287, August 2022. doi:10.24963/kr.2022/28.
- [24] Emmanuel Roche and Yves Schabes. Deterministic part-of-speech tagging with finite state transducers. Computational linguistics, 21(2):227–253, 1995.
- [25] Géraud Sénizergues. The equivalence problem for deterministic pushdown automata is decidable. In Automata, Languages and Programming, 24th International Colloquium, ICALP’97, Bologna, Italy, 7-11 July 1997, Proceedings, pages 671–681, 1997. doi:10.1007/3-540-63165-8_221.
- [26] Qinglong Wang, Kaixuan Zhang, Xue Liu, and C Lee Giles. Verification of recurrent neural networks through rule extraction. arXiv preprint arXiv:1811.06029, 2018. doi:10.48550/arXiv.1811.06029.
- [27] Erkang Zhu, Silu Huang, and Surajit Chaudhuri. High-performance row pattern recognition using joins. Proc. VLDB Endow., 16(5):1181–1195, January 2023. doi:10.14778/3579075.3579090.
