Abstract 1 Introduction 2 Preliminaries 3 Bounded edit distance 4 Effective algorithms for bounded edit distance 5 Fixed threshold edit distance 6 Other metrics 7 Conclusions References

Minimization of Deterministic Finite Automata Modulo the Edit Distance

Jakub Michaliszyn ORCID University of Wrocław, Poland Jan Otop ORCID University of Wrocław, Poland
Abstract

We propose a novel approach to minimization of deterministic finite automata (DFA), in which the DFA is further minimized at the expense of relaxing equality of languages to merely a similarity. As the notion of similarity of languages, we consider the edit distance between languages ,, i.e., the minimal number of edits necessary to transform any word from to some word from and vice versa.

In this paper we address two problems: minimization up to a predetermined edit distance given in the input, and minimization up to a bounded edit distance, in which there has to be an upper bound on the number of edits, but it is not specified. We show the first problem to be PSpace-complete and that the second problem is in Σ2p, and both NP-hard and coNP-hard. We show that if we limit how many strongly connected components can be visited by a single run (i.e., bounded SCC-depth), the problem becomes NP-complete. We also establish maximal subclasses of DFA over which minimization up to a bounded edit distance can be performed in polynomial time.

Additionally, we provide a succinct overview of alternative metrics for assessing language similarity.

Keywords and phrases:
automata theory, automata minimization, edit distance
Copyright and License:
[Uncaptioned image] © Jakub Michaliszyn and Jan Otop; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Formal languages and automata theory
Funding:
This work was supported by the National Science Centre (NCN), Poland under grant 2020/39/B/ST6/00521.
Editors:
Paweł Gawrychowski, Filip Mazowiecki, and Michał Skrzypczak

1 Introduction

Finite automata are one of the most fundamental models of computation, playing a central role in theoretical computer science and numerous practical applications, including pattern recognition [27, 4], rule-based systems [9], planning [1], language recognition [24] or multi-agent systems [6, 23]. Among finite automata, Deterministic Finite Automata (DFA) provide one of the simplest formalisms for representing and reasoning about structured systems, patterns, rules, and processes.

Minimization of DFA is crucial for two reasons: complexity and interpretability. First, in applications involving DFA, it is often advantageous performance-wise to minimize DFA before proceeding with the actual algorithm.

Second, DFA are used as a formalism to approximate more complex models such as neural networks [26, 15]. Such applications operate in environments where inputs are inherently noisy or incomplete. In such cases, it is often sufficient to approximate the behavior of a system rather than achieving perfect fidelity. In this scenario the exact language equivalence is neither feasible nor necessary, and furthermore DFA of smaller size are typically easier to comprehend, which leads to better interpretability of the underlying model.

Furthermore, as DFA already approximate the underlying models, preserving the exact language of a given DFA in the minimization process is not imperative anymore. Instead, to boost state reduction, we can settle for a similarity notion between languages rather than the rigid language equivalence.

Various notions of similarity between languages have been considered. For instance, two languages are similar if their symmetric difference is finite. Minimization of DFA under this similarity notion is called hyper-minimization of DFA [10, 18]. However, this is a rather generic notion as it applies to any set of objects.

Another idea, which relies on the fact that languages are sets of words, is to take a distance d(w,u) on words and lift it to languages in the canonical way: the distance between a word w and a language , denoted by d(w,), is the minimal d(w,u) over all u. Then, the distance between and is the maximum of numbers d(w,), d(w,) over all w,w. This approach allows infinitely many words to be slightly changed, and is more specialized than the finite symmetric difference. We follow the latter approach employing the edit distance as the distance d.

The edit distance [17], also called the Levenshtein distance, between words w,w is the minimal number of single letter insertions, deletions and substitutions necessary to transform w into w. It is one of the fundamental string metrics used in natural language processing [14], DNA analysis [7] or analysis of neural networks [26]. Furthermore, it is closely related to dynamic time warping, which is a basic measure of similarity between timed series [14]. As a similarity measure between languages, edit distance has been considered on regular languages [3] and context-free languages [8].

1.1 Contributions

This paper aims to provide a fresh perspective to the field of DFA minimization, which balances the conflicting objectives of state reduction and similarity of languages.

We introduce the problem of minimization of DFA modulo the edit distance, a novel approach that balances compactness with approximation flexibility. Unlike classical minimization, which focuses on preserving exact language equivalence, this approach allows for slight deviations between the original and minimized automata, measured by the edit distance, which enables significant state reduction while maintaining practical usability.

The problem is considered in two variants. In the fixed threshold variant (MIN-ED[k]), the goal is to find a DFA with the minimal number of states such that the edit distance between the languages of the original DFA 𝒜 and is bounded by a given threshold k.

In the bounded threshold variant (MIN-BED), the objective is to find a minimal DFA such that the edit distance between the languages of 𝒜 and is finite. This variant further loosens the constraints, allowing for more aggressive state reduction while still capturing the core structure of the original language.

These two perspectives provide a flexible framework for tailoring DFA minimization to the needs of specific applications, particularly in noisy or approximate environments.

Our main technical results are for the bounded threshold variant MIN-BED. We show that this problem is in general Σ2P and both NP-hard and coNP-hard, but we identify subclasses of DFA, which enjoy better complexity. We consider automata with a bounded depth of strongly connected components. We prove that for any c, restricting the problem to automata whose accepting runs visit at most c different (maximal) strongly connected components reduces the complexity to NP. For cases c{1,2}, we provide polynomial-time algorithms and show NP-hardness for c=3.

For the fixed threshold variant MIN-ED[k], we prove that the problem is robustly PSpace-complete, in the sense that it is unclear how to restrict the class of DFA to gain better complexity results. In particular, we discuss that the hardness proof can be adapted to automata in which accepting runs visit at most one (maximal) strongly connected component.

1.2 Related work

Minimization of finite automata has been extensively studied [22, Chapter 10]. The foundational algorithms in this area were designed to reduce the state complexity of DFA while preserving their exact language equivalence. Notable among these are the algorithms proposed by Hopcroft [11], which works in O(nlogn), the algorithm by Moore [20], which is quadratic in the worst case but enjoys better average-case complexity, and the algorithm by Brzozowski [5], which is exponential in the worst case, but often outperforms other algorithms in practice [2].

Building on these foundational methods, researchers have explored extensions and generalizations of DFA minimization. One significant direction is hyper-minimization, which allows for a relaxation of exact language equivalence by permitting finite differences between the languages of the original and minimized DFA. This approach has been studied extensively, with algorithms achieving O(nlogn) complexity [10, 18]. Hyper-minimization has also been adapted to other automata models, such as deterministic tree automata [13] and deterministic weighted tree automata [19].

Language inclusion modulo the edit distance has been studied between regular languages represented by DFA and NFA [3], and regular and context-free languages represented by DFA, NFA, and DPDA [8].

Despite extensive research on DFA minimization, the problem of minimization modulo the edit distance has not been addressed in the literature. Our work fills this gap.

2 Preliminaries

A word w over a finite alphabet Σ of letters is a finite sequence of letters. The set of all finite words over Σ is denoted by Σ.

A deterministic finite-state automaton (DFA) is a tuple Σ,Q,q0,F,δ consisting of the alphabet Σ, a finite set of states Q, the initial state q0Q, a subset FQ of accepting states, and a transition function δ:Q×ΣQ. The size of an automaton 𝒜, denoted by |𝒜|, is its number of states.

Given a word w1wn, the run of a DFA 𝒜 on w is a sequence of states q0q1qn such that q0 is the initial state and for each i we have δ(qi1,wi)=qi. The run on w is accepting if its last state belongs to F. The language of 𝒜, denoted by (𝒜), is the set of words whose runs are accepting. DFA 𝒜 and are equivalent if and only if (𝒜)=().

Let 𝒜 be a DFA and q be its state. We denote by 𝒜q the DFA obtained from 𝒜 by changing the initial state to q. The DFA 𝒜 is minimal if and only if (a) all states of 𝒜 are reachable from the initial state, and (b) for all states q,q of 𝒜, if qq, then (𝒜q)(𝒜q). Due to Myhill-Nerode theorem [12], every regular language is recognized by a minimal DFA 𝒜, which is unique (up to state names), and every DFA language equivalent to 𝒜 other than 𝒜 has more states than 𝒜. Furthermore, given a DFA 𝒜, we can compute a minimal DFA equivalent to 𝒜 in time O(|𝒜|log|𝒜|).

For convenience, throughout the paper, we assume that input DFAs are already minimal – the minimization can be done in polynomial time if needed. This in particular means that all the states are reachable.

A state s of DFA is a sink if it is not accepting and all the transitions from s lead to s. In a minimal DFA there is at most one sink, which is the only state from which no accepting state is reachable.

We extend δ to all words by defining δ(q,ϵ)=q (where ϵ is the empty string) and δ(q,wx):=δ(δ(q,w),x) for each state q, word w and letter x. A subset of states C of a DFA 𝒜 is a (maximal) strongly connected component (SCC) if for any two states s1,s2C, there is a non-empty word w such that δ(s1,w)=s2, and there is no CC with this property. We say that 𝒜 has the SCC-depth c if c is the maximal number such that there is an accepting run visiting c different (maximal) SCC of 𝒜. Importantly, we only consider accepting runs and hence the sink state does not influence the SCC-depth; for example, both automata presented in Figure 2 (page 2) have the SCC-depth 3.

For convenience, we use regular expressions defined in the usual manner, i.e., expressions defined using the BNF e:=ϵxeee+eee+, where x is a letter and ϵ stands for the empty word. The language (e) of e is defined inductively: (ϵ)={ϵ}, for each letter x, (x)={x}; (e1e2)={vwv(e1),w(e2)}; (e1+e2)={vv(e1)v(e2)}; (e)={v1vnivi(e)}, and (e+)=(ee).

2.1 Distance between languages and decision problems

The edit distance between words w,w, denoted by ed(w,w), is the minimal number of operations: insertions of a letter, deletions of a letter, and substitutions of a letter, necessary to transform w into w. For example, ed(ab,acd)=2 (one substitution and one insertion).

For a natural number k, we say that a language is contained in a language up to the edit distance k, denoted as edk, if and only if for each w there is w such that ed(w,w)k. We extend this notion to ed by stating that ed if and only if for some k we have edk. The relation edk is transitive only for k=.

We say that DFA 𝒜, are equivalent up to the edit distance k{}, denoted as 𝒜edk, if and only if (𝒜)edk() and ()edk(𝒜). Notice that 𝒜edk is an equivalence relation if and only if k=; in other cases, it is symmetric and reflexive, but not necessarily transitive.

We study decision problems called MIN-ED[k] for k and MIN-BED defined as follows:
MIN-ED[k]: minimization modulo a fixed edit distance k
Input
: A DFA 𝒜 and a number m.  
Output: Is there a DFA with at most m states such that 𝒜edk?
MIN-BED: minimization modulo a bounded edit distance
Input
: A DFA 𝒜 and a number m.  
Output: Is there a DFA with at most m states such that 𝒜ed?

2.2 Containment modulo a bounded edit distance

We recall a condition for checking containment modulo a bounded edit distance presented in [3, Theorem 4.1]. This condition is used in Sections 3 and 4.

Following [3], we consider minimal DFA, which recognize infinite languages. Note that all DFA that recognize finite non-empty languages are equivalent modulo a bounded edit distance, while the DFA that recognizes the empty language is only equivalent to itself. Conversely, a DFA that recognizes an infinite language can be equivalent modulo a bounded edit distance only to a DFA recognizing infinite language. Therefore, considering only DFA that recognize infinite languages removes only trivial cases.

For a DFA 𝒜 (minimal and recognizing an infinite language), let SCC(𝒜) be the set of all (maximal) strongly connected components of 𝒜 containing an accepting state or from which an accepting state is reachable.

We define dag(𝒜) as a directed acyclic graph whose set of nodes is SCC(𝒜), and there is an edge from C1 to C2 if and only if C1C2 and there is a transition from some state of C1 to some state in C2. Let dag(𝒜) be the reflexive and transitive closure of dag(𝒜).

For CSCC(𝒜), we define (𝒜|C) as the language of words such that w(𝒜|C) if, starting from some state in C, the automaton can read the whole word w without leaving C. More precisely, it is the language of words w1wn such that there are q0qnC satisfying, for all i, δ(qi1,wi)=qi.

Now, consider two DFA 𝒜 and . We say that a path π=C1Cn in dag(𝒜) is covered by a path π=D1Dm in dag() if and only if n=m and for every 1in we have (𝒜|Ci)(|Di). Checking whether a path is covered by a path can be decided in polynomial time [3, Lemma 4.4].

It has been shown in [3, Theorem 4.1] that (𝒜)ed() if and only if every path in dag(𝒜) is covered by some path in dag() (the result from [3, Theorem 4.1] holds even for non-deterministic finite automata). Furthermore, due to [3, Theorem 4.1], we have (𝒜)ed() if and only if (𝒜)edk() for any k(|𝒜|+1)(||+1).

 Remark 1.

The upper bound on the edit distance threshold in MIN-BED is (|𝒜|+1)m, i.e., an instance (𝒜,m) of MIN-BED is positive if and only if it is a positive instance of MIN-ED[(|𝒜|+1)m].

Note that a DFA 𝒜 may contain a rejecting sink state, which has no counterpart in dag(𝒜). Furthermore, some states of 𝒜 may not belong to any SCC (for example, the initial state of 𝒜 if it is not reachable from any other state). In particular, if dag(𝒜) has no root, i.e., the node from which all other nodes are reachable, then the initial state of 𝒜 does not belong to any SCC. We will use this fact further on.

Fact 2.

Let 𝒜 be a DFA such that (1) its language is non-empty, (2) for some word w, its language does not contain any extension of w ((𝒜)(wΣ)=), and (3) dag(𝒜) does not have a root, i.e., a node from which all other nodes are reachable. Then, |𝒜| is at least the number of nodes of dag(𝒜) plus 2.

2.3 Example

Consider a program that receives some data, computes something and sends the response, after which it can start over or terminate. Traces of such a program can be represented over the alphabet {r,c,s,t} as depicted in Figure 1 (left).

Observe that 𝚜 is a sink state; the sink state is present in all automata, but we do not draw it or discuss it for better clarity.

Assume that we want to learn the (target) automaton 𝒜T depicted in Figure 1 (left) from noisy data, which contains as a positive example the word rcsrcsrcs instead of rcsrcsrcst. It is possible that the learning procedure produces the automaton that is depicted in Figure 1 (right).

Figure 1: An example automaton 𝒜T (left) and the automaton 𝒜L (right) constructed from noised data. The remaining transitions of the right automaton lead to the sink state, which is also present in the automaton but not drawn here for better clarity.

Observe that the automaton 𝒜L from Figure 1 (right) is minimal in the classical sense, and also minimal w.r.t. hyper-minimization [10, 18] as there is no language that differ from the language of 𝒜L on some finite set of words whose automaton has less states.

Observe that 𝒜Ted1𝒜L. Clearly (𝒜T)ed1(𝒜L) because (𝒜T)(𝒜L). To see (𝒜L)ed1(𝒜T), observe that any word accepted by 𝒜L but not by 𝒜T is of the form (rcs)4k+3 for some k, and it can be modified to be accepted by 𝒜T by appending t. In this case, the automaton 𝒜L is the minimal automaton equivalent to 𝒜T up to edit distance 1. Therefore, minimizing 𝒜T allows to overcome the problem introduced by the noised data.

Now consider an automaton 𝒜K which is a copy of 𝒜L, but with the state q2 accepting. In this case, this automaton accepts the word rc, which cannot be transformed into any word accepted by 𝒜T with edit distance 1. Therefore, 𝒜Ted1𝒜K. It can be checked, however, that 𝒜Ted2𝒜K and 𝒜Ted𝒜K.

3 Bounded edit distance

In this section, we study the minimization problem for DFA modulo a bounded edit distance. We show that the MIN-BED problem is in Σ2P and it is both coNP-hard and NP-hard. We also identify a class of DFA for which the problem is NP-complete.

3.1 Membership in 𝚺𝟐𝑷 and coNP-hardness

We first show that the MIN-BED problem is in Σ2P. Recall that a language P is in Σ2P if there is a language P in coNP such that xP if and only if there exists y of polynomial size in |x| such that (x,y)P.

Given two DFA 𝒜,, checking whether 𝒜ed amounts to checking two inclusions: (𝒜)ed() and ()ed(𝒜). Since the inclusions can be checked in coNP [3, Theorem 5.2], checking equivalence can also be done in coNP.

To solve MIN-BED for an instance (𝒜,m), assume that m<|𝒜| (otherwise the answer is yes with =𝒜). Then, non-deterministically pick a DFA of the size at most m and return whether 𝒜ed. It follows that the problem is in Σ2P.

Proposition 3.

The MIN-BED problem is in Σ2P.

Recall the condition for checking containment modulo a bounded edit distance (Section 2.2), which involves checking coverage of all paths in dag(𝒜). As we strive for better complexity results, we will restrict the class of DFA to those with polynomially many paths in dag. Later on, we consider DFA 𝒜 such that dag(𝒜) has bounded depth.

It has been shown in [3] that deciding language containment up to a bounded edit distance is coNP-complete. This does not directly imply that the MIN-BED problem or even deciding the equivalence up to a bounded edit distance is coNP-hard. For instance the language inclusion on languages represented by deterministic pushdown automata (DPDA) is undecidable [12], while the equivalence problem for DPDA is decidable [25]. Still, we show coNP-hardness of MIN-BED by adapting the reduction from [3].

Theorem 4.

The MIN-BED problem is coNP-hard.

3.2 MIN-ED[] over DFA of a bounded SCC-depth is NP-complete

We now restrict the class of DFA to those with polynomially many paths in dag and show an improved complexity results.

Lemma 5.

Fix l. Deciding 𝒜ed over DFA 𝒜 such that the number of paths of dag(𝒜) is at most |𝒜|l, is in NP.

Notice that we do not require dag() to have polynomially many paths.

Proof.

To check whether 𝒜ed, for each path of dag(𝒜), non-deterministically pick a path of dag() that covers it. Checking ed𝒜 is more complicated, as may have exponentially many paths. Let Π be the set of all the paths of dag(𝒜). We employ the following algorithm.

For each node C of dag(), non-deterministically pick a non-empty subset covers(C) of paths from Π such that for all πcovers(C), the last node D of π satisfies (|C)(𝒜|D). For nodes C,C such that C is a successor of C in dag() and πcovers(C), we say that π is a continuation of π in C if πcovers(C) and π is a prefix of π or π=π. The algorithm accepts if for all nodes C,C such that C is a successor of C in dag(), each πcovers(C) has a continuation in covers(C).

The algorithm works in polynomial time, as the size of Π is polynomial. If the algorithm accepts, each path is covered. Indeed, consider a path C1Cn in dag(). Consider π1 in covers(C1), which exists as the subsets are non-empty. Then, inductively, let πi+1 be a continuation of πi in Ci. From πn in dag(𝒜), we can extract a path in dag(𝒜) that covers the path C1Cn.

Now we prove that if every path is covered, then the algorithm accepts. To do so, construct covers() starting with covers(C)= for each node. Then, for each path C1Cn of dag(), which is covered by a path D1Dn of dag(𝒜), and for each i we construct a path πi for covers(Ci) as follows. First, remove repeating nodes from D1,,Di, obtaining πi. Since πi is a path of dag(𝒜), one can compute a path πi of dag(𝒜) such that πi can be obtained from πi by possibly inserting some nodes. We add πi to covers(Ci). Then covers() is as required.

Lemma 6.

For every c, the MIN-BED problem over DFA of SCC-depth at most c is in NP.

Proof.

Given a DFA 𝒜 and m, the algorithm first non-deterministically picks a DFA with at most m states (if m>|𝒜| then the algorithm accepts straight away). Notice that dag(𝒜) has at most |𝒜|c paths as the length of a path is bounded by c. Then, the algorithm returns whether 𝒜ed using the algorithm from Lemma 5.

We now show NP-hardness of MIN-BED over DFA of SCC-depth 3. For a (directed) bipartite graph G, let RB(G) be the minimal number of complete bipartite graphs contained in G (not necessarily disjoint), which together cover all edges of G. The following problem is NP-complete [21, Theorem 8.1].
R-bicontent covering
Input
: a bipartite graph G without isolated nodes and m  
Output: whether RB(G)m

Figure 2: A bipartite graph G (bottom left), the corresponding automaton 𝒜G (top left) and a minimal automaton equivalent modulo a bounded edit distance (top right) and the corresponding cover of G (bottom right). For readability, we omit sinks and the transitions leading to the sinks. Notice that the SCC-depth of both automata is 3, as the initial state is not in an SCC, and the sink does not count.

We employ the R-bicontent covering problem to prove the following hardness result.

Lemma 7.

The MIN-BED problem over DFA of SCC-depth at most 3 is NP-hard.

Proof.

Let G=(Vin,Vout,E), where EVin×Vout, be a bipartite graph without isolated nodes, and m. Consider a DFA 𝒜G over the alphabet VinVout{e} accepting exactly words of the form v+ueu+ such that (v,u)E.

The DFA 𝒜G has the following states: an initial state s0, a sink sr, a state vT for each node vVin, and two states vM, vB for each node vVout.

For each (v,u)E, 𝒜G has the following transitions:

  • From s0 to vT over v.

  • From vT, a loop over v and a transition to uM over u.

  • From uM, a loop over e and a transition to uB over u.

  • A loop in uB over u.

All the remaining transitions lead to the sink. The states vB are accepting, while all other states are rejecting. Figure 2 contains an example of this reduction with a corresponding minimal automaton.

Assume that l=RB(G). We claim that the minimal size of a DFA equivalent modulo a bounded edit distance to 𝒜G is |Vin|+|Vout|+l+2.

Let S={(Cin1,Cout1,E1),,(Cinl,Coutl,El)} be a minimal R-bicontent covering of G. Each element of S is a complete bipartite subgraph of G. We construct a DFA 𝒜S from 𝒜G by removing all states vM and introducing states s1, …, sl. For each i, the state si has a loop over e and for all uCouti, a transition over u to uB. Moreover, for each (v,u)E, we pick one i such that (v,u)Ei and we add a transition from vT to si over u. The remaining transitions lead to the sink.

For (𝒜G)ed(𝒜S) observe that (𝒜G)(𝒜S); each word vxueyuz(𝒜G) is accepted by 𝒜S as there is a transition from vT to some si such that (v,u)Ei, and from si there is a transition to uB.

For (𝒜S)ed(𝒜G), observe that 𝒜S accepts words of the form w=vxueyuz such that (v,u)E. For each such a word, there is w=vxueyuz accepted by 𝒜G such that ed(w,w)1.

We now show that if for some l there is a DFA ed𝒜S of the size |Vin|+|Vout|+l+2, then RB(G) is at most l. Due to Fact 2, dag() has at most |Vin|+|Vout|+l nodes.

Consider dag(). Since is equivalent modulo a bounded edit distance to 𝒜G, every path in dag() is covered in dag(𝒜G) and vice versa. Therefore, dag() contains three sets of nodes V1={D(|D)=(v) for some vVin}, V2={D(|D)=(e)}, and V3={D(|D)=(u) for some uVout}. Observe that |V1||Vin| and |V3||Vout|, as otherwise some paths of dag(𝒜G) would not be covered. Hence |V2|l.

We claim that each node in V2 corresponds to a (possibly empty) complete bipartite subgraph of G. For DV2, let DinVin be the labels of the predecessors of D from V1, and DoutVout be the labels of the successors of D from V3. Observe that for any vDin, uDout we have (v,u)E, because every path in dag() is covered in dag(𝒜G), which in turn encodes edges of G, and hence the complete bipartite graph (Din,Dout,Din×Dout) is a subgraph of G. Furthermore, since every path in dag(𝒜G) is covered by dag(), every edge of G is contained in some complete subgraph. Thus, X={(Din,Dout,Din×Dout)DV2} is a set of complete bipartite graphs which together cover all edges of G. Finally, RB(G)|X||V2|l.

By combining the containment in NP from Lemma 6 and the hardness from Lemma 7, we have the following.

Theorem 8.

For any c3, the MIN-BED problem restricted to DFA of SCC-depth at most c is NP-complete.

4 Effective algorithms for bounded edit distance

In this section, we show that the MIN-BED problem restricted to DFA of SCC-depth at most 2 can be solved in polynomial time. In order to avoid technicalities, we allow the minimized automaton to operate over an extended alphabet with additional fresh letters. We use up to |𝒜| fresh letters and the number of occurrences of each fresh letter is bounded among all accepted words.

We first investigate automata, for which the minimization problem is trivial. There are two reasons for that: either (𝒜) is finite or (conversely) the language is equivalent modulo bounded edit distance to Σ. In the former case, the DFA has only sink and possibly some transient states, i.e., the states that do not belong to any SCC and hence can be visited at most once along any run. In the latter case, the DFA has a state from which no sink is reachable. We argue that such automata are equivalent either to a single-state automaton accepting/rejecting all the words or to a two-state automaton recognizing the language {ϵ}. All finite non-empty languages are equivalent modulo bounded edit distance, but {ϵ} has the DFA with the least number of states.

Proposition 9.

Consider a DFA 𝒜. (a) If (𝒜) is finite, then either (𝒜)= or (𝒜)ed{ϵ}. (b) If 𝒜 has a state from which no sink is reachable, then (𝒜)edΣ.

To see (a), observe that the empty language is equivalent modulo bounded edit distance only to the empty language, and all finite non-empty languages are equivalent.

To see (b), observe that if there is a state from which no sink is reachable, then there is a state q in some bottom SCC C of 𝒜 from which no sink is reachable. Observe that since 𝒜 is minimal, C has an accepting state qf – otherwise C would be reduced to a sink. Let w0 be a shortest word such that 𝒜 reading w0 from q0 ends in q. For each word wΣ, we now consider the word w0w. Let q be the state of 𝒜 after reading w0w, and let wf be a shortest word such that the state of 𝒜 after reading wf from q is qf.

Clearly, 𝒜 accepts w0wwf. Moreover, the sizes of w0 and wf can be bounded by the number of states of 𝒜 (using a pumping argument). This shows that Σed2|𝒜|(𝒜). Obviously (𝒜)ed0Σ, and thus 𝒜 is equivalent modulo a bounded edit distance to a single state DFA, which accepts every word.

Due to Proposition 9, in the remainder of this section we assume that a given DFA 𝒜 recognizes an infinite language (which can be easily checked) and from every state of 𝒜 the rejecting sink state is reachable. We call such DFA non-trivial. Observe that for a non-trivial DFA 𝒜, the DAG SCC(𝒜) is non-empty.

4.1 Single SCC with a sink

We now focus on the case of DFA 𝒜 such that 𝒜 is non-trivial and SCC(𝒜) is a singleton {C}. This means that 𝒜 may contain states that are not in any SCC and the sink {sr}.

Theorem 10.

For any non-trivial DFA 𝒜 with dag(𝒜) consisting of a single node, min-scc(𝒜) returns an automaton equivalent to 𝒜 modulo a bounded edit distance such that any DFA equivalent to 𝒜 modulo a bounded edit distance has at least as many states as min-scc(𝒜).

Proof sketch.

The automaton 𝒜 can be minimized using the following steps: (1) remove all that states except for C and sr, (2) make all states of C accepting, (3) pick any state of C as initial, and (4) minimize the resulting automaton using standard DFA minimization.

The automaton 𝒜 obtained this way is equivalent modulo a bounded edit distance to 𝒜, and by construction 𝒜 and min-scc(𝒜) are language equivalent.

We argue that min-scc(𝒜) is minimal. Let =min-scc(𝒜) and assume that 𝒞 is a minimal automaton such that 𝒞ed𝒜. Recall that 𝒜q stands for 𝒜 with the initial state changed to q. One can show that there is a state s such that ()=(𝒞s). Since is by construction a minimal automaton recognizing (q0), we have |||𝒞|, and since 𝒞 is minimal, =min-scc(𝒜) is minimal as well.

A heuristic for the general case

The described algorithm gives us a polynomial-time procedure to minimize SCCs in non-trivial DFA with multiple SCCs. The idea is that one can minimize each SCC independently and then connect the results according to the connections in dag(𝒜), as described below.

Consider 𝒜 whose dag(𝒜) consists of nodes C1,Cn. For each Ci, consider a DFA 𝒜i which is a restriction of 𝒜 to the states Ci and the sink, where the initial state is any state of Ci and each transition from state of Ci that in 𝒜 does not lead to a state in Ci is diverted to the sink. Moreover, let 𝒜0 be a two-state automaton where all the transitions from the accepting initial state lead to the sink. We now consider the DFA 𝒜S obtained by taking the disjoint union of automata 𝒜0,𝒜1,,𝒜n and unifying the sinks. Consider fresh letters c1,,cn. Now for each Ci that has no predecessor in dag(𝒜), we add a transition from all the states of 𝒜0 (except the sink) to all the states of 𝒜i (except the sink). Moreover, for each edge (Ci,Cj) of dag(𝒜), we add a transition from all the states of 𝒜i (except the sink) to all the states of 𝒜j (except the sink). The remaining edges are to the sink.

The procedure described above constructs in polynomial time a DFA 𝒜S that is equivalent modulo a bounded edit distance to the initial one.

Corollary 11.

Let 𝒜 be a non-trivial DFA. Then, the DFA 𝒜S is equivalent modulo a bounded edit distance to 𝒜 and can be constructed in polynomial time.

This gives us a possible heuristic to reduce the size of a given automaton. The procedure described above does not guarantee that the constructed automaton is minimal – for example, it may be possible to eliminate some SCCs. This comes as no surprise as that the minimization problem is both NP-hard and coNP-hard, and the above algorithm works in polynomial time. Nevertheless, this procedure is optimal whenever the SCC structure of the input automaton matches that of any minimal automaton; in particular, this holds if each SCC recognizes a distinct language.

4.2 DFA of SCC-depth 2

For DFA of SCC-depth 3, the MIN-BED problem is NP-hard. We present a complementary result that for DFA of SCC-depth at most 2, minimization modulo a bounded edit distance can be computed in polynomial time.

Recall the example in Figure 2: the main idea there is to have many SCCs in the “middle layer” of the automaton – SCCs that have predecessors and successors in the corresponding DAG. Because of that, the states (being single-state maximal SCCs) s1 and s2, that have the same language ((e)), cannot be merged. However, for SCCs with no successors, it holds that they can be merged if they have the same language – and the same holds for SCCs with no predecessors. We build on this idea to solve the minimization modulo a bounded edit distance for DFA of SCC-depth 2, where each SCC has either no successors or no predecessors. One needs to be careful though; even if the input automaton has the SCC depth 2, it might happen that the minimal equivalent one has SCC depth 3, as we will show later on (Example 13).

For a path π, by π[i] we denote the ith element of π (starting from 1). We say that a path π is weakly covered by π if there is a sequence i1ik such that π[i1]π[ik] covers π. For example, the path CCC is weakly covered by CC because of the sequence 1,1,1.

Let 𝒜 be a DFA of SCC-depth at most 2. A subset E of paths of dag(𝒜) is:

  • noncollapsible if for each path C,D in dag(𝒜) of length 2 we have (𝒜|C)(𝒜|D) and (𝒜|C)(𝒜|D).

  • irredundant if there is no proper subset E of E such that every path in E is weakly covered by a path in E.

  • a minimal cover if it is noncollapsible, irredundant, and paths from E weakly cover all paths in dag(𝒜).

Let 𝒜, be DFA. A path C1,,Ck is tightly covered by a path D1,,Dk if for every i we have (𝒜|Ci)=(|Di).

We show that if 𝒜, are equivalent modulo a bounded edit distance, then every path in a minimal cover of paths in dag(𝒜) is tightly covered by some path from dag(). The intuition is that each paths in the minimal cover defines a “maximal” possible language, that has to be covered by some path of dag(), which in turn has to be covered by some path of dag(𝒜). The language maximality then guarantees the equality of the languages.

Lemma 12.

Let DFA 𝒜, be equivalent modulo a bounded edit distance, SCC depth of 𝒜 be 2, and P be a minimal cover of paths in dag(𝒜). Then, every path in P is tightly covered by some path from dag().

Algorithm 1 Function min-depth-2(𝒜) computing a minimal equivalent automaton in the 2 SCC-depth case. All transitions that are not explicitly defined lead to the sink.

Now, we construct a DFA equivalent modulo a bounded edit distance to 𝒜 with the minimal number of states. To do so, we first compute a minimal cover of paths in dag(𝒜). This is done iteratively in polynomial time, by starting from all the noncollapsible paths of dag(𝒜), and removing paths that can be covered by other paths. Then, we split the SCCs of the minimal cover into two categories: starters, which are the first SCCs of the paths (including paths of length 1), and finishers, which are the second SCCs of the paths of lengths 2. We remove duplicates in each category, minimize each SCC, and construct the target automaton 𝒜~ that consists of a disjoint union of all the remaining starters and finishers.

For each path of length 2 in the minimal cover, we find a corresponding starter and a finisher, and connect all the states of the starter to all the states of the finisher (connecting any state from the starter to any state of the finisher would also work). This may require extending the alphabet with additional letters; we need at most |P| such letters.

If it is possible to pick a state of some starter as an initial state, then we do; otherwise we add one additional starting state. We connect the initial state to all the starters.

Figure 3: Example of the initial state selection process. For readability, the sink and the transitions that lead to sinks are omitted.
Example 13.

Consider the automaton 𝒜 depicted in Figure 3. Without the red dashed edge, this automaton is minimal, and a separate initial state q0 is needed. However, if we consider 𝒜 with with the red dashed edge, then it is not minimal – one can remove q0, make q1 initial and connect q1 to q3 over c, obtaining a smaller equivalent automaton. Interestingly, in this case the SCC-depth of the minimal automaton is 3.

We add transitions from the initial state to all the starters (possibly using another letter). The detailed algorithm is presented in Algorithm 1. A quick check shows that it works in polynomial time.

The DFA min-depth-2(𝒜) is constructed in such a way that dag(min-depth-2(𝒜)) consists of nodes from L1L2 with edges from C to D if and only if CL1,DL2 and the path C,D is tightly covered by some path from P.

The automaton min-depth-2(𝒜) is not unique as the transitions between different SCCs may use different letters; the choice of letters is irrelevant as it can be fixed with a bounded edit distance.

Theorem 14.

Given a non-trivial DFA 𝒜 of SCC-depth at most 2, we can construct in polynomial time a DFA min-depth-2(𝒜) equivalent to 𝒜 modulo bounded edit distance such that any DFA equivalent to 𝒜 modulo a bounded edit distance has at least as many states as min-depth-2(𝒜).

Proof.

Observe that dag(min-depth-2(𝒜)) consists of nodes from L1L2 with edges from C to D if and only if CL1,DL2 and the path C,D is tightly covered by some path from P. Hence each path in dag(min-depth-2(𝒜)) is tightly covered by some path in P and hence it is covered by some path in dag(𝒜). Conversely, every path in P is exactly covered by a path in dag() and hence every path in dag(𝒜) is covered by a path in dag(min-depth-2(𝒜)). It follows that 𝒜 and min-depth-2(𝒜) are equivalent modulo a bounded edit distance.

Minimality of min-depth-2(𝒜) follows from Lemma 12. Indeed, if is a DFA, which is equivalent modulo a bounded edit distance to 𝒜, then Lemma 12 implies that for each SCC C from L1L2, the DFA contains an SCC D with (𝒜|C)=(|D). Therefore, all SCCs present in min-depth-2(𝒜) need to have their counterparts in with the same language. It follows that has at least as many states as min-depth-2(𝒜).

5 Fixed threshold edit distance

In this section we show that for any fixed k the MIN-ED[k] problem is PSpace-complete. The hardness result is robust in the sense that the subclasses of DFA discussed earlier do not admit a better complexity of the MIN-ED[k] problem, and it is not clear how to restrict the class of DFA to obtain better complexity results.

For membership, one can non-deterministically pick a target automaton and verify its equivalence modulo the edit distance k in polynomial space using the algorithm presented in [3, Theorem 5.12]. For PSpace-hardness we reduce the following problem.
Automata universality
Input
: DFA 𝒜1,,𝒜n over an alphabet Σ, such that 𝒜1 contains one sink state and it accepts all words of length at most 1, and for all i we have that (𝒜i)Σ.  
Output: whether L(𝒜1)L(𝒜n)=Σ

This problem is PSpace-complete, which follows from the classic result [16] by adding an appropriate automaton 𝒜1.

Theorem 15.

For each k, the MIN-ED[k] problem is PSpace-hard.

Proof.

Fix k1. We reduce the automata universality problem. Let Σ={b1,,bn,} be disjoint from Σ. We construct a DFA 𝒜 over the alphabet ΣΣ that accepts exactly words of the form (bi)kw such that w(𝒜i).

The automaton 𝒜 over the alphabet ΣΣ consists of the disjoint union of the automata 𝒜1,,𝒜n and fresh non-accepting states q0 (initial) and qij for i,j{1,,k}.

The automaton 𝒜 has the following additional transitions:

  1. 1.

    δ(q0,bi)=qi1 for any i;

  2. 2.

    δ(qij,bi)=qij+1 for any j<k;

  3. 3.

    δ(qik,) is the initial state of 𝒜i, for any j.

The remaining transitions, including transitions over Σ from the states of 𝒜1,,𝒜n, are to the sink state of 𝒜1. Figure 4 contains an example of such an automaton.

Figure 4: Example of the reduction in Theorem 15 for k=3. For readability, sinks and the transitions that lead to sinks are omitted.

We conclude the reduction by fixing m=3, i.e., we ask whether the input DFA is equivalent modulo the edit distance k to some automaton with at most 3 states.

Let be the minimal automaton over ΣΣ recognizing the language {wwΣ}. It has three states: q0 (initial), q (accepting) and s (sink). From q0 the transition over is to q, for each xΣ the transition from q over x is to q, and all the remaining transitions are to s.

Lemma 16.

For all three-state automata 𝒞 such that 𝒜edk𝒞 we have 𝒞= (up to state renaming).

Lemma 17.

𝒜edk if and only if L(𝒜1)L(𝒜n)=Σ.

Proof.

Assume 𝒜edk and let wΣ. Since accepts w, for some i, 𝒜 accepts bikw, which means that 𝒜i accepts w. Thus L(𝒜1)L(𝒜n)=Σ.

Now assume L(𝒜1)L(𝒜n)=Σ. Each word accepted by 𝒜 is of the form bikw, where wΣ. Then, accepts w and ed(bikw,w)k.

On the other hand, each word accepted by is of the form w, where wΣ. Since L(𝒜1)L(𝒜n)=Σ, there is i such that wL(𝒜i). Therefore, for w=bikw we have that 𝒜 accepts w and ed(w,w)k. Thus 𝒜edk .

Lemmas 16 and 17 together imply that the instance (𝒜,3) of the MIN-ED[k] problem is positive iff we have L(𝒜1)L(𝒜n)=Σ. This concludes the proof of PSpace-hardness.

 Remark 18.

One can assume that the input automata of the automata universality problem consists of one SCC, for example, by adding an additional reset letter to each automaton (i.e., a letter such that the transition from each state over this letter leads to the initial state). This shows that the problem is PSpace-hard even for automata of SCC-depth 1.

6 Other metrics

The paper focuses on minimizing DFA modulo similarity of languages, where the similarity measure is based on the edit distance between words. We briefly discuss alternative approaches to define similarity of languages: the Hamming distance and asymmetric relations.

6.1 Minimization modulo the Hamming distance

Hamming distance calculates the minimum number of single-character substitutions required to transform one word into another, assuming both words are of the same length. In contrast, edit distance also allows insertions and deletions of characters, making it applicable to words of different lengths. Hamming distance could be more advantageous in scenarios where fixed-length words are present, or situations where insertions and deletions are unrealistic, like comparing fixed-length DNA sequences. However, its limitation to equal-length words makes it less suitable than edit distance in cases where inputs may vary in length or include additional elements.

We now briefly discuss how our results could be adapted for the Hamming distance metrics. For fixed k, the PSpace-hardness result from Theorem 15 can be easily adapted to the Hamming distance. For the bounded Hamming distance case, i.e, k=, the result would be different from those obtained for equivalence modulo a bounded edit distance. First, our results for a bounded edit distance rely on the condition from [3, Theorem 4.1] presented in Section 2.2 for DFA to be equivalent modulo a bounded edit distance. Up to our best knowledge, there is no counterpart of this condition for the Hamming distance. Second, due to Proposition 9, we have focused on non-universal DFA, for which a rejecting sink state is reachable from every state, as all universal DFA are equivalent modulo finite edit distance to a single-state DFA. Proposition 9 does not extend to language equivalence modulo a bounded Hamming distance. For example, the language consisting of words of even length is equivalent modulo a bounded edit distance to the language of all words, whereas it is not equivalent modulo a bounded Hamming distance. For these reasons, we leave the study of minimization modulo a bounded Hamming distance as future work.

6.2 Asymmetric similarity relations

While we have discussed symmetric similarity measure, in some applications only one-sided errors are allowed. For instance, in formal verification, abstraction methods reduce a given model size by constructing an over-approximation, which is verified against a specification. In such a case, there are no false positives, i.e., the answers that the specification is met are correct, while there may be false negatives due to over-approximation. Over-approximation should balance the size of the minimized model and the false-negatives rate. We can address it with similarity relations.

Consider the following relation: edk if and edk. For k=, this is a transitive relation. This leads to the following minimization problem: given a DFA 𝒜, find a minimal such that (𝒜)ed(). This approach gives us a close over-approximation; an automaton that accepts all the words that 𝒜 accepts and perhaps some additional words, but only if the additional words are close (w.r.t. the edit distance) to some accepted words.

Again, for fixed k, the PSpace-hardness result from Theorem 15 can be easily adapted to the asymmetric distance ed. For k=, most results can be easily adapted to the asymmetric distance ed: containment in Σ2P (Proposition 3), coNP-hardness (Theorem 4), NP-completeness (Theorem 8), triviality of minimization with unrachable sink state (Proposition 9), and minimization of a DFA being a single SCC with a sink (Theorem 10). However, minimization of DFA of SCC-depth 2 does not extend easily to the asymmetric distance ed.

While potentially useful in applications like formal verification, these asymmetric relations exhibit challenges such as counter-intuitive behaviour, and require further investigation into their properties and practical applications.

7 Conclusions

We investigated the problem of minimizing DFA modulo the edit distance. We proved that this problem is PSpace-complete for fixed edit distance and in Σ2P for a bounded edit distance. Then, we proposed a restriction of the latter problem which improves the complexity: we showed that for a fixed SCC-depth, the minimization problem is in NP, and for SCC-depth at most 2 the problem can be solved in polynomial time.

An interesting future work are possible heuristics and approximations for the fixed-edit-distance case. In particular, if the edit distance is fixed, but it is not strictly enforced (for example, it can be slightly violated by some fraction of words).

The bounded edit distance minimization is at least as good as hyperminimization in terms of the number of states (except in some corner cases of finite languages). In practice, we observed that it sometimes minimize too much (see, e.g., Proposition 9). To mitigate this, and also preserve the modular structure of the automata, one could restrict the minimization algorithm to work at some subsets of automata states. In our future work we want to explore algorithms for this scenario.

References

  • [1] Diego Aineto, Sergio Jimenez, and Eva Onaindia. Generalized Temporal Inference via Planning. In Proceedings of the 18th International Conference on Principles of Knowledge Representation and Reasoning, pages 22–31, November 2021. doi:10.24963/kr.2021/3.
  • [2] Marco Almeida, Nelma Moreira, and Rogério Reis. On the performance of automata minimization algorithms. In Proceedings of the 4th Conference on Computation in Europe: Logic and Theory of Algorithms, pages 3–14, 2007.
  • [3] Michael Benedikt, Gabriele Puppis, and Cristian Riveros. Bounded repairability of word languages. J. Comput. Syst. Sci., 79(8):1302–1321, 2013. doi:10.1016/j.jcss.2013.06.001.
  • [4] Anat Bremler-Barr, David Hay, and Yaron Koral. Compactdfa: Scalable pattern matching using longest prefix match solutions. IEEE/Acm Transactions On Networking, 22(2):415–428, 2013. doi:10.1109/TNET.2013.2253119.
  • [5] Janusz A Brzozowski. Canonical regular expressions and minimal state graphs for definite events. In Proc. Symposium of Mathematical Theory of Automata, pages 529–561, 1962.
  • [6] David Carmel and Shaul Markovitch. Opponent modeling in multi-agent systems. In International Joint Conference on Artificial Intelligence, pages 40–52. Springer, 1995. doi:10.1007/3-540-60923-7_18.
  • [7] William I. Chang and Eugene L. Lawler. Sublinear approximate string matching and biological applications. Algorithmica, 12(4/5):327–344, 1994. doi:10.1007/BF01185431.
  • [8] Krishnendu Chatterjee, Thomas A. Henzinger, Rasmus Ibsen-Jensen, and Jan Otop. Edit distance for pushdown automata. Log. Methods Comput. Sci., 13(3), 2017. doi:10.23638/LMCS-13(3:23)2017.
  • [9] Massimiliano de Leoni, Paolo Felli, and Marco Montali. Strategy Synthesis for Data-Aware Dynamic Systems with Multiple Actors. In Proceedings of the 17th International Conference on Principles of Knowledge Representation and Reasoning, pages 315–325, September 2020. doi:10.24963/kr.2020/32.
  • [10] Pawel Gawrychowski and Artur Jeż. Hyper-minimisation made efficient. In MFCS 2009, pages 356–368, 2009. doi:10.1007/978-3-642-03816-7_31.
  • [11] John Hopcroft. An n log n algorithm for minimizing states in a finite automaton. In Theory of machines and computations, pages 189–196. Elsevier, 1971.
  • [12] John E. Hopcroft and Jeffrey D. Ullman. Introduction to Automata Theory, Languages and Computation, Second Edition. Addison-Wesley, 2000.
  • [13] Artur Jeż and Andreas Maletti. Hyper-minimization for deterministic tree automata. In CIAA 2012, pages 217–228, 2012. doi:10.1007/978-3-642-31606-7_19.
  • [14] Dan Jurafsky and James H. Martin. Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition. Prentice Hall series in artificial intelligence. Prentice Hall, Pearson Education International, 2009. URL: https://www.worldcat.org/oclc/315913020.
  • [15] Igor Khmelnitsky, Daniel Neider, Rajarshi Roy, Xuan Xie, Benoît Barbot, Benedikt Bollig, Alain Finkel, Serge Haddad, Martin Leucker, and Lina Ye. Analysis of recurrent neural networks via property-directed verification of surrogate models. International Journal on Software Tools for Technology Transfer, 25(3):341–354, 2023. doi:10.1007/s10009-022-00684-w.
  • [16] Dexter Kozen. Lower bounds for natural proof systems. In 18th Annual Symposium on Foundations of Computer Science (sfcs 1977), pages 254–266. IEEE, 1977. doi:10.1109/SFCS.1977.16.
  • [17] Vladimir I Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady, 10(8):707–710, 1966.
  • [18] Andreas Maletti and Daniel Quernheim. Optimal hyper-minimization. Int. J. Found. Comput. Sci., 22(8):1877–1891, 2011. doi:10.1142/S0129054111009094.
  • [19] Andreas Maletti and Daniel Quernheim. Hyper-minimization for deterministic weighted tree automata. In AFL 2014, pages 314–326, 2014. doi:10.4204/EPTCS.151.22.
  • [20] Edward F Moore et al. Gedanken-experiments on sequential machines. Automata studies, 34:129–153, 1956.
  • [21] James Orlin. Contentment in graph theory: Covering graphs with cliques. Indagationes Mathematicae (Proceedings), 80(5):406–424, 1977. URL: https://www.sciencedirect.com/science/article/pii/1385725877900555.
  • [22] Jean-Éric Pin, editor. Handbook of Automata Theory. European Mathematical Society Publishing House, Zürich, Switzerland, 2021. doi:10.4171/Automata.
  • [23] Senthil Rajasekaran and Moshe Y. Vardi. Verification and Realizability in Finite-Horizon Multiagent Systems. In Proceedings of the 19th International Conference on Principles of Knowledge Representation and Reasoning, pages 278–287, August 2022. doi:10.24963/kr.2022/28.
  • [24] Emmanuel Roche and Yves Schabes. Deterministic part-of-speech tagging with finite state transducers. Computational linguistics, 21(2):227–253, 1995.
  • [25] Géraud Sénizergues. The equivalence problem for deterministic pushdown automata is decidable. In Automata, Languages and Programming, 24th International Colloquium, ICALP’97, Bologna, Italy, 7-11 July 1997, Proceedings, pages 671–681, 1997. doi:10.1007/3-540-63165-8_221.
  • [26] Qinglong Wang, Kaixuan Zhang, Xue Liu, and C Lee Giles. Verification of recurrent neural networks through rule extraction. arXiv preprint arXiv:1811.06029, 2018. doi:10.48550/arXiv.1811.06029.
  • [27] Erkang Zhu, Silu Huang, and Surajit Chaudhuri. High-performance row pattern recognition using joins. Proc. VLDB Endow., 16(5):1181–1195, January 2023. doi:10.14778/3579075.3579090.