Abstract 1 Introduction 2 Preliminaries 3 Infimum and Supremum Walks 4 Inf and Sup Conflicts 5 Data Structure 6 Existence of a Leftmost Walk References

Encoding Co-Lex Orders of Finite-State Automata in Linear Space

Ruben Becker ORCID DAIS, Ca’ Foscari University of Venice, Italy Nicola Cotumaccio ORCID University of Helsinki, Finland Sung-Hwan Kim ORCID DAIS, Ca’ Foscari University of Venice, Italy Nicola Prezza ORCID DAIS, Ca’ Foscari University of Venice, Italy Carlo Tosoni ORCID DAIS, Ca’ Foscari University of Venice, Italy
Abstract

The Burrows-Wheeler transform (BWT) is a string transformation that enhances string indexing and compressibility. Cotumaccio and Prezza [SODA ’21] extended this transformation to nondeterministic finite automata (NFAs) through co-lexicographic partial orders, i.e., by sorting the states of an NFA according to the co-lexicographic order of the strings reaching them. As the BWT of an NFA shares many properties with its original string variant, the transformation can be used to implement indices for locating specific patterns on the NFA itself. The efficiency of the resulting index is influenced by the width of the partial order on the states: the smaller the width, the faster the index. The most efficient index for arbitrary NFAs currently known in the literature is based on the coarsest forward-stable co-lex (CFS) order of Becker et al. [SPIRE ’24]. In this paper, we prove that this CFS order can be encoded within linear space in the number of states in the automaton. The importance of this result stems from the fact that encoding such an order in linear space represents a big first step in the direction of building the index based on this order in near-linear time – the biggest open research question in this context. The currently most efficient known algorithm for this task run in quadratic time in the number of transitions in the NFA and are thus infeasible to run on very large graphs (e.g., pangenome graphs). At this point, a near-linear time algorithm is solely known for the simpler case of deterministic automata [Becker et al., ESA ’23] and, in fact, this algorithmic result was enabled by a linear space encoding for deterministic automata [Kim et al., CPM ’23].

Keywords and phrases:
Burrows-Wheeler Transform, Co-Lexicographic Orders, Nondeterministic Finite Automata, Graph Walks
Copyright and License:
[Uncaptioned image] © Ruben Becker, Nicola Cotumaccio, Sung-Hwan Kim, Nicola Prezza, and Carlo Tosoni; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Formal languages and automata theory
; Theory of computation Data structures design and analysis ; Theory of computation Pattern matching ; Theory of computation Graph algorithms analysis
Funding:
Ruben Becker, Sung-Hwan Kim, Nicola Prezza, Carlo Tosoni: funded by the European Union (ERC, REGINDEX, 101039208). Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them. Nicola Cotumaccio: funded by the Helsinki Institute for Information Technology.
Editors:
Paola Bonizzoni and Veli Mäkinen

1 Introduction

The Burrows-Wheeler transform (BWT) [4] is a renowned reversible string transformation that rearranges a string’s characters to improve compressibility, while at the same time allowing the implementation of efficient indices. Although the original BWT was designed for strings, Gagie et al. extended this transformation to a particular class of nondeterministic finite automata (NFAs), which they termed Wheeler NFAs [11]. Subsequently, Cotumaccio et al. [10, 8] managed to extend the transformation to arbitrary NFAs through the concept of co-lexicographic orders (abbreviated to co-lex orders), yielding a natural extension of the BWT to NFAs. More precisely, co-lex orders are particular partial orders on an NFA’s states such that, if uv, with u,v being two states, then the strings reaching u and not reaching v are smaller than or equal to the strings reaching v and not reaching u. Such co-lex orders exist for every NFA and can be used to implement indices on the recognized regular language. The efficiency of the index depends on the width of the used co-lex order (a parameter being equal to 1 on Wheeler NFAs and always upper-bounded by the number of states). Specifically, the smaller the width of the co-lex order, the faster and smaller the resulting index. Computing the co-lex order of minimal width is however an NP-hard problem [12]. This issue has been addressed by Becker et al. [2, 3], who introduced coarsest forward-stable co-lex (CFS) orders, a new category of partial preorders that are as useful as co-lex orders for indexing purposes. Such CFS orders are guaranteed to exist for every NFA and, furthermore, are unique and can be computed in polynomial time. Moreover, the width of the CFS order is never larger than that of any co-lex order and, in some cases, is asymptotically smaller than the minimum-width co-lex order. As a result, CFS orders enable the implementation of indices in polynomial time, which are never slower than those based on co-lex orders, and that in some cases are asymptotically faster and smaller. However, the state-of-the-art algorithm for computing such CFS orders has quadratic time complexity with respect to the number of transitions in the automaton [3, Corollary 1]. This quadratic time complexity makes the application of such CFS orders infeasible in practice, e.g., in bioinformatics, where pangenome graphs (i.e., graphs encoding the DNA of a population) are used more and more frequently [16]. Such pangenome graphs fall within the category of big data for which only near-linear time algorithms can be considered feasible [15]. For this reason, the current main open research problem in this realm is to find an efficient, i.e., near-linear time, algorithm for computing co-lex orders of small width for arbitrary NFAs. For the special case of deterministic finite automata such a near-linear time algorithm is known [2, Algorithm 2] and its discovery was preceded by an encoding of this order that takes linear space with respect to the number of states of the automaton [13]. A similar linear space representation for the general case of nondeterministic finite automata is however not known for any of the candidate co-lex orders in the literature. In this paper, we resolve this main problem that hinders us to find an efficient algorithm for computing co-lex orders for arbitrary NFAs. We do so by giving an efficient data structure for the following problem.

Problem 1.

Given a forward-stable NFA, find a data structure for its maximum co-lex order FS that supports queries of the form: given two states u and v, determine if uFSv.

Here a forward-stable NFA is an NFA for which the coarsest forward stable partition [14] is equal to the partition consisting of all singleton sets. The CFS order on an arbitrary NFA [3, Definition 6] is defined as the maximum co-lex order on the corresponding forward-stable NFA. The forward stable NFA is a quotient automaton of the original automaton (thus its size is at most the size of the original automaton). As a result, a data structure for Problem 1 permits to represent the CFS order of an arbitrary NFA and is thus general enough to represent a co-lex order for an arbitrary NFA.

Contribution and Main Techniques.

In this article we give a data structure for Problem 1 stated above. More precisely, we prove the following theorem. In what follows, we denote with n the number of states and with m the number of transitions of the NFA at hand.

Theorem 2.

Given a forward-stable NFA with n states, there exists a data structure for Problem 1 taking O(n) space and supporting queries in O(n) time.

Here space is measured in RAM words of Θ(logn) bits. In order to better put our result into context we observe that there are two trivial solutions to Problem 1. (1) Explicitly storing the n2 pairs of the co-lex order yields a data structure for Problem 1 that takes O(n2) bits and supports queries in O(1) time. (2) Storing the input NFA takes O(m) space and the NFA inherently represents its maximum co-lex order FS. It is however unclear how to support queries efficiently in this case. Both of these approaches require Ω(n2) bits to be stored in the worst-case as the number of transitions may be Θ(n2).

Our data structure that satisfies the properties in Theorem 2 relies on three main techniques. (i) We assume to have computed a co-lex extension , i.e., a total order that is a superset of the maximum co-lex order of a forward-stable NFA (see Definition 11 for the formal definition). Such a co-lex extension can be obtained by running the ordered partition refinement algorithm of Becker et al. [2]. (ii) For each state, we store a left-minimal infimum walk Puinf and a right-maximal supremum walk Pusup to u. An infimum (supremum) walk to a state u is a walk encoding the lexicographic smallest (largest) string reaching u from the initial state. An infimum (supremum) walk Pu to a state u is left-minimal (right-maximal) if, whenever it intersects with another infimum (supremum) walk Pu, then the predecessor in Pu is smaller (larger) than or equal to the predecessor in Pu according to the co-lex extension. We study this type of walks in Sections 3 and 6. (iii) For each state u, we furthermore store two integers ϕ(u,Puinf) and γ(u,Pusup) that we use in order to encode the deepest states on the infimum walk Puinf (supremum walk Pusup) that is in infimum (supremum) conflict with Puinf (Pusup). We introduce this concept of infimum and supremum conflicts in Section 4. Intuitively, a state u¯ in Puinf is in infimum conflict with Puinf if there exists a state u^ that is incomparable with u¯ according to the maximum co-lex order and both u¯ and u^ can reach u with the same string α.

Figure 1 shows the decision tree used by our data structure in order to determine comparability with respect to the CFS order FS between two states u and v based on the above three concepts. Let u and v be two states. If u=v, then uFSv holds by reflexivity, case (a) in Figure 1. Otherwise, if v<u (here < means vu with respect to the co-lex extension and vu), we can conclude ¬(uFSv), case (b) in Figure 1. Otherwise, let Pusup be the right-maximal supremum walk to u and let Pvinf be the left-minimal infimum walk to v. Imagine traversing the two walks from u and v backwards yielding a sequence of pairs (ui,vi)i1. While traversing the walks we can construct the strings supIu and infIv. If supIuinfIv, we know that uFSv, case (c) in Figure 1. Otherwise, when comparing the pairs ui,vi with respect to the co-lex extension , we are guaranteed to find a pair uj,vj such that vj<uj. We distinguish two cases, if for each i[j1], we have ui<vi as in case (d) of Figure 1, then ¬(uFSv). Otherwise, there exists a minimal integer jj such that uj=vj. In this case, the comparability of u and v can be decided using the infimum/supremum conflicts. Consider the maximum integer h such that either uh is in sup conflict with Pusup or vh is in inf conflict with Pvinf. If hj (case (e) in Figure 1), we know ¬(uFSv). In fact, the opposite would imply that either Pusup is not right-maximal or Puinf is not left-minimal. Finally, if h<j (case (f) in Figure 1), we conclude that uFSv. The details of the data structure are presented in Section 5. We remark that the existence of our left-minimal infimum and right-maximal supremum walks is actually proved independently of the labels in the graph through an unlabeled analogue that we call leftmost/rightmost walks. These walks represent a combinatorial object in unlabeled directed graphs that may be of independent interest. As a central ingredient of our proof, we show constructively (i.e., through an algorithm) that such leftmost/rightmost walks are guaranteed to exist for any (unlabeled) directed graph and can be represented by a linear space function that encodes the predecessor of each node in their leftmost/rightmost walk. This is shown in Section 6.

Figure 1: Let FS and be the maximum co-lex order (see Definition 6) and a co-lex extension (see Definition 11) of a forward-stable NFA 𝒜, respectively. Let u and v be any two states in 𝒜. Denote with Pusup=(ui)i1 a supremum right-maximal walk to the state u and with Pvinf=(vi)i1 an infimum left-minimal walk to v (see Definitions 5 and 19). The figure shows the decision tree representing all possible cases that may arise when determining whether uFSv. Here, j is the smallest integer such that vj<uj, while j is the smallest integer such that uj=vj. Functions ϕj and γj represent the deepest states in infimum/supremum conflict with the walks Pusup and Pvinf.

2 Preliminaries

Strings and NFAs.

Given an alphabet Σ, we denote by Σ the set of all finite strings over Σ, where εΣ is the empty string. Moreover, we define Σω as the set containing all strings formed by an infinite enumerable concatenation of characters from Σ (i.e., strings of infinite length). In particular, we consider right-infinite strings, meaning that αΣω is constructed from ε by appending an infinite sequence of characters to its end. Therefore, the operation of prepending a character aΣ of αΣω is well defined and yields the string aα. The notation αω, with αΣ, denotes the concatenation of an infinite (enumerable) number of copies of α. In this paper, we assume to have a fixed total order over Σ. We extend to ΣΣω in order to obtain the lexicographic order on strings. For each αΣΣω, |α|=l denotes the length of α, where l= if αΣω. In addition, for each integer i with 1i<l+1, αi denotes the i-th character of α, starting from the left. Finally, we denote with α[i,j], where i,j are two integers such that 1ij<l+1, the string of Σ formed by the concatenation of characters αiαi+1αj.

A nondeterministic finite automaton (NFA) is a 4-tuple (Q,δ,Σ,s), where Q represents the set of the states, δ:Q×Σ2Q is the automaton’s transition function, Σ is the alphabet, and sQ is the initial state. The standard definition of NFAs also includes a set of final states; however, we omitted them since we are not interested in distinguishing between final states and non-final states. Given an NFA 𝒜=(Q,δ,Σ,s), a state uQ, and a character aΣ, we may use the shortcut δa(u) for δ(u,a). We make the following assumptions on NFAs: (i) We assume the alphabet Σ to be effective; each character of Σ labels at least one edge of the transition function. (ii) We assume that every state is reachable from the initial state. (iii) We assume that s has only one incoming edge, sδ(s,#), where #Σ does not label any other edge of 𝒜. (iv) We do not require each state to have an outgoing edge for all possible characters of Σ. (v) We assume that our NFAs are input-consistent; an NFA is said to be input-consistent if all edges reaching the same state have the same label of Σ. This assumption is not restrictive since any NFA can be transformed into an input-consistent NFA by replacing each state with at most |Σ| copies of itself, without changing its regular language. Given an NFA 𝒜=(Q, δ, Σ, s), and a state uQ, with us, we denote with λ(u) the (unique) character of Σ that labels the incoming edges of u, thus λ(s)=#.

Definition 3 (Forward-Stability).

Given an NFA 𝒜=(Q,δ,Σ,s) and two sets of states S, TQ, we say that S is forward-stable with respect to T, if, for all aΣ, one of the following conditions holds: (i) for every uS, there exists vT such that uδa(v). (ii) For every uS and vT, it holds that uδa(v). A partition 𝒬 of 𝒜’s states is forward-stable for 𝒜, if, for any two parts S, T𝒬, it holds that S is forward-stable with respect to T.

For each NFA there exists a unique coarsest forward-stable partition, i.e., the forward-stable partition with the lowest cardinality [2]. Henceforth, we call an NFA forward-stable if its coarsest forward-stable partition is formed by singleton sets.

Infimum and supremum walks.

Given an integer j, we denote by [j] the set {1,,j}. Given an NFA 𝒜=(Q,δ,Σ,s), and a state uQ, we say that a walk to u, denoted as Pu=(ui)i=1l, is a non-empty sequence of l states from Q that satisfies the following conditions: (i) u1=u, and (ii) for each i[l1], uiδa(ui+1) where a=λ(ui). Moreover, we denote by Pu=(ui)i1 a walk of infinite length to uQ. We define Iu as the set of all strings αΣω, for which there exists a walk Pu=(ui)i1 such that α=λ(u1)λ(u2)λ(u3). We now provide the definition of supremum and infimum strings of an NFA’s state, which was introduced by Conte et al. [5] and Kim et al. [13, Definition 7].

Definition 4 (Infimum and supremum strings).

Let 𝒜 be an NFA, and let u be a state of 𝒜. Then, the infimum string of u, denoted infIu, is the lexicographically smallest string in Iu. The supremum string of u, denoted supIu, instead is the lexicographically largest string Iu.

It is easy to demonstrate that for each uQ, the strings infIu and supIu exist (for instance, it follows from Observation 8 in the work of Kim et al. [13], see also [7]). We now provide the definition of infimum and supremum walks.

Definition 5 (infimum and supremum walks).

Let 𝒜 be an NFA, and let u be a state in 𝒜. Consider α=infIu and β=supIu and a walk Pu=(ui)i1, then we say that:

  • Pu is an infimum walk to u, denoted as Puinf, if for each integer i1, λ(ui)=αi.

  • Pu is a supremum walk to u, denoted as Pusup, if for each integer i1, λ(ui)=βi.

In other words, a walk to u is an infimum (supremum) walk, if it is labeled with the infimum (supremum) string of u.

Co-lex orders.

A partial order on a set U is a reflexive, antisymmetric and transitive relation on U. Given a partial order on a set U, for each u,vU, we write u<v if uv and uv. We now report the formal definition of co-lex order of an NFA, which was first introduced by Cotumaccio and Prezza [10, Definition 3.1].

Definition 6 (Co-lex order).

Let 𝒜 be an NFA. A co-lex order of 𝒜 is a partial order over Q that satisfies the following two axioms:

  1. 1.

    For each u,vQ, if uv, then λ(u)λ(v).

  2. 2.

    For each pair uδa(u) and vδa(v), if u<v, then uv.

Note that every NFA 𝒜=(Q,δ,Σ,s) admits a co-lex order; in fact, the partial order {(u,u):uQ} trivially satisfies the two axioms of Definition 6. We say that a co-lex order of an automaton 𝒜 is the maximum co-lex order of 𝒜 if is equal to the union of all co-lex orders of 𝒜. Becker et al. [3, Lemma 3] showed that every forward-stable NFA admits a maximum co-lex order. Hereafter, we will denote by FS the maximum co-lex order of an NFA.

Observation 7.

Let 𝒜 be a forward-stable NFA, and let u,v be two states of 𝒜 such that λ(u)<λ(v). Then u<FSv.

Proof.

Consider the partial order defined as follows; {(z,z):zQ}{(u,v)}. is a co-lex order of 𝒜, as it does not violate the axioms of Definition 6. Therefore, since FS is defined as the union of every co-lex order of 𝒜, it follows that u<FSv holds.

Preceding pairs.

We start with the definition of preceding pairs [6, Definition 6].

Definition 8 (Preceding pairs).

Let 𝒜 be an NFA and let (u¯,v¯),(u,v)Q×Q be pairs of distinct states. We say that (u¯,v¯) precedes (u,v), denoted by (u¯,v¯)(u,v), if there exist two walks, Pu=(ui)i=1l and Pv=(vi)i=1l, such that (i) ul=u¯ and vl=v¯, (ii) for each i[l], uivi, and (iii) for each i[l1], λ(ui)=λ(vi)=a, for some aΣ.

Note that if u,vQ are distinct states, then the pair (u,v) trivially precedes itself. The following observation directly follows from Definition 8.

Observation 9.

The relation (u¯,v¯)(u,v) is transitive.

Preceding pairs characterize the maximum co-lex order of a forward-stable NFA as follows.

Corollary 10.

Let 𝒜 be a forward-stable NFA and let FS be its maximum co-lex order. Then u<FSv holds for two distinct states u,v in of 𝒜 if and only if, for each pair (u¯,v¯) with (u¯,v¯)(u,v), it holds that λ(u¯)λ(v¯).

Proof.

Let u and v be two distinct states. Following [6, Lemma 7], (u,v) is contained in the maximum co-lex relation (see [6, Definition 4]) if and only if λ(u¯)λ(v¯) for every pair (u¯,v¯) with (u¯,v¯)(u,v). The maximum co-lex order FS of a forward-stable NFA 𝒜 is equal to its maximum co-lex relation according to [3, Lemma 3].

Note that for any distinct states u,v of 𝒜, the statement ¬(u<FSv) holds if and only if there exists (u¯,v¯)(u,v) with λ(u¯)>λ(v¯). We now define co-lex extensions.

Definition 11 (Co-lex extension).

Let 𝒜 be a forward-stable NFA and Q its set of states. Consider the maximum co-lex order FS of 𝒜. Then, a total order on Q is a co-lex extension of 𝒜, if FS.

Due to Lemma 11 of the work of Becker et al. [2], it follows that the ordered partition refinement [2, Algorithm 1] represents a feasible algorithm for computing a co-lex extension.

Lemma 12.

Consider a forward-stable NFA 𝒜=(Q,δ,Σ,s). Moreover, let FS and be the maximum co-lex order and a co-lex extension of 𝒜, respectively. Then, for every u,vQ such that u<v, there exists a pair (u¯,v¯) with (u¯,v¯)(u,v) such that λ(u¯)<λ(v¯).

Proof.

According to Definition 11 u<v implies ¬(v<FSu). Corollary 10 then yields that there exists (u¯,v¯) with (u¯,v¯)(u,v) such that λ(u¯)<λ(v¯).

3 Infimum and Supremum Walks

In this section we prove structural properties of infimum and supremum walks in a forward-stable NFA 𝒜. At the end of the section, we define left-minimal infimum and right-maximal supremum walks, the special type of infimum/supremum walks that our data structure is based on. As before, we let FS be the maximum co-lex order of 𝒜, and be a co-lex extension of 𝒜. We further denote by n the number of states of 𝒜. The following lemma states that infimum and supremum can be compared using their first 2n1 characters (see [5, 1, 9] for similar results).

Lemma 13.

Let 𝒜 be an NFA with set of states Q, where |Q|=n, and let u,vQ. If supIuinfIv, then supIu[1,2n1]infIv[1,2n1].

Proof.

Let C be the set consisting of all strings supIu’s and all strings infIu’s, and write C={γ1,γ2,,γn} where γ1<γ2<<γn. Then, n2n. Notice that for every uQ, by the maximality of supIu there exist c1Σ and u1Q such that supIu=c1supIu1, and by the minimality of infIv there exist c2Σ and u2Q such that infIu=c2infIu2. This implies that for every 2in there exist cΣ and 2in such that γi=cγi. We define the array 𝖫𝖢𝖯[2,n] such that 𝖫𝖢𝖯[i]=𝗅𝖼𝗉(γi1,γi), where 𝗅𝖼𝗉(α,β) is the length of the longest common prefix between α and β. To prove the lemma it is sufficient to show that 𝖫𝖢𝖯[i]n2 for every i with 2in. To prove this it is sufficient to show that if 𝖫𝖢𝖯[i]=d for some d1 and i with 2in, then there exists j with 2jn such that 𝖫𝖢𝖯[j]=d1, because then we obtain that 𝖫𝖢𝖯 has at least d+1 distinct entries, so d+1n1, which implies dn2. Assume that 𝖫𝖢𝖯[i]=d, with d1. Let c1,c2Σ and 2i,i′′n be such that γi=c1γi and γi1=c2γi′′. Since 1d=𝖫𝖢𝖯[i]=𝗅𝖼𝗉(γi1,γi)=𝗅𝖼𝗉(c1γi,c2γi′′), we have c1=c2 and from γi1<γi we obtain γi′′<γi, which implies i′′<i. Then, 𝖫𝖢𝖯[i]=1+𝗅𝖼𝗉(γi,γi′′)=1+mini+1ji′′𝗅𝖼𝗉(γj1,γj)=1+mini+1ji′′𝖫𝖢𝖯[j], so there exists i+1ji′′ such that d=𝖫𝖢𝖯[i]=1+𝖫𝖢𝖯[j], which implies 𝖫𝖢𝖯[j]=d1.

The following lemma treats the case in which the relation between the supremum and infimum of two states already implies their respective order with respect to FS.

Lemma 14.

Let u,v be two distinct states. Then, supIuinfIv implies u<FSv.

Proof.

We show the contrapositive. Assume that ¬(u<FSv). By Corollary 10, this implies the existence of a pair (u¯,v¯) with (u¯,v¯)(u,v) such that λ(u¯)>λ(v¯). By Definition 8, this implies that there exist two walks Pu=(ui)i=1l and Pv=(vi)i=1l with u1=u,v1=v,ul=u¯,vl=v¯ such that α=β where α:=λ(u1)λ(u2)λ(ul1) and β:=λ(v1)λ(v2)λ(vl1). Now recall that every state is reachable from the initial state s and sδ(s,#). Thus there exist infinite strings α′′Iu¯ and β′′Iv¯. Note that α′′>β′′ because λ(u¯)>λ(v¯). Since α=αα′′Iu and β=ββ′′Iv, we obtain supIuα=αα′′=βα′′>ββ′′=βinfIv, completing the proof.

With the next lemma we show a case in which we can determine if ¬(u<FSv) holds.

Lemma 15.

Let u,v be two states with u<v and supIu>infIv. Furthermore, let Pusup=(ui)i1 and Pvinf=(vi)i1. Then there exists an integer j with 1<j<2n such that vj<uj and uivi as well as λ(ui)=λ(vi) for each i[j1]. In addition, if ui<vi holds for all i[j1], then ¬(u<FSv).

Proof.

Consider α=supIu and β=infIv. Let k be the maximal integer such that α[1,k1]=β[1,k1]. Since α>β, it holds that λ(uk)>λ(vk). By Lemma 13, k<2n. Moreover, by Observation 7, λ(uk)>λ(vk) implies vk<FSuk which, by Definition 11, in turn implies vk<uk. Therefore, if we define j>1 as the smallest integer for which vj<uj, then clearly for each i[j1], uivi. Moreover, since jk, and α[1,k1]=β[1,k1], for each i[j1], it holds that λ(ui)=λ(vi). For the second part of the lemma, if ui<vi (implying uivi) for each i[j1], then (uj,vj)(u,v) by definition. Since vj<uj, by Lemma 12, there exists (u¯,v¯)(uj,vj) with λ(u¯)>λ(v¯). Thus, by Observation 9 it holds that (u¯,v¯)(u,v) which by Corollary 10 proves that ¬(u<FSv).

We summarize what we achieved so far. Given two distinct states u,v, to understand whether or not u<FSv holds we can proceed as follows. If according to a co-lex extension , the statement v<u holds, by Definition 11, we know ¬(u<FSv). If u<v, then if supIuinfIv, by Lemma 14, we know that u<FSv. Otherwise, consider a supremum walk to u, (ui)i1, and an infimum walk to v, (vi)i1. If supIu>infIv, and there exists j such that vj<uj and ui<vi for each i[j1], then by Lemma 15 ¬(u<FSv). See Figure 2 for an example. The only remaining case to address is the existence of an integer j[j1] such that uj=vj. To study this case, we introduce left-minimal (right-maximal) walks.

Left-minimal/right-maximal walks.

We proceed with the definition of left-minimal infimum walks and right-maximal supremum walks. We first define infimum and supremum graphs.

Definition 16 (Infimum and supremum graphs).

The infimum (supremum) graph G=(Q,E) of 𝒜 is the directed unlabeled graph defined as follows: (i) The node set of G is identical to the one of 𝒜. (ii) For each u,vQ, we let (u,v)E if and only if there exists an infimum (supremum) walk (ui)i1 to a state uQ and an integer j such that uj+1=u and uj=v.

We now prove a preliminary result concerning infimum/supremum graphs.

Observation 17.

Let 𝒜 be an NFA and let G be its infimum (supremum) graph. Then, every walk of infinite length in G is an infimum (supremum) walk in 𝒜.

Proof.

Let u be a state of 𝒜, and let Pu=(ui)i1 be an arbitrary walk of infinite length. We prove this result for infimum walks, the proof for the supremum walks is analogous. Suppose for the sake of a contradiction that Pu is not an infimum walk. We consider the largest integer j for which there exists an infimum walk Puinf=(u¯i)i1, such that, for each i[j], u¯i=ui. Thus, u¯j+1uj+1. By Definition 16, there exists a state v, an infimum walk Pvinf=(vi)i1, and an integer k for which vk=uj and vk+1=uj+1. Consider α=λ(u¯1)λ(u¯j), γ=λ(u¯j+1)λ(u¯j+2), β=λ(v1)λ(vk), γ=λ(vk+1)λ(vk+2), we know infIu=αγ, infIv=βγ. Moreover, consider the walks Pu=(u^i)i1, where for each i[j], u^i=u¯i, and for each ij, u^i=vi+kj, and Pv=(v^i)i1, where, for each i[k], v^i=vi, and, for each ik, v^i=u¯i+jk. We know that αγIu and βγIv. By Definition 5, αγαγ which implies γγ, and βγβγ which implies γγ. Thus γ=γ. Consequently, Pu is an infimum walk. However, u^j=uj and u^j+1=uj+1, a contradiction.

We now introduce leftmost/rightmost walks in directed (unlabeled) graphs.

Definition 18 (Leftmost/rightmost walk).

Let G=(V,E) be a directed graph, let be a total order on V and let uV. A walk Pu=(ui)i1 is a leftmost (rightmost) walk to u, if for each walk P¯u=(u¯i)i1 and integer j>1, uj=u¯j implies uj1u¯j1 (u¯j1uj1).

We are now ready to introduce left-minimal and right-maximal walks.

Definition 19 (Left-minimal and right-maximal walks).

Let u be a state of 𝒜. We say that an infimum (supremum) walk Pu=(ui)i1 is left-minimal (right-maximal) if Pu is a leftmost (rightmost) walk to u in the infimum (supremum) graph of 𝒜 according to .

The proof of existences of left-minimal and right-maximal walks in directed graphs is given in Theorem 25 in Section 6. The next corollary then follows together with Observation 17.

Corollary 20.

Let Q be the states of 𝒜. There exists p:QQ, such that, for each uQ, the sequence (pi(u))i0 is a left-minimal infimum (right-maximal supremum) walk to u.

4 Inf and Sup Conflicts

We still consider 𝒜 to be a fixed forward-stable NFA and FS and be the maximum co-lex order and a co-lex extension of 𝒜, respectively. We now define inf/sup conflicts.

Definition 21 (inf sup conflicts).

Let u be a state of 𝒜 and let Pu=(ui)i1 be an infimum (supremum) walk. For j>1, we say that uj is in inf (sup) conflict with Pu, denoted as ujPu (ujPu), if there exists P¯u=(u¯i)i=1j satisfying: (i) For each i with 1<ij, it holds that u¯iui and λ(u¯i)=λ(ui). (ii) It holds that ¬(uj<FSu¯j) (¬(u¯j<FSuj)).

At this point, we present a first result concerning inf and sup conflicts.

Lemma 22.

Let u be a state of 𝒜, and Pu=(ui)i1 be an infimum (supremum) walk. If for some j>1, ujPu (ujPu) holds, then for any integer j with 1<jj, it holds that ujPu (ujPu).

Proof.

We prove the lemma for infimum walks and inf conflicts; the proof for supremum walks and sup conflicts is analogous. Consider an integer j such that ujPu and an arbitrary integer j with 1<j<j. Moreover, let (u¯i)i=1j be the walk for which state uj is in inf conflict with Pu. Clearly, (u¯i)i=1j satisfies condition (i) of Definition 21 for ujPu. It remains to prove that ¬(uj<FSu¯j). Note that for each i with jij, we have u¯iui and λ(u¯i)=λ(ui). Hence every preceding pair of (uj,u¯j) is a preceding pair of (uj,u¯j) and it follows from Corollary 10 that ¬(uj<FSu¯j) implies ¬(uj<FSu¯j). This proves the lemma.

We introduce now two functions ϕ and γ which will be at the basis of our data structure.

Definition 23 (Functions ϕ and γ).

Let u be a state of 𝒜 and let Puinf=(ui)i1 and Pusup=(ui)i1 be an infimum walk and a supremum walk to u, respectively. We define two functions ϕ and γ as

ϕ(u,Puinf)max({i<2n:uiPuinf}{1}),γ(u,Pusup)max({i<2n:uiPusup}{1}).

Furthermore, for every integer j>1, we define two functions ϕj and γj as

ϕj(u,Puinf)maxi[j1]{ϕ(ui,Puiinf)+i1},γj(u,Pusup)maxi[j1]{γ(ui,Puisup)+i1},

where Puiinf=(ui)ii and Puisup=(ui)ii.

In other words, the functions ϕj(u,Puinf) and γj(u,Pusup) intuitively represent the largest integer k for which either ukPuiinf or ukPuisup holds for an integer i with i<j.

Lemma 24.

Let u,v be two states of 𝒜 with u<v and supIu>infIv and let Pusup=(ui)i1 and Pvinf=(vi)i1 be a right-maximal supremum walk and a left-minimal infimum walk according to the co-lex extension , respectively. Furthermore, assume there exists an integer j>1 such that uj=vj and ui<vi for each i[j1]. Then ¬(u<FSv) if and only if max{γj(u,Pusup),ϕj(v,Pvinf)}j.

Proof.

() By Corollary 10, ¬(u<FSv) implies the existence of a pair (u¯,v¯) with (u¯,v¯)(u,v) such that λ(u¯)>λ(v¯). Let (u¯i)i=1j and (v¯i)i=1j be the walks for which (u¯,v¯)(u,v) holds, respectively. Lemma 15 implies that λ(ui)=λ(vi) for each i[j] and j<2n. Furthermore, we have λ(u¯i)λ(ui) and λ(vi)λ(v¯i) for each i[j] by the properties of supremum and infimum walks. By the properties of preceding pairs (see Definition 8), for each i[j], it holds that λ(u¯i)=λ(v¯i) and thus λ(u¯i)=λ(ui) and λ(vi)=λ(v¯i). Moreover, since λ(u¯)>λ(v¯), it follows that j>j. Due to the walks (u¯i)i=jj and (u¯i)i=jj, it holds that (u¯,v¯)(u¯j,v¯j), which by Corollary 10 and λ(u¯)>λ(v¯) implies ¬(u¯j<FSv¯j). Now, define h:=max{i[j1]:u¯i=ui} and h:=max{i[j1]:v¯i=vi} and consider Puhsup=(ui)ih and Pvhinf=(vi)ih. In order to prove max{γj(u,Pusup),ϕj(v,Pvinf)}j, since j<2n, it is thus sufficient to prove that A:=ujPuhsupvjPvhinf holds. Due to the previous considerations, (u¯i)i=hj satisfies condition (i) of Definition 21 for ujPuhsup, while (v¯i)i=hj satisfies condition (i) of Definition 21 for vjPvhinf. Hence, A holds if and only if ¬(u¯j<FSuj)¬(vj<FSv¯j) or equivalently ¬(u¯j<FSujvj<FSv¯j). Since uj=vj and FS is transitive, the conclusion is implied by ¬(u¯j<FSv¯j), which we argued above.

() Consider the case where the maximum is attained by γj(u,Pusup) (the other case is analogous) and call that value h. We know hj by hypothesis. By Corollary 10, we have to prove the existence of a pair (u¯,v¯) with (u¯,v¯)(u,v) such that λ(u¯)>λ(v¯). Let k[j1] be such that γ(uk,Puksup)+k1=h. Due to Observation 17, the walk Puksup=(ui)ik is a supremum walk to uk. Since uhPuksup and jh, by Lemma 22 it follows that ujPuksup. Therefore, by Definition 21 there exists a walk (u¯i)i=kj such that u¯iui and λ(u¯i)=λ(ui) for each i with k<ij as well as ¬(u¯j<FSuj). Now define u¯i:=ui for i[k] and consider the two walks (u¯i)i=1j and (vi)i=1j. Note that u¯k=uk. From Definition 21 and Lemma 15 it follows that λ(u¯i)=λ(vi) for each i[j]. Suppose now that u¯i=vi for some i with k<i<j. We can then define a new supremum walk (u^i)i1 to u as follows. We define (i) u^i:=u¯i for i[i], (ii) u^i:=vi for i with i<i<j, and (iii) u^i:=ui for ij. Since uj=u^j and uj1<u^j1, we conclude that Pusup is not a right-maximal supremum walk to u, a contradiction. Hence, u¯ivi for all i[j] and consequently (u¯j,vj)(u,v). Note that uj=vj holds by hypothesis. Finally, since ¬(u¯j<FSuj), by Corollary 10, there exists (u¯,v¯) with (u¯,v¯)(u¯j,uj) such that λ(u¯)>λ(v¯). Observation 9 then implies (u¯,v¯)(u,v).

5 Data Structure

In this section, we present our data structure and finally prove Theorem 2, the main theorem of this article. Motivated by the above lemma and our previous observations, the data structure that enjoys the properties promised in Theorem 2 can be defined as follows.

Store.

For each state u of 𝒜, we store (i) a left-minimal infimum walk Puinf to u, (ii) a right-maximal supremum walk Pusup to u, and (iii) the integers ϕ(u,Puinf) and γ(u,Pusup). Furthermore, we store a co-lex extension of 𝒜. A possible co-lex extension of 𝒜 can be computed in O(mlogn) time using the ordered partition refinement algorithm [2, Algorithm 1], where n is the number of states of 𝒜, and m the number of transitions.

Query.

The procedure of a query on two arbitrary states u,v is illustrated in Figure 1. If u=v, then uFSv by reflexivity. If v<u, then ¬(u<FSv) by Definition 11. If u<v, then four possible cases may arise. (i) Let us consider supIu and infIv, which can be reconstructed from Pusup and Pvinf. If supIuinfIv by Lemma 14 u<FSv. (ii) Otherwise, supIu>infIv holds. We then check, if there exists j such that vj<uj and ui<vi for all i[j1]. If this is the case, then ¬(u<FSv) according to Lemma 15. Otherwise, we let j with 1<j<j be such that uj=vj and ui<vi for all i[j1]. (iii) By Lemma 24, if h:=max{γj(u,Pusup),ϕj(v,Pvinf)}<j, we know that u<FSv, (iv) while if hj, we know that ¬(u<FSv).

We refer the reader to Figure 2 for an example of this data structure. The correctness of our data structure follows immediately from the description above. In order to establish Theorem 2, it remains to argue why the space and query time bounds hold.

Proof of Theorem 2.

By Corollary 20, we know that this information can be stored in O(n) space. By Lemma 13, we can check supIuinfIv in O(n) time, and if this is the case by Lemma 14 u<FSv. Otherwise, by Lemma 15, we can identify in O(n) time the integer j<2n such that vj<uj and uivi for all i[j1]. If ui<vj for all i[j1], then ¬(u<FSv) by Lemma 15. On the other hand, if there exists j with (i) j<j, (ii) uj=vj, and (iii) ui<vi for all i[j1], by Definition 23, since j<2n, we can compute γj(u,Pusup) and ϕj(v,Pvinf) in O(n) time. Hence the total time complexity is O(n).

(a)
u infIu supIu ϕ γ
1 #ω #ω 1 1
2 a#ω abω 1 1
3 aa#ω aabω 1 1
4 aab#ω aabω 1 1
5 ab#ω ab#ω 1 1
6 ab#ω abω 1 1
7 aab#ω abb#ω 1 1
8 b#ω b#ω 1 1
9 b#ω bω 1 1
10 baa#ω baabω 1 25
11 baab#ω babb#ω 2 1
12 bb#ω bb#ω 1 1
13 bb#ω bω 1 1
(b)
Figure 2: Consider the forward-stable NFA 𝒜 in Figure (a). Each state is assigned an integer i indicating its position in the co-lex extension . We denote by ui the i-th state according to . Figures (b) and (c) show the NFAs encoding a left-minimal infimum walk and a right-maximal supremum walk, respectively, for each state. The table on the right shows for each state u the values of infIu, supIu, ϕ(u,Puinf), and γ(u,Pusup), where Puinf and Pusup are the walks shown in Figures (b) and (c). Our data structure comprises , the walks in Figures (b) and (c), and the two columns ϕ, γ from the table. We sketch the four cases that arise when determining whether u<FSv holds, assuming u<v. (i) By Lemma 14, since supIu3infIu5, it follows that u3<FSu5. (ii) Consider Pu2sup=u2,u13 and Pu6inf=u6,u9, since supIu2>infIu6, and u2<u6,u13>u9, by Lemma 15, ¬(u2<FSu6). (iii) Consider now Pu4sup=u4,u6 and Pu7inf=u7,u6. Since, supIu4>infIu7, u4<u7,u6=u6, and max{γ2(u4,Pu4sup),ϕ2(u7,Pu7inf)}=1<2, by Lemma 24, we can conclude u4<FSu7. (iv) Finally, consider Pu10sup=u10,u4,u6 and P11inf=u11,u7,u6, due to the fact that supIu10>infIu11, u10<u11, u4<u7, u6=u6, and max{γ3(u10,Pu10sup),ϕ3(u11,Pu11inf)}=263, by Lemma 24, we conclude that ¬(u10<FSu11).

6 Existence of a Leftmost Walk

In this section, we consider an (unlabeled) directed graph G=(V,E), and a total order on V. We always assume that every node of G has at least one incoming edge. Moreover, a walk to a node uV, denoted by Pu=(ui)i=1l, is a sequence of l nodes such that: (i) u1=u, and (ii) (ui+1,ui)E for each i[l1]. We denote by Pu=(ui)i1 a walk to u of infinite length. We fix some further graph-related notation. The induced subgraph of G on VV is the graph G[V]:=(V,E(V×V)). The subgraph of G reachable from a subset SV, denoted by δG(S), is the induced subgraph G[V] on the nodes V={vV:Pv=(vi)i=1 with vS}. For two sets of nodes S,TV, we call N(S):={vV:uS and (u,v)E} the neighbors of S, E(S):=E(S×V) the edges from S to N(S), and E(S,T):=E(S×T) the edges from S to T. For two directed graphs G1=(V1,E1) and G2=(V2,E2) their union is defined as G1G2:=(V1V2,E1E2).

The rest of the section is devoted to proving the following theorem.

Theorem 25.

Let G=(V,E) be a directed graph, and let be a total order on V. Then, there exists a function p:VV such that Pu=(pi(u))i0 is a leftmost (rightmost) walk to u for every uV.

Hereafter, we only consider the leftmost case since the rightmost is analogous. We prove Theorem 25 constructively, i.e., we give an algorithm, termed Forward Visit, that computes the function p on input a directed graph G=(V,E) and a total order on V. At a very high level, the algorithm consists of DFS and BFS visits that alternate with each other. Here the DFS visit has the purpose of computing a cycle C. Once this cycle C is computed, we start two BFS-like searches starting from C on two different subgraphs GL and GR of G. These two BFS-like searches each compute walks starting from nodes in C that when concatenated with C form the leftmost infinite walks for all nodes on those walks. These walks will then be represented by the function p. More precisely, the BFS-like searches work as follows: Let V be nodes for which we have not computed a value for p yet. The graph GL contains the nodes that can be reached in G[V] from some node uC through a neighbor node vC that is on the left of u’s cycle successor with respect to the total order . The subgraph GR is defined symmetrically as reachable through right neighbors. The BFS-like searches are then a multi-source shortest path search in GR and a multi-source longest path search in GL. The intuition why we compute shortest paths for GR and longest paths for GL is as follows. Let zC be a node in GL that is reachable by a length-d path from some cycle node uC through a left neighbor vC of u. Now, consider some other length-d path from u to z that follows the cycle longer, i.e., goes through the successor of u in the cycle. Such a walk is not left-most by definition and in fact our algorithm will never output such a walk as it is constructed using a strictly shorter path from C to z than the one going through v. By a symmetric argument nodes in GR should be connected to C via shortest path.

Figure 3: (a) A directed graph G=(V,E) and a total order over V represented by the integer names of nodes. (b) In green the first cycle C that is found by the DFS of Algorithm 1, if we start a DFS from node 2; in red and blue the subgraphs GL and GR corresponding to C, respectively. Here, L={3,4} and R={11,13}. (c) Leftmost walks represented by p (indicated by the shown edges).

The Forward Visit Algorithm.

We proceed with a detailed description of the algorithm, a pseudocode implementation can be found in Algorithm 1. For each uV, the algorithm maintains three values, p(u) (initially null), u.color (initially white), and u.next (initially null). The algorithm iterates over the nodes in V in an arbitrary order. For each uV with u.color = white, the algorithm starts a DFS visit from u. This DFS first changes the color of u to gray, and eventually to black once the DFS from u terminated. The DFS visit from u is a classical DFS up to two caveats. (1) For a node uV, its neighbors N(u) are processed in increasing order with respect to . (2) The color v.color of a neighbor vN(u) determines our action when processing v as follows: (a) if v.color= black, the node v is ignored, (b) if v.color= white, we process the node normally, i.e., we launch a new DFS from v and set u.next=v, and (c) if v.color= gray, we still set u.next=v but we temporarily interrupt the DFS from u. Case (c) implies that the DFS from u found a cycle C in G that can be reconstructed from the next-values. We now call the function ComputeWalks with input C.

Algorithm 1 Forward Visit.

Let us denote with V{vV:p(v)=null} all the nodes for which we have not yet constructed a leftmost walk. The function ComputeWalks computes two (possibly overlapping) subgraphs GR and GL of G[V] and constructs walks from C to the nodes in these two graphs. The graphs GR and GL are defined as follows. First, we compute two (possibly overlapping) subsets R and L of N:=N(C)V as follows. We let R{vV:(u,v)E(C) and v>u.next} and L{vV:(u,v)E(C) and v<u.next}, as well as EC,RE(C,R) and EC,LE(C,L). In other words, the set R (the set L) consists of those neighbors in V of nodes uC that are larger (smaller) than u.next with respect to . We then define GR=(VR,ER):=(C,EC,R)δG[V](R) and GL=(VL,EL):=(C,EC,L)δG[V](L). In order to compute the function p(u) for nodes uVRVLC, we now launch two BFS-like visits in a specific (and essential) order. We first launch a multi-source shortest path search in GR and then a multi-source longest path search in GL, in both cases from C. In both of these subroutines we set p(v)=u whenever v is discovered to be the next node on a shortest (respectively longest) path starting from the cycle. In both of these BFS-like visits, we process a node’s neighbors in increasing order with respect to (recall that we are searching for leftmost walks), we give more details in the paragraph below. It is essential that we first run the shortest path search on GR and then the longest path search on GL as the two graphs possibly overlap and we hence reset the p-values for nodes that are in both graphs, prioritizing paths in GL. We then compute p(u) also for the nodes uC by setting p(u)=v where vC is such that v.next=u. Lastly, we set u.color= black for each uVLVR. Then the function ComputeWalks terminates and the DFS is resumed. We refer the reader to Figure 3 for an example run of the algorithm.

Details on the BFS searches.

The multi-source shortest path search is implemented as a classical BFS on GR with a queue initially containing the nodes C. When a node u is dequeued, nodes vN(u)VR are processed in increasing order with respect to . If v is newly discovered by the BFS, we set p(v)=u and enqueue v.

The multi-source longest path search is implemented as follows. It is essential that GL is acyclic, see Lemma 27 (i). Hence, we can simply compute the maximum distance d(C,u) from C to every node u in GL by traversing GL in a topological order. We then perform a BFS-like search on GL with a queue initially containing the nodes C. When a node u is dequeued, nodes vN(u)VL are processed in increasing order with respect to . If v is newly discovered by the BFS and d(C,v)=d(C,u)+1, we set p(v)=u and enqueue v.

Analysis.

We start with the following simple observation.

Observation 26.

(i) At termination of ComputeWalks(C) it holds that p(u)null for every uV that can be reached from C. (ii) Before and after each run of ComputeWalks, if (ui)i=1l is a walk in G and p(uj)null for some j[l], then p(u1)null.

Proof.

(i) Let V={uV:p(u)=null} as in the algorithm. ComputeWalks(C) inserts every node uV that can be reached from C to the subgraphs GL or GR. Then it computes function p for all and only nodes in GLGR. (ii) When the DFS outputs a cycle C, all nodes u that are reachable from C and that satisfy p(u)= null are added to one of the subgraphs GL or GR. This holds as RL=N(C)V, where V were all nodes v with p(v)= null. Hence ComputeWalks always calculates p(u) for all and only those nodes in GL and GR.

We proceed with the lemma about the acyclicity of GL.

Lemma 27.

Let GL=(VL,EL) and C be the subgraph of G=(V,E) and the cycle computed during an arbitrary execution of ComputeWalks in Forward Visit. Then, (iGL is acyclic and (ii) it holds that vC for each v such that there exists (u,v)EL.

Proof.

Let LVL be as in the algorithm. Suppose for the purpose of contradiction that (i) or (ii) does not hold. In both cases there exists zVL such that zC for some cycle C in G different from C. Let (u,v)E(C,L) be the edge such that v can reach z in GL. As zVL, we know that p(z)=null and, by Observation 26 (ii), it follows that p(z¯)=null for each z¯C, and, for an analogous argument, also the nodes v¯ that can reach C must satisfy p(v¯)null. Thus, at this point of the algorithm execution, the function ComputeWalks cannot have colored those nodes black that can reach C, as it colors exactly the nodes for which it computes the function p (i.e., the nodes in GLGR). The only remaining part in which Forward Visit may color a node v¯ black that can reach C is line 18, i.e., when DFS(v¯) has terminated. Notice however that v¯ can trivially reach C and thus DFS(v¯) has to find a cycle C′′ that can reach the cycle C in G before terminating (possibly C′′=C). Thus, by Observation 26 (i), when DFS(u) is launched, DFS(v¯) has not terminated yet, otherwise p(z)null would hold. When DFS(u) was started, the nodes in N(u) were processed in increasing order with respect to . In addition, as v<u.next by definition of L, it follows that v was visited by DFS(u) before u.next. Finally, since v can reach z, it follows that DFS(z) has to terminate before DFS(v). As for each node v¯ reaching C the function DFS(v¯) has not terminated yet, it holds that DFS(v) has to output a cycle C′′ (possibly C′′=C) different from C before terminating, contradicting the assumption that C is the output cycle.

We next observe that Forward Visit indeed computes a complete function p on V.

Observation 28.

Upon termination Forward Visit has computed p(u) for every node uV.

Proof.

Let uV. By assumption, every node has an incoming edge and thus there exists a cycle Cu in G containing u. If before or after any run of ComputeWalks there exists vCu with p(v)null, then Observation 26 (ii) shows that also p(u)null. Otherwise, before and after every run of ComputeWalks it holds that p(v)=null for all vCu. In this case, ComputeWalks has never colored the nodes in Cu black, as by line 9, it colors black all and only those nodes for which it has computed the function p. By line 20, the algorithm starts a DFS visit from every node in V (either in line 13 or in line 20). Therefore, let v be the first node of Cu for which a DFS visit is started. As for each zCu it holds that p(v)=null, before DFS(v) starts z.color= white for each zCu. However, since vCu the DFS visit cannot terminate without finding a cycle C with vC. Thus, since v can reach u, by property (i) of Observation 26, we can conclude that the next run of ComputeWalks calculates p(u).

We are now ready to prove Theorem 25.

Proof of Theorem 25.

By Observation 28, it holds that p(v) null for all vV upon termination of the algorithm. It remains to prove that Pu=(p(u)i)i0 is a leftmost walk to u according to for any node uV. Let ui+1p(u)i for i0 (i.e., u0=u) and suppose for the purpose of contradiction that Pu is not a leftmost walk to u. Then by Definition 18 there exists a walk Pu=(u¯i)i0 to u such that u¯j=uj for some j>1 and u¯j1<uj1. Let j be minimal with that property. Furthermore, let k with 0k<j1 be maximal such that u¯k=uk (such k exists as u¯0=u0=u). Consider now the execution of ComputeWalks that has computed p(u) and let the cycle C and the subgraphs GL=(VL,EL) and GR=(VR,ER) be the instances of those objects in that execution of the function. By Observation 26 (ii), at the beginning of this execution p(v)= null for each v in Pu and Pu. Consider the walk P^uk=(u^i)ik to uk, where u^i=u¯i for i with kij and u^i=ui for ij. We distinguish three cases: (1)  ukC, (2) ukVLC, and (3) ukVR(CVL).

(1) Let ukC. Then, uiC for all ik from how the algorithm sets p in line 8. Hence, u^iC for all ij. Now, consider the largest m with km<j such that u^m=u^m for some m<m. By assumption, this integer m exists since k<j and u^kC. Thus, if m is the smallest integer satisfying this property, the sequence C=u^m,u^m+1,,u^m is a cycle different from C in G. Since u^j1<uj1, the DFS has to visit u^j1 before uj1 and as u^j1 can reach u^m and consequently the cycle C, it contradicts that the DFS output C.

(2) Now, let ukVLC and consider the smallest h with h>k such that uhC and observe that then also uiC for ih. Observe that Puk=(ui)i=kh is a longest path from C to uk in GL. We distinguish two sub-cases. (a) Suppose that h<j. Then uj1C and u^j1L as u^j1<uj1. This implies (u^i+1,u^i)EL for each i with ki<j by the definition of EL. By Lemma 27 (i), the graph GL is acyclic and thus (u^i)i=kj is a path from C to u^k in GL that is longer than Puk, a contradiction. (b) Now assume that hj. This implies u^j1VLC and trivially also u^iVL for i with ki<j. Lemma 27 (ii) implies u^iVLC and hence h is also minimal such that u^hC. Due to the previous considerations d(C,u^i)=d(C,ui) for each i with ki<j, where d is as in the algorithm. As the multi-source longest path search processes the neighbors of uj in increasing order with respect to and since u^j1<uj1, this contradicts p(uk)=uk+1.

(3) Now assume ukVR(CVL). Note that u^kVLC for all k>k as the opposite would imply ukVL. As in (2) let h be minimal with h>k such that uhC and observe that then also uiC for each ih. Observe that Puk=(ui)i=kh is a shortest path from C to uk in GR. We again distinguish two sub-cases. (a) Suppose that h<j. Then uj1C and u^j1L as u^j1<uj1. This implies u^j1VLC as p(uj1)=uj, a contradiction to our observation above. (b) Now assume that hj. This implies u^j1VRC and u^iVR for each i with ki<j. Furthermore, u^iC for each i with ki<j as (ui)i=kh is a shortest path from C to uk. It follows that (u^i)i=kh is another shortest path from C to uk in GR. As the multi-source shortest path search processes the neighbors of uj in increasing order with respect to and since u^j1<uj1, this contradicts p(uk)=uk+1.

References

  • [1] Jarno N. Alanko, Davide Cenzato, Nicola Cotumaccio, Sung-Hwan Kim, Giovanni Manzini, and Nicola Prezza. Computing the LCP array of a labeled graph. In Shunsuke Inenaga and Simon J. Puglisi, editors, 35th Annual Symposium on Combinatorial Pattern Matching, CPM 2024, June 25-27, 2024, Fukuoka, Japan, volume 296 of LIPIcs, pages 1:1–1:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2024. doi:10.4230/LIPICS.CPM.2024.1.
  • [2] Ruben Becker, Manuel Cáceres, Davide Cenzato, Sung-Hwan Kim, Bojana Kodric, Francisco Olivares, and Nicola Prezza. Sorting finite automata via partition refinement. In Inge Li Gørtz, Martin Farach-Colton, Simon J. Puglisi, and Grzegorz Herman, editors, 31st Annual European Symposium on Algorithms, ESA 2023, September 4-6, 2023, Amsterdam, The Netherlands, volume 274 of LIPIcs, pages 15:1–15:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.ESA.2023.15.
  • [3] Ruben Becker, Sung-Hwan Kim, Nicola Prezza, and Carlo Tosoni. Indexing finite-state automata using forward-stable partitions. In Zsuzsanna Lipták, Edleno Silva de Moura, Karina Figueroa, and Ricardo Baeza-Yates, editors, String Processing and Information Retrieval - 31st International Symposium, SPIRE 2024, Puerto Vallarta, Mexico, September 23-25, 2024, Proceedings, volume 14899 of Lecture Notes in Computer Science, pages 26–40. Springer, 2024. doi:10.1007/978-3-031-72200-4_3.
  • [4] Michael Burrows and David Wheeler. A block-sorting lossless data compression algorithm. SRS Research Report, 124, 1994.
  • [5] Alessio Conte, Nicola Cotumaccio, Travis Gagie, Giovanni Manzini, Nicola Prezza, and Marinella Sciortino. Computing matching statistics on wheeler dfas. In Ali Bilgin, Michael W. Marcellin, Joan Serra-Sagristà, and James A. Storer, editors, Data Compression Conference, DCC 2023, Snowbird, UT, USA, March 21-24, 2023, pages 150–159. IEEE, 2023. doi:10.1109/DCC55655.2023.00023.
  • [6] Nicola Cotumaccio. Graphs can be succinctly indexed for pattern matching in O(|E|2+|V|5/2) time. In Ali Bilgin, Michael W. Marcellin, Joan Serra-Sagristà, and James A. Storer, editors, Data Compression Conference, DCC 2022, Snowbird, UT, USA, March 22-25, 2022, pages 272–281. IEEE, 2022. doi:10.1109/DCC52660.2022.00035.
  • [7] Nicola Cotumaccio. Prefix sorting dfas: A recursive algorithm. In Satoru Iwata and Naonori Kakimura, editors, 34th International Symposium on Algorithms and Computation, ISAAC 2023, December 3-6, 2023, Kyoto, Japan, volume 283 of LIPIcs, pages 22:1–22:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.ISAAC.2023.22.
  • [8] Nicola Cotumaccio, Giovanna D’Agostino, Alberto Policriti, and Nicola Prezza. Co-lexicographically ordering automata and regular languages - part i. J. ACM, 70(4), August 2023. doi:10.1145/3607471.
  • [9] Nicola Cotumaccio, Travis Gagie, Dominik Köppl, and Nicola Prezza. Space-time trade-offs for the LCP array of wheeler dfas. In Franco Maria Nardini, Nadia Pisanti, and Rossano Venturini, editors, String Processing and Information Retrieval - 30th International Symposium, SPIRE 2023, Pisa, Italy, September 26-28, 2023, Proceedings, volume 14240 of Lecture Notes in Computer Science, pages 143–156. Springer, 2023. doi:10.1007/978-3-031-43980-3_12.
  • [10] Nicola Cotumaccio and Nicola Prezza. On indexing and compressing finite automata. In Dániel Marx, editor, Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, SODA 2021, Virtual Conference, January 10 - 13, 2021, pages 2585–2599. SIAM, 2021. doi:10.1137/1.9781611976465.153.
  • [11] Travis Gagie, Giovanni Manzini, and Jouni Sirén. Wheeler graphs: A framework for bwt-based data structures. Theor. Comput. Sci., 698:67–78, 2017. doi:10.1016/J.TCS.2017.06.016.
  • [12] Daniel Gibney and Sharma V. Thankachan. On the hardness and inapproximability of recognizing wheeler graphs. In Michael A. Bender, Ola Svensson, and Grzegorz Herman, editors, 27th Annual European Symposium on Algorithms, ESA 2019, September 9-11, 2019, Munich/Garching, Germany, volume 144 of LIPIcs, pages 51:1–51:16. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2019. doi:10.4230/LIPICS.ESA.2019.51.
  • [13] Sung-Hwan Kim, Francisco Olivares, and Nicola Prezza. Faster prefix-sorting algorithms for deterministic finite automata. In Laurent Bulteau and Zsuzsanna Lipták, editors, 34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023, June 26-28, 2023, Marne-la-Vallée, France, volume 259 of LIPIcs, pages 16:1–16:16. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.CPM.2023.16.
  • [14] Robert Paige and Robert Endre Tarjan. Three partition refinement algorithms. SIAM J. Comput., 16(6):973–989, 1987. doi:10.1137/0216062.
  • [15] Shang-Hua Teng. Scalable algorithms for data and network analysis. Found. Trends Theor. Comput. Sci., 12(1-2):1–274, 2016. doi:10.1561/0400000051.
  • [16] The Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Briefings in Bioinformatics, 19(1):118–135, October 2016. doi:10.1093/bib/bbw089.