Deciding probabilistic bisimilarity distance one for probabilistic automata

Abstract Probabilistic bisimilarity, due to Segala and Lynch, is an equivalence relation that captures which states of a probabilistic automaton behave exactly the same. Deng et al. proposed a robust quantitative generalization of probabilistic bisimilarity. Their probabilistic bisimilarity distances of states of a probabilistic automaton capture the similarity of their behaviour. The smaller the distance, the more alike the states behave. In particular, states are probabilistic bisimilar if and only if their distance is zero. Although the complexity of computing probabilistic bisimilarity distances for probabilistic automata has already been studied, we are not aware of any practical algorithms to compute those distances. In this paper, we provide several key results towards algorithms to compute probabilistic bisimilarity distances for probabilistic automata. In particular, we present a polynomial time algorithm that decides distance one. Furthermore, we give an alternative characterization of the probabilistic bisimilarity distances as a basis for a policy iteration algorithm.

distances for probabilistic automata. In particular, we present a polynomial time algorithm that decides distance one. Furthermore, we give an alternative characterization of the probabilistic bisimilarity distances as a basis for a policy iteration algorithm.

Introduction
Behavioural equivalences, such as bisimilarity due to Milner [2] and Park [3], are one of the cornerstones of concurrency theory. Recall that a behavioural equivalence ∼ ⊆ S × S, where S is the set of states of the model, satisfies if s ∼ t and t ∼ u then s ∼ u for all s, t, u ∈ S. The fact that states s and t behave the same is captured by s ∼ t.
As first observed by Giacalone, Jou and Smolka [4], behavioural equivalences are not robust for models that contain quantitative information such as probabilities. This lack of robustness is caused by the discrepancy between the discrete nature of behavioural equivalence and the continuous nature of the quantitative information on which the behavioural equivalence relies. In particular, even small changes to the quantitative information may cause behaviourally equivalent states become inequivalent or vice versa. for all s, t, u ∈ S. The distance d(s, t) measures the similarity of the behaviour of states s and t. The smaller this distance, the more alike the states behave. Distance zero captures that states are behaviourally equivalent.
In this paper, we focus on probabilistic automata. This model was first studied in the context of concurrency by Segala in [5]. It captures both nondeterminism (and, hence, concurrency) and probabilities. Consider the probabilistic automaton depicted in Figure 1. The states of a probabilistic automaton are labelled. These labels provide a partition of the states so that states satisfying the same basic properties of interest are in the same partition. In Figure 1, the labels are represented by colours. Each state has one or more probabilistic transitions. For example, the state t has a single probabilistic transition that takes state t to itself with probability one. State f has two probabilistic transitions. The one takes state f to state h with probability one. The other represents a fair coin toss, that is, it transitions to state h with probability 1 2 and to state t with probability 1 2 . Also state b has two transitions, one of which represents a biased coin toss. Segala and Lynch [6] introduced probabilistic bisimilarity. This behavioural equivalence for probabilistic automata generalizes the one introduced by Larsen and Skou [7]. The latter is applicable to models without nondetermin- Deng, Chothia, Palamidessi and Pang [8] introduced a behavioural pseudometric for probabilistic automata that generalizes probabilistic bisimilarity.
Their pseudometric also generalizes the one introduced for labelled Markov chains by Desharnais, Gupta, Jagadeesan and Panangaden in [9]. The Hausdorff metric [10] and the Kantorovich metric [11] are key ingredients of the definition of the pseudometric of Deng et al. The former is used to capture nondeterminism. This idea dates back to the work of De Bakker and Zucker [12]. The latter was first used by Van Breugel and Worrell [13] to capture probabilistic behaviour. On the one hand, the behaviours of the states h and t of the above example are very different since their labels are different.
As a result, their probabilistic bisimilarity distance is one. On the other hand, the behaviours of the states f and b are very similar, which is reflected by the fact that these states have probabilistic bisimilarity distance 1 100 . Tracol, Desharnais and Zhioua [14] also introduced a behavioural pseudometric for probabilistic automata. Their probabilistic bisimilarity distances generalize probabilistic bisimilarity as well, but are different from the ones introduced by Deng et al. An example showing the difference can be found in [14,Example 5]. To compute their probabilistic bisimilarity distances, they developed an iterative algorithm. In each iteration, a maximum flow problem needs to be solved. The resulting algorithm runs in polynomial time.
Before discussing other algorithms to approximate or compute probabilistic bisimilarity distancesà la Deng et al. for probabilistic automata, let us first review some of the algorithms for the special case of labelled Markov chains. Van Breugel, Sharma and Worrell [15] presented an algorithm that approximates the probabilistic bisimilarity distances of a labelled Markov chain. Since the statement that the distance of states s and t is less than some rational q can be expressed in the existential fragment of the first order theory over the reals, and this theory is decidable as shown by Tarski [16], one can use binary search to approximate the distance of s and t. The satisfiability problem for the existential fragment of the first order theory over the reals has been shown to be in PSPACE by Canny [17] (see also [18] Behavioural pseudometrics are usually defined as a least fixed point. This approach was first employed by Desharnais et al. [20] and is discussed in detail in Section 2 and 3. In [21], Chen, Van Breugel and Worrell provided an alternative characterization of the probabilistic bisimilarity distances for labelled Markov chains. This characterization provides a bridge to reinforcement learning (see, for example, [22] for an introduction to reinforcement learning) that we will discuss in more detail below. Bacci, Bacci, Larsen and Mardare [23] used this characterization as the basis for an algorithm to compute the probabilistic bisimilarity distances. This algorithm is much more efficient than the algorithm of Van Breugel et al. [15]. For example, for the labelled Markov chain representing an instance of randomized quicksort, the 1 The labelled Markov chain is obtained from an implementation of randomized quicksort in Java by means of the model checker Java PathFinder [19] and its extension jpf-probabilistic. The latter can be found at https://bitbucket.org/discoveri/ jpf-probabilistic. An implementation of the algorithm of Van Breugel, Sharma, and Worrell can be found at https://bitbucket.org/discoveri/first-order. As we have shown in [24], in order to compute the probabilistic bisimilarity distances correctly, one has to decide distance zero before running the algorithm of Bacci et al. [23]. As shown by Desharnais et al. [9], states of a labelled Markov chain are probabilistic bisimilar if and only if they have distance zero. Since probabilistic bisimilarity can be decided in polynomial time, as shown by Baier [25], distance zero can be decided in polynomial time as well.
In [26] As these examples show, our addition of deciding distance one moved the computation of probabilistic bisimilarity distances from a theoretical curiosity to a potential ingredient for practical verification 2 Implementations of these algorithms can be found at https://bitbucket.org/ discoveri/probabilistic-bisimilarity-distances.

tools.
As we already mentioned above, the alternative characterization of the probabilistic bisimilarity distances of labelled Markov chains due to Chen et al. [21] provides a bridge to reinforcement learning. Each labelled Markov chain can be mapped to a Markov decision process such that the states of the Markov decision process are pairs of states of the original labelled Markov chain. Furthermore, this mapping is such that the value of the state (s, t) of the constructed Markov decision process is the probabilistic bisimilarity distance of the states s and t of the labelled Markov chain. The details can be found in [27,Section 5.3]. The values of a Markov decision process can be computed by policy iteration, an algorithm due to Howard [28]. In [24] we have shown that the algorithm of Bacci et al. can be seen as policy iteration (see also [27,Chapter 6] for more details). The correspondence with reinforcement learning is summarized in Table 1.  The complexity of computing the probabilistic bisimilarity distances for probabilistic automataà la Deng et al. was first studied by Fu [29]. He showed that these probabilistic bisimilarity distances are rational. Furthermore, he proved that the problem of deciding whether the distance of two states is smaller than a given rational is in NP ∩ coNP. The proof has been adapted to show that the decision problem is in UP ∩ coUP [30]. Recall that UP contains those problems in NP with a unique accepting computation. Note that the complexity of computing the probabilistic bisimilarity distances for labelled Markov chains has shown to be in P by Chen et al. [21].
Van Breugel and Worrell [31] have shown that the problem of computing the probabilistic bisimilarity distances of probabilistic automata is in PPAD.
This complexity class, which is short for polynomial parity argument in a directed graph, was introduced by Papadimitriou in [32]. It lies between the search problem versions of P and NP. The class captures the basic principles of path-following algorithms like those of Lemke and Howson [33] and Scarf [34]. Finding Nash equilibria of two player games is PPAD-complete, as shown by Chen and Deng in [35]. Kintali et al. [36] present several other PPAD-complete problems. Etessami and Yannakakis [37] have shown that computing the value of a simple stochastic game is in PPAD. The relationship of PPAD with the above mentioned complexity classes is given in The algorithm to approximate the probabilistic bisimilarity distances for probabilistic automata by Chen, Han and Lu [38] generalizes the algorithm of Van Breugel et al. [15] and uses the first order theory over the reals. As we already mentioned above, such algorithms are not practical. We have not implemented the algorithms underlying the above mentioned complexity results from [29,31]. However, we anticipate that these are not practical either.
As shown by Deng et al. [8], states of a probabilistic automaton are probabilistic bisimilar if and only if they have distance zero. Since probabilistic bisimilarity can be decided in polynomial time, as shown by Baier [25], distance zero for probabilistic automata can be decided in polynomial time as well. As we have already discussed above, being able to decide distance one in polynomial time has a significant impact on computing probabilistic bisimilarity distances for labelled Markov chains. In Section 5 we present a polynomial time algorithm that decides distance one for probabilistic automata.
This is the main contribution of this paper. We anticipate that this decision procedure will also impact the computation of probabilistic bisimilarity distances for probabilistic automata. We have implemented our algorithm in Java. 3 In Section 7 we discuss some initial experimental results.
To prove our decision procedure correct, we use an alternative characterization of the probabilistic bisimilarity distances for probabilistic automata.
This characterization generalizes the alternative characterization of the probabilistic bisimilarity distances for labelled Markov chains that we discussed 3 The source code is available at https://github.com/qiyitang71/ distance-one-probabilistic-automata.
above. In [31], Van Breugel and Worrell presented a mapping from probabilistic automata to simple stochastic games. Those games were introduced by Condon [39]. The vertices of the resulting simple stochastic game are pairs of states of the original probabilistic automaton. Their mapping is such that the value of a vertex (s, t) is the probabilistic bisimilarity distance of the states s and t. Our alternative characterization in Section 4 is based on a very similar mapping. As we will see, the size of the resulting game may be exponential in the size of the probabilistic automaton. Despite this potential exponential blow up, this correspondence between behavioural pseudometrics and game theory seems a promising avenue for further research.

Order and Distances
In this section, we provide some definitions and results from the literature about orders and distances that we will use in the remainder of this paper.
For more details we refer the reader to, for example, [40] and [41].

Ordered sets
Given a set S, the set of distance functions on S, that is, functions from Let X, ≤ be an ordered set. Let f : X → X. Following [40, Definition 8.14], we define the following three notions: .
The following result is known as the Knaster-Tarski fixed point theorem [42,43].
Theorem 2. Let X be a complete lattice and let f : X → X be a monotone function.
(a) f has a greatest fixed point. We denote the greatest and least fixed point of a function f by νf and µf , respectively. Given a set X, we denote the set of subsets of X by 2 X .
The correctness of our iterative algorithm to decide distance one relies on the following theorem.
Theorem 4. Let X be a finite set and let Φ : 2 X → 2 X be a monotone function.

Metric spaces
The set [0, 1] S×S of distance functions on S also carries the following natural metric. Let X, d be a metric space and c ∈ (0, 1]. A function f : Lipschitz for some c ∈ (0, 1). The following result is known as Banach's fixed point theorem [45]. The following construction, due to Hausdorff [10], lifts a distance function on a set X to a distance function on the set 2 X as follows. Given a nonempty finite set X, we denote the set of probability distributions on X by Distr(X). For µ ∈ Distr(X), we define its support by A construction due to Kantorovich [11] lifts a distance function on X to a distance function on Distr(X). To define this lifting, we need the notion of a coupling due to Doeblin [47].
In general, the set Ω(µ, ν) is infinite. The set of vertices of the convex polytope Ω(µ, ν) is denoted by V (Ω(µ, ν)). The latter set is finite (see, for example, [48, page 259]). In the lifting, we can restrict to the vertices. This fact will be crucial in the proof of Theorem 12.
The above described liftings due to Hausdorff and Kantorovich are key ingredients of the definition of the probabilistic bisimilarity distances, as we will see in the next section.

Probabilistic Automata
Also in this section, we recall some definitions and results from the literature. In particular, we introduce the model of interest, probabilistic automata, its best known behavioural equivalence, probabilistic bisimilarity, and its quantitative generalization. Probabilistic automata were first studied in the context of concurrency by Segala [5].

Definition 6.
A probabilistic automaton is a tuple S, L, →, consisting of • a nonempty finite set S of states, • a nonempty finite set L of labels, • a finitely branching transition relation → ⊆ S × Distr(S), and • a labelling function : S → L.
Instead of (s, µ) ∈ →, we write s → µ. A transition relation is finitely branching if for all s ∈ S, the set { µ ∈ Distr(S) | s → µ } is nonempty and finite. For the remainder of this paper we fix a probabilistic automaton S, L, →, .
In order to define probabilistic bisimilarity, we first show how a relation on states can be lifted to a relation on probability distributions over states.
This notion of lifting is due to Jonsson and Larsen [49,Definition 4.3].
Probabilistic bisimilarity, a notion due to Segala and Lynch [6], is introduced next. States are probabilistic bisimilar if they have the same label and each probabilistic transition of the one state can be matched by a probabilistic transition of the other state, and vice versa. Two probabilistic transitions match if they transition with exactly the same probability to states that behave exactly the same.
For a proof that a largest probabilistic bisimulation exists, we refer the reader to, for example, [50,Proposition 4.3]. Relying on exact matching is the cause for a lack of robustness. To address this shortcoming, we define a quantitative generalization of probabilistic bisimilarity, the probabilistic bisimilarity distances, as the least fixed point of the function ∆ 1 . To prove an alternative characterization of the probabilistic bisimilarity distances in the next section, we also introduce a family of discounted versions of ∆ 1 , namely ∆ c with c ∈ (0, 1).
Probabilistic automata combine probabilistic and nondeterministic choices.
The Hausdorff distance is usually employed to model nondeterminism (see, for example, [12]). For a discussion of the suitability of the Kantorovich distance to handle probabilistic choices, we refer the reader to [50].
Proposition 9. For all c ∈ (0, 1], the function ∆ c is monotone. Since [0, 1] S×S , is a complete lattice according to Proposition 1 and ∆ c is a monotone function by Proposition 9, we can conclude from Theorem 2(c) that ∆ c has a least fixed point µ∆ c . The probabilistic bisimilarity distances µ∆ 1 form a behavioural pseudometric and the fact that they provide a quantitative generalization of probabilistic bisimilarity is captured by the following theorem due to Deng et al. [8].

An Alternative Characterization
In the previous section, we defined the probabilistic bisimilarity distances as a least fixed point. Next, we present an alternative characterization. This generalizes the characterization of probabilistic bisimilarity distances for labelled Markov chains due to Chen et al. [21,Theorem 8]. First, we partition the set of state pairs as follows.
Note that, due to Theorem 10 the state pairs in S 2 0 have distance zero. From Definition 9 we can infer that the state pairs in S 2 1 have distance one. The state pairs in S 2 ? cannot have distance zero, again due to Theorem 10, but can have any distance in the interval (0, 1], including distance one.
As we will discuss below, the alternative characterization of the probabilistic bisimilarity distances can be viewed in terms of a stochastic game.
These games were introduced by Shapley [51]. We focus here on a simplified version of these games, called simple stochastic games, which were first studied by Condon [39]. We use the more general definition of Zwick and Paterson [52]. The more general simple stochastic game can be converted to an ordinary simple stochastic game as defined in [39] in polynomial time [52, page 355].
A simple stochastic game is played with a single token by two players, called min and max, on a finite directed graph. The graph has five types of vertices: min, max and random vertices, 0-sinks and 1-sinks. The min, max, and random vertices have several outgoing edges, whereas the 0-sinks and 1sinks have no outgoing edges. Whenever the token is in a min (max) vertex, the token is moved to one of the successors of the vertex, chosen by the min (max) player. If the token is in a random vertex, the successor is chosen randomly. The min (max) player's objective is to minimize (maximize) the probability of reaching a 1-sink.
Our alternative characterization of probabilistic bisimilarity distances can be viewed in terms of a simple stochastic game, similar to the one presented in [31]. The game can be considered a quantitative generalization of the game that characterizes bisimilarity (see [53]). In this turn based game, starting in a pair of states (s, t), the max player chooses a probabilistic transition from either s or t. Subsequently, the min player chooses a probabilistic transition from the other state and also chooses a coupling. In [31], the latter move by the min player is split into two moves by the min player. For example, if the max player picks s → µ and the min player picks t → ν, then the min player also has to choose ω ∈ V (Ω(µ, ν)). This will be formalized in Definition 10. Recall that such a coupling ω is a probability distribution on S × S. From a coupling ω the stochastic game moves to state pair (u, v) with The objective of the max player is to maximize the probability of reaching a state pair with different labels. The min player tries to minimize this probability. In the above example, the max player tries to reach the state pair (s, v), whereas the min player tries to avoid that from happening. The policies, also known as strategies, for the max and min player are introduced next.
Definition 10. The set A of max policies is defined by The set I of min policies is defined by Given a policy A for the max player and a policy I for the min player, we define the value function as the least fixed point of the function Γ A,I 1 . This least fixed point captures the probability of reaching a state pair with different labels if both players use the given policies. We also introduce a family of discounted versions of Γ A,I 1 , namely Γ A,I c with c ∈ (0, 1), that we will use later in this section.
We collect some properties of Γ A,I c in the following proposition. Proof. Let A ∈ A, I ∈ I and c ∈ (0, 1]. (a) Let d, e ∈ [0, 1] S×S with d e. We distinguish three cases. - -Otherwise, (s, t) ∈ S 2 ? . Then We distinguish three cases. - -If (s, t) ∈ S 2 1 then -Otherwise, (s, t) ∈ S 2 ? . Then From Theorem 2(c) we can conclude that Γ A,I c has a least fixed point, which we denote by µΓ A,I c . In the remainder of this section we will show that there exist an optimal max policy A * and an optimal min policy I * such that the corresponding value function captures the probabilistic bisimilarity distances. In the game graph of Figure 4, the red edge represents the optimal max policy and the blue edges represent the optimal min policy. The proof of µ∆ 1 = µΓ A * ,I * 1 consists of two parts. First, we prove that there exists an optimal min policy. Proof. Towards the construction of I * ∈ I, let s ∈ S and ν ∈ Distr(S).
• Otherwise, (s, t) ∈ S 2 ? . Without any loss of generality, we assume that In the remainder of this paper, we denote the optimal min policy constructed in the above proof by I * . It remains to prove that there exists an optimal max policy. The proof of this second part turns out to be more involved than the proof of the first part contained in the above theorem. The proof has the following three major components.
• For all A ∈ A and I ∈ I, the value function µΓ A,I • Similarly, the probabilistic bisimilarity distances captured by µ∆ 1 are the limit of their discounted counterparts represented by µ∆ c .
• There exists an optimal max policy in the discounted setting.
Combining the above three components, we arrive at an optimal max policy.
The next few propositions lead up to the proof that for all A ∈ A and I ∈ I, the value function µ(Γ A,I 1 ) is the limit of the discounted value functions µ(Γ A,I c ), that is, lim  • If (s, t) ∈ S 2 1 then • Otherwise, (s, t) ∈ S 2 ? . Then To prove lim c↑1 µ(Γ A,I c ) = µ(Γ A,I 1 ), we define a function γ A,I 1 . Proof. Let A ∈ A and I ∈ I. It suffices to prove that

Definition 12. Let
From Proposition 13 and the definition of γ A,I 1 we can conclude that for all Proposition 15. For all A ∈ A and I ∈ I, γ A,I 1 µ(Γ A,I 1 ).
Proof. Let A ∈ A and I ∈ I. From Proposition 13 we can conclude that for all c ∈ (0, 1), µ(Γ A,I c ) µ(Γ A,I 1 ). From this we can immediately deduce that γ A,I 1 µ(Γ A,I 1 ).
Proof. Let A ∈ A and I ∈ I. To conclude that µ(Γ A,I 1 ) γ A,I 1 , it suffices to show that Γ A,I 1 (γ A,I 1 ) = γ A,I 1 according to Theorem 2(d). Let s, t ∈ S. We distinguish three cases.
Otherwise, we define A * c (s, t) by   • Otherwise, (s, t) ∈ S 2 ? . Without loss of any generality, assume that A * c (s, t) = (t, µ). This assumption implies that (3) and Hence, Combining the above three components, we obtain the second part of the proof of the alternative characterization.
Since the set A is finite, the sequence (A n ) n∈N has a subsequence (A σ(n) ) n∈N that is constant, that is, there exists A * ∈ A such that for all n ∈ N, From (5) we can deduce that ∀I ∈ I : µ∆ 1 µΓ A * ,I 1 .
In the remainder of this paper, we use A * to denote an optimal max policy, which exists according to Theorem 20. Combining the above results, we arrive at the following alternative characterization of the probabilistic bisimilarity distances.

Deciding Distance One
In this section, we present an algorithm to compute the set D 1 of state pairs that have distance one, that is The key ingredient of our algorithm is the following function.
The set Λ(X, Y ) contains all state pairs with different labels and those state pairs for which there exists a move by the max player so that every subsequent move of the min player always ends up in X and with some positive probability in Y . We use the notation λZ.Λ(X, Z) to denote the function that maps the set Z to the set Λ(X, Z). We denote the least and greatest fixed point of this function by µZ.Λ(X, Z) and νZ.Λ(X, Z), respectively.
The function Λ has the following monotonicity properties.
(a) Follows immediately from the definition of Λ and the fact that support(ω) ⊆ X and X ⊆ Y imply support(ω) ⊆ Y .
(b) Follows immediately from the definition of Λ and the fact that support(ω)∩ is the least pre-fixed point of λZ.Λ(X, Z) according to Theorem 2(d), we can conclude that µZ.Λ(X, Z) ⊆ µZ.Λ(Y, Z). can reach a state pair with different labels and all state pairs reachable from (s, t) are an element of X.
The set νX.µY.Λ(X, Y ) contains all state pairs (s, t) for which there exists a max policy such that for all min policies, all state pairs reachable from (s, t) can reach a state pair with different labels. In the next section, we will prove that νX.µY.Λ(X, Y ) captures the set D 1 . According to Theorem 4(a) and (b), these greatest and least fixed points can be obtained iteratively as follows. Proposition 23. For all µ, ν ∈ Distr(S) and X ⊆ S × S, Proof. Let µ, ν ∈ Distr(S) and X ⊆ S × S. For all ω ∈ Ω(µ, ν), Furthermore, according to Proposition 8, there exists π ∈ V (Ω(µ, ν)) such that (a) The proof consists of several parts.

Correctness Proof
In this section, we will prove the algorithm presented in the previous section correct. That is, we will show that D 1 = νX.µY.Λ(X, Y ). Intuitively, a pair of states in D 1 has the property that either the states have a different label or there exists a max policy that always ends up, no matter how the min player plays, in state pairs for which the same particular property holds.
Let us first provide a high level overview of the proof. We will define a sequence of sets of state pairs X 0 , X 1 , . . . , X m in terms of Λ in Definition 14.
We will show that X m = νX.µY.Λ(X, Y ) in Lemma 29 and D 1 ⊆ X m in Proposition 27. To conclude that D 1 = νX.µY.Λ(X, Y ), it remains to show that X m \ D 1 = ∅. In Definition 15, we will partition X m \ D 1 into the sets of state pairs Z 0 , Z 1 , . . . , Z n−1 as depicted in Figure 6. In Definition 16, we will construct a max policy A that only differs from the optimal max policy A * on X m \ D 1 . This max policy A will be such that if the min player plays according to the optimal min policy I * and the max player uses A starting from (s, t) ∈ Z i , then the game will stay within X m according to Proposition 32 and the game will reach some (u, v) ∈ D 1 ∪ 0≤j<i Z j with positive probability. In the proof of Theorem 34, towards a contradiction, we will assume that X m \ D 1 is nonempty. Let Z i contain a state pair with minimal value, for the policies I * and A , among the state pairs in X m \ D 1 .
We will prove that this minimal value is smaller than one. As we will show, Z j , for some 0 ≤ j < i, contains a state pair with minimal value as well. As a result, Z 0 contains a state pair with minimal value, which will lead to a contradiction. Proof. For all s, t ∈ S, To prove the second fixed point property of Λ, we first show that if the game starts in D 1 and the max player uses the optimal max policy A * , then the game stays in D 1 .
Secondly, we show that D 1 is a fixed point of λX.µY.Λ(X, Y ).
Proof. Let Y = µY.Λ (D 1 , Y ). From Proposition 24, we can conclude Y ⊆ D 1 . It remains to prove that D 1 ⊆ Y. Towards a contradiction, assume First, we show that there exists I ∈ I such that Next, we prove Finally, we show that there exists (s, t) ∈ D 1 \ Y such that µ(∆ 1 )(s, t) = 0, which is the desired contradiction.
Next, we prove for all (s, t) ∈ D 1 \ Y, µ(Γ A * ,I 1 )(s, t) = 0. To prove this, it suffices to show that for all n ∈ N, (Γ A * ,I 1 ) n ( 0)(s, t) = 0 according to Proposition 11 and a minor variation on Theorem 7. We prove this by induction on n. The base case n = 0 is immediate. Let n > 0. Then By assumption, Hence, (s, t) ∈ D 1 , which contradicts (s, t) ∈ D 1 \ Y.

Iterative Characterization
To conclude that the algorithm presented in the previous section is correct, it remains to show that νX.µY.Λ(X, Y ) equals D 1 . We start by providing an iterative characterization of νX.µY.Λ(X, Y ).

Definition 14.
For each i ∈ N, the set X i ⊆ S × S is defined by For each i, j ∈ N, the set Y j i ⊆ S × S is defined by The above definition differs from the iterative algorithm presented in the previous section in that Y 0 i = D 1 , whereas the algorithm starts its iteration towards the least fixed point from ∅.
Next, we prove a key property of the sets X i .
Proof. We prove this proposition by induction on i. We distinguish the following two cases.
• If i = 0 then • If i > 0 then The proposition below collects two properties of Y j i , which will be used later. Proof.
(a) Let i ∈ N. We prove this proposition by induction on j. The base case, j = 0, is vacuously true. Let j > 0. Then We prove this proposition by induction on j. We distinguish the following two cases.
-If j = 0 then Next, we prove some other key properties of the sets X i and Y j i .
Proof. (c) We have that

Max Policy A
In this section, we will construct a max policy A . The construction of A relies on partitioning X m \ D 1 into n disjoint subsets Z 0 , Z 1 , . . . , Z n−1 , as depicted in Figure 6. Note that m and n are the constants from Lemma 29.
Definition 15. For each 0 ≤ i < n, the set Z i ⊆ S × S is defined by We collect some properties of the sets Z i in the proposition below.
(a) Let 0 ≤ i < n. By Proposition 28(a), (c) We first show that for all 1 ≤ j ≤ n, We distinguish the following two cases.
-If j = 1 then -If j > 1 then The result now follows from the observation that X m = Y n m (Lemma 29(c)).
(d) We prove that for all Y i m = D 1 ∪ 0≤j<i Z j by induction on i. We distinguish the following two cases.
-If i > 0 then Proof. Let 0 ≤ i < n and (s, t) ∈ Z i . Then Based on the above proposition, we construct a max policy A .
Definition 16. The function A : S 2 ? → (S × Distr(S)) is defined by (14) (s, ν) if (s, t) ∈ Z i and (15) Note that the max policy A only differs from the optimal max policy A * on X m \ D 1 . If the game starts in X m and the max player uses the max policy A and the min player plays according to the optimal min policy I * , then the game stays in X m .
If the max player uses the max policy A and the min player plays according to the optimal min policy I * , then the value of (s, t) is one if and only if s and t have probabilistic bisimilarity distance one. Proof. Let s, t ∈ S. We prove two implications. Assume that (s, t) ∈ D 1 .
We will show that for all i ∈ N, we have that From this fact we can conclude that [Proposition 11 and Theorem 7] Next, we prove (16). The base case, i = 0, is vacuously true. Let i > 0.
We distinguish two cases.
Combining the above results, we arrive at the following.
Towards a contradiction, assume that X m \ D 1 = ∅. Let Since X m \ D 1 = ∅, we can conclude that min exists and M = ∅.
Next, we will prove that Let 0 ≤ i<n and assume that Z i ∩M = ∅. Let (s, t) ∈ Z i with µΓ A ,I * 1 (s, t) = min. Then (u, v) ≥ min for all (u, v) ∈ X m \ D 1 . Since min < 1, from the above we can deduce that support(I * (A (s, t)) ⊆ X m \ D 1 and µΓ A ,I * 1 (u, v) = min for all (u, v) ∈ support(I * (A (s, t)).

Experiments
In this section, we evaluate the performance of the algorithm of deciding distance one on several probabilistic automata. These probabilistic automata model probabilistic protocols that are part of the distribution of the probabilistic model checker PRISM [57].
We implemented the iterative algorithm of deciding distance one presented in Section 5 in Java. 4 As S 2 ? = (S × S) \ (S 2 0 ∪ S 2 1 ) is required when applying the Λ operator, we need to compute S 2 0 and S 2 1 . The set S 2 1 simply denotes the set of state pairs which have different labels and can be computed by comparing the labels of each state pair, whereas S 2 0 denotes the state pairs which have distance zero. As we mentioned earlier, distance zero coincides with probabilistic bisimilarity. We thus implemented the twophased partitioning algorithm of deciding probabilistic bisimilarity for probabilistic automata by Baier et al. [58] in Java. This algorithm runs in time O(mn(log m + log n)) where m is the number of probabilistic transitions and n the number of states of the probabilistic automaton.
In our experiments, we keep track of |S 2 0 |, |S 2 1 | and |D 1 \ S 2 1 |. As a consequence, we can determine the number of distances that are non-trivial, that is, greater than zero and smaller than one: |S| 2 −(|S 2 0 |+|S 2 1 |+|D 1 \S 2 1 |). The distances of the state pairs in S 2 ? can be computed using, for example, policy iteration [59]. This algorithm runs in exponential time in the worst case [24]. Note that the state pairs in D 1 \ S 2 1 can be computed in polynomial time. Hence, the number of state pairs that remains to be computed has been reduced by |D 1 \ S 2 1 |. We applied our implementation to probabilistic automata obtained from PRISM. We compute the above mentioned reduction of the number of nontrivial distances for the following models: the randomized consensus shared coin protocol due to Aspnes and Herlihy [60]   The IPv4 zeroconf protocol is a dynamic configuration protocol for IPv4 addresses. The protocol modelled in PRISM has three parameters: N denotes the number of abstract hosts, K the number of probes each host sends and a boolean value reset. When reset is set to true, each host should delete the previously received messages before choosing a new IP address. For details of the model, we refer interested reader to [62]. The smallest size of this model has 451 states, when K = 1 and reset = true. This is actually the only size we can handle for this protocol. We consider two systems by considering two different values for N : 20 and 1000. The results are shown in Table 3.
Both systems have the same number of states and also the same number of state pairs of distance zero and one.

Conclusion
In the introduction, we alluded to the connection between reinforcement learning and behavioural pseudometrics for labelled Markov chains (see Table 1). Here, we briefly discuss a connection between game theory and behavioural pseudometrics for probabilistic automata (see Table 4).
behavioural pseudometrics game theory probabilistic automaton simple stochastic game distances values policy iteration Chen et al. [21] have provided an alternative characterization of the prob-abilistic bisimilarity distances for a labelled Markov chain as the values of a Markov decision process. This characterization forms the basis for the algorithm to compute the probabilistic bisimilarity distances for labelled Markov chains by Bacci et al. [23]. Their algorithm is similar to Howard's policy iteration algorithm [28]. In this paper we have presented an alternative characterization of the probabilistic bisimilarity distances for a probabilistic automaton as the values of a simple stochastic game. Bacci et al. [59] have recently used a similar characterization as the foundation for an algorithm to compute the probabilistic bisimilarity distances for probabilistic automata based on the policy iteration algorithm from game theory due to Hoffman and Karp [63].
As shown by Baier [25], probabilistic bisimilarity distance zero for probabilistic automata can be decided in polynomial time. In this paper we have shown that distance one can also be decided in polynomial time. As a consequence, we can determine in polynomial time how many, if any, distances are non-trivial, that is, greater than zero and smaller than one. As we have already shown in [26] in the context of labelled Markov chains, being able to decide distance zero and distance one in polynomial time has significant impact on computing probabilistic bisimilarity distances for labelled Markov chains. The algorithm by Bacci et al. [23], that does not decide distance one before computing the non-trivial distances using policy iteration, can compute distances for labelled Markov chains up to 150 states. For one such labelled Markov chain, their algorithm takes more than 49 hours. Our algorithm that we present in [26] decides distance zero and distance one before using policy iteration to compute the non-trivial distances. Our algorithm takes 13 milliseconds instead of 49 hours. For more details, we refer the reader to [27,Chapter 9].
s 1 · · · s n t 1 · · · t n s t Consider the probabilistic automaton in Figure 7. This probabilistic automaton induces the game graph depicted in Figure 8. If µ and ν are both the uniform distribution on n elements, then the vertices of Ω(µ, ν) can be viewed as permutations (see [64]). As a result, from the state pair (s, t) after one move by the max player and one move by the min player, n! vertices can be reached. Hence, we may encounter an exponential blow-up when we transform a probabilistic automaton into a simple stochastic game. As a consequence, it is not immediately obvious which results from game theory can be transferred to behavioural pseudometrics. Let us briefly discuss one paper about games that contains seemingly related results that deserve further study. In [65] de Alfaro, Henzinger and Kupferman study reachability games. These games have two players and the strategies can be randomized.
They present an algorithm to compute the almost-sure reachable states, that is, those states for which one of the players has a strategy to reach a particular set of states with probability one. Similarly, the function Λ induces a strategy of the max player and is related to reachability of the set S 2 1 .
s, t · · · n! vertices To prove Theorem 20, which provides the second part of the proof of the alternative characterization of the probabilistic bisimilarity distances, we rely on the discounted functions ∆ c and Γ A * c ,I c for c ∈ (0, 1). In particular, in the proof of Lemma 19 we use the fact that Γ A * c ,I c has a unique fixed point. If we were able to prove that Γ A * ,I 1 has a unique fixed point, then we would be able to give a proof of Theorem 20 that does not rely on discounted functions.
We also leave that for future research.