Abstract 1 Introduction 2 Definitions 3 Almost-Sure Winning 4 Limit-Sure Winning 5 The Gap Problem References

The Value Problem for Multiple-Environment MDPs with Parity Objective

Krishnendu Chatterjee ORCID IST , Klosterneuburg, Austria Laurent Doyen ORCID CNRS & LMF, ENS Paris-Saclay, France Jean-François Raskin ORCID Université Libre de Bruxelles, Belgium Ocan Sankur ORCID Université de Rennes, CNRS, Inria, France
Mitsubishi Electric R&D Centre Europe, Rennes, France
Abstract

We consider multiple-environment Markov decision processes (MEMDP), which consist of a finite set of MDPs over the same state space, representing different scenarios of transition structure and probability. The value of a strategy is the probability to satisfy the objective, here a parity objective, in the worst-case scenario, and the value of an MEMDP is the supremum of the values achievable by a strategy.

We show that deciding whether the value is 1 is a PSPACE-complete problem, and even in P when the number of environments is fixed, along with new insights to the almost-sure winning problem, which is to decide if there exists a strategy with value 1. Pure strategies are sufficient for theses problems, whereas randomization is necessary in general when the value is smaller than 1. We present an algorithm to approximate the value, running in double exponential space. Our results are in contrast to the related model of partially-observable MDPs where all these problems are known to be undecidable.

Keywords and phrases:
Markov decision processes, imperfect information, randomized strategies, limit-sure winning
Category:
Track B: Automata, Logic, Semantics, and Theory of Programming
Funding:
Krishnendu Chatterjee: ERC CoG 863818 (ForM-SMArt) and Austrian Science Fund (FWF) 10.55776/COE12.
Jean-François Raskin: PDR Weave project FORM-LEARN-POMDP funded by FNRS and DFG, and the support of the Fondation ULB.
Ocan Sankur: ANR BisoUS (ANR-22-CE48-0012) and ANR EpiRL (ANR-22-CE23-0029).
Copyright and License:
[Uncaptioned image] © Krishnendu Chatterjee, Laurent Doyen, Jean-François Raskin, and Ocan Sankur; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Logic and verification
; Theory of computation Probabilistic computation
Related Version:
Full Version: https://arxiv.org/abs/2504.15960 [8]
Editors:
Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele Puppis

1 Introduction

We consider Markov decision processes (MDP), a well-established state-transition model for decision making in a stochastic environment. The decisions involve choosing an action from a finite set, which together with the current state determine a probability distribution over the successor state. The question of constructing a strategy that maximizes the probability to satisfy a logical specification is a classical synthesis problem with a wide range of applications [20, 12, 3, 21].

The stochastic transitions in MDPs capture the uncertainty in the effect of an action. Another form of uncertainty arises when the states are (partially) hidden to the decision-maker, as in the classical model of partially-observable MDPs (POMDP) [16, 19]. Recently, an alternative model of MDPs with partial information has attracted attention, the multiple-environment MDPs (MEMDP) [22], which consists of a finite set of MDPs over the same state space. Each MDP represents a possible environment, but the decision-maker does not know in which environment they are operating. The synthesis problem is then to construct a single strategy that can be executed in all environments to ensure the objective be satisfied independently of the environment. This model is natural in applications where the structure of the transitions and their probability are uncertain such as in robust planning or population models with individual variability [4, 6, 1, 24, 23].

In contrast to what previous work suggest, the two models of POMDP and MEMDP are (syntactically) incomparable: the choice of the environment in MEMDP is adversarial, which cannot be expressed in a POMDP, and the partial observability of POMDP can occur throughout the execution, whereas the uncertainty in MEMDP is only initial. In particular, MEMDP are not a subclass of POMDP since pure strategies are sufficient in POMDPs [18, 7] while randomization is necessary in general in MEMDPs [22, Lemma 3].

The synthesis problem has been considered for traditional ω-regular objectives, defined as parity [22] or Rabin [23] condition, in three variants: the almost-sure problem is to decide whether there exists a strategy that is winning with probability 1 in all environments, the limit-sure problem is to decide whether, for every ε>0, there exists a strategy that is winning with probability at least 1ε in all environments, and the gap problem, which is an approximate version of the quantitative problem to decide, given a threshold 0<λ1, whether there exists a strategy that is winning with probability at least λ in all environments. The limit-sure problem is also called the value-1 problem, where the value of an MEMDP is defined as the supremum of the values achievable by a strategy. The value is 1 if and only if the answer to the limit-sure problem is Yes.

Figure 1: Multiple-environment MDP for the missing card (over 3-card deck). Each M[ei] represents the behavior of the MEMDP under environment ei where card i has been removed. The environment can be identified almost-surely (with probability 1).
Figure 2: Multiple-environment MDP for the duplicate card (over 3-card deck). Each M[ei] represents the behavior of the MEMDP under environment ei where card i has been duplicated. The environment can be identified limit-surely (with probability arbitrarily close to 1).

A classical example to illustrate the difference between almost-sure and limit-sure winning is to consider an environment consisting of 51 cards, obtained by removing one card from a standard 52-card deck (see Figure 1). The decision-maker has two possible actions: the action sample reveals the top card of the deck and then shuffles the cards (including the top card, which remains in the deck); the action guess(x), where 1x52 is a card, stops the game with a win if x is the missing card, and a lose otherwise. If no guess is ever made, the game is also losing. An almost-sure winning strategy is to sample until each of the 51 cards has been revealed at least once, then to make a correct guess. It is easy to see that the strategy wins with probability 1, even if there exist scenarios (though with probability 0) where some of the 51 cards are never revealed and no correct guess is made. Hence the MEMDP is almost-sure winning, and we say that it is not sure winning because a losing scenario exists in every strategy. Consider now an environment consisting of 53 cards, obtained by adding one duplicate card c to the standard deck, and the same action set and rules of the game, except that a correct guess is now the duplicate card x=c (see Figure 2). The strategy that samples for a long time and then makes a guess based on the most frequent card wins with probability close to 1 – and closer to 1 as the sampling time is longer – but not equal to 1, since no matter how long is the sampling phase there is always a nonzero probability that the duplicate card does not have the highest frequency at the time of the guess. In this case, the MEMDP is limit-sure winning, but not almost-sure winning. Intuitively, the solution of almost-sure winning relies on the analysis of revealing transitions, which give a sure information allowing to exclude some environment (seeing card c is a guarantee that we are not in the environment where c is missing); the solution of limit-sure winning involves learning by sampling, which also allows to exclude some environment, but possibly with a nonzero probability to be mistaken.

For MEMDPs with two environments, it is known that the almost-sure and limit-sure problem for parity objectives are solvable in polynomial time [22, Theorem 33, Theorem 40], while the gap problem is decidable in 2-fold exponential space [22, Theorem 30] and is NP-hard, even for acyclic MEMDPs with two environments [22, Theorem 26]. With an arbitrary number of environments, the almost-sure problem becomes PSPACE-complete [23, Theorem 41], even for reachability objectives [24, Lemma 11]. For comparison, in the close model of POMDP, the decidability frontier lies between limit-sure winning and almost-sure winning: with reachability objectives, the almost-sure problem is decidable (and EXPTIME-complete [2]), whereas the limit-sure problem is undecidable [13]. The gap problem is also undecidable [17].

In this paper, we consider the limit-sure problem and the gap problem for parity objectives in MEMDPs with an arbitrary number of environments. Our main result is to show that (a) the limit-sure problem is PSPACE-complete and can be solved in polynomial time for a fixed number of environments, and (b) the gap problem can be solved in double exponential space. Correspondingly, our algorithms significantly extend the solutions that are known for two environments, relying on a non-trivial recursive (inductive) analysis.

The PSPACE upper bound is obtained by a characterization of limit-sure winning for a subclass of MEMDPs, in terms of almost-sure winning conditions (Lemma 12). A pre-processing phase transforms general MEMDPs into the subclass. We present a PSPACE algorithm to compute the pre-processing and verify the characterization. Since our algorithm relies on almost-sure winning, we also give a new characterization of almost-sure winning for parity objectives in MEMDPs (Lemma 3), which gives a conceptually simple alternative algorithm to the known solution [23]. The PSPACE lower bound straightforwardly follows from the same reduction as for almost-sure winning [23, Theorem 7]. A corollary of our characterizations is a refined strategy complexity: pure (non-randomized) strategies are sufficient for both limit-sure and almost-sure winning, which was known only for acyclic MEMDPs and almost-sure reachability objectives [24, Lemma 12], and exponential memory is sufficient. In the last part of the paper, we present an algorithm running in double exponential space for solving the gap problem, by computing an approximation of the value of the MEMDP. To win with probability at least λ in all environments, randomized strategies are more powerful [22, Lemma 3], and thus need to be considered for solving the gap problem.

In conclusion, the model of MEMDP is a valuable alternative to POMDPs, from a theoretical perspective since the limit-sure problem and gap problem are undecidable for POMDPs whereas our results establish decidability for MEMDPs, and from a practical perspective since many applications of POMDPs can be expressed by MEMDPs, as was observed previously [1, 24].

Omitted proofs and material are provided in the extended version [8].

2 Definitions

A probability distribution on a finite set Q is a function d:Q[0,1] such that qQd(q)=1. The support of d is Supp(d)={qQd(q)>0}. A Dirac distribution assigns probability 1 to some qQ. We denote by 𝒟(Q) the set of all probability distributions on Q.

2.1 Markov Decision Processes

A Markov decision process (MDP) over a finite set A of actions is a tuple M=Q,(Aq)qQ,δ consisting of a finite set Q of states, a nonempty set AqA of actions for each state qQ, and a partial probabilistic transition function δ:Q×A𝒟(Q). We say that (q,a,q) is a transition if δ(q,a)(q)>0. A state qQ is a sink if δ(q,a)(q)=1 for all aAq.

A run of M from an initial state q0Q is an infinite sequence π=q0a0q1a1 of interleaved states and actions such that aiAqi and δ(qi,ai)(qi+1)>0 for all i0. Finite prefixes ρ=q0a0qn of runs ending in a state are called histories and we denote by 𝗅𝖺𝗌𝗍(ρ)=qn the last state of ρ. We denote by 𝖧𝗂𝗌𝗍ω(M) (resp., 𝖧𝗂𝗌𝗍(M)) the set of all runs (resp., histories) of M, and by Inf(π) the set of states that occur infinitely often along the run π.

A sub-MDP of M is an MDP M=Q,(Aq)qQ,δ such that QQ and Supp(δ(q,a))Q for all states qQ and actions aAq (recall the requirement that Aq). Consider a set QQ such that for all qQ, there exists aAq with Supp(δ(q,a))Q. We define the sub-MDP of M induced by Q, denoted by M|Q, as the sub-MDP M=Q,(Aq)qQ,δ where Aq={aAqSupp(δ(q,a))Q} for all qQ.

End-components.

An end-component of M=Q,(Aq)qQ,δ is a pair (Q,(Aq)qQ) such that (Q,(Aq)qQ,δ) is a sub-MDP of M, where δ denotes the restriction of δ to {(q,a)qQ,aAq}, and where the graph Q,E with E={(q,q)Q×QaAq:δ(q,a)(q)>0} is strongly connected [10, 3]. We often identify an end-component as the set Q{(q,a)qQ,aAq} of states and state-action pairs, and we say that it is supported by the set Q of states. The (componentwise) union of two end-components with nonempty intersection is an end-component, thus one can define maximal end-components. We denote by MEC(M) the set of maximal end-components of M, which is computable in polynomial time [10], and by EC(M) the set of all end-components of M.

Histories and Strategies.

strategy is a function σ:𝖧𝗂𝗌𝗍(M)𝒟(A) such that Supp(σ(ρ))Aq for all histories ρ𝖧𝗂𝗌𝗍(M) ending in 𝗅𝖺𝗌𝗍(ρ)=q. A strategy is pure if all histories are mapped to Dirac distributions. A strategy σ is memoryless if σ(ρ)=σ(ρ) for all histories ρ,ρ such that 𝗅𝖺𝗌𝗍(ρ)=𝗅𝖺𝗌𝗍(ρ). We sometimes view memoryless strategies as functions σ:Q𝒟(A). A strategy σ uses finite memory (of size k) if there exists a right congruence over 𝖧𝗂𝗌𝗍(M) (i.e., such that if ρρ, then ρaqρaq for all ρ,ρ𝖧𝗂𝗌𝗍(M) and (a,q)A×Q) of finite index k such that σ(ρ)=σ(ρ) for all histories ρρ with 𝗅𝖺𝗌𝗍(ρ)=𝗅𝖺𝗌𝗍(ρ).

Objectives.

An objective φ is a Borel set of runs. We denote by qσ(M,φ) the standard probability measure on the sigma-algebra over the set of (infinite) runs of M with initial state q, generated by the cylinder sets spanned by the histories [3]. Given a history ρ=q0a0q1qk, the cylinder set Cyl(ρ)=ρ(AQ)ω has probability qσ(M,Cyl(ρ))=i=0k1σ(q0a0q1qi)(ai)δ(qi,ai)(qi+1) if q0=q, and probability 0 otherwise. We say that a run ρ is compatible with strategy σ if qσ(M,Cyl(ρ))>0.

We consider the following standard objectives for an MDP M:

  • safety objective: given a set TQ of states, let Safe(T)={q0a0q1a1𝖧𝗂𝗌𝗍ω(M)i0:qiT};

  • reachability objective: given a set TQ of states, let Reach(T)={q0a0q1a1𝖧𝗂𝗌𝗍ω(M)i0:qiT};

  • parity objective: given a priority function p:Q, let Parity(p)={π𝖧𝗂𝗌𝗍ω(M)min{p(q)qInf(π)} is even}.

It is standard to cast safety and reachability objectives as special cases of parity objectives, using sink states. Given an objective φ, we denote by ¬φ=𝖧𝗂𝗌𝗍ω(M)φ the complement of φ. We say that a run π𝖧𝗂𝗌𝗍ω(M) satisfies φ if πφ, and that it violates φ otherwise.

It is known that under arbitrary strategies, with probability 1 the set Inf(π) of states occurring infinitely often along a run π is the support of an end-component [9, 10].

Lemma 1 ([9, 10]).

Given an MDP M, for all states qQ and all strategies σ, we have qσ(M,{πInf(π) is the support of an end-component})=1.

An end-component DEC(M) is positive under strategy σ from q if qσ(M,{πInf(π)=D})>0. By Lemma 1, we have DEC(M)qσ(M,{πInf(π)=D})=1.

Value and qualitative satisfaction.

A strategy σ is winning for objective φ from q with probability (at least) α if qσ(M,φ)α. We denote by Valq(M,φ)=supσqσ(M,φ) the value of objective φ from state q. A strategy σ is optimal if qσ(M,φ)=Valq(M,φ).

We consider the following classical qualitative modes of winning. Given an objective φ, a state q is:

  • almost-sure winning if there exists a strategy σ such that is winning with probability 1, that is qσ(M,φ)=1.

  • limit-sure winning if Valq(M,φ)=1, or equivalently for all ε>0 there exists a strategy σ such that qσ(M,φ)1ε.

We denote by AS(M,φ) and LS(M,φ) the set of almost-sure and limit-sure winning states, respectively. In MDPs, it is known that AS(M,φ)=LS(M,φ) and pure memoryless optimal strategies exist for parity objectives φ [20, 9].

We recall that the value of a parity objective φ=Parity(p) from every state of an end-component D is the same, and is either 0 or 1, which does not depend on the precise value of the (non-zero) transition probabilities, but only on the supports Supp(δ(q,a)) of the transition function at the state-action pairs (q,a) in D [10]. When the value 1, there exists a pure memoryless strategy σ such that qσ(M,φ)=1 for all states qD. If such a strategy exists, then D is said to be φ-winning, and otherwise φ-losing.

2.2 Multiple-Environment MDP

A multiple-environment MDP (MEMDP) over a finite set E of environments is a tuple M=Q,(Aq)qQ,(δe)eE, where M[e]=Q,(Aq)qQ,δe is an MDP that models the behaviour of the system in the environment eE. The state space is identical in all M[e] (eE), only the transition probabilities may differ. We sometimes refer to the environments of M as the MDPs {M[e]eE} rather than the set E itself. For EE, let M[E] be the MEMDP M over set E of environments. We denote by M[¬e] the MEMDP M over environments E{e}, and by eEM[e] the MDP Q,(Aq)qQ,δ such that δ(q,a) is the uniform distribution over eESupp(δe(q,a)) for all qQ and aA.

A transition t=(q,a,q) is revealing in M if Kt={eEqSupp(δe(q,a))} is a strict subset of E (KtE). We say that Kt, which is the set of environments where the transition t=(q,a,q) is possible, is the knowledge after observing transition t. An MEMDP is in revealed form if for all revealing transitions t=(q,a,q), the state q is a sink in all environments, that is Supp(δe(q,a))={q} for all environments eE and all actions aAq. By extension, we call knowledge after a history ρ the set of environments in which all transitions of ρ are possible.

Decision Problems.

We are interested in synthesizing a single strategy σ with guarantees in all environments, without knowing in which environment σ is executing. We consider reachability, safety, and parity objectives.

A state q is almost-sure winning in M for objective φ if there exists a strategy σ such that in all environments eE, we have qσ(M[e],φ)=1, and we call such a strategy σ almost-sure winning. A state q is limit-sure winning in M for objective φ if for all ε>0, there exists a strategy σ such that in all environments eE we have qσ(M[e],φ)1ε, and we say that such a strategy σ is (1ε)-winning.

We denote by AS(M,φ) (resp., LS(M,φ)) the set of all almost-sure (resp., limit-sure) winning states in M for objective φ. We consider the membership problem for almost-sure (resp., limit-sure) winning, which asks whether a given state q is almost-sure (resp., limit-sure) winning in M for objective φ. We refer to these membership problems as qualitative problems.

We are also interested in the quantitative problems. Given MEMDP M, a parity objective φ, and probability threshold α0, we are interested in the existence of a strategy σ satisfying qσ(M[e],φ)α for all eE. We present an approximation algorithm for the quantitative problem, solving the gap problem consisting, given MEMDP M, state q, parity objective φ, and thresholds 0<α<1 and ε>0, in answering

  • Yes if there exists a strategy σ such that for all eE, we have qσ(M[e],φ)α,

  • No if for all strategies σ, there exists eE with qσ(M[e],φ)<αε,

  • and arbitrarily otherwise.

The gap problem is an instance of promise problems which guarantee a correct answer in two disjoint sets of inputs, namely positive and negative instances – which do not necessarily cover all inputs, while giving no guarantees in the rest of the input [11, 14].

Results.

We solve the membership problem for limit-sure winning with parity objectives φ (i.e., deciding whether a given state q is limit-sure winning, that is qLS(M,φ)), providing a PSPACE algorithm with a matching complexity lower bound, and showing that the problem is solvable in polynomial time when the number of environments is fixed. Our solution relies on the solution of almost-sure winning, which is known to be PSPACE-complete for reachability [24] and Rabin objectives [23]. We revisit the solution of almost-sure winning and give a simple characterization for safety objectives (which is also PSPACE-complete), that can easily be extended to parity objectives. A corollary of our characterization is that pure (non-randomized) strategies are sufficient for both limit-sure and almost-sure winning, which was known only for acyclic MEMDPs and reachability objectives [24, Lemma 12].

For the gap problem, we present an double exponential-space procedure to approximate the value α that can be achieved in all environments, up to an arbitrary precision ε.

3 Almost-Sure Winning

It is known that the membership problem for almost-sure winning in MEMDPs is PSPACE-complete with reachability objectives [24] as well as with Rabin objectives [23], an expressively equivalent of the parity objectives. We revisit the membership problem for almost-sure winning with parity objectives, as it will be instrumental to the solution of limit-sure winning. We present a conceptually simple characterization of the winning region for almost-sure winning, from which we derive a PSPACE algorithm, thus matching the known complexity for almost-sure Rabin objectives. A corollary of our characterization is that pure (non-randomized) strategies are sufficient for both limit-sure and almost-sure winning, which was known only for acyclic MEMDPs and reachability objectives [24, Lemma 12].

Theorem 2 ([24],[23]).

The membership problem for almost-sure winning in MEMDPs with a reachability, safety, or Rabin objective is PSPACE-complete.

To solve the membership problem for a parity objective φ, we first convert M into an MEMDP M in revealed form with state space Q{q𝗐𝗂𝗇,q𝗅𝗈𝗌𝖾} and each revealing transition t=(q,a,q) in M is redirected in M to q𝗐𝗂𝗇 if qAS(M[Kt],φ) is almost-sure winning when the set of environments is the knowledge Kt after observing transition t, and to q𝗅𝗈𝗌𝖾 otherwise. In order to decide if qAS(M[Kt],φ), we need to solve the membership problem for an MEMDP with strictly fewer environments than in M as KtE, which will lead to a recursive algorithm. The base case of the solution is MEMDPs with one environment, which is equivalent to plain MDPs.

It is easy to see that AS(M,φ){q𝗐𝗂𝗇}=AS(M,φReach(q𝗐𝗂𝗇)) for all prefix-independent objectives φ, and we can transform the objective φReach(q𝗐𝗂𝗇) into an objective of the same type as φ (for example, if φ is a parity objective then assigning the smallest even priority to q𝗐𝗂𝗇 turns the objective φReach(q𝗐𝗂𝗇) into a pure parity objective).

Hence, the main difficulty is to solve the membership problem for MEMDP in revealed form.

By definition, the almost-sure winning region W=AS(M,Parity(p)) for a parity objective in an MEMDP M is such that there exists a strategy σ that is almost-sure winning for the parity objective from every state qW in every MDP M[e] (where e is an environment of M). In contrast, we show the following characterization (note the order of the quantifiers).

Lemma 3.

Given an MEMDP M in revealed form with state space Q, if WQ is such that in every environment e, from every state qW, there exists a strategy σe that is almost-sure winning for the parity objective Parity(p) in M|W[e] from q, then WAS(M,Parity(p)). Moreover, for all qW, there exists a pure (|Q||E|)-memory strategy ensuring Parity(p) from q in M.

The characterization in the first part of Lemma 3 holds simply because parity objectives are prefix-independent (runs that differ by a finite prefix are either both winning or both losing), and thus the characterization holds for all prefix-independent objectives.

Following the characterization in Lemma 3, we can construct an algorithm that computes the revealed form (using recursive calls to determine the winning region of MEMDPs with smaller sets of environments), and then computes the winning region by a fixpoint iteration computable in polynomial time for a fixed set of environments. The depth of the recursive calls is bounded by the number of environments, giving a PSPACE algorithm. Correspondingly, we show that pure strategies with exponential memory O(|Q||E|2|E|) are sufficient. It is easy to show that exponential memory is necessary using the construction establishing PSPACE lower bound of the problem [23, Theorem 7].

Theorem 4.

The membership problem for almost-sure parity in MEMDPs is PSPACE-complete. Pure exponential-memory strategies are sufficient for almost-sure winning in MEMDPs with parity (thus also reachability and safety) objectives. When the number of environments is fixed, the problem is solvable in polynomial time.

4 Limit-Sure Winning

We refer to the examples of the duplicate card and the missing card in Section 1 to illustrate the difference between limit-sure and almost-sure winning. We present in Section 4.1 two other scenarios where limit-sure winning and almost-sure winning do not coincide, which will be useful to illustrate the key ideas in the algorithmic solution.

4.1 Examples

Figure 3: An end-component {q1,q2} with different transition probabilities in environments e1 and e2.

In the example of Figure 3, the set D={q1,q2} is an end-component in both environments e1 and e2 (the actions are shown in the figures only when relevant, that is in q2). However, the transition probabilities from q1 are different in the two environments e1 and e2, and intuitively we can learn (with high probability) in which environment we are by playing c for a long enough (but finite) time and collecting the frequency of the visits to q1 and q2. Then, in order to reach the target q3, if there are more q1’s than q2’s in the history we play a in q2, otherwise b. The intuition is that the histories with more q1’s than q2’s have a high probability (more than 1ε) in M[e1] and a small probability (less than ε) in M[e2], where ε can be made arbitrarily small (however not 0) by playing c for sufficiently long. Hence q1 is limit-sure winning, but not almost-sure winning.

In the second scenario (Figure 4), the transition probabilities do not matter. The objective is to visit some state in {q3,q4,q5} infinitely often (those states have priority 0, the other states have priority 1). The state q1 is limit-sure winning, but not almost-sure winning. To win with probability 1ε, a strategy can play a (in q2) for a sufficiently long time, then switch to playing b (unless q5 was reached before that). The crux is that playing a does not harm, as it does not leave the limit-sure winning region, but ensures in at least one environment (namely, e1) that the objective is satisfied with probability 1 (by reaching q5). This allows to “discard” the environment e1 if q5 was not reached, and to switch to a strategy that is winning with probability at least 1ε in e2, namely by playing b. With an arbitrary number of environments, the difficulty is to determine in which order the environments can be “discarded”.

Note that the transition (q2,a,q1) is not revealing, since it is present in both environments. However, after crossing this transition a large number of times, we can still learn that the environment is e2 (and be mistaken with arbitrarily small probability). In contrast, the transition (q2,a,q5) is revealing and the environment is e1 with certainty upon crossing that transition.

To solve the membership problem for limit-sure parity, we first convert M into a revealed-form MEMDP M, similar to the case of almost-sure winning, with the obvious difference that revealing transitions t=(q,a,q) of M[e] are redirected in M[e] to q𝗐𝗂𝗇 if qLS(M[Kt],φ) is limit-sure winning when the set of environments is the knowledge Kt after observing transition t. Thus, we aim for a recursive algorithm, where the base case is limit-sure winning in MEMDPs with one environment, which are equivalent to plain MDPs, for which limit-sure and almost-sure parity coincide. Note that the examples of Figure 3 and Figure 4 are in revealed form.

Figure 4: The set {q1,q2} is an end-component in e2, not in e1.

4.2 Common End-Components and Learning

A common end-component (CEC) of an MEMDP M=Q,(Aq)qQ,(δe)eE is a pair (Q,A) that is an end-component in M[e] for all environments eE. A CEC D is trivial if it contains a single state. D is said winning for a parity condition Parity(p), if for all eE, there is a strategy in M[e] which, when started inside D, ensures Parity(p) with probability 1. Notice that since D is a common end-component, such a strategy ensures Parity(p) with probability 1 in M[e] iff it does in M[e].

We note that the common end-components of an MEMDP are the end-components of the MDP eEM[e] assuming M is in revealed form, and thus can be computed using standard algorithm for end-components [10].

Lemma 5.

Consider an MEMDP M in revealed form. The common end-components of M are exactly the end-components of eEM[e].

A CEC may have different transition probabilities in different environments. We call a CEC distinguishing if it contains a transition (q,a,q) (called a distinguishing transition) such that δe(q,a)(q)δf(q,a)(q) for some environments e,fE. Given a distinguishing transition (q,a,q) and environment e, define K1={fEδf(q,a)(q)=δe(q,a)(q)} and K2=EK1. We say that (K1,K2) is a distinguishing partition of D that is induced by the distinguishing transition (q,a,q) and environment e.

Distinguishing transitions can be used to learn the partition (K1,K2), that is to guess (correctly with high probability) whether the current environment is in K1 or K2, as in the example of Figure 3, where the set D={q1,q2} is a distinguishing end-component with distinguishing transition (q1,,q2) and partition ({e1},{e2}). A distinguishing CEC may have several distinguishing transitions and induced partitions.

We formalize how a strategy can distinguish between K1 and K2 with high probability inside a distinguishing CEC. First let us recall Hoeffding’s inequality.

Theorem 6 (Hoeffding’s Inequality [15]).

Let X1,X2,,Xn be a sequence of independent and identical Bernoulli variables with [Xi]=p, and write Sn=X1++Xn. For all t>0, [Sn𝔼[Sn]t]e2t2/n, and [𝔼[Sn]Snt]e2t2/n.

Given a distinguishing CEC with distinguishing partition (K1,K2) induced by a transition (q,a,q), a strategy can sample the distribution δe(q,a) by repeating the following two phases: first, use a pure memoryless strategy to almost-surely visit q, then play action a; by repeating this long enough (the precise bound depends on a given ε and is derived from Theorem 6) while storing the frequency of visits to q in the second phase, we can learn and guess in which block Ki belongs the environment, with sufficiently small probability of mistake to ensure winning with probability 1ε.

Lemma 7.

Given an MEMDP M containing a distinguishing common end-component D with partition (K1,K2) induced by a distinguishing transition, and parity objective φ, for all states q0 in D, all pairs of strategies σ1,σ2, and all ε>0, there exists a strategy σ such that:

q0σ(M[e],φ)(1ε)q0σ1(M[e],φ) for all eK1,
q0σ(M[e],φ)(1ε)q0σ2(M[e],φ) for all eK2.

Moreover, the strategy σ is pure if both σi are pure; and if each strategy σi uses a memory of size mi, then σ uses finite memory of size m1+m2+8log(1/ε)η22 where η=min({|δe(q,a)(q)δf(q,a)(q)|e,fE,q,qQ,aA}{0}).

It follows that the membership problem for limit-sure winning can be decomposed into subproblems where the set of environments is one of the blocks Ki in the partition.

Lemma 8.

Given an MEMDP M containing a distinguishing common end-component D with a partition (K1,K2) induced by a distinguishing transition, and a parity objective φ the following equivalence holds: DLS(M,φ) if and only if DLS(M[K1],φ) and DLS(M[K2],φ).

4.3 Characterization and Algorithm

Here, we assume that MEMDPs are in revealed form with sink states q𝗐𝗂𝗇 and q𝗅𝗈𝗌𝖾.

We show that the winning region W=LS(M,φ) for limit-sure parity is a closed set: from every state qW, there exists an action a ensuring in all environments that all successors of q are in W. We call such actions limit-sure safe for q. We show in Lemma 9 that a limit-sure safe action always exists in limit-sure winning states. Note that playing actions that are not limit-sure safe may be useful for limit-sure winning, as in the example of Figure 3 where action a is limit-sure safe, but action b is not (from q2).

By definition of limit-sure winning, if a state q is not limit-sure winning, there exists εq>0 such that for all strategies σ, there exists an environment eE such that qσ(M[e],φ)<1εq. We denote by ε0=min{εqqQLS(M,φ)} a uniform bound.

Lemma 9.

Given an MEMDP M (in revealed form) over environments E, a parity objective φ, and a state q, if qLS(M,φ) is limit-sure winning, then there exists an action a such that for all environments eE, all successors of q are limit-sure winning, i.e Supp(δe(q,a))LS(M,φ).

Given an MEMDP M, consider the limit-sure winning region W=LS(M,φ) for φ=Parity(p). For the purpose of the analysis, consider the (memoryless) randomized strategy σLS that plays uniformly at random all limit-sure safe actions in every state qW, which is well-defined by Lemma 9.

Consider an arbitrary environment e, and an end-component D in M[e] that is positive under σLS (recall Lemma 1 and the definition afterward). There are three possibilities:

  1. 1.

    D is not a common end-component (as in the example of Figure 4, for D={q1,q2} in M[e2]), that is, D is not an end-component in some environment e (in the example e=e1), then we can learn (and be mistaken with arbitrarily small probability) that we are not in e, reducing the problem to an MEMDP with fewer environments (namely, M[¬e]);

  2. 2.

    D is a common end-component and is distinguishing (as in the example of Figure 3, for D={q1,q2}), then we can also learn a distinguishing partition (K1,K2) and reduce the problem to MEMDPs with fewer environments (namely, M[K1] and M[K2]);

  3. 3.

    D is a common end-component and is non-distinguishing, then we show in Lemma 10 below that D is almost-sure winning (DAS(M,φ)), obviously in all environments.

Lemma 10.

Given an MEMDP M over environments E (in revealed form), a parity objective φ, and a state q, if qLS(M,φ), then all non-distinguishing common end-components D that are positive under strategy σLS from q in M[e] (for some eE) are almost-sure winning for φ (that is DAS(M,φ)).

Our approach to compute the limit-sure winning states is to first identify the distinguishing CECs that are limit-sure winning. We can compute the maximal CECs using Lemma 5, and note that a maximal CEC containing a distinguishing CEC is itself distinguishing, so it is sufficient to consider maximal CECs. By Lemma 8, we can decide if a given distinguishing CEC is limit-sure winning using a recursive procedure on MEMDPs with fewer environments. We show in Lemma 11 below that we can replace the limit-sure CECs by a sink state q𝗐𝗂𝗇.

Lemma 11.

Given an MEMDP M with parity objective φ and a set TLS(M,φ) of limit-sure winning states, we have LS(M,φ)=LS(M,φReach(T)).

We can now assume that MEMDPs contain no limit-sure winning distinguishing CEC, and present a characterization for the remaining possibility, illustrated by the scenario of Figure 4, where playing the action a (in q2, forever) ensures, in some environment (namely, e1), almost-sure satisfaction of the parity objective while remaining inside the limit-sure winning region in all other environments.

Lemma 12.

Consider an MEMDP M (in revealed form) over environments E with |E|2, that contains no limit-sure winning distinguishing common end-component, and a parity objective φ. Writing Te=LS(M[¬e],φ), we have the following: LS(M,φ)=AS(M,Reach(eEAS(M[e],φSafe(Te)))).

Algorithm 1 𝖫𝖲_𝖯𝖺𝗋𝗂𝗍𝗒(M,p).

Algorithm Overview.

Given a MEMDP M=(Q,(Aq)qQ,(δe)eE), the algorithm proceeds by recursion on the size of the environment set E (Algorithm 1). The base case is that of a singleton set E where LS(M,φ)=AS(M,φ) and this can be computed in polynomial time.

Assume |E|2. We first convert M into an MEMDP M in revealed form with state space Q{q𝗐𝗂𝗇,q𝗅𝗈𝗌𝖾} and each revealing transition t=(q,a,q) in M is redirected in M to q𝗐𝗂𝗇 if qLS(M[Kt],φ) is limit-sure winning when the set of environments is the knowledge Kt after observing transition t, and to q𝗅𝗈𝗌𝖾 otherwise. Notice that each query qLS(M[Kt],φ) uses a set Kt that is strictly smaller than E.

We now assume that M is in revealed form and we compute the maximal end-components of the MDP eEM[e]; these are maximal common end-components of M by Lemma 5. For each distinguishing maximal CEC D, we determine whether it is limit-sure winning using the condition of Lemma 8, namely that DLS(M[Ki],φ) (for i=1,2) where (K1,K2) is a partition of E induced by a distinguishing transition of D, which is computed by a recursive calls to the algorithm. We replace D by q𝗐𝗂𝗇 if it is limit-sure winning, which yields an MEMDP without limit-sure winning distinguishing CECs, and we can apply Lemma 12: for each environment eE, we compute Te=LS(M[¬e],φ) which is done by |E| separate recursive calls, and we compute the sets AS(M[e],φSafe(Te)) using standard MDP algorithms (we restrict the state space to Te and compute the almost-sure winning states for φ). We then solve the almost-sure reachability problem in M for the target set eEAS(M[e],φSafe(Te)).

Thus each recursive step takes polynomial time (besides the recursive calls), and because each recursive call decreases the size of E, the depth of the recursion is bounded by |E|. It follows that the procedure runs in polynomial space. The PSPACE lower bound follows from the same reduction as for almost-sure winning [23, Theorem 7], since the MEMDP constructed in the reduction is acyclic, thus almost-sure and limit-sure winning coincide.

Note that Lemma 12 constructs a pure strategy that achieves the objective with probability at least 1ε from the limit-sure winning states, and that the strategies constructed in Lemmas 8 and 11 to witness limit-sure winning are also pure (in Lemma 8, the construction assumes that pure strategies are sufficient for fewer environments, which allows a proof by induction since pure strategies are sufficient in MDPs, i.e. in a single environment).

Theorem 13.

The membership problem for limit-sure parity objectives in MEMDPs is PSPACE-complete and pure exponential-memory strategies are sufficient, i.e., if a state q is limit-sure winning, then for all ε>0 there exists a pure exponential-memory strategy that ensures the objective is satisfied with probability at least 1ε from q. When the number of environments is fixed, the problem is solvable in polynomial time.

The time complexity of Algorithm 1 is established as follows. Let us consider a single recursive call. The maximal end-components of eEM[e] can be computed in O(|Q||δ|) where |δ| denotes the number of transitions. Then, determining whether each MEC is distinguishing, and replacing them with sink states can be done in time O(|δ||E|) since one needs to go over each transition and check whether their probability differs in two environments. The last step requires solving almost-sure parity and safety for MDPs defined for each eE, which can be done in time O(|E||Q||δ|) (similarly as in the discussion following Theorem 4). The most costly operation is almost-sure reachability for the MEMDP M, which by Theorem 4 takes O(|Q|4|E||A|2|E|). There are 2|E| recursive calls (the algorithm can be run once for each subset of E using memoization), so overall we get O(|Q|4|E||A|22|E|).

5 The Gap Problem

The goal of this section is to give a procedure to solve the gap problem for parity objectives. For this, we show that an arbitrary strategy in M can be imitated by a finite-memory one (with a computable bound on the memory size) while achieving the same probability of winning up to ε in all environments. Once this is established, we show how to guess a finite-memory strategy of appropriate size, and check that it is a solution to the gap problem.

To establish the memory bound for such an ε-approximation, we need a few intermediate lemmas. First, we define a transformation on MEMDPs consisting in collapsing non-distinguishing maximal CECs (MCECs) of the MEMDP M; the resulting MEMDP is denoted purge(M). We show that M and purge(M) have the same probabilities of satisfaction of the considered parity objective under all environments.

Intuitively, removing non-distinguishing MCECs ensures that in purge(M), under all strategies, with high probability, within a fixed number of steps, either a maximal CEC is reached (which is either distinguishing, or non-distinguishing but trivial – recall that a trivial CEC contains a single absorbing state) or enough samples are gathered to improve the knowledge about the current environment. This observation will help us constructing the finite-memory strategy inductively since in each case the knowledge can be improved correctly with high probability: in trivial MCECs, the strategy is extended arbitrarily; inside distinguishing MCECs, the strategy can be extended so that it stays inside the MCEC while sampling distinguishing transitions with any desired precision as in Lemma 7; last, if no MCECs are reached but enough samples are gathered along the way, we prove that the knowledge can also be improved with high probability. The final strategy is obtained by combining finite-memory strategies constructed inductively for smaller sets of environments.

Purge: Removing Non-Distinguishing MCECs.

We present key properties of the transformation purge(M) in Lemma 14 below, showing that strategies of M can be imitated in purge(M), and vice versa. Hence the value of parity objectives in M and purge(M) coincide.

Lemma 14.

Consider an MEMDP M=Q,A,(δe)eE, and objective φ=Parity(p). There exists an MEMDP purge(M), with a map f:QQ relating states of M and those of purge(M)=Q,A,(δe)eE, with the following properties:

  • The only non-distinguishing MCECs of purge(M) are the trivial q𝗐𝗂𝗇 and q𝗅𝗈𝗌𝖾.

  • For all states qQ and strategies σ for M, there exists σ with qσ(M,φ)f(q)σ(purge(M),φ).

  • For all states qQ and strategies σ for purge(M), and all qf1(q), there exists a strategy σ with qσ(M,φ)=qσ(purge(M),φ). Furthermore, if σ is a m-memory strategy, then σ can be chosen to be a (m+|Q||A|)-memory strategy.

Note that while purge(M) is computable in polynomial time, our results only need that purge(M) can be constructed to establish the existence of finite-memory strategies (Lemma 16).

Learning While Playing.

We now show that after collapsing non-distinguishing MCECs, within n steps (for n large enough), with high probability, we either reach a MCEC (which is either distinguishing or trivial) or collect a large number of samples of distinguishing transitions whose empirical average is close to their mean. Intuitively, this means that either the knowledge can be improved after n steps using the collected samples while bounding the probability of error, or a MCEC is reached.

For a history h, let |h|q,a denote the number of occurrences of the state-action pair (q,a), and |h|q,a,q the number of times these are followed by q, where qSupp(δ(q,a)). For a distinguishing transition t=(q,a,q), we say that a history h is a bad (t,η)-classification in MDP M[e] if ||h|q,a,q|h|q,aδe(q,a)(q)|η/2, that is the measured and theoretical frequency of t are too far apart. It is a good (t,η)-classification otherwise. Intuitively, over long histories, good classifications have high probability.

Given MEMDP M, let TM be a set obtained by selecting one distinguishing transition (q,a,q) for each state-action pair (q,a) for which such transitions exist (i.e., such that δe(q,a)δf(q,a) for some environments e, f). Let |h|TM=(q,a,q)TM|h|q,a, and fix η such that η<12min({|δe(q,a)(q)δf(q,a)(q)|e,fE,q,qQ,aA}{0}).

Let us define the set of good histories with n0 samples, denoted by Goodn0, as the set of histories h satisfying:

  1. 1.

    |h|TMn0,

  2. 2.

    for all t=(q,a,)TM satisfying |h|q,a=max(q,a,)TM|h|q,a, the history h is a good (t,η)-classification.

Lemma 15.

Consider an MEMDP M whose only non-distinguishing MCECs are trivial, and fix ε>0, Let n0=2(|Q||A|)3εη2, and n2p2|Q|max(log(4ε),n0) where p is the smallest nonzero probability that appears in M. Then, from any starting state, and under any strategy, with probability at least 1ε, within n steps, the history either visits a MCEC (distinguishing or trivial), or belongs to Goodn0.

Lemma 15 shows that long histories that do not enter MCECs have high probability of being good classifications, which means that the frequency of the distinguishing transitions is close to their actual probability. Technically, this result is nontrivial to prove. In fact, successive visits of distinguishing transitions are not independent trials so classical reasoning based, for instance, on Hoeffding’s inequality cannot be applied. It turns out that although these trials can be dependent, their covariances are 0. We exploit this fact to prove Lemma 15 by carefully modeling the successive trials, and relying on Chebyshev’s inequality.

Constructing Approximate Finite-Memory Strategies.

We are now ready to construct a finite-memory strategy that approximates an arbitrary strategy σ. We first construct a finite-memory strategy for purge(M) and then transfer it to M using Lemma 14. The finite-memory strategy imitates σ for n steps, where n is defined in Lemma 15. By this lemma, because all nontrivial MCECs of purge(M) are distinguishing, when we play for n steps, with high probability, we either visit a trivial MCEC (which is either winning for all environments or losing for all environments), or reach a distinguishing MCEC, or observe enough samples of distinguishing transitions. The strategy is extended arbitrarily in trivial MCECs. Inside distinguishing MCECs, it gathers samples of distinguishing transitions as in Lemma 7, which improves the knowledge (with an arbitrarily small probability of error). The knowledge is also correctly improved with high probability if enough samples are gathered outside of MCECs. In both cases, the strategy switches to a finite-memory strategy for the improved knowledge constructed recursively for smaller sets of environments.

Lemma 16 formalizes this reasoning and gives a bound N on the memory of the resulting strategy. In the memory bound, the term 8log(8/ε)η22 comes from the application of Lemma 7 for distinguishing MCECs for each subset of E; and the term (2|Q|)n(|E|+1) corresponds to the recursive analysis, since the strategy is defined inductively for each subset of E. Notice that n is exponential, and that N is doubly-exponential in the size of the input.

Lemma 16.

Consider an MEMDP M=Q,A,(δe)eE, state qQ, parity objective φ. For all strategies σ, and ε>0, there exists a strategy σ using at most N=(2|Q|)n(|E|+1)|A|8log(8/ε)η22 memory where n=2p2|Q|max(8(|Q||A|)3εη2,log(16/ε)), with p the smallest nonzero probability in M, and that satisfies qσ(M,φ)qσ(M,φ)ε.

Approximation Algorithm.

We now sketch a procedure for solving the gap problem with threshold α for parity objectives in MEMDPs. Informally, when the memory size is fixed, one can define a system of polynomial equations of linear size that describe the reachability probabilities for a target set in MDPs [3]; and this can be extended easily to MEMDPs. Here, given any ε>0, we consider the bound N of Lemma 16, and solve the polynomial equation system whose solutions correspond to finite-memory strategies with N memory states. Solving such equations can be done in polynomial space [5] in the size of the equations, which means a double exponential-space procedure here because N is double exponential.

More precisely, the equation system is written so that any solution corresponds to a strategy achieving the probability threshold αε. If some strategy achieves the threshold α, by Lemma 16, we know that there is one with a memory size N, and the equation system has a solution. Conversely, if no strategy achieves the threshold αε, then, in particular, no finite-memory strategy achieves this threshold and the equation system has no solution.

The case of parity objectives is solved by enumerating the supports of the strategies that are being guessed since the supports determine bottom strongly connected components, and the problem can then be reduced to reachability.

Theorem 17.

The gap problem can be solved in double exponential space for MEMDPs with parity objectives.

Note that the double exponential complexity remains even if |E| is fixed. Whether the complexity can be improved when |E| is fixed, and the decidability of the value problem (and not just the ε-gap problem) are interesting directions for future research.

References

  • [1] T. S. Badings, T. D. Simão, M. Suilen, and N. Jansen. Decision-making under uncertainty: beyond probabilities. Int. J. Softw. Tools Technol. Transf., 25(3):375–391, 2023. doi:10.1007/S10009-023-00704-3.
  • [2] C. Baier, M. Größer, and N. Bertrand. Probabilistic ω-automata. J. ACM, 59(1):1, 2012.
  • [3] C. Baier and J.-P. Katoen. Principles of Model Checking. MIT Press, 2008.
  • [4] P. Buchholz and D. Scheftelowitsch. Computation of weighted sums of rewards for concurrent MDPs. Math. Methods Oper. Res., 89(1):1–42, 2019. doi:10.1007/S00186-018-0653-1.
  • [5] J. Canny. Some algebraic and geometric computations in PSPACE. In Proc. of STOC: Symposium on Theory of Computing, pages 460–467. ACM, 1988. doi:10.1145/62212.62257.
  • [6] K. Chatterjee, M. Chmelík, D. Karkhanis, P. Novotný, and A. Royer. Multiple-environment Markov decision processes: Efficient analysis and applications. In Proc. of ICAPS: Automated Planning and Scheduling, pages 48–56. AAAI Press, 2020. URL: https://ojs.aaai.org/index.php/ICAPS/article/view/6644.
  • [7] K. Chatterjee, L. Doyen, H. Gimbert, and T. A. Henzinger. Randomness for free. Information and Computation, 245:3–16, 2017.
  • [8] K. Chatterjee, L. Doyen, J.-F. Raskin, and O. Sankur. The value problem for multiple-environment MDPs with parity objective. CoRR, abs/2504.15960, 2025.
  • [9] C. Courcoubetis and M. Yannakakis. The complexity of probabilistic verification. J. ACM, 42(4):857–907, July 1995. doi:10.1145/210332.210339.
  • [10] L. de Alfaro. Formal verification of probabilistic systems. Ph.d. thesis, Stanford University, 1997.
  • [11] S. Even, A. L. Selman, and Y. Yacobi. The complexity of promise problems with applications to public-key cryptography. Information and Control, 61(2):159–173, 1984. doi:10.1016/S0019-9958(84)80056-X.
  • [12] E. A. Feinberg and A. Shwartz, editors. Handbook of Markov Decision Processes - Methods and Applications. Kluwer, 2002.
  • [13] H. Gimbert and Y. Oualhadj. Probabilistic automata on finite words: Decidable and undecidable problems. In Proc. of ICALP (2), LNCS 6199, pages 527–538. Springer, 2010. doi:10.1007/978-3-642-14162-1_44.
  • [14] O. Goldreich. On promise problems (a survey in memory of Shimon Even [1935-2004]). Manuscript, 2005.
  • [15] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963. URL: http://www.jstor.org/stable/2282952.
  • [16] K. J. Åström. Optimal control of Markov processes with incomplete state information I. Journal of Mathematical Analysis and Applications, 10:174–205, 1965.
  • [17] O. Madani, S. Hanks, and A. Condon. On the undecidability of probabilistic planning and related stochastic optimization problems. Artif. Intell., 147(1-2):5–34, 2003. doi:10.1016/S0004-3702(02)00378-8.
  • [18] D. A. Martin. The determinacy of Blackwell games. The Journal of Symbolic Logic, 63(4):1565–1581, 1998. doi:10.2307/2586667.
  • [19] C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of Markov decision processes. Math. Oper. Res., 12(3):441–450, 1987. doi:10.1287/MOOR.12.3.441.
  • [20] M. L. Puterman. Markov decision processes. John Wiley and Sons, 1994.
  • [21] N. M. van Dijk R. J. Boucherie. Markov decision processes in practice. Springer, 2017.
  • [22] J.-F. Raskin and O. Sankur. Multiple-environment Markov decision processes. In Proc. of FSTTCS: Foundation of Software Technology and Theoretical Computer Science, volume 29 of LIPIcs, pages 531–543. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2014. doi:10.4230/LIPICS.FSTTCS.2014.531.
  • [23] M. Suilen, M. van der Vegt, and S. Junges. A PSPACE algorithm for almost-sure Rabin objectives in multi-environment MDPs. In Proc. of CONCUR: Concurrency Theory, volume 311 of LIPIcs, pages 40:1–40:17. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2024. doi:10.4230/LIPICS.CONCUR.2024.40.
  • [24] M. van der Vegt, N. Jansen, and S. Junges. Robust almost-sure reachability in multi-environment MDPs. In Proc. of TACAS: Tools and Algorithms for the Construction and Analysis of Systems, LNCS 13993, pages 508–526. Springer, 2023. doi:10.1007/978-3-031-30823-9_26.