The Value Problem for Multiple-Environment MDPs with Parity Objective
Abstract
We consider multiple-environment Markov decision processes (MEMDP), which consist of a finite set of MDPs over the same state space, representing different scenarios of transition structure and probability. The value of a strategy is the probability to satisfy the objective, here a parity objective, in the worst-case scenario, and the value of an MEMDP is the supremum of the values achievable by a strategy.
We show that deciding whether the value is is a PSPACE-complete problem, and even in P when the number of environments is fixed, along with new insights to the almost-sure winning problem, which is to decide if there exists a strategy with value . Pure strategies are sufficient for theses problems, whereas randomization is necessary in general when the value is smaller than . We present an algorithm to approximate the value, running in double exponential space. Our results are in contrast to the related model of partially-observable MDPs where all these problems are known to be undecidable.
Keywords and phrases:
Markov decision processes, imperfect information, randomized strategies, limit-sure winningCategory:
Track B: Automata, Logic, Semantics, and Theory of ProgrammingFunding:
Krishnendu Chatterjee: ERC CoG 863818 (ForM-SMArt) and Austrian Science Fund (FWF) 10.55776/COE12.Copyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Theory of computation Logic and verification ; Theory of computation Probabilistic computationEditors:
Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele PuppisSeries and Publisher:

1 Introduction
We consider Markov decision processes (MDP), a well-established state-transition model for decision making in a stochastic environment. The decisions involve choosing an action from a finite set, which together with the current state determine a probability distribution over the successor state. The question of constructing a strategy that maximizes the probability to satisfy a logical specification is a classical synthesis problem with a wide range of applications [20, 12, 3, 21].
The stochastic transitions in MDPs capture the uncertainty in the effect of an action. Another form of uncertainty arises when the states are (partially) hidden to the decision-maker, as in the classical model of partially-observable MDPs (POMDP) [16, 19]. Recently, an alternative model of MDPs with partial information has attracted attention, the multiple-environment MDPs (MEMDP) [22], which consists of a finite set of MDPs over the same state space. Each MDP represents a possible environment, but the decision-maker does not know in which environment they are operating. The synthesis problem is then to construct a single strategy that can be executed in all environments to ensure the objective be satisfied independently of the environment. This model is natural in applications where the structure of the transitions and their probability are uncertain such as in robust planning or population models with individual variability [4, 6, 1, 24, 23].
In contrast to what previous work suggest, the two models of POMDP and MEMDP are (syntactically) incomparable: the choice of the environment in MEMDP is adversarial, which cannot be expressed in a POMDP, and the partial observability of POMDP can occur throughout the execution, whereas the uncertainty in MEMDP is only initial. In particular, MEMDP are not a subclass of POMDP since pure strategies are sufficient in POMDPs [18, 7] while randomization is necessary in general in MEMDPs [22, Lemma 3].
The synthesis problem has been considered for traditional -regular objectives, defined as parity [22] or Rabin [23] condition, in three variants: the almost-sure problem is to decide whether there exists a strategy that is winning with probability in all environments, the limit-sure problem is to decide whether, for every , there exists a strategy that is winning with probability at least in all environments, and the gap problem, which is an approximate version of the quantitative problem to decide, given a threshold , whether there exists a strategy that is winning with probability at least in all environments. The limit-sure problem is also called the value- problem, where the value of an MEMDP is defined as the supremum of the values achievable by a strategy. The value is if and only if the answer to the limit-sure problem is Yes.
A classical example to illustrate the difference between almost-sure and limit-sure winning is to consider an environment consisting of cards, obtained by removing one card from a standard -card deck (see Figure 1). The decision-maker has two possible actions: the action sample reveals the top card of the deck and then shuffles the cards (including the top card, which remains in the deck); the action guess(), where is a card, stops the game with a win if is the missing card, and a lose otherwise. If no guess is ever made, the game is also losing. An almost-sure winning strategy is to sample until each of the cards has been revealed at least once, then to make a correct guess. It is easy to see that the strategy wins with probability , even if there exist scenarios (though with probability ) where some of the cards are never revealed and no correct guess is made. Hence the MEMDP is almost-sure winning, and we say that it is not sure winning because a losing scenario exists in every strategy. Consider now an environment consisting of cards, obtained by adding one duplicate card to the standard deck, and the same action set and rules of the game, except that a correct guess is now the duplicate card (see Figure 2). The strategy that samples for a long time and then makes a guess based on the most frequent card wins with probability close to – and closer to as the sampling time is longer – but not equal to , since no matter how long is the sampling phase there is always a nonzero probability that the duplicate card does not have the highest frequency at the time of the guess. In this case, the MEMDP is limit-sure winning, but not almost-sure winning. Intuitively, the solution of almost-sure winning relies on the analysis of revealing transitions, which give a sure information allowing to exclude some environment (seeing card is a guarantee that we are not in the environment where is missing); the solution of limit-sure winning involves learning by sampling, which also allows to exclude some environment, but possibly with a nonzero probability to be mistaken.
For MEMDPs with two environments, it is known that the almost-sure and limit-sure problem for parity objectives are solvable in polynomial time [22, Theorem 33, Theorem 40], while the gap problem is decidable in 2-fold exponential space [22, Theorem 30] and is NP-hard, even for acyclic MEMDPs with two environments [22, Theorem 26]. With an arbitrary number of environments, the almost-sure problem becomes PSPACE-complete [23, Theorem 41], even for reachability objectives [24, Lemma 11]. For comparison, in the close model of POMDP, the decidability frontier lies between limit-sure winning and almost-sure winning: with reachability objectives, the almost-sure problem is decidable (and EXPTIME-complete [2]), whereas the limit-sure problem is undecidable [13]. The gap problem is also undecidable [17].
In this paper, we consider the limit-sure problem and the gap problem for parity objectives in MEMDPs with an arbitrary number of environments. Our main result is to show that the limit-sure problem is PSPACE-complete and can be solved in polynomial time for a fixed number of environments, and the gap problem can be solved in double exponential space. Correspondingly, our algorithms significantly extend the solutions that are known for two environments, relying on a non-trivial recursive (inductive) analysis.
The PSPACE upper bound is obtained by a characterization of limit-sure winning for a subclass of MEMDPs, in terms of almost-sure winning conditions (Lemma 12). A pre-processing phase transforms general MEMDPs into the subclass. We present a PSPACE algorithm to compute the pre-processing and verify the characterization. Since our algorithm relies on almost-sure winning, we also give a new characterization of almost-sure winning for parity objectives in MEMDPs (Lemma 3), which gives a conceptually simple alternative algorithm to the known solution [23]. The PSPACE lower bound straightforwardly follows from the same reduction as for almost-sure winning [23, Theorem 7]. A corollary of our characterizations is a refined strategy complexity: pure (non-randomized) strategies are sufficient for both limit-sure and almost-sure winning, which was known only for acyclic MEMDPs and almost-sure reachability objectives [24, Lemma 12], and exponential memory is sufficient. In the last part of the paper, we present an algorithm running in double exponential space for solving the gap problem, by computing an approximation of the value of the MEMDP. To win with probability at least in all environments, randomized strategies are more powerful [22, Lemma 3], and thus need to be considered for solving the gap problem.
In conclusion, the model of MEMDP is a valuable alternative to POMDPs, from a theoretical perspective since the limit-sure problem and gap problem are undecidable for POMDPs whereas our results establish decidability for MEMDPs, and from a practical perspective since many applications of POMDPs can be expressed by MEMDPs, as was observed previously [1, 24].
Omitted proofs and material are provided in the extended version [8].
2 Definitions
A probability distribution on a finite set is a function such that . The support of is . A Dirac distribution assigns probability to some . We denote by the set of all probability distributions on .
2.1 Markov Decision Processes
A Markov decision process (MDP) over a finite set of actions is a tuple consisting of a finite set of states, a nonempty set of actions for each state , and a partial probabilistic transition function . We say that is a transition if . A state is a sink if for all .
A run of from an initial state is an infinite sequence of interleaved states and actions such that and for all . Finite prefixes of runs ending in a state are called histories and we denote by the last state of . We denote by (resp., ) the set of all runs (resp., histories) of , and by the set of states that occur infinitely often along the run .
A sub-MDP of is an MDP such that and for all states and actions (recall the requirement that ). Consider a set such that for all , there exists with . We define the sub-MDP of induced by , denoted by , as the sub-MDP where for all .
End-components.
An end-component of is a pair such that is a sub-MDP of , where denotes the restriction of to , and where the graph with is strongly connected [10, 3]. We often identify an end-component as the set of states and state-action pairs, and we say that it is supported by the set of states. The (componentwise) union of two end-components with nonempty intersection is an end-component, thus one can define maximal end-components. We denote by the set of maximal end-components of , which is computable in polynomial time [10], and by the set of all end-components of .
Histories and Strategies.
A strategy is a function such that for all histories ending in . A strategy is pure if all histories are mapped to Dirac distributions. A strategy is memoryless if for all histories such that . We sometimes view memoryless strategies as functions . A strategy uses finite memory (of size ) if there exists a right congruence over (i.e., such that if , then for all and ) of finite index such that for all histories with .
Objectives.
An objective is a Borel set of runs. We denote by the standard probability measure on the sigma-algebra over the set of (infinite) runs of with initial state , generated by the cylinder sets spanned by the histories [3]. Given a history , the cylinder set has probability if , and probability otherwise. We say that a run is compatible with strategy if .
We consider the following standard objectives for an MDP :
-
safety objective: given a set of states, let ;
-
reachability objective: given a set of states, let ;
-
parity objective: given a priority function , let .
It is standard to cast safety and reachability objectives as special cases of parity objectives, using sink states. Given an objective , we denote by the complement of . We say that a run satisfies if , and that it violates otherwise.
It is known that under arbitrary strategies, with probability the set of states occurring infinitely often along a run is the support of an end-component [9, 10].
An end-component is positive under strategy from if . By Lemma 1, we have .
Value and qualitative satisfaction.
A strategy is winning for objective from with probability (at least) if . We denote by the value of objective from state . A strategy is optimal if .
We consider the following classical qualitative modes of winning. Given an objective , a state is:
-
almost-sure winning if there exists a strategy such that is winning with probability , that is .
-
limit-sure winning if , or equivalently for all there exists a strategy such that .
We denote by and the set of almost-sure and limit-sure winning states, respectively. In MDPs, it is known that and pure memoryless optimal strategies exist for parity objectives [20, 9].
We recall that the value of a parity objective from every state of an end-component is the same, and is either or , which does not depend on the precise value of the (non-zero) transition probabilities, but only on the supports of the transition function at the state-action pairs in [10]. When the value , there exists a pure memoryless strategy such that for all states . If such a strategy exists, then is said to be -winning, and otherwise -losing.
2.2 Multiple-Environment MDP
A multiple-environment MDP (MEMDP) over a finite set of environments is a tuple , where is an MDP that models the behaviour of the system in the environment . The state space is identical in all (), only the transition probabilities may differ. We sometimes refer to the environments of as the MDPs rather than the set itself. For , let be the MEMDP over set of environments. We denote by the MEMDP over environments , and by the MDP such that is the uniform distribution over for all and .
A transition is revealing in if is a strict subset of (). We say that , which is the set of environments where the transition is possible, is the knowledge after observing transition . An MEMDP is in revealed form if for all revealing transitions , the state is a sink in all environments, that is for all environments and all actions . By extension, we call knowledge after a history the set of environments in which all transitions of are possible.
Decision Problems.
We are interested in synthesizing a single strategy with guarantees in all environments, without knowing in which environment is executing. We consider reachability, safety, and parity objectives.
A state is almost-sure winning in for objective if there exists a strategy such that in all environments , we have , and we call such a strategy almost-sure winning. A state is limit-sure winning in for objective if for all , there exists a strategy such that in all environments we have , and we say that such a strategy is -winning.
We denote by (resp., ) the set of all almost-sure (resp., limit-sure) winning states in for objective . We consider the membership problem for almost-sure (resp., limit-sure) winning, which asks whether a given state is almost-sure (resp., limit-sure) winning in for objective . We refer to these membership problems as qualitative problems.
We are also interested in the quantitative problems. Given MEMDP , a parity objective , and probability threshold , we are interested in the existence of a strategy satisfying for all . We present an approximation algorithm for the quantitative problem, solving the gap problem consisting, given MEMDP , state , parity objective , and thresholds and , in answering
-
Yes if there exists a strategy such that for all , we have ,
-
No if for all strategies , there exists with ,
-
and arbitrarily otherwise.
Results.
We solve the membership problem for limit-sure winning with parity objectives (i.e., deciding whether a given state is limit-sure winning, that is ), providing a PSPACE algorithm with a matching complexity lower bound, and showing that the problem is solvable in polynomial time when the number of environments is fixed. Our solution relies on the solution of almost-sure winning, which is known to be PSPACE-complete for reachability [24] and Rabin objectives [23]. We revisit the solution of almost-sure winning and give a simple characterization for safety objectives (which is also PSPACE-complete), that can easily be extended to parity objectives. A corollary of our characterization is that pure (non-randomized) strategies are sufficient for both limit-sure and almost-sure winning, which was known only for acyclic MEMDPs and reachability objectives [24, Lemma 12].
For the gap problem, we present an double exponential-space procedure to approximate the value that can be achieved in all environments, up to an arbitrary precision .
3 Almost-Sure Winning
It is known that the membership problem for almost-sure winning in MEMDPs is PSPACE-complete with reachability objectives [24] as well as with Rabin objectives [23], an expressively equivalent of the parity objectives. We revisit the membership problem for almost-sure winning with parity objectives, as it will be instrumental to the solution of limit-sure winning. We present a conceptually simple characterization of the winning region for almost-sure winning, from which we derive a PSPACE algorithm, thus matching the known complexity for almost-sure Rabin objectives. A corollary of our characterization is that pure (non-randomized) strategies are sufficient for both limit-sure and almost-sure winning, which was known only for acyclic MEMDPs and reachability objectives [24, Lemma 12].
Theorem 2 ([24],[23]).
The membership problem for almost-sure winning in MEMDPs with a reachability, safety, or Rabin objective is PSPACE-complete.
To solve the membership problem for a parity objective , we first convert into an MEMDP in revealed form with state space and each revealing transition in is redirected in to if is almost-sure winning when the set of environments is the knowledge after observing transition , and to otherwise. In order to decide if , we need to solve the membership problem for an MEMDP with strictly fewer environments than in as , which will lead to a recursive algorithm. The base case of the solution is MEMDPs with one environment, which is equivalent to plain MDPs.
It is easy to see that for all prefix-independent objectives , and we can transform the objective into an objective of the same type as (for example, if is a parity objective then assigning the smallest even priority to turns the objective into a pure parity objective).
Hence, the main difficulty is to solve the membership problem for MEMDP in revealed form.
By definition, the almost-sure winning region for a parity objective in an MEMDP is such that there exists a strategy that is almost-sure winning for the parity objective from every state in every MDP (where is an environment of ). In contrast, we show the following characterization (note the order of the quantifiers).
Lemma 3.
Given an MEMDP in revealed form with state space , if is such that in every environment , from every state , there exists a strategy that is almost-sure winning for the parity objective in from , then . Moreover, for all , there exists a pure ()-memory strategy ensuring from in .
The characterization in the first part of Lemma 3 holds simply because parity objectives are prefix-independent (runs that differ by a finite prefix are either both winning or both losing), and thus the characterization holds for all prefix-independent objectives.
Following the characterization in Lemma 3, we can construct an algorithm that computes the revealed form (using recursive calls to determine the winning region of MEMDPs with smaller sets of environments), and then computes the winning region by a fixpoint iteration computable in polynomial time for a fixed set of environments. The depth of the recursive calls is bounded by the number of environments, giving a PSPACE algorithm. Correspondingly, we show that pure strategies with exponential memory are sufficient. It is easy to show that exponential memory is necessary using the construction establishing PSPACE lower bound of the problem [23, Theorem 7].
Theorem 4.
The membership problem for almost-sure parity in MEMDPs is PSPACE-complete. Pure exponential-memory strategies are sufficient for almost-sure winning in MEMDPs with parity (thus also reachability and safety) objectives. When the number of environments is fixed, the problem is solvable in polynomial time.
4 Limit-Sure Winning
We refer to the examples of the duplicate card and the missing card in Section 1 to illustrate the difference between limit-sure and almost-sure winning. We present in Section 4.1 two other scenarios where limit-sure winning and almost-sure winning do not coincide, which will be useful to illustrate the key ideas in the algorithmic solution.
4.1 Examples
In the example of Figure 3, the set is an end-component in both environments and (the actions are shown in the figures only when relevant, that is in ). However, the transition probabilities from are different in the two environments and , and intuitively we can learn (with high probability) in which environment we are by playing for a long enough (but finite) time and collecting the frequency of the visits to and . Then, in order to reach the target , if there are more ’s than ’s in the history we play in , otherwise . The intuition is that the histories with more ’s than ’s have a high probability (more than ) in and a small probability (less than ) in , where can be made arbitrarily small (however not ) by playing for sufficiently long. Hence is limit-sure winning, but not almost-sure winning.
In the second scenario (Figure 4), the transition probabilities do not matter. The objective is to visit some state in infinitely often (those states have priority , the other states have priority ). The state is limit-sure winning, but not almost-sure winning. To win with probability , a strategy can play (in ) for a sufficiently long time, then switch to playing (unless was reached before that). The crux is that playing does not harm, as it does not leave the limit-sure winning region, but ensures in at least one environment (namely, ) that the objective is satisfied with probability (by reaching ). This allows to “discard” the environment if was not reached, and to switch to a strategy that is winning with probability at least in , namely by playing . With an arbitrary number of environments, the difficulty is to determine in which order the environments can be “discarded”.
Note that the transition is not revealing, since it is present in both environments. However, after crossing this transition a large number of times, we can still learn that the environment is (and be mistaken with arbitrarily small probability). In contrast, the transition is revealing and the environment is with certainty upon crossing that transition.
To solve the membership problem for limit-sure parity, we first convert into a revealed-form MEMDP , similar to the case of almost-sure winning, with the obvious difference that revealing transitions of are redirected in to if is limit-sure winning when the set of environments is the knowledge after observing transition . Thus, we aim for a recursive algorithm, where the base case is limit-sure winning in MEMDPs with one environment, which are equivalent to plain MDPs, for which limit-sure and almost-sure parity coincide. Note that the examples of Figure 3 and Figure 4 are in revealed form.
4.2 Common End-Components and Learning
A common end-component (CEC) of an MEMDP is a pair that is an end-component in for all environments . A CEC is trivial if it contains a single state. is said winning for a parity condition , if for all , there is a strategy in which, when started inside , ensures with probability 1. Notice that since is a common end-component, such a strategy ensures with probability 1 in iff it does in .
We note that the common end-components of an MEMDP are the end-components of the MDP assuming is in revealed form, and thus can be computed using standard algorithm for end-components [10].
Lemma 5.
Consider an MEMDP in revealed form. The common end-components of are exactly the end-components of .
A CEC may have different transition probabilities in different environments. We call a CEC distinguishing if it contains a transition (called a distinguishing transition) such that for some environments . Given a distinguishing transition and environment , define and . We say that is a distinguishing partition of that is induced by the distinguishing transition and environment .
Distinguishing transitions can be used to learn the partition , that is to guess (correctly with high probability) whether the current environment is in or , as in the example of Figure 3, where the set is a distinguishing end-component with distinguishing transition and partition . A distinguishing CEC may have several distinguishing transitions and induced partitions.
We formalize how a strategy can distinguish between and with high probability inside a distinguishing CEC. First let us recall Hoeffding’s inequality.
Theorem 6 (Hoeffding’s Inequality [15]).
Let be a sequence of independent and identical Bernoulli variables with , and write . For all , and
Given a distinguishing CEC with distinguishing partition induced by a transition , a strategy can sample the distribution by repeating the following two phases: first, use a pure memoryless strategy to almost-surely visit , then play action ; by repeating this long enough (the precise bound depends on a given and is derived from Theorem 6) while storing the frequency of visits to in the second phase, we can learn and guess in which block belongs the environment, with sufficiently small probability of mistake to ensure winning with probability .
Lemma 7.
Given an MEMDP containing a distinguishing common end-component with partition induced by a distinguishing transition, and parity objective , for all states in , all pairs of strategies , and all , there exists a strategy such that:
Moreover, the strategy is pure if both are pure; and if each strategy uses a memory of size , then uses finite memory of size where .
It follows that the membership problem for limit-sure winning can be decomposed into subproblems where the set of environments is one of the blocks in the partition.
Lemma 8.
Given an MEMDP containing a distinguishing common end-component with a partition induced by a distinguishing transition, and a parity objective the following equivalence holds: if and only if and .
4.3 Characterization and Algorithm
Here, we assume that MEMDPs are in revealed form with sink states and .
We show that the winning region for limit-sure parity is a closed set: from every state , there exists an action ensuring in all environments that all successors of are in . We call such actions limit-sure safe for . We show in Lemma 9 that a limit-sure safe action always exists in limit-sure winning states. Note that playing actions that are not limit-sure safe may be useful for limit-sure winning, as in the example of Figure 3 where action is limit-sure safe, but action is not (from ).
By definition of limit-sure winning, if a state is not limit-sure winning, there exists such that for all strategies , there exists an environment such that . We denote by a uniform bound.
Lemma 9.
Given an MEMDP (in revealed form) over environments , a parity objective , and a state , if is limit-sure winning, then there exists an action such that for all environments , all successors of are limit-sure winning, i.e .
Given an MEMDP , consider the limit-sure winning region for . For the purpose of the analysis, consider the (memoryless) randomized strategy that plays uniformly at random all limit-sure safe actions in every state , which is well-defined by Lemma 9.
Consider an arbitrary environment , and an end-component in that is positive under (recall Lemma 1 and the definition afterward). There are three possibilities:
-
1.
is not a common end-component (as in the example of Figure 4, for in ), that is, is not an end-component in some environment (in the example ), then we can learn (and be mistaken with arbitrarily small probability) that we are not in , reducing the problem to an MEMDP with fewer environments (namely, );
-
2.
is a common end-component and is distinguishing (as in the example of Figure 3, for ), then we can also learn a distinguishing partition and reduce the problem to MEMDPs with fewer environments (namely, and );
-
3.
is a common end-component and is non-distinguishing, then we show in Lemma 10 below that is almost-sure winning (), obviously in all environments.
Lemma 10.
Given an MEMDP over environments (in revealed form), a parity objective , and a state , if , then all non-distinguishing common end-components that are positive under strategy from in (for some ) are almost-sure winning for (that is ).
Our approach to compute the limit-sure winning states is to first identify the distinguishing CECs that are limit-sure winning. We can compute the maximal CECs using Lemma 5, and note that a maximal CEC containing a distinguishing CEC is itself distinguishing, so it is sufficient to consider maximal CECs. By Lemma 8, we can decide if a given distinguishing CEC is limit-sure winning using a recursive procedure on MEMDPs with fewer environments. We show in Lemma 11 below that we can replace the limit-sure CECs by a sink state .
Lemma 11.
Given an MEMDP with parity objective and a set of limit-sure winning states, we have
We can now assume that MEMDPs contain no limit-sure winning distinguishing CEC, and present a characterization for the remaining possibility, illustrated by the scenario of Figure 4, where playing the action (in , forever) ensures, in some environment (namely, ), almost-sure satisfaction of the parity objective while remaining inside the limit-sure winning region in all other environments.
Lemma 12.
Consider an MEMDP (in revealed form) over environments with , that contains no limit-sure winning distinguishing common end-component, and a parity objective . Writing , we have the following:
Algorithm Overview.
Given a MEMDP , the algorithm proceeds by recursion on the size of the environment set (Algorithm 1). The base case is that of a singleton set where and this can be computed in polynomial time.
Assume . We first convert into an MEMDP in revealed form with state space and each revealing transition in is redirected in to if is limit-sure winning when the set of environments is the knowledge after observing transition , and to otherwise. Notice that each query uses a set that is strictly smaller than .
We now assume that is in revealed form and we compute the maximal end-components of the MDP ; these are maximal common end-components of by Lemma 5. For each distinguishing maximal CEC , we determine whether it is limit-sure winning using the condition of Lemma 8, namely that (for ) where is a partition of induced by a distinguishing transition of , which is computed by a recursive calls to the algorithm. We replace by if it is limit-sure winning, which yields an MEMDP without limit-sure winning distinguishing CECs, and we can apply Lemma 12: for each environment , we compute which is done by separate recursive calls, and we compute the sets using standard MDP algorithms (we restrict the state space to and compute the almost-sure winning states for ). We then solve the almost-sure reachability problem in for the target set .
Thus each recursive step takes polynomial time (besides the recursive calls), and because each recursive call decreases the size of , the depth of the recursion is bounded by . It follows that the procedure runs in polynomial space. The PSPACE lower bound follows from the same reduction as for almost-sure winning [23, Theorem 7], since the MEMDP constructed in the reduction is acyclic, thus almost-sure and limit-sure winning coincide.
Note that Lemma 12 constructs a pure strategy that achieves the objective with probability at least from the limit-sure winning states, and that the strategies constructed in Lemmas 8 and 11 to witness limit-sure winning are also pure (in Lemma 8, the construction assumes that pure strategies are sufficient for fewer environments, which allows a proof by induction since pure strategies are sufficient in MDPs, i.e. in a single environment).
Theorem 13.
The membership problem for limit-sure parity objectives in MEMDPs is PSPACE-complete and pure exponential-memory strategies are sufficient, i.e., if a state is limit-sure winning, then for all there exists a pure exponential-memory strategy that ensures the objective is satisfied with probability at least from . When the number of environments is fixed, the problem is solvable in polynomial time.
The time complexity of Algorithm 1 is established as follows. Let us consider a single recursive call. The maximal end-components of can be computed in where denotes the number of transitions. Then, determining whether each MEC is distinguishing, and replacing them with sink states can be done in time since one needs to go over each transition and check whether their probability differs in two environments. The last step requires solving almost-sure parity and safety for MDPs defined for each , which can be done in time (similarly as in the discussion following Theorem 4). The most costly operation is almost-sure reachability for the MEMDP , which by Theorem 4 takes . There are recursive calls (the algorithm can be run once for each subset of using memoization), so overall we get .
5 The Gap Problem
The goal of this section is to give a procedure to solve the gap problem for parity objectives. For this, we show that an arbitrary strategy in can be imitated by a finite-memory one (with a computable bound on the memory size) while achieving the same probability of winning up to in all environments. Once this is established, we show how to guess a finite-memory strategy of appropriate size, and check that it is a solution to the gap problem.
To establish the memory bound for such an -approximation, we need a few intermediate lemmas. First, we define a transformation on MEMDPs consisting in collapsing non-distinguishing maximal CECs (MCECs) of the MEMDP ; the resulting MEMDP is denoted . We show that and have the same probabilities of satisfaction of the considered parity objective under all environments.
Intuitively, removing non-distinguishing MCECs ensures that in , under all strategies, with high probability, within a fixed number of steps, either a maximal CEC is reached (which is either distinguishing, or non-distinguishing but trivial – recall that a trivial CEC contains a single absorbing state) or enough samples are gathered to improve the knowledge about the current environment. This observation will help us constructing the finite-memory strategy inductively since in each case the knowledge can be improved correctly with high probability: in trivial MCECs, the strategy is extended arbitrarily; inside distinguishing MCECs, the strategy can be extended so that it stays inside the MCEC while sampling distinguishing transitions with any desired precision as in Lemma 7; last, if no MCECs are reached but enough samples are gathered along the way, we prove that the knowledge can also be improved with high probability. The final strategy is obtained by combining finite-memory strategies constructed inductively for smaller sets of environments.
Purge: Removing Non-Distinguishing MCECs.
We present key properties of the transformation in Lemma 14 below, showing that strategies of can be imitated in , and vice versa. Hence the value of parity objectives in and coincide.
Lemma 14.
Consider an MEMDP , and objective . There exists an MEMDP , with a map relating states of and those of , with the following properties:
-
The only non-distinguishing MCECs of are the trivial and .
-
For all states and strategies for , there exists with .
-
For all states and strategies for , and all , there exists a strategy with . Furthermore, if is a -memory strategy, then can be chosen to be a -memory strategy.
Note that while is computable in polynomial time, our results only need that can be constructed to establish the existence of finite-memory strategies (Lemma 16).
Learning While Playing.
We now show that after collapsing non-distinguishing MCECs, within steps (for large enough), with high probability, we either reach a MCEC (which is either distinguishing or trivial) or collect a large number of samples of distinguishing transitions whose empirical average is close to their mean. Intuitively, this means that either the knowledge can be improved after steps using the collected samples while bounding the probability of error, or a MCEC is reached.
For a history , let denote the number of occurrences of the state-action pair , and the number of times these are followed by , where . For a distinguishing transition , we say that a history is a bad -classification in MDP if , that is the measured and theoretical frequency of are too far apart. It is a good -classification otherwise. Intuitively, over long histories, good classifications have high probability.
Given MEMDP , let be a set obtained by selecting one distinguishing transition for each state-action pair for which such transitions exist (i.e., such that for some environments , ). Let , and fix such that
Let us define the set of good histories with samples, denoted by , as the set of histories satisfying:
-
1.
,
-
2.
for all satisfying , the history is a good -classification.
Lemma 15.
Consider an MEMDP whose only non-distinguishing MCECs are trivial, and fix , Let , and where is the smallest nonzero probability that appears in . Then, from any starting state, and under any strategy, with probability at least , within steps, the history either visits a MCEC (distinguishing or trivial), or belongs to .
Lemma 15 shows that long histories that do not enter MCECs have high probability of being good classifications, which means that the frequency of the distinguishing transitions is close to their actual probability. Technically, this result is nontrivial to prove. In fact, successive visits of distinguishing transitions are not independent trials so classical reasoning based, for instance, on Hoeffding’s inequality cannot be applied. It turns out that although these trials can be dependent, their covariances are . We exploit this fact to prove Lemma 15 by carefully modeling the successive trials, and relying on Chebyshev’s inequality.
Constructing Approximate Finite-Memory Strategies.
We are now ready to construct a finite-memory strategy that approximates an arbitrary strategy . We first construct a finite-memory strategy for and then transfer it to using Lemma 14. The finite-memory strategy imitates for steps, where is defined in Lemma 15. By this lemma, because all nontrivial MCECs of are distinguishing, when we play for steps, with high probability, we either visit a trivial MCEC (which is either winning for all environments or losing for all environments), or reach a distinguishing MCEC, or observe enough samples of distinguishing transitions. The strategy is extended arbitrarily in trivial MCECs. Inside distinguishing MCECs, it gathers samples of distinguishing transitions as in Lemma 7, which improves the knowledge (with an arbitrarily small probability of error). The knowledge is also correctly improved with high probability if enough samples are gathered outside of MCECs. In both cases, the strategy switches to a finite-memory strategy for the improved knowledge constructed recursively for smaller sets of environments.
Lemma 16 formalizes this reasoning and gives a bound on the memory of the resulting strategy. In the memory bound, the term comes from the application of Lemma 7 for distinguishing MCECs for each subset of ; and the term corresponds to the recursive analysis, since the strategy is defined inductively for each subset of . Notice that is exponential, and that is doubly-exponential in the size of the input.
Lemma 16.
Consider an MEMDP , state , parity objective . For all strategies , and , there exists a strategy using at most memory where , with the smallest nonzero probability in , and that satisfies .
Approximation Algorithm.
We now sketch a procedure for solving the gap problem with threshold for parity objectives in MEMDPs. Informally, when the memory size is fixed, one can define a system of polynomial equations of linear size that describe the reachability probabilities for a target set in MDPs [3]; and this can be extended easily to MEMDPs. Here, given any , we consider the bound of Lemma 16, and solve the polynomial equation system whose solutions correspond to finite-memory strategies with memory states. Solving such equations can be done in polynomial space [5] in the size of the equations, which means a double exponential-space procedure here because is double exponential.
More precisely, the equation system is written so that any solution corresponds to a strategy achieving the probability threshold . If some strategy achieves the threshold , by Lemma 16, we know that there is one with a memory size , and the equation system has a solution. Conversely, if no strategy achieves the threshold , then, in particular, no finite-memory strategy achieves this threshold and the equation system has no solution.
The case of parity objectives is solved by enumerating the supports of the strategies that are being guessed since the supports determine bottom strongly connected components, and the problem can then be reduced to reachability.
Theorem 17.
The gap problem can be solved in double exponential space for MEMDPs with parity objectives.
Note that the double exponential complexity remains even if is fixed. Whether the complexity can be improved when is fixed, and the decidability of the value problem (and not just the -gap problem) are interesting directions for future research.
References
- [1] T. S. Badings, T. D. Simão, M. Suilen, and N. Jansen. Decision-making under uncertainty: beyond probabilities. Int. J. Softw. Tools Technol. Transf., 25(3):375–391, 2023. doi:10.1007/S10009-023-00704-3.
- [2] C. Baier, M. Größer, and N. Bertrand. Probabilistic -automata. J. ACM, 59(1):1, 2012.
- [3] C. Baier and J.-P. Katoen. Principles of Model Checking. MIT Press, 2008.
- [4] P. Buchholz and D. Scheftelowitsch. Computation of weighted sums of rewards for concurrent MDPs. Math. Methods Oper. Res., 89(1):1–42, 2019. doi:10.1007/S00186-018-0653-1.
- [5] J. Canny. Some algebraic and geometric computations in PSPACE. In Proc. of STOC: Symposium on Theory of Computing, pages 460–467. ACM, 1988. doi:10.1145/62212.62257.
- [6] K. Chatterjee, M. Chmelík, D. Karkhanis, P. Novotný, and A. Royer. Multiple-environment Markov decision processes: Efficient analysis and applications. In Proc. of ICAPS: Automated Planning and Scheduling, pages 48–56. AAAI Press, 2020. URL: https://ojs.aaai.org/index.php/ICAPS/article/view/6644.
- [7] K. Chatterjee, L. Doyen, H. Gimbert, and T. A. Henzinger. Randomness for free. Information and Computation, 245:3–16, 2017.
- [8] K. Chatterjee, L. Doyen, J.-F. Raskin, and O. Sankur. The value problem for multiple-environment MDPs with parity objective. CoRR, abs/2504.15960, 2025.
- [9] C. Courcoubetis and M. Yannakakis. The complexity of probabilistic verification. J. ACM, 42(4):857–907, July 1995. doi:10.1145/210332.210339.
- [10] L. de Alfaro. Formal verification of probabilistic systems. Ph.d. thesis, Stanford University, 1997.
- [11] S. Even, A. L. Selman, and Y. Yacobi. The complexity of promise problems with applications to public-key cryptography. Information and Control, 61(2):159–173, 1984. doi:10.1016/S0019-9958(84)80056-X.
- [12] E. A. Feinberg and A. Shwartz, editors. Handbook of Markov Decision Processes - Methods and Applications. Kluwer, 2002.
- [13] H. Gimbert and Y. Oualhadj. Probabilistic automata on finite words: Decidable and undecidable problems. In Proc. of ICALP (2), LNCS 6199, pages 527–538. Springer, 2010. doi:10.1007/978-3-642-14162-1_44.
- [14] O. Goldreich. On promise problems (a survey in memory of Shimon Even [1935-2004]). Manuscript, 2005.
- [15] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963. URL: http://www.jstor.org/stable/2282952.
- [16] K. J. Åström. Optimal control of Markov processes with incomplete state information I. Journal of Mathematical Analysis and Applications, 10:174–205, 1965.
- [17] O. Madani, S. Hanks, and A. Condon. On the undecidability of probabilistic planning and related stochastic optimization problems. Artif. Intell., 147(1-2):5–34, 2003. doi:10.1016/S0004-3702(02)00378-8.
- [18] D. A. Martin. The determinacy of Blackwell games. The Journal of Symbolic Logic, 63(4):1565–1581, 1998. doi:10.2307/2586667.
- [19] C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of Markov decision processes. Math. Oper. Res., 12(3):441–450, 1987. doi:10.1287/MOOR.12.3.441.
- [20] M. L. Puterman. Markov decision processes. John Wiley and Sons, 1994.
- [21] N. M. van Dijk R. J. Boucherie. Markov decision processes in practice. Springer, 2017.
- [22] J.-F. Raskin and O. Sankur. Multiple-environment Markov decision processes. In Proc. of FSTTCS: Foundation of Software Technology and Theoretical Computer Science, volume 29 of LIPIcs, pages 531–543. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2014. doi:10.4230/LIPICS.FSTTCS.2014.531.
- [23] M. Suilen, M. van der Vegt, and S. Junges. A PSPACE algorithm for almost-sure Rabin objectives in multi-environment MDPs. In Proc. of CONCUR: Concurrency Theory, volume 311 of LIPIcs, pages 40:1–40:17. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2024. doi:10.4230/LIPICS.CONCUR.2024.40.
- [24] M. van der Vegt, N. Jansen, and S. Junges. Robust almost-sure reachability in multi-environment MDPs. In Proc. of TACAS: Tools and Algorithms for the Construction and Analysis of Systems, LNCS 13993, pages 508–526. Springer, 2023. doi:10.1007/978-3-031-30823-9_26.