Taming Infinity One Chunk at a Time: Concisely Represented Strategies in One-Counter MDPs
Abstract
Markov decision processes (MDPs) are a canonical model to reason about decision making within a stochastic environment. We study a fundamental class of infinite MDPs: one-counter MDPs (OC-MDPs). They extend finite MDPs via an associated counter taking natural values, thus inducing an infinite MDP over the set of configurations (current state and counter value). We consider two characteristic objectives: reaching a target state (state-reachability), and reaching a target state with counter value zero (selective termination). The synthesis problem for the latter is not known to be decidable and connected to major open problems in number theory. Furthermore, even seemingly simple strategies (e.g., memoryless ones) in OC-MDPs might be impossible to build in practice (due to the underlying infinite configuration space): we need finite, and preferably small, representations.
To overcome these obstacles, we introduce two natural classes of concisely represented strategies based on a (possibly infinite) partition of counter values in intervals. For both classes, and both objectives, we study the verification problem (does a given strategy ensure a high enough probability for the objective?), and two synthesis problems (does there exist such a strategy?): one where the interval partition is fixed as input, and one where it is only parameterized. We develop a generic approach based on a compression of the induced infinite MDP that yields decidability in all cases, with all complexities within PSPACE.
Keywords and phrases:
one-counter Markov decision processes, randomised strategies, termination, reachabilityCategory:
Track B: Automata, Logic, Semantics, and Theory of ProgrammingFunding:
James C. A. Main: Research Fellow of the Fonds de la Recherche Scientifique – FNRS and member of the TRAIL institute.Copyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Theory of computation Probabilistic computation ; Theory of computation Logic and verificationAcknowledgements:
We thank Sougata Bose for pointing out the reference [6].Funding:
This work has been supported by the Fonds de la Recherche Scientifique - FNRS under Grant n° T.0188.23 (PDR ControlleRS) and by the Czech Science Foundation grant no. GA23-06963S.Editors:
Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele PuppisSeries and Publisher:

1 Introduction
A Markov decision process (MDP) models the continuous interaction of a controllable system with a stochastic environment. MDPs are used notably in the fields of formal methods (e.g., [4, 27]) and reinforcement learning (e.g., [41]). In each step of an MDP, the system selects an action that is available in the current state, and the state is updated randomly following a distribution depending only on the current state and chosen action. This interaction goes on forever and yields a play (i.e., an execution) of the MDP. The resolution of non-determinism in an MDP is done via a strategy: strategies prescribe choices depending on the current history of the play and may use randomisation. Strategies are the formal counterpart of controllers for the modelled system [38, 10]. A strategy is pure if it does not use randomisation.
A classical problem in MDPs is synthesis, which asks to construct a good strategy with respect to a specification. Specifications can be defined, e.g., via payoff functions that quantify the quality of plays, and via objectives, which are sets of well-behaved plays. The goal is thus to synthesise a strategy that, in the former case, has a high expected payoff, and, in the latter case, enforces the objective with high probability. If possible, it is desirable to construct an optimal strategy, i.e., a strategy whose performance is not (strictly) exceeded by another. The decision variant of the synthesis problem, which asks whether a well-performing strategy exists, is called realisability. We focus on variants of reachability objectives, asking that a target set of states be visited. These canonical objectives are central in synthesis [21].
Strategies and their representations.
Traditionally, strategies are represented by Mealy machines, i.e., finite automata with outputs (see, e.g., [12, 33]). Strategies that admit such a representation are called finite-memory strategies. A special subclass is that of memoryless strategies, i.e., strategies whose decisions depend only on the current state.
In finite MDPs, pure memoryless strategies often suffice to play optimally with a single payoff or objective, e.g., for reachability [4] and parity objectives [22], and mean [8] and discounted-sum payoffs [40]. In countable MDPs, for reachability objectives, pure memoryless strategies are also as powerful as general ones, though optimal strategies need not exist in general [35, 31]. In the multi-objective setting, memory and randomisation are necessary in general (e.g., [39, 21]). Nonetheless, finite memory often suffices in finite MDPs, e.g., when considering several -regular objectives [25].
The picture is more complicated in infinite-state MDPs, where even pure memoryless strategies need not admit finite representations. Our contribution focuses on small (and particularly, finite) counter-based representations of memoryless strategies in a fundamental class of infinite-state MDPs: one-counter MDPs.
One-counter Markov decision processes.
One-counter MDPs (OC-MDPs, [16]) are finite MDPs augmented with a counter that can be incremented (by one), decremented (by one) or left unchanged on each transition.111 Considering such counter updates is not restrictive for modelling: any integer counter update can be obtained with several transitions. However, this impacts the complexity of decision problems. Such a finite OC-MDP induces an infinite MDP over a set of configurations given by states of the underlying MDP and counter values. In this induced MDP, any play that reaches counter value zero is interrupted; this event is called termination. We consider two variants of the model: unbounded OC-MDPs, where counter values can grow arbitrarily large, and bounded OC-MDPs, in which plays are interrupted when a fixed counter upper bound is reached.
The counter in OC-MDPs can, e.g., model resource consumption along plays [16], or serve as an abstraction of unbounded data types and structures [20]. It can also model the passage of time: indeed, OC-MDPs generalise finite-horizon MDPs, in which a bound is imposed on the number of steps (see, e.g., [5]). OC-MDPs can be also seen as an extension of one-counter Markov chains with non-determinism. One-counter Markov chains are equivalent to (discrete-time) quasi-birth-death processes [26], a model studied in queuing theory.
Termination is the canonical objective in OC-MDPs [16]. Also relevant is the more general selective termination objective, which requires terminating in a target set of states. In this work, we study both the selective termination objective and the state-reachability objective, which requires visiting a target set of states regardless of the counter value.
Optimal strategies need not exist in unbounded OC-MDPs for these objectives [15]. The general synthesis problem in unbounded OC-MDPs for selective termination is not known to be decidable, and it is at least as hard as the positivity problem for linear recurrence sequences [37], whose decidability would yield a major breakthrough in number theory [36]. Optimal strategies exist in bounded OC-MDPs: the induced MDP is finite and we consider reachability objectives. However, constructing optimal strategies is already EXPTIME-hard for reachability in finite-horizon MDPs [5].
In this work, we propose to tame the inherent complexity of analysing OC-MDPs by restricting our analysis to a class of succinctly representable (yet natural and expressive) strategies called interval strategies.
Interval strategies.
The monotonic structure of OC-MDPs makes them amenable to analysis through strategies of special structure. For instance, in their study of solvency games (stateless variant of OC-MDPs with binary counter updates), Berger et al. [7] identify a natural class of rich man’s strategies, which, whenever the counter value is above some threshold, always select the same action in the same state. They argue that the question of existence of rich man’s optimal strategies “gets at a real phenomenon which …should be reflected in any good model of risk-averse investing.” While [7] shows that optimal rich man’s strategies do not always exist, if they do, their existence substantially simplifies the model analysis.
In this work, we make such finitely-represented strategies the primary object of study. We consider so-called interval strategies: each such strategy is based on some (finite or infinite but finitely-representable) partitioning of into intervals, and the strategy’s decision depends on the current state and on the partition containing the current counter value.
More precisely, we focus on two classes of these strategies: On the one hand, in bounded and unbounded OC-MDPs, we consider open-ended interval strategies (OEISs): here, the underlying partitioning is finite, and thus contains an open-ended interval on which the strategy behaves memorylessly, akin to rich man’s strategies.
On the other hand, in unbounded OC-MDPs, we also consider cyclic interval strategies (CISs): strategies for which there exists a (positive integer) period such that, for any two counter values that differ by the period, we take the same decisions. A CIS can be represented similarly to an OEIS using a partition of the set of counter values up to the period.
Our contributions.
We formally introduce the class of interval strategies. We show that OEISs in bounded OC-MDPs and CISs in unbounded OC-MDPs can be exponentially more concise than equivalent Mealy machines (Sect. 3), and unbounded OEISs can even represent infinite-memory strategies.
For selective termination and state reachability, we consider the interval strategy verification problem and two realisability problems for structure-constrained interval strategies. On the one hand, the fixed-interval realisability problem asks, given an interval partition, whether there is an interval strategy built on this partition that ensures the objective with a probability greater than a given threshold. Intuitively, in this case, the system designer specifies the desired structure of the controller. On the other hand, the parameterised realisability problem (for interval strategies), asks whether there exists a well-performing strategy built on a partition of size no more than a parameter such that no finite interval is larger than a second parameter . We consider two variants of the realisability problem: one for checking the existence of a suitable pure strategy and another for randomised strategies. Randomisation allows for better performance when imposing structural constraints on strategies (Ex. 3), but pure strategies are however often preferred for synthesis [24].
Our complexity results are summarised in Table 1. Analysing the performance of a memoryless strategy amounts to studying the Markov chain it induces on the MDP. The Markov chain induced by an interval strategy is infinite in unbounded OC-MDPs and of exponential size in bounded OC-MDPs, and thus cannot be analysed in polynomial space with classical approaches. Our results rely on the analysis of a compressed Markov chain derived from this large induced Markov chain. We remove certain configurations and aggregate several transitions into one (Sect. 4). This compressed Markov chain preserves the probability of selective termination and of hitting counter upper bounds (Thm. 6). However, its transition probabilities may require exponential-size representations or even be irrational. To represent these probabilities, we characterise them as the least solutions of quadratic systems of equations (Thm. 8), similarly to the result of [32] for termination probabilities in probabilistic pushdown automata. Compressed Markov chains are finite for OEISs and are induced by a one-counter Markov chain for CISs (Lem. 10).
The crux of our algorithmic results is the aforementioned compression. For verification, we reduce the problem to checking the validity of a universal formula in the theory of the reals, by exploiting our characterisation of transition probabilities in compressed Markov chains. This induces a PSPACE upper bound. For bounded OC-MDPs, we can do better: verification can be solved in polynomial time in the unit-cost arithmetic RAM model of computation of Blum, Shub and Smale [11], by computing transition and reachability probabilities of the compressed Markov chain. This yields a complexity in the Turing model (see [2]).
Both realisability variants exploit the verification approach through the theory of the reals. For fixed-interval realisability for pure strategies, we exploit non-determinism to select good strategies and then verify them with the above. In the randomised case, in essence, we build on the verification formulae and existentially quantify over the probabilities of actions under the sought strategy. Finally, for parameterised realisability, we build on our algorithms for the fixed-interval case by first non-deterministically building an appropriate partition.
We also provide complexity lower bounds. We show that all of our considered problems are hard for the square-root-sum problem, a problem that is not known to be solvable in polynomial time but that is solvable in polynomial time in the Blum-Shub-Smale model [42]. We also prove NP-hardness for our realisability problems for selective termination, already when checking the existence of good single-interval strategies.
Impact.
Our results provide a natural class of strategies for which realisability is decidable (whereas the general case remains open and known to be difficult [37]), and with arguably low complexity (for synthesis). Furthermore, the class of interval strategies is of practical interest thanks to their concise representation and their inherently understandable structure (in contrast to the corresponding Mealy machine representation).
Related work.
In addition to the main references cited previously, we mention some (non-exhaustive) related work. The closest is [9]: interval strategies are similar to counter selector strategies, studied in consumption MDPs. These differ from OC-MDPs in key aspects: all transitions consume resources (i.e., bear negative weights), and recharge can only be done in special reload states, where it is considered as an atomic action up to a given capacity. Consumption and counter-based (or energy) models have different behaviors (e.g., [19]). The authors of [9] also study incomparable objectives: almost-sure (i.e., probability one) Büchi objectives.
Restricting oneself to subclasses of strategies that prove to be of practical interest is a common endeavor in synthesis: e.g., strategies represented by decision trees [17, 18, 3, 30], pure strategies [29, 24, 13], finite-memory strategies [23, 14, 13], or strategies using limited forms of randomness [33, 34].
Outline.
Due to space constraints, we only present an overview of our results. Technical details and proofs can be found in the full version of this paper [1]. In Sect. 2, we introduce all prerequisite notions. We define interval strategies and formalise our decision problems in Sect. 3. The compressed Markov chain construction, which is central in our verification and realisability algorithms, is presented in Sect. 4. Complexity results are overviewed in Sect. 5.
2 Preliminaries
Set and probability theory.
We write , and for the sets of non-negative integers, rational numbers and real numbers respectively. We let denote the set of positive integers. We let and . Given , we let denote the set , and if , we shorten the notation to .
Let be a finite or countable set. We write for the set of distributions over , i.e., of functions such that . Given a distribution , we let denote the support of .
Markov decision processes.
Formally, a Markov decision process (MDP) is a tuple where is a countable set of states, is a finite set of actions and is a (partial) probabilistic transition function. Given , we write for the set of actions such that is defined. We say that an action is enabled in if . We require, without loss of generality, that there are no deadlocks in MDPs, i.e., that for all states , is non-empty. We say that is finite if its state space is finite.
A Markov chain can be seen as an MDP with a single enabled action per state. We omit actions in Markov chains and denote them as tuples with .
Let . A play of is an infinite sequence such that for all , . A history is a finite prefix of a play ending in a state. We let and respectively denote the set of plays and of histories of . Given a play and , we let and extend this notation to histories. For any given history , we let and . The cylinder of a history is the set of plays extending . For any set of histories , we let .
In a Markov chain, due to the lack of actions, plays and histories are simply sequences of states coherent with transitions. We extend the notation introduced above for plays and histories of MDPs to plays and histories of Markov chains.
Strategies.
A strategy describes the action choices made throughout a play of an MDP. In general, strategies can use randomisation and all previous knowledge to make decisions. Formally, a strategy is a function such that for any history , , i.e., a strategy can only prescribe actions that are available in the last state of a history. A play is consistent with a strategy if for all , . Consistency of a history with respect to a strategy is defined similarly.
A strategy is pure if it does not use any randomisation, i.e., if it maps histories to Dirac distributions. We view pure strategies as functions . A strategy is memoryless if the distribution it suggests depends only on the last state of a history, i.e., if for all histories and , implies . We view memoryless strategies and pure memoryless strategies as functions and respectively.
Given a strategy and an initial state , we define a distribution over measurable sets of plays in the usual way (e.g., [4]). For a Markov chain , due to the absence of non-determinism, we drop the strategy from this notation. If the relevant MDP or Markov chain is clear from the context, we also omit them.
In the following, we focus on memoryless strategies. If is a memoryless strategy of , we consider the Markov chain induced by on as a Markov chain over . Formally, it is defined as the Markov chain such that for all , , .
One-counter Markov decision processes.
One-counter MDPs (OC-MDPs) extend MDPs with a counter that can be incremented, decremented or left unchanged on each transition. Formally, a one-counter MDP is a tuple where is a finite MDP and is a (partial) weight function that assigns an integer weight from to state-action pairs. For all and , we require that be defined whenever . A configuration of is a pair where and . In the sequel, by plays, histories, strategies etc. of , we refer to the corresponding notion with respect to the MDP underlying .
We remark that the weight function is not subject to randomisation, i.e., the weight of any transition is determined by the outgoing state and chosen action. In particular, counter values can be inferred from histories and be taken in account to make decisions in .
Let be an OC-MDP. The OC-MDP induces an MDP over the infinite countable space of configurations. Transitions in this induced MDP are defined using for the probability of updating the state and for the deterministic change of counter value. We interrupt any play whenever a configuration with counter value is reached. Intuitively, such configurations can be seen as situations in which we have run out of an energy resource. We may also impose an upper bound on the counter value, and interrupt any plays that reach this counter upper bound. We refer to OC-MDPs with a finite upper bound on counter values as bounded OC-MDPs and OC-MDPs with no upper bounds as unbounded OC-MDP.
We provide a unified definition for both semantics. Let be an OC-MDP and let be an upper bound on the counter value. We define the MDP where is defined, for all configurations , actions and states , by if , and otherwise. The state space of is finite if and only if .
For technical reasons, we also introduce one-counter Markov chains. In one-counter Markov chains, we authorise stochastic counter updates, i.e., counter updates are integrated in the transition function. This contrasts with OC-MDPs where deterministic counter updates are used to allow strategies to observe counter updates. We only require the unbounded semantics in this case. Formally, a one-counter Markov chain is a tuple where is a finite set of states and is a probabilistic transition and counter update function. The one-counter Markov chain induces a Markov chain such that for any configuration , any and any , we have if and otherwise.
Objectives.
Let be an MDP. An objective is a measurable set of plays that represents a specification. We focus on variants of reachability. Given a set of target states , the reachability objective for is defined as . If , we write for . We assume familiarity with linear systems for reachability probabilities in finite Markov chains (see, e.g., [4]).
Let be an OC-MDP, let be a counter upper bound and let . We consider two variants of reachability on . The first goal we consider requires visiting regardless of the counter value and the second requires visiting with a counter value of zero. The first objective is called state-reachability and we denote it by instead of . The second objective is called selective termination, which is formalised as . Like above, if , we write and instead of and .
The Blum-Shub-Smale model of computation.
Some of our complexity bounds rely on a model of computation introduced by Blum, Shub and Smale [11]. A Blum-Shub-Smale (BSS) machine, intuitively, is a random-access memory machine with registers storing real numbers. Arithmetic computations on the content of registers are in constant time in this model.
The class of decision problem that can be solved in polynomial time in the BSS model coincides with the class of decision problems that can be solved, in the Turing model, in polynomial time with a PosSLP oracle [2]. The PosSLP problem asks, given a division-free straight-line program (intuitively, an arithmetic circuit), whether its output is positive. The PosSLP problem lies in the counting hierarchy and can be solved in polynomial space [2].
Theory of the reals.
The theory of the reals refers to the set of sentences (i.e., fully quantified first-order logical formulae) in the signature of ordered fields that hold in . The problem of deciding whether a sentence is in the theory of the reals is decidable; if the number of quantifier blocks is fixed, this can be done in PSPACE [6, Rmk. 13.10]. Furthermore, the problem of deciding the validity of an existential (resp. universal) formula is NP-hard (resp. co-NP-hard) [6, Rmk. 13.9]. Our complexity bounds refer to the complexity classes ETR (existential theory of the reals) and co-ETR, which contain the problems that can be reduced in polynomial time to checking the membership of an existential sentence and universal sentence in the theory of the reals respectively.
3 Interval strategies and decision problems
We now introduce interval strategies and compare them to finite-memory strategies of the MDP underlying the OC-MDP. We close this section by formulating our decision problems. We fix an OC-MDP and a counter upper bound .
Interval strategies.
Interval strategies are a subclass of memoryless strategies of that admit finite compact representations based on families of intervals. Intuitively, a strategy is an interval strategy if there exists an interval partition (i.e., a partition containing only intervals) of the set of counter values such that decisions taken in a state for two counter values in the same interval coincide. We require that this partition has a finite representation to formulate verification and synthesis problems for interval strategies.
The set contains all counter values for which decisions are relevant. Let be an interval partition of . A memoryless strategy of is based on the partition if for all , all and all , we have . All memoryless strategies are based on the trivial partition of into singleton sets. In practice, we are interested in strategies based on partitions with a small number of large intervals.
We define two classes of interval strategies: strategies that are based on finite partitions and, in unbounded OC-MDPs, strategies that are based on periodic partitions. An interval partition of is periodic if there exists a period such that for all , . A periodic interval partition with period induces the interval partition of . Conversely, for any , an interval partition of generates the periodic partition .
Let be a memoryless strategy of . The strategy is an open-ended interval strategy (OEIS) if it is based on a finite interval partition of . The qualifier open-ended follows from there being an unbounded interval in any finite interval partition of (for the unbounded case). When , is a cyclic interval strategy (CIS) if there exists a period such that for all and all , we have . A strategy is a CIS with period if and only if it is based on a periodic interval partition of with period .222We do not consider strategies based on ultimately periodic interval partitions. However, our techniques can be adapted to analyse such strategies. Furthermore, we can show that our complexity bounds for the decision problems we study extend to this case.
We represent interval strategies as follows. First, assume that is an OEIS and let be the coarsest finite interval partition of on which is based. We can represent by a table that lists, for each , the bounds of and a memoryless strategy of dictating the choices to be made when the current counter value lies in . Next, assume that and that is a CIS with period . Let be an interval partition of such that is based on the partition generated by . We represent by and an OEIS of based on that specifies the behaviour of for counter values up to .
Interval strategies subsume counter-oblivious strategies, i.e., memoryless strategies that make choices based only on the state and disregard the current counter value. Counter-oblivious strategies can be viewed as memoryless strategies of .
Substituting counters with memory.
Memoryless strategies of can be seen as strategies of when an initial counter value is fixed. The idea is to use memory to keep track of the counter instead of observing it. We can thus compare our representation of interval strategies to the classical Mealy machine representation of the corresponding strategies of . The following discussion is formalised in the full version of this paper [1].
If , the counterpart in of any OEIS has finite memory: there are only finitely many counter values. However, a Mealy machine for this strategy can require a size exponential in the binary encoding of (which occurs in the OEIS representation). Below, we illustrate this claim and that, if , then counterparts in of OEIS may require infinite memory.
Example 1.
We consider the OC-MDP illustrated in Fig. 1(a). Let , . We consider the OEIS defined by for all and . The strategy maximises the probability of terminating in from the configuration .
We let denote the counterpart of in for the initial counter value . Intuitively, any automaton-based memory structure must distinguish different histories and thus at least memory states are necessary to implement via a Mealy machine. As requires a specific action for , for all , we must be able to distinguish the histories from one another. In particular, if , then is not finite-memory.
We now assume that and that is a CIS. Its counterpart in is a finite-memory strategy: by periodicity of , it suffices to keep track of the remainder of division of the counter value by the period. By adapting Ex. 1, we can show that a CIS representation may be exponentially more succinct than any Mealy machine for its counterpart in .
Decision problems.
We formulate the decision problems studied in following sections. The common inputs of these problems are an OC-MDP with rational transition probabilities, a counter bound (encoded in binary if it is finite), a target , an objective , an initial configuration and a threshold against which we compare the probability of . Problems that are related to CISs assume that as all strategies in the bounded case are OEISs. We lighten the probability notation below by omitting from it.
First, we are concerned with the verification of interval strategies. The interval strategy verification problem asks to decide, given an interval strategy , whether . We assume that the encoding of the input interval strategy matches the description above.
The other two problems relate to interval strategy synthesis. The corresponding decision problem is called realisability. We provide algorithms checking the existence of well-performing structurally-constrained interval strategies. We formulate two variants of this problem.
For the first variant, we fix an interval partition of beforehand and ask to check if there is a good strategy based on . Formally, the fixed-interval OEIS realisability problem asks whether, given a finite interval partition of , there exists an OEIS based on such that . The variant for CISs is defined similarly: the fixed-interval CIS realisability problem asks whether, given a period and an interval partition of , there exists a CIS based on the partition generated by such that .
For the second variant, we parameterise the number of intervals in the partition and the size of bounded intervals. The parameterised OEISs realisability problem asks, given a bound on the number of intervals and a bound on the size of bounded intervals, whether there exists an OEIS such that and is based on an interval partition of with and, for all bounded , (in particular, the parameter does not constrain the required infinite interval in the unbounded setting ). The parameterised CIS realisability problem asks, given a bound on the number of intervals and a bound on the size of intervals, whether there exists a CIS such that and is based on an interval partition of with period such that for all and induces a partition of with at most intervals. In both cases, we assume that the number is given in unary. This ensures that witness strategies, when they exist, are based on interval partitions that have a representation of size polynomial in the size of the inputs.
Remark 2.
In bounded OC-MDPs, instances of the parameterised OEIS realisability problem such that no partitions are compatible with the input parameters and are trivially negative. This issue does not arise for OEISs in unbounded OC-MDPs or for CISs, as counter-oblivious strategies are always possible witnesses no matter the parameters.
For both realisability problems, we consider two variants, depending on whether we want the answer with respect to the set of pure or randomised interval strategies. For many objectives in MDPs (e.g., reachability, parity objectives [4, 13]), the maximum probability of satisfying the objective is the same for pure and randomised strategies. As highlighted by the following example, when we restrict the structure of the sought interval strategy, there may exist randomised strategies that perform better than all pure ones. Intuitively, randomisation can somewhat alleviate the loss in flexibility caused by structural restrictions [23].
Example 3.
The fixed-interval and parameterised realisability problems subsume the realisability problem for counter-oblivious strategies. The goal of this example is to provide an OC-MDP in which there exists a randomised counter-oblivious strategy that performs better than any pure counter-oblivious strategy from a given configuration.
We consider the OC-MDP depicted in Fig. 1(b) in which all weights are , the objective , a counter bound and the initial configuration . To maximise the probability of reaching from with an unrestricted strategy, we must select action in and then in . Intuitively, is used when two steps remain as it allows another attempt and we switch to when one step is left because it is the action that maximises the probability of reaching in a single step.
We now limit our attention to counter-oblivious strategies. For pure strategies, no matter whether action or is chosen in , is reached with probability from . However, when playing both actions uniformly at random in , the resulting reachability probability from is . This shows that randomised counter-oblivious strategies can achieve better reachability (resp. selective termination) probabilities than pure strategies.
4 Compressing induced Markov chains
This section introduces the construction underlying our algorithms for the decision problems presented in Sect. 3. To analyse the (possibly infinite) Markov chains induced by memoryless strategies over the space of configurations of an OC-MDP, we build a compressed Markov chain, i.e., a Markov chain with fewer configurations and adjusted transitions. This construction is generic, in the sense that it can formally be defined for any memoryless strategy.
We introduce the compression construction in Sect. 4.1. Sect. 4.2 overviews its relevant properties for our algorithms.
For the whole section, we fix an OC-MDP , a bound on counter values and a memoryless strategy of based on an interval partition of .
4.1 Defining compressed Markov chains
We introduce the compressed Markov chain derived from the Markov chain induced on by and the partition . We write instead of whenever is clear from the context. Intuitively, we keep some configurations in the state space of and, to balance this, transitions of aggregate several histories of the induced Markov chain. We also apply compression for bounded intervals: interval bounds are represented in binary and thus the size of an interval is exponential in its encoding size. We open with an example.
Example 4.
We consider the OC-MDP depicted in Fig. 2(a) and counter upper bound . Let denote the randomised OEIS based on such that and .
We define the compressed Markov chain depicted in Fig. 2(b) by processing each interval individually. First, we consider the bounded interval . When we enter from its minimum or maximum, we only consider counter jumps by powers of , starting with . If a counter value in is reached by jumping by , we consider counter updates of from it. This explains the counter progressions from to and from to . The state space of with respect to contains the configurations whose counter values can be reached via this scheme. Transitions aggregate several histories of , e.g., the probability from to is the probability under of all histories of from to along which counter values elsewhere than in remain between and (i.e., the counter value before the next step). The encoding of transition probabilities may be exponential in the number of retained configurations; this is highlighted by the progression of probability denominators between and .
For the unbounded interval , we only consider configurations with counter value and consider transitions to configurations with counter value . In this case, for instance, the transition probability from to , corresponds the probability under in of hitting counter value for the first time in from . This example illustrates that this probability can be irrational. Here, the probability of moving from to is a solution of the quadratic equation (see [32]): is the probability of directly moving from to and is the probability of moving from to by first going through the intermediate configuration .
Finally, we comment on the absorbing state . The rules making up transitions of outlined above require a change in counter value. We redirect the probability of never seeing such a change to . In this example, does not allow a counter decrease from .
Ex. 4 outlines the main ideas to construct . To guarantee that bounded intervals of can only be entered by one of their bounds in general, we require that, for all bounded , for some . This is not prohibitive: given a bounded interval, we can partition it into sub-intervals satisfying this constraint in polynomial time (see the full version of this paper [1]).
We now formalise . We start by defining its state space which does not depend on . We first formalise the configurations that are retained for each interval.
Let . First, let us assume that is unbounded and let denote its minimum. We set , i.e., we only retain the configurations with minimal counter value in .
Next, let us assume that is bounded and of the form . Let (i.e., we have ). The set of retained configurations for is given by
Finally, we consider absorbing configurations and the new state . We let if and otherwise. We define .
We now define the transition function . For all , we let . For configurations whose counter value lies in one of the intervals , we provide a unified definition based on a notion of successor counter values, generalising the ideas of Ex. 4.
Let . If is unbounded, we define the successor of to be . We now assume that is bounded and let and . The successors of are and . Symmetrically, for , its successors are and . Both cases entail a counter change by . The size constraints on the intervals of ensure that all successor counter values appear in .
Let and . If is not a successor of , we set . Assume now that is a successor of . We let be the set of histories such that , and for all configurations along besides , the counter value of is not a successor of ; outside of along , the counter remains, in the bounded case, strictly between the two successors of , and, in the unbounded case, strictly above the successor of . We set . To ensure that is a distribution we let ; this transition captures the probability of the counter never hitting a successor of .
Remark 5.
Although we have formalised compressed Markov chains for OC-MDPs, the construction can be applied to one-counter Markov chains. In particular, the properties outlined below transfer to the compression of a one-counter Markov chain.
4.2 Properties of compressed Markov chains
Validity of the compression approach.
Let . The probability of terminating or reaching from in a given state is the same in under and in . Intuitively, histories of group histories of together and preserve their probability under . By design of transitions in , all histories of from to configurations of the form or are abstracted in this way. This implies the following theorem, in which, for the sake of clarity, we indicate the relevant MDP or Markov chain for each objective.
Theorem 6.
Let . For all , and, if , .
Remark 7.
Thm. 6 does not enable the use of to analyse state-reachability objectives. In this case, we first modify so targets become absorbing and have a negative-weighted self-loop. Probabilities for state-reachability in are equal to those of reaching a target with counter or in the compression of the modified OC-MDP with respect to and .
Representing and computing transition probabilities.
Transition probabilities in can be irrational for unbounded intervals or require large representations for bounded intervals. For unbounded intervals, these probabilities can be shown to be termination probabilities in a one-counter Markov chain. It follows that they are the least solution of a quadratic system of equations [32].
For bounded intervals, the transition probabilities of are also the least solution of a quadratic system of equations. We provide intuition on how to derive this system for an interval via Markov chains.
Let us first consider transitions from . We illustrate the situation in Fig. 3(a): we consider a Markov chain over where states in are absorbing and transitions from other states are inherited from the Markov chain induced by on . We represent transitions in this Markov chain from a configuration to configurations with a state . For any , the probability of reaching a configuration from in this Markov chain is by definition.
Next, we let and consider configurations in . The situation is depicted in Fig. 3(b). We divide a counter change by into counter changes by and, thus, rely on the transition probabilities from in . In this case, we can see transition probabilities from in as reachability probabilities in a Markov chain over . We obtain the following.
Theorem 8.
Let . Transition probabilities of from configurations with counter value in are the least solution of a quadratic system of equations. The equation system for is of size polynomial in the binary encoding of the bounds of and the encoding of .
By determining the states from which targets are unreachable, systems for Markov chain reachability probabilities can be made to have a unique solution. For bounded OC-MDPs, we can reason similarly due to the structure of the systems of Thm. 8 to obtain the following.
Theorem 9.
Assume that . Then we can modify the systems of Thm. 8 in polynomial time such that the least solutions of the original systems are the unique solution of the modified ones. This process only relies on the supports of the distributions in the image of and not the exact probabilities. The resulting system can be solved in polynomial time in the BSS model.
Finite representations of compressed Markov chains.
A compressed Markov chain is finite if and only if is finite. In particular, for OEISs, the compression technique allows us to reduce to the analysis of a finite Markov chain. We now assume that is a CIS and that has period . In contrast to the above, is an infinite Markov chain. Nonetheless, we can show that is induced by a one-counter Markov chain. Due to the periodic structure of , transitions from mainly depend on . We construct a one-counter Markov chain by encoding these remainders into states and use the counter to keep track of the quotient of the counter by . We obtain the following.
Lemma 10.
Assume that is periodic with period . The compression is generated by a one-counter Markov chain (where is the interval partition of induced by ).
5 Algorithms and complexity
We now overview the complexity results for the decision problems presented in Table 1.
We fix the following inputs for the whole section: an OC-MDP , a counter upper bound , an initial configuration , a set of targets , an objective and a threshold . We assume that, if , then the modification outlined in Rmk. 7 has been applied to .
Verification.
Let be an interval strategy. First, assume that is an OEIS. We comment on the CIS case below. Let be a refinement of the interval partition from the encoding of such that is well-defined and is an interval bound in (thus ). We can construct this partition in polynomial time with respect to the input size.
By Thm. 6 and Rmk. 7, is equal to a reachability probability from in . To analyse , we use different approaches for bounded and unbounded OC-MDPs.
In the bounded case, we can decide the problem in polynomial time in the BSS model. By Thm. 9, we can explicitly compute the transition probabilities of . These can be used to compute the relevant reachability probabilities (by solving a linear system) and conclude.
In general, we can show that the verification problem is in co-ETR. From the characterisation of transition probabilities of in Thm. 8, we can derive a formula where are variables representing transition probabilities of , are strategy probabilities of , and the least satisfying assignment of -variables in contains the transition probabilities of . Similarly, we can build a formula from the linear system for reachability probabilities in , where are the variables for the relevant reachability probabilities with respect to . By construction, any vectors and such that are over-estimations of the probabilities of interest. Thus, the verification problem amounts to checking if , where is the variable for the reachability probability from in .
We now informally comment on how to adapt the technique for OEIS to obtain a co-ETR upper bound when is a CIS. The main technical difference is that we apply the compression technique twice. We apply the compression first to obtain the infinite Markov chain . We then consider the one-counter Markov chain that generates it (Lem. 10) and use a compression to analyse it (Rmk. 5). In the end, we construct a formula similarly to above, with an additional sub-formula for the transition probabilities of the additional compression.
Fixed-interval realisability.
We consider the fixed-interval realisability problem. We first discuss the variant for pure strategies. It suffices to non-deterministically guess a pure interval strategy based on the input interval partition of and verify this strategy. In our complexity analysis, we treat verification as a part of the procedure and not as an oracle. We obtain that the fixed-interval realisability problem for pure strategies is in for OEIS in bounded OC-MDPs and in for OEIS and CIS in unbounded OC-MDPs. In particular, we have a PSPACE upper bound.
For randomised strategies, we quantify existentially over strategy variables in the verification formulae. We illustrate the case of OEISs with an interval partition . We introduce a new sub-formula that holds if and only if the interpretation of the variables are strategy probabilities. Solving the fixed-interval realisability problem can be reduced to checking if . For CISs, we can follow the same reasoning. We obtain a PSPACE algorithm (see Sect. 2).
We refine this approach for bounded OC-MDPs. Assume that the supports of the distributions of the strategy to be synthesised, i.e., which of the variables of have a positive assignment, are fixed. With this assumption, we can modify the conjunction of the verification formula by using Thm. 9 so that for any valuation of compatible with the supports, the only valuation of and satisfying this conjunction is made of the probabilities represented by these variables. We can refine the system underlying because Thm. 9 indicates the transitions of the compressed chain that have a positive probability. We obtain an complexity as follows: we non-deterministically guess some strategy distribution supports and then check if where only allows strategies with the guessed support and and have been adapted by applying the result of Thm. 9.
Parameterised realisability.
We consider the input parameters for the number of intervals and for the bound on the size of finite intervals. Intuitively, we non-deterministically construct an interval partition compatible with and and then use algorithms for the fixed case. Due to the assumption that is given in unary, the guessed partition can be represented in polynomial space. To obtain the same complexity upper bounds for parameterised realisability as for fixed-interval realisability, we consider the algorithms for fixed-interval realisability as a part of the overall procedure and not as oracles in our complexity analyses.
Lower bounds.
All of the problems discussed above are square-root-sum-hard. The square-root-sum problem asks, given integers and , whether . The square-root-sum problem can be solved in polynomial time in the BSS model [42] but is not known to lie in NP. In the unbounded case, the hardness can be obtained by relying on an existing reduction for one-counter Markov chains [26]. For the bounded setting, we build on the same construction. The main hurdle is to establish that we can, in polynomial time, compute a large enough counter bound so the bounded one-counter Markov chain approximates the one used in the unbounded reduction well enough for the reduction to still be valid. We also provide a reduction from the Hamiltonian cycle problem to the counter-oblivious realisability problem for selective termination, implying its NP-hardness [28]. This implies the NP-hardness of our two realisability problems.
References
- [1] Michal Ajdarów, James C. A. Main, Petr Novotný, and Mickael Randour. Taming infinity one chunk at a time: Concisely represented strategies in one-counter MDPs. CoRR, abs/2503.00788, 2025. doi:10.48550/arXiv.2503.00788.
- [2] Eric Allender, Peter Bürgisser, Johan Kjeldgaard-Pedersen, and Peter Bro Miltersen. On the complexity of numerical analysis. SIAM Journal on Computing, 38(5):1987–2006, 2009. doi:10.1137/070697926.
- [3] Pranav Ashok, Mathias Jackermeier, Jan Kretínský, Christoph Weinhuber, Maximilian Weininger, and Mayank Yadav. dtControl 2.0: Explainable strategy representation via decision tree learning steered by experts. In Jan Friso Groote and Kim Guldstrand Larsen, editors, Proceedings (Part II) of the 27th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS 2021, Held as Part of ETAPS 2021, Luxemburg City, Luxemburg, March 27–April 1, 2021, volume 12652 of Lecture Notes in Computer Science, pages 326–345. Springer, 2021. doi:10.1007/978-3-030-72013-1_17.
- [4] Christel Baier and Joost-Pieter Katoen. Principles of model checking. MIT Press, 2008.
- [5] Nikhil Balaji, Stefan Kiefer, Petr Novotný, Guillermo A. Pérez, and Mahsa Shirmohammadi. On the complexity of value iteration. In Christel Baier, Ioannis Chatzigiannakis, Paola Flocchini, and Stefano Leonardi, editors, Proceedings of the 46th International Colloquium on Automata, Languages, and Programming, ICALP 2019, Patras, Greece, July 9–12, 2019, volume 132 of LIPIcs, pages 102:1–102:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2019. doi:10.4230/LIPICS.ICALP.2019.102.
- [6] Saugata Basu, Richard Pollack, and Marie-Françoise Roy. Algorithms in Real Algebraic Geometry. 1431-1550. Springer, 2nd edition, 2006. doi:10.1007/3-540-33099-2.
- [7] Noam Berger, Nevin Kapur, Leonard J. Schulman, and Vijay V. Vazirani. Solvency games. In Ramesh Hariharan, Madhavan Mukund, and V. Vinay, editors, IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science, FSTTCS 2008, December 9-11, 2008, Bangalore, India, volume 2 of LIPIcs, pages 61–72. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2008. doi:10.4230/LIPICS.FSTTCS.2008.1741.
- [8] Kark-Josef Bierth. An expected average reward criterion. Stochastic processes and their applications, 26:123–140, 1987.
- [9] Frantisek Blahoudek, Tomás Brázdil, Petr Novotný, Melkior Ornik, Pranay Thangeda, and Ufuk Topcu. Qualitative controller synthesis for consumption Markov decision processes. In Shuvendu K. Lahiri and Chao Wang, editors, Proceedings (Part II) of the 32nd International Conference on Computer Aided Verification, CAV 2020, Los Angeles, CA, USA, July 21–24, 2020, volume 12225 of Lecture Notes in Computer Science, pages 421–447. Springer, 2020. doi:10.1007/978-3-030-53291-8_22.
- [10] Roderick Bloem, Krishnendu Chatterjee, and Barbara Jobstmann. Graph games and reactive synthesis. In Edmund M. Clarke, Thomas A. Henzinger, Helmut Veith, and Roderick Bloem, editors, Handbook of Model Checking, pages 921–962. Springer, 2018. doi:10.1007/978-3-319-10575-8_27.
- [11] Lenore Blum, Mike Shub, and Steve Smale. Over the real numbers: NP-completeness, recursive functions and universal machines. Bulletin of the American Mathematical Society, 21(1), 1989.
- [12] Patricia Bouyer, Stéphane Le Roux, Youssouf Oualhadj, Mickael Randour, and Pierre Vandenhove. Games where you can play optimally with arena-independent finite memory. Logical Methods in Computer Science, 18(1), 2022. doi:10.46298/lmcs-18(1:11)2022.
- [13] Patricia Bouyer, Youssouf Oualhadj, Mickael Randour, and Pierre Vandenhove. Arena-independent finite-memory determinacy in stochastic games. Log. Methods Comput. Sci., 19(4), 2023. doi:10.46298/LMCS-19(4:18)2023.
- [14] Patricia Bouyer, Mickael Randour, and Pierre Vandenhove. The true colors of memory: A tour of chromatic-memory strategies in zero-sum games on graphs (invited talk). In Anuj Dawar and Venkatesan Guruswami, editors, Proceedings of the 42nd IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science, FSTTCS 2022, IIT Madras, Chennai, India, December 18–20, 2022, volume 250 of LIPIcs, pages 3:1–3:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2022. doi:10.4230/LIPIcs.FSTTCS.2022.3.
- [15] Tomás Brázdil, Václav Brozek, Kousha Etessami, and Antonín Kucera. Approximating the termination value of one-counter MDPs and stochastic games. Information and Computation, 222:121–138, 2013. doi:10.1016/J.IC.2012.01.008.
- [16] Tomás Brázdil, Václav Brozek, Kousha Etessami, Antonín Kucera, and Dominik Wojtczak. One-counter Markov decision processes. In Moses Charikar, editor, Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17-19, 2010, pages 863–874. SIAM, 2010. doi:10.1137/1.9781611973075.70.
- [17] Tomás Brázdil, Krishnendu Chatterjee, Martin Chmelik, Andreas Fellner, and Jan Kretínský. Counterexample explanation by learning small strategies in Markov decision processes. In Daniel Kroening and Corina S. Pasareanu, editors, Proceedings (Part I) of the 27th International Conference on Computer Aided Verification, CAV 2015, San Francisco, CA, USA, July 18–24, 2015, volume 9206 of Lecture Notes in Computer Science, pages 158–177. Springer, 2015. doi:10.1007/978-3-319-21690-4_10.
- [18] Tomás Brázdil, Krishnendu Chatterjee, Jan Kretínský, and Viktor Toman. Strategy representation by decision trees in reactive synthesis. In Dirk Beyer and Marieke Huisman, editors, Proceedings (Part I) of the 24th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS 2018, Held as Part of ETAPS 2018, Thessaloniki, Greece, April 14–20, 2018, volume 10805 of Lecture Notes in Computer Science, pages 385–407. Springer, 2018. doi:10.1007/978-3-319-89960-2_21.
- [19] Tomás Brázdil, Krishnendu Chatterjee, Antonín Kucera, and Petr Novotný. Efficient controller synthesis for consumption games with multiple resource types. In P. Madhusudan and Sanjit A. Seshia, editors, Computer Aided Verification - 24th International Conference, CAV 2012, Berkeley, CA, USA, July 7-13, 2012 Proceedings, volume 7358 of Lecture Notes in Computer Science, pages 23–38. Springer, 2012. doi:10.1007/978-3-642-31424-7_8.
- [20] Tomás Brázdil, Stefan Kiefer, and Antonín Kucera. Efficient analysis of probabilistic programs with an unbounded counter. In Ganesh Gopalakrishnan and Shaz Qadeer, editors, Computer Aided Verification - 23rd International Conference, CAV 2011, Snowbird, UT, USA, July 14-20, 2011. Proceedings, volume 6806 of Lecture Notes in Computer Science, pages 208–224. Springer, 2011. doi:10.1007/978-3-642-22110-1_18.
- [21] Thomas Brihaye, Aline Goeminne, James C. A. Main, and Mickael Randour. Reachability games and friends: A journey through the lens of memory and complexity (invited talk). In Patricia Bouyer and Srikanth Srinivasan, editors, Proceedings of the 43rd IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science, FSTTCS 2023, IIIT Hyderabad, Telangana, India, December 18–20, 2023, volume 284 of LIPIcs, pages 1:1–1:26. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.FSTTCS.2023.1.
- [22] Krishnendu Chatterjee, Marcin Jurdzinski, and Thomas A. Henzinger. Quantitative stochastic parity games. In J. Ian Munro, editor, Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2004, New Orleans, Louisiana, USA, January 11–14, 2004, pages 121–130. SIAM, 2004. URL: http://dl.acm.org/citation.cfm?id=982792.982808.
- [23] Krishnendu Chatterjee, Mickael Randour, and Jean-François Raskin. Strategy synthesis for multi-dimensional quantitative objectives. Acta Informatica, 51(3-4):129–163, 2014. doi:10.1007/s00236-013-0182-6.
- [24] Florent Delgrange, Joost-Pieter Katoen, Tim Quatmann, and Mickael Randour. Simple strategies in multi-objective MDPs. In Armin Biere and David Parker, editors, Proceedings (Part I) of the 26th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS 2020, Held as Part of ETAPS 2020, Dublin, Ireland, April 25–30, 2020, volume 12078 of Lecture Notes in Computer Science, pages 346–364. Springer, 2020. doi:10.1007/978-3-030-45190-5_19.
- [25] Kousha Etessami, Marta Z. Kwiatkowska, Moshe Y. Vardi, and Mihalis Yannakakis. Multi-objective model checking of Markov decision processes. Logical Methods in Computer Science, 4(4), 2008. doi:10.2168/LMCS-4(4:8)2008.
- [26] Kousha Etessami, Dominik Wojtczak, and Mihalis Yannakakis. Quasi-birth-death processes, tree-like QBDs, probabilistic 1-counter automata, and pushdown systems. Performance Evaluation, 67(9):837–857, 2010. doi:10.1016/J.PEVA.2009.12.009.
- [27] Nathanaël Fijalkow, Nathalie Bertrand, Patricia Bouyer-Decitre, Romain Brenguier, Arnaud Carayol, John Fearnley, Hugo Gimbert, Florian Horn, Rasmus Ibsen-Jensen, Nicolas Markey, Benjamin Monmege, Petr Novotný, Mickael Randour, Ocan Sankur, Sylvain Schmitz, Olivier Serre, and Mateusz Skomra. Games on graphs. CoRR, abs/2305.10546, 2023. doi:10.48550/arXiv.2305.10546.
- [28] M. R. Garey and David S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979.
- [29] Hugo Gimbert. Pure stationary optimal strategies in Markov decision processes. In Wolfgang Thomas and Pascal Weil, editors, Proceedings of the 24th Annual Symposium on Theoretical Aspects of Computer Science, STACS 2007, Aachen, Germany, February 22–24, 2007, volume 4393, pages 200–211. Springer, 2007. doi:10.1007/978-3-540-70918-3_18.
- [30] Florian Jüngermann, Jan Kretínský, and Maximilian Weininger. Algebraically explainable controllers: decision trees and support vector machines join forces. International Journal on Software Tools for Technology Transfer, 25(3):249–266, 2023. doi:10.1007/S10009-023-00716-Z.
- [31] Stefan Kiefer, Richard Mayr, Mahsa Shirmohammadi, Patrick Totzke, and Dominik Wojtczak. How to play in infinite MDPs (invited talk). In Artur Czumaj, Anuj Dawar, and Emanuela Merelli, editors, Proceedings of the 47th International Colloquium on Automata, Languages, and Programming, ICALP 2020, Saarbrücken, Germany, July 8–11, 2020, volume 168 of LIPIcs, pages 3:1–3:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPICS.ICALP.2020.3.
- [32] Antonín Kucera, Javier Esparza, and Richard Mayr. Model checking probabilistic pushdown automata. Logical Methods in Computer Science, 2(1), 2006. doi:10.2168/LMCS-2(1:2)2006.
- [33] James C. A. Main and Mickael Randour. Different strokes in randomised strategies: Revisiting Kuhn’s theorem under finite-memory assumptions. Information and Computation, 301:105229, 2024. doi:10.1016/J.IC.2024.105229.
- [34] James C. A. Main and Mickael Randour. Mixing any cocktail with limited ingredients: On the structure of payoff sets in multi-objective MDPs and its impact on randomised strategies. CoRR, abs/2502.18296, 2025. doi:10.48550/arXiv.2502.18296.
- [35] Donald Ornstein. On the existence of stationary optimal strategies. Proceedings of the American Mathematical Society, 20(2):563–569, 1969. URL: http://www.jstor.org/stable/2035700.
- [36] Joël Ouaknine and James Worrell. Positivity problems for low-order linear recurrence sequences. In Chandra Chekuri, editor, Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2014, Portland, Oregon, USA, January 5–7, 2014, pages 366–379. SIAM, 2014. doi:10.1137/1.9781611973402.27.
- [37] Jakob Piribauer and Christel Baier. Positivity-hardness results on Markov decision processes. TheoretiCS, 3, 2024. doi:10.46298/THEORETICS.24.9.
- [38] Mickael Randour. Automated synthesis of reliable and efficient systems through game theory: A case study. In Proc. of ECCS 2012, Springer Proceedings in Complexity XVII, pages 731–738. Springer, 2013. doi:10.1007/978-3-319-00395-5_90.
- [39] Mickael Randour, Jean-François Raskin, and Ocan Sankur. Percentile queries in multi-dimensional Markov decision processes. Formal methods in system design, 50(2-3):207–248, 2017. doi:10.1007/s10703-016-0262-7.
- [40] Lloyd S. Shapley. Stochastic games. Proceedings of the National Academy of Sciences, 39(10):1095–1100, 1953. doi:10.1073/pnas.39.10.1095.
- [41] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2018.
- [42] Prasoon Tiwari. A problem that is easier to solve on the unit-cost algebraic RAM. Journal of Complexity, 8(4):393–397, 1992. doi:10.1016/0885-064X(92)90003-T.