Abstract 1 Introduction 2 Preliminaries 3 Interval strategies and decision problems 4 Compressing induced Markov chains 5 Algorithms and complexity References

Taming Infinity One Chunk at a Time: Concisely Represented Strategies in One-Counter MDPs

Michal Ajdarów ORCID Masaryk University, Brno, Czech Republic James C. A. Main ORCID F.R.S.-FNRS and UMONS – Université de Mons, Mons, Belgium Petr Novotný ORCID Masaryk University, Brno, Czech Republic Mickael Randour ORCID F.R.S.-FNRS and UMONS – Université de Mons, Mons, Belgium
Abstract

Markov decision processes (MDPs) are a canonical model to reason about decision making within a stochastic environment. We study a fundamental class of infinite MDPs: one-counter MDPs (OC-MDPs). They extend finite MDPs via an associated counter taking natural values, thus inducing an infinite MDP over the set of configurations (current state and counter value). We consider two characteristic objectives: reaching a target state (state-reachability), and reaching a target state with counter value zero (selective termination). The synthesis problem for the latter is not known to be decidable and connected to major open problems in number theory. Furthermore, even seemingly simple strategies (e.g., memoryless ones) in OC-MDPs might be impossible to build in practice (due to the underlying infinite configuration space): we need finite, and preferably small, representations.

To overcome these obstacles, we introduce two natural classes of concisely represented strategies based on a (possibly infinite) partition of counter values in intervals. For both classes, and both objectives, we study the verification problem (does a given strategy ensure a high enough probability for the objective?), and two synthesis problems (does there exist such a strategy?): one where the interval partition is fixed as input, and one where it is only parameterized. We develop a generic approach based on a compression of the induced infinite MDP that yields decidability in all cases, with all complexities within PSPACE.

Keywords and phrases:
one-counter Markov decision processes, randomised strategies, termination, reachability
Category:
Track B: Automata, Logic, Semantics, and Theory of Programming
Funding:
James C. A. Main: Research Fellow of the Fonds de la Recherche Scientifique – FNRS and member of the TRAIL institute.
Mickael Randour: Research Associate of the Fonds de la Recherche Scientifique – FNRS and member of the TRAIL institute.
Copyright and License:
[Uncaptioned image] © Michal Ajdarów, James C. A. Main, Petr Novotný, and Mickael Randour; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Probabilistic computation
; Theory of computation Logic and verification
Related Version:
Full Version: https://arxiv.org/abs/2503.00788 [1]
Acknowledgements:
We thank Sougata Bose for pointing out the reference [6].
Funding:
This work has been supported by the Fonds de la Recherche Scientifique - FNRS under Grant n° T.0188.23 (PDR ControlleRS) and by the Czech Science Foundation grant no. GA23-06963S.
Editors:
Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele Puppis

1 Introduction

A Markov decision process (MDP) models the continuous interaction of a controllable system with a stochastic environment. MDPs are used notably in the fields of formal methods (e.g., [4, 27]) and reinforcement learning (e.g., [41]). In each step of an MDP, the system selects an action that is available in the current state, and the state is updated randomly following a distribution depending only on the current state and chosen action. This interaction goes on forever and yields a play (i.e., an execution) of the MDP. The resolution of non-determinism in an MDP is done via a strategy: strategies prescribe choices depending on the current history of the play and may use randomisation. Strategies are the formal counterpart of controllers for the modelled system [38, 10]. A strategy is pure if it does not use randomisation.

A classical problem in MDPs is synthesis, which asks to construct a good strategy with respect to a specification. Specifications can be defined, e.g., via payoff functions that quantify the quality of plays, and via objectives, which are sets of well-behaved plays. The goal is thus to synthesise a strategy that, in the former case, has a high expected payoff, and, in the latter case, enforces the objective with high probability. If possible, it is desirable to construct an optimal strategy, i.e., a strategy whose performance is not (strictly) exceeded by another. The decision variant of the synthesis problem, which asks whether a well-performing strategy exists, is called realisability. We focus on variants of reachability objectives, asking that a target set of states be visited. These canonical objectives are central in synthesis [21].

Strategies and their representations.

Traditionally, strategies are represented by Mealy machines, i.e., finite automata with outputs (see, e.g., [12, 33]). Strategies that admit such a representation are called finite-memory strategies. A special subclass is that of memoryless strategies, i.e., strategies whose decisions depend only on the current state.

In finite MDPs, pure memoryless strategies often suffice to play optimally with a single payoff or objective, e.g., for reachability [4] and parity objectives [22], and mean [8] and discounted-sum payoffs [40]. In countable MDPs, for reachability objectives, pure memoryless strategies are also as powerful as general ones, though optimal strategies need not exist in general [35, 31]. In the multi-objective setting, memory and randomisation are necessary in general (e.g., [39, 21]). Nonetheless, finite memory often suffices in finite MDPs, e.g., when considering several ω-regular objectives [25].

The picture is more complicated in infinite-state MDPs, where even pure memoryless strategies need not admit finite representations. Our contribution focuses on small (and particularly, finite) counter-based representations of memoryless strategies in a fundamental class of infinite-state MDPs: one-counter MDPs.

One-counter Markov decision processes.

One-counter MDPs (OC-MDPs, [16]) are finite MDPs augmented with a counter that can be incremented (by one), decremented (by one) or left unchanged on each transition.111 Considering such counter updates is not restrictive for modelling: any integer counter update can be obtained with several transitions. However, this impacts the complexity of decision problems. Such a finite OC-MDP induces an infinite MDP over a set of configurations given by states of the underlying MDP and counter values. In this induced MDP, any play that reaches counter value zero is interrupted; this event is called termination. We consider two variants of the model: unbounded OC-MDPs, where counter values can grow arbitrarily large, and bounded OC-MDPs, in which plays are interrupted when a fixed counter upper bound is reached.

The counter in OC-MDPs can, e.g., model resource consumption along plays [16], or serve as an abstraction of unbounded data types and structures [20]. It can also model the passage of time: indeed, OC-MDPs generalise finite-horizon MDPs, in which a bound is imposed on the number of steps (see, e.g., [5]). OC-MDPs can be also seen as an extension of one-counter Markov chains with non-determinism. One-counter Markov chains are equivalent to (discrete-time) quasi-birth-death processes [26], a model studied in queuing theory.

Termination is the canonical objective in OC-MDPs [16]. Also relevant is the more general selective termination objective, which requires terminating in a target set of states. In this work, we study both the selective termination objective and the state-reachability objective, which requires visiting a target set of states regardless of the counter value.

Optimal strategies need not exist in unbounded OC-MDPs for these objectives [15]. The general synthesis problem in unbounded OC-MDPs for selective termination is not known to be decidable, and it is at least as hard as the positivity problem for linear recurrence sequences [37], whose decidability would yield a major breakthrough in number theory [36]. Optimal strategies exist in bounded OC-MDPs: the induced MDP is finite and we consider reachability objectives. However, constructing optimal strategies is already EXPTIME-hard for reachability in finite-horizon MDPs [5].

In this work, we propose to tame the inherent complexity of analysing OC-MDPs by restricting our analysis to a class of succinctly representable (yet natural and expressive) strategies called interval strategies.

Interval strategies.

The monotonic structure of OC-MDPs makes them amenable to analysis through strategies of special structure. For instance, in their study of solvency games (stateless variant of OC-MDPs with binary counter updates), Berger et al. [7] identify a natural class of rich man’s strategies, which, whenever the counter value is above some threshold, always select the same action in the same state. They argue that the question of existence of rich man’s optimal strategies “gets at a real phenomenon which …should be reflected in any good model of risk-averse investing.” While [7] shows that optimal rich man’s strategies do not always exist, if they do, their existence substantially simplifies the model analysis.

In this work, we make such finitely-represented strategies the primary object of study. We consider so-called interval strategies: each such strategy is based on some (finite or infinite but finitely-representable) partitioning of into intervals, and the strategy’s decision depends on the current state and on the partition containing the current counter value.

More precisely, we focus on two classes of these strategies: On the one hand, in bounded and unbounded OC-MDPs, we consider open-ended interval strategies (OEISs): here, the underlying partitioning is finite, and thus contains an open-ended interval [n,) on which the strategy behaves memorylessly, akin to rich man’s strategies.

On the other hand, in unbounded OC-MDPs, we also consider cyclic interval strategies (CISs): strategies for which there exists a (positive integer) period such that, for any two counter values that differ by the period, we take the same decisions. A CIS can be represented similarly to an OEIS using a partition of the set of counter values up to the period.

We collectively refer to OEISs and CISs as interval strategies. While interval strategies are not sufficient to play optimally in unbounded OC-MDPs [7], they can be used to approximate the supremum probability for the objectives we consider [15].

Our contributions.

We formally introduce the class of interval strategies. We show that OEISs in bounded OC-MDPs and CISs in unbounded OC-MDPs can be exponentially more concise than equivalent Mealy machines (Sect. 3), and unbounded OEISs can even represent infinite-memory strategies.

For selective termination and state reachability, we consider the interval strategy verification problem and two realisability problems for structure-constrained interval strategies. On the one hand, the fixed-interval realisability problem asks, given an interval partition, whether there is an interval strategy built on this partition that ensures the objective with a probability greater than a given threshold. Intuitively, in this case, the system designer specifies the desired structure of the controller. On the other hand, the parameterised realisability problem (for interval strategies), asks whether there exists a well-performing strategy built on a partition of size no more than a parameter d such that no finite interval is larger than a second parameter n. We consider two variants of the realisability problem: one for checking the existence of a suitable pure strategy and another for randomised strategies. Randomisation allows for better performance when imposing structural constraints on strategies (Ex. 3), but pure strategies are however often preferred for synthesis [24].

Table 1: Complexity bound summary for our problems. All bounds are below PSPACE. Square-root-sum-hardness results are derived from instances of the form xiy. The entries of this table are annotated with the associated results in the full version of this paper [1].

Our complexity results are summarised in Table 1. Analysing the performance of a memoryless strategy amounts to studying the Markov chain it induces on the MDP. The Markov chain induced by an interval strategy is infinite in unbounded OC-MDPs and of exponential size in bounded OC-MDPs, and thus cannot be analysed in polynomial space with classical approaches. Our results rely on the analysis of a compressed Markov chain derived from this large induced Markov chain. We remove certain configurations and aggregate several transitions into one (Sect. 4). This compressed Markov chain preserves the probability of selective termination and of hitting counter upper bounds (Thm. 6). However, its transition probabilities may require exponential-size representations or even be irrational. To represent these probabilities, we characterise them as the least solutions of quadratic systems of equations (Thm. 8), similarly to the result of [32] for termination probabilities in probabilistic pushdown automata. Compressed Markov chains are finite for OEISs and are induced by a one-counter Markov chain for CISs (Lem. 10).

The crux of our algorithmic results is the aforementioned compression. For verification, we reduce the problem to checking the validity of a universal formula in the theory of the reals, by exploiting our characterisation of transition probabilities in compressed Markov chains. This induces a PSPACE upper bound. For bounded OC-MDPs, we can do better: verification can be solved in polynomial time in the unit-cost arithmetic RAM model of computation of Blum, Shub and Smale [11], by computing transition and reachability probabilities of the compressed Markov chain. This yields a PPosSLP complexity in the Turing model (see [2]).

Both realisability variants exploit the verification approach through the theory of the reals. For fixed-interval realisability for pure strategies, we exploit non-determinism to select good strategies and then verify them with the above. In the randomised case, in essence, we build on the verification formulae and existentially quantify over the probabilities of actions under the sought strategy. Finally, for parameterised realisability, we build on our algorithms for the fixed-interval case by first non-deterministically building an appropriate partition.

We also provide complexity lower bounds. We show that all of our considered problems are hard for the square-root-sum problem, a problem that is not known to be solvable in polynomial time but that is solvable in polynomial time in the Blum-Shub-Smale model [42]. We also prove NP-hardness for our realisability problems for selective termination, already when checking the existence of good single-interval strategies.

Impact.

Our results provide a natural class of strategies for which realisability is decidable (whereas the general case remains open and known to be difficult [37]), and with arguably low complexity (for synthesis). Furthermore, the class of interval strategies is of practical interest thanks to their concise representation and their inherently understandable structure (in contrast to the corresponding Mealy machine representation).

Related work.

In addition to the main references cited previously, we mention some (non-exhaustive) related work. The closest is [9]: interval strategies are similar to counter selector strategies, studied in consumption MDPs. These differ from OC-MDPs in key aspects: all transitions consume resources (i.e., bear negative weights), and recharge can only be done in special reload states, where it is considered as an atomic action up to a given capacity. Consumption and counter-based (or energy) models have different behaviors (e.g., [19]). The authors of [9] also study incomparable objectives: almost-sure (i.e., probability one) Büchi objectives.

Restricting oneself to subclasses of strategies that prove to be of practical interest is a common endeavor in synthesis: e.g., strategies represented by decision trees [17, 18, 3, 30], pure strategies [29, 24, 13], finite-memory strategies [23, 14, 13], or strategies using limited forms of randomness [33, 34].

Outline.

Due to space constraints, we only present an overview of our results. Technical details and proofs can be found in the full version of this paper [1]. In Sect. 2, we introduce all prerequisite notions. We define interval strategies and formalise our decision problems in Sect. 3. The compressed Markov chain construction, which is central in our verification and realisability algorithms, is presented in Sect. 4. Complexity results are overviewed in Sect. 5.

2 Preliminaries

Set and probability theory.

We write , and for the sets of non-negative integers, rational numbers and real numbers respectively. We let >0 denote the set of positive integers. We let ¯={} and ¯>0=>0{}. Given n,n¯, we let n,n denote the set {knkn}, and if n=0, we shorten the notation to n.

Let A be a finite or countable set. We write 𝒟(A) for the set of distributions over A, i.e., of functions μ:A[0,1] such that aAμ(a)=1. Given a distribution μ𝒟(A), we let 𝗌𝗎𝗉𝗉(μ)={aAμ(a)>0} denote the support of μ.

Markov decision processes.

Formally, a Markov decision process (MDP) is a tuple =(S,A,δ) where S is a countable set of states, A is a finite set of actions and δ:S×A𝒟(S) is a (partial) probabilistic transition function. Given sS, we write A(s) for the set of actions aA such that δ(s,a) is defined. We say that an action a is enabled in s if aA(s). We require, without loss of generality, that there are no deadlocks in MDPs, i.e., that for all states s, A(s) is non-empty. We say that is finite if its state space is finite.

A Markov chain can be seen as an MDP with a single enabled action per state. We omit actions in Markov chains and denote them as tuples 𝒞=(S,δ) with δ:S𝒟(S).

Let =(S,A,δ). A play of is an infinite sequence π=s0a0s1a1(SA)ω such that for all , s+1𝗌𝗎𝗉𝗉(δ(s,a)). A history is a finite prefix of a play ending in a state. We let 𝖯𝗅𝖺𝗒𝗌() and 𝖧𝗂𝗌𝗍() respectively denote the set of plays and of histories of . Given a play π=s0a0s1a1 and n, we let π|n=s0a0an1sn and extend this notation to histories. For any given history h=s0a0an1sn, we let 𝖿𝗂𝗋𝗌𝗍(h)=s0 and 𝗅𝖺𝗌𝗍(h)=sn. The cylinder of a history h=s0a0an1sn is the set 𝖢𝗒𝗅(h)={π𝖯𝗅𝖺𝗒𝗌()π|n=h} of plays extending h. For any set of histories , we let 𝖢𝗒𝗅()=h𝖢𝗒𝗅(h).

In a Markov chain, due to the lack of actions, plays and histories are simply sequences of states coherent with transitions. We extend the notation introduced above for plays and histories of MDPs to plays and histories of Markov chains.

Strategies.

A strategy describes the action choices made throughout a play of an MDP. In general, strategies can use randomisation and all previous knowledge to make decisions. Formally, a strategy is a function σ:𝖧𝗂𝗌𝗍()𝒟(A) such that for any history h𝖧𝗂𝗌𝗍(), 𝗌𝗎𝗉𝗉(σ(h))A(𝗅𝖺𝗌𝗍(h)), i.e., a strategy can only prescribe actions that are available in the last state of a history. A play π=s0a0s1 is consistent with a strategy σ if for all , σ(π|)(a)>0. Consistency of a history with respect to a strategy is defined similarly.

A strategy is pure if it does not use any randomisation, i.e., if it maps histories to Dirac distributions. We view pure strategies as functions σ:𝖧𝗂𝗌𝗍()A. A strategy σ is memoryless if the distribution it suggests depends only on the last state of a history, i.e., if for all histories h and h, 𝗅𝖺𝗌𝗍(h)=𝗅𝖺𝗌𝗍(h) implies σ(h)=σ(h). We view memoryless strategies and pure memoryless strategies as functions S𝒟(A) and SA respectively.

Given a strategy σ and an initial state s𝗂𝗇𝗂𝗍S, we define a distribution ,s𝗂𝗇𝗂𝗍σ over measurable sets of plays in the usual way (e.g., [4]). For a Markov chain 𝒞, due to the absence of non-determinism, we drop the strategy from this notation. If the relevant MDP or Markov chain is clear from the context, we also omit them.

In the following, we focus on memoryless strategies. If σ is a memoryless strategy of , we consider the Markov chain induced by σ on as a Markov chain over S. Formally, it is defined as the Markov chain (S,δ) such that for all s, sS, δ(s)(s)=aA(s)σ(s)(a)δ(s,a)(s).

One-counter Markov decision processes.

One-counter MDPs (OC-MDPs) extend MDPs with a counter that can be incremented, decremented or left unchanged on each transition. Formally, a one-counter MDP is a tuple 𝒬=(Q,A,δ,w) where (Q,A,δ) is a finite MDP and w:S×A{1,0,1} is a (partial) weight function that assigns an integer weight from {1,0,1} to state-action pairs. For all qQ and aA, we require that w(q,a) be defined whenever aA(q). A configuration of 𝒬 is a pair (q,k) where qQ and k. In the sequel, by plays, histories, strategies etc. of 𝒬, we refer to the corresponding notion with respect to the MDP (Q,A,δ) underlying 𝒬.

We remark that the weight function is not subject to randomisation, i.e., the weight of any transition is determined by the outgoing state and chosen action. In particular, counter values can be inferred from histories and be taken in account to make decisions in 𝒬.

Let 𝒬=(Q,A,δ,w) be an OC-MDP. The OC-MDP 𝒬 induces an MDP over the infinite countable space of configurations. Transitions in this induced MDP are defined using δ for the probability of updating the state and w for the deterministic change of counter value. We interrupt any play whenever a configuration with counter value 0 is reached. Intuitively, such configurations can be seen as situations in which we have run out of an energy resource. We may also impose an upper bound on the counter value, and interrupt any plays that reach this counter upper bound. We refer to OC-MDPs with a finite upper bound on counter values as bounded OC-MDPs and OC-MDPs with no upper bounds as unbounded OC-MDP.

We provide a unified definition for both semantics. Let 𝒬=(Q,A,δ,w) be an OC-MDP and let r¯>0 be an upper bound on the counter value. We define the MDP r(𝒬)=(Q×r,A,δr) where δr is defined, for all configurations s=(q,k)Q×r, actions aA(q) and states pQ, by δr(s,a)(p,k+w(q,a))=δ(q,a)(p) if k{0,r}, and δr(s,a)(s)=1 otherwise. The state space of r(𝒬) is finite if and only if r.

For technical reasons, we also introduce one-counter Markov chains. In one-counter Markov chains, we authorise stochastic counter updates, i.e., counter updates are integrated in the transition function. This contrasts with OC-MDPs where deterministic counter updates are used to allow strategies to observe counter updates. We only require the unbounded semantics in this case. Formally, a one-counter Markov chain is a tuple 𝒫=(Q,δ) where Q is a finite set of states and δ:Q𝒟(Q×{1,0,1}) is a probabilistic transition and counter update function. The one-counter Markov chain 𝒫 induces a Markov chain 𝒞(𝒫)=(Q×,δ) such that for any configuration s=(q,k)Q×, any pQ and any u{1,0,1}, we have δ(s)((p,k+u))=δ(q)(p,u) if k0 and δ(s)(s)=1 otherwise.

Objectives.

Let =(S,A,δ) be an MDP. An objective is a measurable set of plays that represents a specification. We focus on variants of reachability. Given a set of target states TS, the reachability objective for T is defined as 𝖱𝖾𝖺𝖼𝗁(T)={s0a0s1a1𝖯𝗅𝖺𝗒𝗌(),sT}. If T={s}, we write 𝖱𝖾𝖺𝖼𝗁(s) for 𝖱𝖾𝖺𝖼𝗁({s}). We assume familiarity with linear systems for reachability probabilities in finite Markov chains (see, e.g., [4]).

Let 𝒬=(Q,A,δ,w) be an OC-MDP, let r¯>0 be a counter upper bound and let TQ. We consider two variants of reachability on r(𝒬). The first goal we consider requires visiting T regardless of the counter value and the second requires visiting T with a counter value of zero. The first objective is called state-reachability and we denote it by 𝖱𝖾𝖺𝖼𝗁(T) instead of 𝖱𝖾𝖺𝖼𝗁(T×r). The second objective is called selective termination, which is formalised as 𝖳𝖾𝗋𝗆(T)=𝖱𝖾𝖺𝖼𝗁(T×{0}). Like above, if T={q}, we write 𝖱𝖾𝖺𝖼𝗁(q) and 𝖳𝖾𝗋𝗆(q) instead of 𝖱𝖾𝖺𝖼𝗁({q}) and 𝖳𝖾𝗋𝗆({q}).

The Blum-Shub-Smale model of computation.

Some of our complexity bounds rely on a model of computation introduced by Blum, Shub and Smale [11]. A Blum-Shub-Smale (BSS) machine, intuitively, is a random-access memory machine with registers storing real numbers. Arithmetic computations on the content of registers are in constant time in this model.

The class of decision problem that can be solved in polynomial time in the BSS model coincides with the class of decision problems that can be solved, in the Turing model, in polynomial time with a PosSLP oracle [2]. The PosSLP problem asks, given a division-free straight-line program (intuitively, an arithmetic circuit), whether its output is positive. The PosSLP problem lies in the counting hierarchy and can be solved in polynomial space [2].

Theory of the reals.

The theory of the reals refers to the set of sentences (i.e., fully quantified first-order logical formulae) in the signature of ordered fields that hold in . The problem of deciding whether a sentence is in the theory of the reals is decidable; if the number of quantifier blocks is fixed, this can be done in PSPACE [6, Rmk. 13.10]. Furthermore, the problem of deciding the validity of an existential (resp. universal) formula is NP-hard (resp. co-NP-hard) [6, Rmk. 13.9]. Our complexity bounds refer to the complexity classes ETR (existential theory of the reals) and co-ETR, which contain the problems that can be reduced in polynomial time to checking the membership of an existential sentence and universal sentence in the theory of the reals respectively.

3 Interval strategies and decision problems

We now introduce interval strategies and compare them to finite-memory strategies of the MDP underlying the OC-MDP. We close this section by formulating our decision problems. We fix an OC-MDP 𝒬=(Q,A,δ,w) and a counter upper bound r¯>0.

Interval strategies.

Interval strategies are a subclass of memoryless strategies of r(𝒬) that admit finite compact representations based on families of intervals. Intuitively, a strategy is an interval strategy if there exists an interval partition (i.e., a partition containing only intervals) of the set of counter values such that decisions taken in a state for two counter values in the same interval coincide. We require that this partition has a finite representation to formulate verification and synthesis problems for interval strategies.

The set 1,r1 contains all counter values for which decisions are relevant. Let be an interval partition of 1,r1. A memoryless strategy σ of r(𝒬) is based on the partition if for all qQ, all I and all k,kI, we have σ(q,k)=σ(q,k). All memoryless strategies are based on the trivial partition of 1,r1 into singleton sets. In practice, we are interested in strategies based on partitions with a small number of large intervals.

We define two classes of interval strategies: strategies that are based on finite partitions and, in unbounded OC-MDPs, strategies that are based on periodic partitions. An interval partition of >0 is periodic if there exists a period ρ>0 such that for all I, I+ρ={k+ρkI}. A periodic interval partition with period ρ induces the interval partition 𝒥={II1,ρ} of 1,ρ. Conversely, for any ρ>0, an interval partition 𝒥 of 1,ρ generates the periodic partition ={I+kρI𝒥,k}.

Let σ be a memoryless strategy of r(𝒬). The strategy σ is an open-ended interval strategy (OEIS) if it is based on a finite interval partition of 1,r1. The qualifier open-ended follows from there being an unbounded interval in any finite interval partition of >0 (for the unbounded case). When r=, σ is a cyclic interval strategy (CIS) if there exists a period ρ>0 such that for all qQ and all k>0, we have σ(q,k)=σ(q,k+ρ). A strategy is a CIS with period ρ if and only if it is based on a periodic interval partition of >0 with period ρ.222We do not consider strategies based on ultimately periodic interval partitions. However, our techniques can be adapted to analyse such strategies. Furthermore, we can show that our complexity bounds for the decision problems we study extend to this case.

We represent interval strategies as follows. First, assume that σ is an OEIS and let be the coarsest finite interval partition of 1,r1 on which σ is based. We can represent σ by a table that lists, for each I, the bounds of I and a memoryless strategy of 𝒬 dictating the choices to be made when the current counter value lies in I. Next, assume that r= and that σ is a CIS with period ρ. Let 𝒥 be an interval partition of 1,ρ such that σ is based on the partition generated by 𝒥. We represent σ by ρ and an OEIS of ρ+1(𝒬) based on 𝒥 that specifies the behaviour of σ for counter values up to ρ.

Interval strategies subsume counter-oblivious strategies, i.e., memoryless strategies that make choices based only on the state and disregard the current counter value. Counter-oblivious strategies can be viewed as memoryless strategies σ:Q𝒟(A) of 𝒬.

Substituting counters with memory.

Memoryless strategies of r(𝒬) can be seen as strategies of 𝒬 when an initial counter value is fixed. The idea is to use memory to keep track of the counter instead of observing it. We can thus compare our representation of interval strategies to the classical Mealy machine representation of the corresponding strategies of 𝒬. The following discussion is formalised in the full version of this paper [1].

If r, the counterpart in 𝒬 of any OEIS has finite memory: there are only finitely many counter values. However, a Mealy machine for this strategy can require a size exponential in the binary encoding of r (which occurs in the OEIS representation). Below, we illustrate this claim and that, if r=, then counterparts in 𝒬 of OEIS may require infinite memory.

Example 1.

We consider the OC-MDP 𝒬 illustrated in Fig. 1(a). Let r¯>0, r3. We consider the OEIS σ defined by σ(q1,k)=a for all k2,r1 and σ(q1,1)=b. The strategy σ maximises the probability of terminating in q2 from the configuration (q0,1).

(a) An OC-MDP. We indicate the weight of a state-action pair next to each action.
(b) An OC-MDP where all weights are 1 and are omitted from the figure.
Figure 1: OC-MDP examples. Edge splits following action indicate probabilistic transitions.

We let τ1σ denote the counterpart of σ in 𝒬 for the initial counter value 1. Intuitively, any automaton-based memory structure must distinguish r1 different histories and thus at least r1 memory states are necessary to implement τ1σ via a Mealy machine. As σ requires a specific action for (q1,1), for all 1k<r1, we must be able to distinguish the histories (q0a)kq1 from one another. In particular, if r=, then τ1σ is not finite-memory.  

We now assume that r= and that σ is a CIS. Its counterpart in 𝒬 is a finite-memory strategy: by periodicity of σ, it suffices to keep track of the remainder of division of the counter value by the period. By adapting Ex. 1, we can show that a CIS representation may be exponentially more succinct than any Mealy machine for its counterpart in 𝒬.

Decision problems.

We formulate the decision problems studied in following sections. The common inputs of these problems are an OC-MDP 𝒬 with rational transition probabilities, a counter bound r¯>0 (encoded in binary if it is finite), a target TQ, an objective Ω{𝖱𝖾𝖺𝖼𝗁(T),𝖳𝖾𝗋𝗆(T)}, an initial configuration s𝗂𝗇𝗂𝗍=(q𝗂𝗇𝗂𝗍,k𝗂𝗇𝗂𝗍) and a threshold θ[0,1] against which we compare the probability of Ω. Problems that are related to CISs assume that r= as all strategies in the bounded case are OEISs. We lighten the probability notation below by omitting r(𝒬) from it.

First, we are concerned with the verification of interval strategies. The interval strategy verification problem asks to decide, given an interval strategy σ, whether s𝗂𝗇𝗂𝗍σ(Ω)θ. We assume that the encoding of the input interval strategy matches the description above.

The other two problems relate to interval strategy synthesis. The corresponding decision problem is called realisability. We provide algorithms checking the existence of well-performing structurally-constrained interval strategies. We formulate two variants of this problem.

For the first variant, we fix an interval partition of 1,r1 beforehand and ask to check if there is a good strategy based on . Formally, the fixed-interval OEIS realisability problem asks whether, given a finite interval partition of 1,r1, there exists an OEIS σ based on such that s𝗂𝗇𝗂𝗍σ(Ω)θ. The variant for CISs is defined similarly: the fixed-interval CIS realisability problem asks whether, given a period ρ>0 and an interval partition 𝒥 of 1,ρ, there exists a CIS σ based on the partition generated by 𝒥 such that s𝗂𝗇𝗂𝗍σ(Ω)θ.

For the second variant, we parameterise the number of intervals in the partition and the size of bounded intervals. The parameterised OEISs realisability problem asks, given a bound d>0 on the number of intervals and a bound n>0 on the size of bounded intervals, whether there exists an OEIS σ such that s𝗂𝗇𝗂𝗍σ(Ω)θ and σ is based on an interval partition of 1,r1 with ||d and, for all bounded I, |I|1,n (in particular, the parameter n does not constrain the required infinite interval in the unbounded setting r=). The parameterised CIS realisability problem asks, given a bound d>0 on the number of intervals and a bound n>0 on the size of intervals, whether there exists a CIS σ such that s𝗂𝗇𝗂𝗍σ(Ω)θ and σ is based on an interval partition of >0 with period ρ such that |I|n for all I and induces a partition of 1,ρ with at most d intervals. In both cases, we assume that the number d is given in unary. This ensures that witness strategies, when they exist, are based on interval partitions that have a representation of size polynomial in the size of the inputs.

 Remark 2.

In bounded OC-MDPs, instances of the parameterised OEIS realisability problem such that no partitions are compatible with the input parameters d and n are trivially negative. This issue does not arise for OEISs in unbounded OC-MDPs or for CISs, as counter-oblivious strategies are always possible witnesses no matter the parameters.  

For both realisability problems, we consider two variants, depending on whether we want the answer with respect to the set of pure or randomised interval strategies. For many objectives in MDPs (e.g., reachability, parity objectives [4, 13]), the maximum probability of satisfying the objective is the same for pure and randomised strategies. As highlighted by the following example, when we restrict the structure of the sought interval strategy, there may exist randomised strategies that perform better than all pure ones. Intuitively, randomisation can somewhat alleviate the loss in flexibility caused by structural restrictions [23].

Example 3.

The fixed-interval and parameterised realisability problems subsume the realisability problem for counter-oblivious strategies. The goal of this example is to provide an OC-MDP in which there exists a randomised counter-oblivious strategy that performs better than any pure counter-oblivious strategy from a given configuration.

We consider the OC-MDP 𝒬 depicted in Fig. 1(b) in which all weights are 1, the objective 𝖱𝖾𝖺𝖼𝗁(t)=𝖳𝖾𝗋𝗆(t), a counter bound r3 and the initial configuration (q,2). To maximise the probability of reaching t from (q,2) with an unrestricted strategy, we must select action a in (q,2) and then b in (q,1). Intuitively, a is used when two steps remain as it allows another attempt and we switch to b when one step is left because it is the action that maximises the probability of reaching t in a single step.

We now limit our attention to counter-oblivious strategies. For pure strategies, no matter whether action a or b is chosen in q, t is reached with probability 34 from (q,2). However, when playing both actions uniformly at random in q, the resulting reachability probability from (q,2) is 2532>34. This shows that randomised counter-oblivious strategies can achieve better reachability (resp. selective termination) probabilities than pure strategies.  

4 Compressing induced Markov chains

This section introduces the construction underlying our algorithms for the decision problems presented in Sect. 3. To analyse the (possibly infinite) Markov chains induced by memoryless strategies over the space of configurations of an OC-MDP, we build a compressed Markov chain, i.e., a Markov chain with fewer configurations and adjusted transitions. This construction is generic, in the sense that it can formally be defined for any memoryless strategy.

We introduce the compression construction in Sect. 4.1. Sect. 4.2 overviews its relevant properties for our algorithms.

For the whole section, we fix an OC-MDP 𝒬=(Q,A,δ,w), a bound r¯>0 on counter values and a memoryless strategy σ of r(𝒬) based on an interval partition of 1,r.

4.1 Defining compressed Markov chains

We introduce the compressed Markov chain 𝒞σ(𝒬) derived from the Markov chain induced on r(𝒬) by σ and the partition . We write 𝒞σ instead of 𝒞σ(𝒬) whenever 𝒬 is clear from the context. Intuitively, we keep some configurations in the state space of 𝒞σ and, to balance this, transitions of 𝒞σ aggregate several histories of the induced Markov chain. We also apply compression for bounded intervals: interval bounds are represented in binary and thus the size of an interval is exponential in its encoding size. We open with an example.

Example 4.

We consider the OC-MDP depicted in Fig. 2(a) and counter upper bound r=+. Let σ denote the randomised OEIS based on ={1,7,8,} such that σ(q,1)(a)=σ(p,1)(a)=σ(q,8)(c)=1 and σ(p,8)(a)=σ(p,8)(b)=12.

(a) An OC-MDP. Weights are written next to actions.
(b) Fragment of the compressed Markov chain of Ex. 4 reachable from (q,1). Configuration parentheses are omitted to lighten the figure.
Figure 2: An illustration of an OC-MDP and its compression for a specific strategy.

We define the compressed Markov chain 𝒞σ depicted in Fig. 2(b) by processing each interval individually. First, we consider the bounded interval I=1,7. When we enter I from its minimum or maximum, we only consider counter jumps by powers of 2, starting with 1=20. If a counter value in I is reached by jumping by 2β, we consider counter updates of 2β+1 from it. This explains the counter progressions from (q,1) to (q,8) and from (t,7) to (t,0). The state space of 𝒞σ with respect to I contains the configurations whose counter values can be reached via this scheme. Transitions aggregate several histories of (𝒬), e.g., the probability from s=(q,2) to s=(p,4) is the probability under σ of all histories of (𝒬) from s to s along which counter values elsewhere than in s remain between minI=1 and 3 (i.e., the counter value before the next step). The encoding of transition probabilities may be exponential in the number of retained configurations; this is highlighted by the progression of probability denominators between (q,1) and (q,8).

For the unbounded interval I=8,, we only consider configurations with counter value minI=8 and consider transitions to configurations with counter value minI1=7. In this case, for instance, the transition probability from (p,8) to (p,7), corresponds the probability under σ in (𝒬) of hitting counter value 7 for the first time in p from (p,8). This example illustrates that this probability can be irrational. Here, the probability of moving from (p,8) to (p,7) is a solution of the quadratic equation x=14+12x2 (see [32]): 14 is the probability of directly moving from (p,8) to (p,7) and 12x2 is the probability of moving from (p,8) to (p,7) by first going through the intermediate configuration (p,9).

Finally, we comment on the absorbing state . The rules making up transitions of 𝒞σ outlined above require a change in counter value. We redirect the probability of never seeing such a change to . In this example, σ does not allow a counter decrease from (q,8).  

Ex. 4 outlines the main ideas to construct 𝒞σ. To guarantee that bounded intervals of can only be entered by one of their bounds in general, we require that, for all bounded I, |I|=2βI1 for some βI. This is not prohibitive: given a bounded interval, we can partition it into sub-intervals satisfying this constraint in polynomial time (see the full version of this paper [1]).

We now formalise 𝒞σ=(S,δσ). We start by defining its state space S which does not depend on σ. We first formalise the configurations that are retained for each interval.

Let I. First, let us assume that I is unbounded and let bI denote its minimum. We set SI=Q×{bI}, i.e., we only retain the configurations with minimal counter value in I.

Next, let us assume that I is bounded and of the form bI,bI+. Let βI=log2(|I|+1) (i.e., we have |I|=2βI1). The set of retained configurations for I is given by

SI=Q×({bI+2α1αβI1}{bI+(2α1)αβI1}).

Finally, we consider absorbing configurations and the new state . We let S={}(Q×{0,r}) if r and S={}(Q×{0}) otherwise. We define S=SISI.

We now define the transition function δσ. For all sS, we let δσ(s)(s)=1. For configurations whose counter value lies in one of the intervals I, we provide a unified definition based on a notion of successor counter values, generalising the ideas of Ex. 4.

Let I. If I is unbounded, we define the successor of bI=minI to be bI1. We now assume that I=bI,bI+ is bounded and let βI=log2(|I|+1) and αβI1. The successors of bI+2α1 are bI1 and bI+2α+11. Symmetrically, for bI+(2α1), its successors are bI++1 and bI+(2α+11). Both cases entail a counter change by 2α. The size constraints on the intervals of ensure that all successor counter values appear in S.

Let s=(q,k)SS and s=(q,k)S{}. If k is not a successor of k, we set δσ(s)(s)=0. Assume now that k is a successor of k. We let 𝗌𝗎𝖼𝖼(s,s)𝖧𝗂𝗌𝗍(r(𝒬)) be the set of histories h such that 𝖿𝗂𝗋𝗌𝗍(h)=s, 𝗅𝖺𝗌𝗍(h)=s and for all configurations s′′ along h besides s, the counter value of s′′ is not a successor of k; outside of s along h, the counter remains, in the bounded case, strictly between the two successors of k, and, in the unbounded case, strictly above the successor k1 of k. We set δσ(s)(s)=r(𝒬),sσ(𝖢𝗒𝗅(𝗌𝗎𝖼𝖼(s,s))). To ensure that δσ(s) is a distribution we let δσ(s)()=1s′′δσ(s)(s′′); this transition captures the probability of the counter never hitting a successor of k.

 Remark 5.

Although we have formalised compressed Markov chains for OC-MDPs, the construction can be applied to one-counter Markov chains. In particular, the properties outlined below transfer to the compression of a one-counter Markov chain.  

4.2 Properties of compressed Markov chains

Validity of the compression approach.

Let sS. The probability of terminating or reaching r from s in a given state is the same in r(𝒬) under σ and in 𝒞σ. Intuitively, histories of 𝒞σ group histories of r(𝒬) together and preserve their probability under σ. By design of transitions in 𝒞σ, all histories of r(𝒬) from s to configurations of the form (q,0) or (q,r) are abstracted in 𝒞σ this way. This implies the following theorem, in which, for the sake of clarity, we indicate the relevant MDP or Markov chain for each objective.

Theorem 6.

Let sS{}. For all qQ, r(𝒬),sσ(𝖳𝖾𝗋𝗆(q))=𝒞σ,s(𝖱𝖾𝖺𝖼𝗁𝒞σ(q,0)) and, if r, r(𝒬),sσ(𝖱𝖾𝖺𝖼𝗁r(𝒬)(q,r))=𝒞σ,s(𝖱𝖾𝖺𝖼𝗁𝒞σ(q,r)).

 Remark 7.

Thm. 6 does not enable the use of 𝒞σ to analyse state-reachability objectives. In this case, we first modify 𝒬 so targets become absorbing and have a negative-weighted self-loop. Probabilities for state-reachability in 𝒬 are equal to those of reaching a target with counter 0 or r in the compression of the modified OC-MDP with respect to and σ.  

Representing and computing transition probabilities.

Transition probabilities in 𝒞σ can be irrational for unbounded intervals or require large representations for bounded intervals. For unbounded intervals, these probabilities can be shown to be termination probabilities in a one-counter Markov chain. It follows that they are the least solution of a quadratic system of equations [32].

For bounded intervals, the transition probabilities of 𝒞σ are also the least solution of a quadratic system of equations. We provide intuition on how to derive this system for an interval I=1,2β1 via Markov chains.

Let us first consider transitions from Q×{1}. We illustrate the situation in Fig. 3(a): we consider a Markov chain over Q×{0,1,2} where states in Q×{0,2} are absorbing and transitions from other states are inherited from the Markov chain induced by σ on r(𝒬). We represent transitions in this Markov chain from a configuration (q,1)Q×{1} to configurations with a state pQ. For any qQ, the probability of reaching a configuration sQ×{0,2} from (q,1) in this Markov chain is δσ((q,1))(s) by definition.

(a) Markov chain transition scheme for the characterisation of transitions from Q×{1}, where δI(q)(p,u)=aA(q)w(q,a)=uσ((q,1))(a)δ(q,a)(p).
(b) Markov chain transition scheme for the characterisation of transitions from Q×{2α} for for 0<α<β.
Figure 3: Markov chain fragments to derive a characterisation for transition probabilities of 𝒞σ.

Next, we let α1,β1 and consider configurations in Q×{2α}. The situation is depicted in Fig. 3(b). We divide a counter change by 2α into counter changes by 2α1 and, thus, rely on the transition probabilities from Q×{2α1} in 𝒞σ. In this case, we can see transition probabilities from Q×{2α} in 𝒞σ as reachability probabilities in a Markov chain over Q×{0,2α1,2α,32α1,2α+1}. We obtain the following.

Theorem 8.

Let I. Transition probabilities of 𝒞σ from configurations with counter value in I are the least solution of a quadratic system of equations. The equation system for I is of size polynomial in the binary encoding of the bounds of I and the encoding of 𝒬.

By determining the states from which targets are unreachable, systems for Markov chain reachability probabilities can be made to have a unique solution. For bounded OC-MDPs, we can reason similarly due to the structure of the systems of Thm. 8 to obtain the following.

Theorem 9.

Assume that r. Then we can modify the systems of Thm. 8 in polynomial time such that the least solutions of the original systems are the unique solution of the modified ones. This process only relies on the supports of the distributions in the image of σ and not the exact probabilities. The resulting system can be solved in polynomial time in the BSS model.

Finite representations of compressed Markov chains.

A compressed Markov chain is finite if and only if is finite. In particular, for OEISs, the compression technique allows us to reduce to the analysis of a finite Markov chain. We now assume that σ is a CIS and that has period ρ. In contrast to the above, 𝒞σ is an infinite Markov chain. Nonetheless, we can show that 𝒞σ is induced by a one-counter Markov chain. Due to the periodic structure of , transitions from (q,k)S mainly depend on kmodρ. We construct a one-counter Markov chain by encoding these remainders into states and use the counter to keep track of the quotient of the counter by ρ. We obtain the following.

Lemma 10.

Assume that is periodic with period ρ. The compression 𝒞σ is generated by a one-counter Markov chain 𝒫𝒥σ (where 𝒥 is the interval partition of 1,ρ induced by ).

5 Algorithms and complexity

We now overview the complexity results for the decision problems presented in Table 1.

We fix the following inputs for the whole section: an OC-MDP 𝒬=(Q,A,δ,w), a counter upper bound r¯>0, an initial configuration s𝗂𝗇𝗂𝗍=(q𝗂𝗇𝗂𝗍,k𝗂𝗇𝗂𝗍)Q×r, a set of targets TQ, an objective Ω{𝖳𝖾𝗋𝗆(T),𝖱𝖾𝖺𝖼𝗁(T)} and a threshold θ[0,1]. We assume that, if Ω=𝖱𝖾𝖺𝖼𝗁(T), then the modification outlined in Rmk. 7 has been applied to 𝒬.

Verification.

Let σ be an interval strategy. First, assume that σ is an OEIS. We comment on the CIS case below. Let be a refinement of the interval partition from the encoding of σ such that 𝒞σ is well-defined and k𝗂𝗇𝗂𝗍 is an interval bound in (thus s𝗂𝗇𝗂𝗍S). We can construct this partition in polynomial time with respect to the input size.

By Thm. 6 and Rmk. 7, r(𝒬),s𝗂𝗇𝗂𝗍σ(Ω) is equal to a reachability probability from s𝗂𝗇𝗂𝗍 in 𝒞σ. To analyse 𝒞σ, we use different approaches for bounded and unbounded OC-MDPs.

In the bounded case, we can decide the problem in polynomial time in the BSS model. By Thm. 9, we can explicitly compute the transition probabilities of 𝒞σ. These can be used to compute the relevant reachability probabilities (by solving a linear system) and conclude.

In general, we can show that the verification problem is in co-ETR. From the characterisation of transition probabilities of 𝒞σ in Thm. 8, we can derive a formula Φδ(𝐱,𝐳(σ)) where 𝐱 are variables representing transition probabilities of 𝒞σ, 𝐳(σ) are strategy probabilities of σ, and the least satisfying assignment of 𝐱-variables in Φδ(𝐱,𝐳(σ)) contains the transition probabilities of 𝒞σ. Similarly, we can build a formula ΦΩ(𝐱,𝐲) from the linear system for reachability probabilities in 𝒞σ, where 𝐲 are the variables for the relevant reachability probabilities with respect to Ω. By construction, any vectors 𝐱 and 𝐲 such that Φδ(𝐱,𝐳(σ))ΦΩ(𝐱,𝐲) are over-estimations of the probabilities of interest. Thus, the verification problem amounts to checking if 𝐱𝐲((Φδ(𝐱,𝐳(σ))ΦΩ(𝐱,𝐲))ys𝗂𝗇𝗂𝗍θ), where ys𝗂𝗇𝗂𝗍 is the variable for the reachability probability from s𝗂𝗇𝗂𝗍 in 𝒞σ.

We now informally comment on how to adapt the technique for OEIS to obtain a co-ETR upper bound when σ is a CIS. The main technical difference is that we apply the compression technique twice. We apply the compression first to obtain the infinite Markov chain 𝒞σ. We then consider the one-counter Markov chain 𝒫𝒥σ that generates it (Lem. 10) and use a compression to analyse it (Rmk. 5). In the end, we construct a formula similarly to above, with an additional sub-formula for the transition probabilities of the additional compression.

Fixed-interval realisability.

We consider the fixed-interval realisability problem. We first discuss the variant for pure strategies. It suffices to non-deterministically guess a pure interval strategy based on the input interval partition of 1,r1 and verify this strategy. In our complexity analysis, we treat verification as a part of the procedure and not as an oracle. We obtain that the fixed-interval realisability problem for pure strategies is in NPPosSLP for OEIS in bounded OC-MDPs and in NPco-ETR=NPETR for OEIS and CIS in unbounded OC-MDPs. In particular, we have a PSPACE upper bound.

For randomised strategies, we quantify existentially over strategy variables in the verification formulae. We illustrate the case of OEISs with an interval partition . We introduce a new sub-formula Φσ(𝐳) that holds if and only if the interpretation of the variables 𝐳 are strategy probabilities. Solving the fixed-interval realisability problem can be reduced to checking if 𝐳𝐱𝐲(Φσ(𝐳)(Φδ(𝐱,𝐳)ΦΩ(𝐱,𝐲))ys𝗂𝗇𝗂𝗍θ). For CISs, we can follow the same reasoning. We obtain a PSPACE algorithm (see Sect. 2).

We refine this approach for bounded OC-MDPs. Assume that the supports of the distributions of the strategy to be synthesised, i.e., which of the variables of 𝐳 have a positive assignment, are fixed. With this assumption, we can modify the conjunction Φδ(𝐱,𝐳)ΦΩ(𝐱,𝐲) of the verification formula by using Thm. 9 so that for any valuation of 𝐳 compatible with the supports, the only valuation of 𝐱 and 𝐲 satisfying this conjunction is made of the probabilities represented by these variables. We can refine the system underlying ΦΩ(𝐱,𝐲) because Thm. 9 indicates the transitions of the compressed chain that have a positive probability. We obtain an NPETR complexity as follows: we non-deterministically guess some strategy distribution supports and then check if 𝐳𝐱𝐲(Φσ(𝐳)Φδ(𝐱,𝐳)ΦΩ(𝐱,𝐲)ys𝗂𝗇𝗂𝗍θ) where Φσ(𝐳) only allows strategies with the guessed support and Φδ(𝐱,𝐳) and ΦΩ(𝐱,𝐲) have been adapted by applying the result of Thm. 9.

Parameterised realisability.

We consider the input parameters d for the number of intervals and n for the bound on the size of finite intervals. Intuitively, we non-deterministically construct an interval partition compatible with d and n and then use algorithms for the fixed case. Due to the assumption that d is given in unary, the guessed partition can be represented in polynomial space. To obtain the same complexity upper bounds for parameterised realisability as for fixed-interval realisability, we consider the algorithms for fixed-interval realisability as a part of the overall procedure and not as oracles in our complexity analyses.

Lower bounds.

All of the problems discussed above are square-root-sum-hard. The square-root-sum problem asks, given integers x1,,xn and y, whether i=1nxiy. The square-root-sum problem can be solved in polynomial time in the BSS model [42] but is not known to lie in NP. In the unbounded case, the hardness can be obtained by relying on an existing reduction for one-counter Markov chains [26]. For the bounded setting, we build on the same construction. The main hurdle is to establish that we can, in polynomial time, compute a large enough counter bound so the bounded one-counter Markov chain approximates the one used in the unbounded reduction well enough for the reduction to still be valid. We also provide a reduction from the Hamiltonian cycle problem to the counter-oblivious realisability problem for selective termination, implying its NP-hardness [28]. This implies the NP-hardness of our two realisability problems.

References

  • [1] Michal Ajdarów, James C. A. Main, Petr Novotný, and Mickael Randour. Taming infinity one chunk at a time: Concisely represented strategies in one-counter MDPs. CoRR, abs/2503.00788, 2025. doi:10.48550/arXiv.2503.00788.
  • [2] Eric Allender, Peter Bürgisser, Johan Kjeldgaard-Pedersen, and Peter Bro Miltersen. On the complexity of numerical analysis. SIAM Journal on Computing, 38(5):1987–2006, 2009. doi:10.1137/070697926.
  • [3] Pranav Ashok, Mathias Jackermeier, Jan Kretínský, Christoph Weinhuber, Maximilian Weininger, and Mayank Yadav. dtControl 2.0: Explainable strategy representation via decision tree learning steered by experts. In Jan Friso Groote and Kim Guldstrand Larsen, editors, Proceedings (Part II) of the 27th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS 2021, Held as Part of ETAPS 2021, Luxemburg City, Luxemburg, March 27–April 1, 2021, volume 12652 of Lecture Notes in Computer Science, pages 326–345. Springer, 2021. doi:10.1007/978-3-030-72013-1_17.
  • [4] Christel Baier and Joost-Pieter Katoen. Principles of model checking. MIT Press, 2008.
  • [5] Nikhil Balaji, Stefan Kiefer, Petr Novotný, Guillermo A. Pérez, and Mahsa Shirmohammadi. On the complexity of value iteration. In Christel Baier, Ioannis Chatzigiannakis, Paola Flocchini, and Stefano Leonardi, editors, Proceedings of the 46th International Colloquium on Automata, Languages, and Programming, ICALP 2019, Patras, Greece, July 9–12, 2019, volume 132 of LIPIcs, pages 102:1–102:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2019. doi:10.4230/LIPICS.ICALP.2019.102.
  • [6] Saugata Basu, Richard Pollack, and Marie-Françoise Roy. Algorithms in Real Algebraic Geometry. 1431-1550. Springer, 2nd edition, 2006. doi:10.1007/3-540-33099-2.
  • [7] Noam Berger, Nevin Kapur, Leonard J. Schulman, and Vijay V. Vazirani. Solvency games. In Ramesh Hariharan, Madhavan Mukund, and V. Vinay, editors, IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science, FSTTCS 2008, December 9-11, 2008, Bangalore, India, volume 2 of LIPIcs, pages 61–72. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2008. doi:10.4230/LIPICS.FSTTCS.2008.1741.
  • [8] Kark-Josef Bierth. An expected average reward criterion. Stochastic processes and their applications, 26:123–140, 1987.
  • [9] Frantisek Blahoudek, Tomás Brázdil, Petr Novotný, Melkior Ornik, Pranay Thangeda, and Ufuk Topcu. Qualitative controller synthesis for consumption Markov decision processes. In Shuvendu K. Lahiri and Chao Wang, editors, Proceedings (Part II) of the 32nd International Conference on Computer Aided Verification, CAV 2020, Los Angeles, CA, USA, July 21–24, 2020, volume 12225 of Lecture Notes in Computer Science, pages 421–447. Springer, 2020. doi:10.1007/978-3-030-53291-8_22.
  • [10] Roderick Bloem, Krishnendu Chatterjee, and Barbara Jobstmann. Graph games and reactive synthesis. In Edmund M. Clarke, Thomas A. Henzinger, Helmut Veith, and Roderick Bloem, editors, Handbook of Model Checking, pages 921–962. Springer, 2018. doi:10.1007/978-3-319-10575-8_27.
  • [11] Lenore Blum, Mike Shub, and Steve Smale. Over the real numbers: NP-completeness, recursive functions and universal machines. Bulletin of the American Mathematical Society, 21(1), 1989.
  • [12] Patricia Bouyer, Stéphane Le Roux, Youssouf Oualhadj, Mickael Randour, and Pierre Vandenhove. Games where you can play optimally with arena-independent finite memory. Logical Methods in Computer Science, 18(1), 2022. doi:10.46298/lmcs-18(1:11)2022.
  • [13] Patricia Bouyer, Youssouf Oualhadj, Mickael Randour, and Pierre Vandenhove. Arena-independent finite-memory determinacy in stochastic games. Log. Methods Comput. Sci., 19(4), 2023. doi:10.46298/LMCS-19(4:18)2023.
  • [14] Patricia Bouyer, Mickael Randour, and Pierre Vandenhove. The true colors of memory: A tour of chromatic-memory strategies in zero-sum games on graphs (invited talk). In Anuj Dawar and Venkatesan Guruswami, editors, Proceedings of the 42nd IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science, FSTTCS 2022, IIT Madras, Chennai, India, December 18–20, 2022, volume 250 of LIPIcs, pages 3:1–3:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2022. doi:10.4230/LIPIcs.FSTTCS.2022.3.
  • [15] Tomás Brázdil, Václav Brozek, Kousha Etessami, and Antonín Kucera. Approximating the termination value of one-counter MDPs and stochastic games. Information and Computation, 222:121–138, 2013. doi:10.1016/J.IC.2012.01.008.
  • [16] Tomás Brázdil, Václav Brozek, Kousha Etessami, Antonín Kucera, and Dominik Wojtczak. One-counter Markov decision processes. In Moses Charikar, editor, Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17-19, 2010, pages 863–874. SIAM, 2010. doi:10.1137/1.9781611973075.70.
  • [17] Tomás Brázdil, Krishnendu Chatterjee, Martin Chmelik, Andreas Fellner, and Jan Kretínský. Counterexample explanation by learning small strategies in Markov decision processes. In Daniel Kroening and Corina S. Pasareanu, editors, Proceedings (Part I) of the 27th International Conference on Computer Aided Verification, CAV 2015, San Francisco, CA, USA, July 18–24, 2015, volume 9206 of Lecture Notes in Computer Science, pages 158–177. Springer, 2015. doi:10.1007/978-3-319-21690-4_10.
  • [18] Tomás Brázdil, Krishnendu Chatterjee, Jan Kretínský, and Viktor Toman. Strategy representation by decision trees in reactive synthesis. In Dirk Beyer and Marieke Huisman, editors, Proceedings (Part I) of the 24th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS 2018, Held as Part of ETAPS 2018, Thessaloniki, Greece, April 14–20, 2018, volume 10805 of Lecture Notes in Computer Science, pages 385–407. Springer, 2018. doi:10.1007/978-3-319-89960-2_21.
  • [19] Tomás Brázdil, Krishnendu Chatterjee, Antonín Kucera, and Petr Novotný. Efficient controller synthesis for consumption games with multiple resource types. In P. Madhusudan and Sanjit A. Seshia, editors, Computer Aided Verification - 24th International Conference, CAV 2012, Berkeley, CA, USA, July 7-13, 2012 Proceedings, volume 7358 of Lecture Notes in Computer Science, pages 23–38. Springer, 2012. doi:10.1007/978-3-642-31424-7_8.
  • [20] Tomás Brázdil, Stefan Kiefer, and Antonín Kucera. Efficient analysis of probabilistic programs with an unbounded counter. In Ganesh Gopalakrishnan and Shaz Qadeer, editors, Computer Aided Verification - 23rd International Conference, CAV 2011, Snowbird, UT, USA, July 14-20, 2011. Proceedings, volume 6806 of Lecture Notes in Computer Science, pages 208–224. Springer, 2011. doi:10.1007/978-3-642-22110-1_18.
  • [21] Thomas Brihaye, Aline Goeminne, James C. A. Main, and Mickael Randour. Reachability games and friends: A journey through the lens of memory and complexity (invited talk). In Patricia Bouyer and Srikanth Srinivasan, editors, Proceedings of the 43rd IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science, FSTTCS 2023, IIIT Hyderabad, Telangana, India, December 18–20, 2023, volume 284 of LIPIcs, pages 1:1–1:26. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.FSTTCS.2023.1.
  • [22] Krishnendu Chatterjee, Marcin Jurdzinski, and Thomas A. Henzinger. Quantitative stochastic parity games. In J. Ian Munro, editor, Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2004, New Orleans, Louisiana, USA, January 11–14, 2004, pages 121–130. SIAM, 2004. URL: http://dl.acm.org/citation.cfm?id=982792.982808.
  • [23] Krishnendu Chatterjee, Mickael Randour, and Jean-François Raskin. Strategy synthesis for multi-dimensional quantitative objectives. Acta Informatica, 51(3-4):129–163, 2014. doi:10.1007/s00236-013-0182-6.
  • [24] Florent Delgrange, Joost-Pieter Katoen, Tim Quatmann, and Mickael Randour. Simple strategies in multi-objective MDPs. In Armin Biere and David Parker, editors, Proceedings (Part I) of the 26th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS 2020, Held as Part of ETAPS 2020, Dublin, Ireland, April 25–30, 2020, volume 12078 of Lecture Notes in Computer Science, pages 346–364. Springer, 2020. doi:10.1007/978-3-030-45190-5_19.
  • [25] Kousha Etessami, Marta Z. Kwiatkowska, Moshe Y. Vardi, and Mihalis Yannakakis. Multi-objective model checking of Markov decision processes. Logical Methods in Computer Science, 4(4), 2008. doi:10.2168/LMCS-4(4:8)2008.
  • [26] Kousha Etessami, Dominik Wojtczak, and Mihalis Yannakakis. Quasi-birth-death processes, tree-like QBDs, probabilistic 1-counter automata, and pushdown systems. Performance Evaluation, 67(9):837–857, 2010. doi:10.1016/J.PEVA.2009.12.009.
  • [27] Nathanaël Fijalkow, Nathalie Bertrand, Patricia Bouyer-Decitre, Romain Brenguier, Arnaud Carayol, John Fearnley, Hugo Gimbert, Florian Horn, Rasmus Ibsen-Jensen, Nicolas Markey, Benjamin Monmege, Petr Novotný, Mickael Randour, Ocan Sankur, Sylvain Schmitz, Olivier Serre, and Mateusz Skomra. Games on graphs. CoRR, abs/2305.10546, 2023. doi:10.48550/arXiv.2305.10546.
  • [28] M. R. Garey and David S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979.
  • [29] Hugo Gimbert. Pure stationary optimal strategies in Markov decision processes. In Wolfgang Thomas and Pascal Weil, editors, Proceedings of the 24th Annual Symposium on Theoretical Aspects of Computer Science, STACS 2007, Aachen, Germany, February 22–24, 2007, volume 4393, pages 200–211. Springer, 2007. doi:10.1007/978-3-540-70918-3_18.
  • [30] Florian Jüngermann, Jan Kretínský, and Maximilian Weininger. Algebraically explainable controllers: decision trees and support vector machines join forces. International Journal on Software Tools for Technology Transfer, 25(3):249–266, 2023. doi:10.1007/S10009-023-00716-Z.
  • [31] Stefan Kiefer, Richard Mayr, Mahsa Shirmohammadi, Patrick Totzke, and Dominik Wojtczak. How to play in infinite MDPs (invited talk). In Artur Czumaj, Anuj Dawar, and Emanuela Merelli, editors, Proceedings of the 47th International Colloquium on Automata, Languages, and Programming, ICALP 2020, Saarbrücken, Germany, July 8–11, 2020, volume 168 of LIPIcs, pages 3:1–3:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPICS.ICALP.2020.3.
  • [32] Antonín Kucera, Javier Esparza, and Richard Mayr. Model checking probabilistic pushdown automata. Logical Methods in Computer Science, 2(1), 2006. doi:10.2168/LMCS-2(1:2)2006.
  • [33] James C. A. Main and Mickael Randour. Different strokes in randomised strategies: Revisiting Kuhn’s theorem under finite-memory assumptions. Information and Computation, 301:105229, 2024. doi:10.1016/J.IC.2024.105229.
  • [34] James C. A. Main and Mickael Randour. Mixing any cocktail with limited ingredients: On the structure of payoff sets in multi-objective MDPs and its impact on randomised strategies. CoRR, abs/2502.18296, 2025. doi:10.48550/arXiv.2502.18296.
  • [35] Donald Ornstein. On the existence of stationary optimal strategies. Proceedings of the American Mathematical Society, 20(2):563–569, 1969. URL: http://www.jstor.org/stable/2035700.
  • [36] Joël Ouaknine and James Worrell. Positivity problems for low-order linear recurrence sequences. In Chandra Chekuri, editor, Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2014, Portland, Oregon, USA, January 5–7, 2014, pages 366–379. SIAM, 2014. doi:10.1137/1.9781611973402.27.
  • [37] Jakob Piribauer and Christel Baier. Positivity-hardness results on Markov decision processes. TheoretiCS, 3, 2024. doi:10.46298/THEORETICS.24.9.
  • [38] Mickael Randour. Automated synthesis of reliable and efficient systems through game theory: A case study. In Proc. of ECCS 2012, Springer Proceedings in Complexity XVII, pages 731–738. Springer, 2013. doi:10.1007/978-3-319-00395-5_90.
  • [39] Mickael Randour, Jean-François Raskin, and Ocan Sankur. Percentile queries in multi-dimensional Markov decision processes. Formal methods in system design, 50(2-3):207–248, 2017. doi:10.1007/s10703-016-0262-7.
  • [40] Lloyd S. Shapley. Stochastic games. Proceedings of the National Academy of Sciences, 39(10):1095–1100, 1953. doi:10.1073/pnas.39.10.1095.
  • [41] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2018.
  • [42] Prasoon Tiwari. A problem that is easier to solve on the unit-cost algebraic RAM. Journal of Complexity, 8(4):393–397, 1992. doi:10.1016/0885-064X(92)90003-T.