Abstract 1 Introduction 2 Preliminaries 3 Reducing expectation to almost-sure satisfaction 4 Expectation of window mean-payoff objectives 5 Discussion References

Expectation in Stochastic Games with Prefix-Independent Objectives

Laurent Doyen ORCID Université Paris-Saclay, CNRS, ENS Paris-Saclay, LMF, Gif-sur-Yvette, France Pranshu Gaba ORCID Tata Institute of Fundamental Research, Mumbai, India Shibashis Guha ORCID Tata Institute of Fundamental Research, Mumbai, India
Abstract

Stochastic two-player games model systems with an environment that is both adversarial and stochastic. In this paper, we study the expected value of bounded quantitative prefix-independent objectives in the context of stochastic games. We show a generic reduction from the expectation problem to linearly many instances of the almost-sure satisfaction problem for threshold Boolean objectives. The result follows from partitioning the vertices of the game into so-called value classes where each class consists of vertices of the same value. Our procedure further entails that the memory required by both players to play optimally for the expectation problem is no more than the memory required by the players to play optimally for the almost-sure satisfaction problem for a corresponding threshold Boolean objective.

We show the applicability of the framework to compute the expected window mean-payoff measure in stochastic games. The window mean-payoff measure strengthens the classical mean-payoff measure by computing the mean payoff over windows of bounded length that slide along an infinite path. We show that the decision problem to check if the expected window mean-payoff value is at least a given threshold is in 𝖴𝖯𝖼𝗈𝖴𝖯 when the window length is given in unary.

Keywords and phrases:
Stochastic games, finitary objectives, mean payoff, reactive synthesis
Copyright and License:
[Uncaptioned image] © Laurent Doyen, Pranshu Gaba, and Shibashis Guha; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Mathematics of computing Stochastic processes
; Theory of computation Probabilistic computation
Related Version:
Full Version: https://arxiv.org/abs/2405.18048 [21]
Editors:
Patricia Bouyer and Jaco van de Pol

1 Introduction

Reactive systems typically have an infinite execution where the controller continually reacts to the environment. Given a specification, the reactive controller synthesis problem [19] concerns with synthesising a policy for the controller such that the specification is satisfied by the system for all behaviours of the environment. This problem is modelled using two-player turn-based games on graphs, where the two players are the controller (Player 1) and the environment (Player 2), the vertices and the edges of the game graph represent the states and transitions of the system, and the objective of Player 1 is to satisfy the specification. An execution of the system is then an infinite path in the game graph. The reactive controller synthesis problem corresponds to determining if there exists a strategy of Player 1 such that for all strategies of Player 2, the outcome satisfies the objective. If such a winning strategy exists, then we would also like to synthesise it. The environment is considered as an adversarial player to ensure that the specification is met even in the worst-case scenario.

Objectives are either Boolean or quantitative. Each execution either satisfies a Boolean objective ψ or does not satisfy ψ. The set of executions that satisfy ψ form a language over infinite words with the alphabet being the set of vertices in the graphs. On the other hand, a quantitative objective φ evaluates the performance of the execution by a numerical metric, which Player 1 aims to maximise and Player 2 aims to minimise. A quantitative objective can be viewed as a real-valued function over infinite paths in the graph.

In the presence of uncertainty or probabilistic behaviour, the game graph becomes stochastic. Fixing the strategies of the two players gives a distribution over infinite paths in the game graph. For Boolean objectives ψ, the goal of Player 1 is to maximise the probability that an outcome satisfies ψ. For quantitative objectives φ, there are two possible views:

Satisfaction.

Given a threshold λ, to maximise the probability that φ-value of the outcome is greater than λ;

Expectation.

To maximise the φ-value of the outcome in expectation.

Either view may be desirable depending on the context [8, 7, 6, 5]. The satisfaction view can be seen as a Boolean objective: the φ-value of the outcome is either greater than λ or it is not. The expectation view is more nuanced, and is the subject of study in this paper.

In this paper, we look at the expectation problem for quantitative prefix-independent objectives (also referred to as tail objectives). These are objectives that do not depend on finite prefixes of the plays, but only on the long-run behaviour of the system. In systems, we are often willing to allow undesirable behaviour in the short-term, if the long run behaviour is desirable. Prefix-independent objectives model such requirements and thus are of interest to study [9]. Prefix-independent objectives also have the benefit that they satisfy the Bellman equations [34], which simplifies their analysis. The expectation problem for such objectives arises naturally in many scenarios. For example:

  1. (i)

    An algorithmic trading system is designed to generate profit by executing trades based on real-time market data. Following an initial phase of learning and unstable behaviour due to parameter tuning, average profit over a bounded time window must always exceed a threshold and decisions need to be made within short well-defined intervals for them to be effective.

  2. (ii)

    A power plant may have different strategies to produce power (such as coal, solar, nuclear, wind) and must allocate resources among these strategies so as to maximise the power produced in expectation.

Contributions.

All of our contributions are with regard to quantitative prefix-independent objectives φ that are bounded (i.e., the image of φ is bounded between integers Wφ and Wφ) and such that a bound denφ on the denominators of the optimal expected φ-values of vertices in the game is known. The bound on the image ensures determinacy [35], that is, the players have optimal strategies, and the bound on the denominator of optimal values of vertices discretise the search space. These bounds often exist and are easily derivable for common objectives of interest such as mean payoff.

Our primary contribution is a reduction of the expectation problem for such an objective φ to linearly many instances of the almost-sure satisfaction problem for threshold Boolean objectives {φ>λ} for thresholds λ. Deciding the almost-sure satisfaction of {φ>λ} is conceptually simpler than computing the expected value of φ, as in the former, we only need to consider if the measure of the paths that satisfy the objective {φ>λ} is equal to one, whereas in the latter, one must take the averages of the measures of the sets of paths π weighted with the value φ(π) of the paths. Our technique is generic in the sense that when an algorithm for the almost-sure satisfaction problem for {φ>λ} is known, we directly obtain the complexity and a way to solve the expectation problem for φ.

Our reduction builds on the technique introduced in [16] for Boolean prefix-independent objectives and non-trivially extends it to quantitative prefix-independent objectives φ for which the bounds Wφ and denφ are known. The expected φ-values of vertices are nondeterministically guessed, and we present a characterisation (Theorem 7, similar to [16, Lemma 8] but with important and subtle differences) to verify the guess. We also explicitly construct strategies for both players that are optimal for the expectation of φ, in terms of almost-sure winning strategies for {φ>λ} (proof of Lemma 9). The memory requirement for the constructed optimal strategies is the same as that of the almost-sure winning strategies (Corollary 10).

Our framework gives an alternative approach to solve the expectation problem for well-studied objectives such as expected mean payoff and gives new results for not-as-well-studied objectives such as the window mean-payoff objectives introduced in [12]. As our secondary contribution, we illustrate our technique by applying it to two variants of window mean-payoff objectives: fixed (FWMP()) and bounded (BWMP) window mean-payoff. Using our reduction, we are able to show that for both of these objectives, the expectation problem is in 𝖴𝖯𝖼𝗈𝖴𝖯 (Theorem 18 and Theorem 22), a result that was not known before. The 𝖴𝖯𝖼𝗈𝖴𝖯 upper bound for window mean-payoff objectives matches the special case of simple stochastic games [20, 13], and thus would require a major breakthrough to be improved. The lower bounds on the memory requirements for these objectives carry over from special case of the non-stochastic games [12, 22]. We summarise the complexity results and bounds on the memory requirements for the window mean-payoff objectives in Table 1.

Table 1: Complexity and bounds on memory requirement for window mean-payoff objectives.
Objective
Complexity
Memory (Player 1)
(lower [22], upper)
Memory (Player 2)
(lower [22], upper)
FWMP() 𝖴𝖯𝖼𝗈𝖴𝖯 1, |V|, |V|
BWMP 𝖴𝖯𝖼𝗈𝖴𝖯 memoryless, memoryless infinite, infinite

Related work.

Stochastic games were introduced by Shapley [38] where these games were studied under expectation semantics for discounted-sum objectives. In [14], it was shown that solving stochastic parity games reduces to solving stochastic mean-payoff games. Further, solving stochastic parity games, stochastic mean-payoff games, and simple stochastic games (i.e., stochastic games with reachability objective) are all polynomial-time equivalent [30, 1], and thus, are all in 𝖴𝖯𝖼𝗈𝖴𝖯 [13]. A sub-exponential (or even quasi-polynomial) time deterministic algorithms for simple stochastic games on graphs with poly-logarithmic treewidth was proposed in [18]. In [29], sufficient conditions on the objective were shown such that optimal deterministic memoryless strategies exist for the players. In [34], value iteration to solve the expectation problem in stochastic games with reachability, safety, total-payoff, and mean-payoff objectives was studied.

Mean-payoff objectives were studied initially in two-player games, without stochasticity [24, 39], and with stochasticity in [28]. Finitary versions were introduced as window mean-payoff objectives [12]. For finitary mean-payoff objectives, the satisfaction problem [6] and the expectation problem [5] were studied in the special case of Markov decision processes (MDPs), which correspond to stochastic games with a trivial adversary. Expected mean payoff, expected discounted payoff, expected total payoff, etc., are widely studied for MDPs [36, 4]. Both the expectation problem [5] and the satisfaction problem [6] for the FWMP() objective are in 𝖯𝖳𝖨𝖬𝖤, while they are in 𝖴𝖯𝖼𝗈𝖴𝖯 for the BWMP objective. Ensuring the satisfaction and expectation semantics simultaneously was studied in MDPs for the mean-payoff objective in [17] and for the window mean-payoff objectives in [26]. In both cases, the complexity was shown to be no greater than that of only expectation optimisation.

The satisfaction problem for window mean-payoff objectives has been studied for two-player stochastic games in [22]. While positive and almost-sure satisfaction of FWMP() are in 𝖯𝖳𝖨𝖬𝖤, it follows from [22] that the problem is in 𝖴𝖯𝖼𝗈𝖴𝖯 for quantitative satisfaction i.e., with threshold probabilities 0<p<1. Furthermore, the satisfaction problem of BWMP is in 𝖴𝖯𝖼𝗈𝖴𝖯 and thus has the same complexity as that of the special case of MDPs [6].

Due to lack of space, some of the proofs and details have been omitted. A full version of the paper can be found in [21].

2 Preliminaries

Probability distributions.

A probability distribution over a finite non-empty set A is a function 𝖯𝗋:A[0,1] such that aA𝖯𝗋(a)=1. We denote by 𝒟(A) the set of all probability distributions over A. For the algorithmic and complexity results, we assume that probabilities are given as rational numbers.

Stochastic games.

We consider two-player turn-based zero-sum stochastic games (or simply, stochastic games in the sequel). The two players are referred to as Player 1 (she/her) and Player 2 (he/him). A stochastic game is given by 𝒢=((V,E),(V1,V2,V),,w), where:

  • (V,E) is a directed graph with a finite set V of vertices and a set EV×V of directed edges such that for all vertices vV, the set E(v)={vV(v,v)E} of out-neighbours of v is non-empty, i.e., E(v) (no deadlocks).

  • (V1,V2,V) is a partition of V. The vertices in V1 belong to Player 1, the vertices in V2 belong to Player 2, and the vertices in V are called probabilistic vertices;

  • :V𝒟(V) is a probability function that describes the behaviour of probabilistic vertices in the game. It maps each probabilistic vertex vV to a probability distribution (v) over the set E(v) of out-neighbours of v such that (v)(v)>0 for all vE(v) (i.e., all out-neighbours have non-zero probability);

  • w:E is a payoff function assigning a rational payoff to every edge in the game.

Stochastic games are played in rounds. The game starts by initially placing a token on some vertex. At the beginning of a round, if the token is on a vertex v, and vVi for i{1,2}, then Playeri chooses an out-neighbour vE(v); otherwise vV, and an out-neighbour vE(v) is chosen with probability (v)(v). Player 1 receives from Player 2 the amount w(v,v) given by the payoff function, and the token moves to v for the next round. This continues ad infinitum resulting in an infinite sequence π=v0v1v2Vω such that (vi,vi+1)E for all i0, called a play. For i<j, we denote by π(i,j) the infix vivi+1vj of π. Its length is |π(i,j)|=ji, the number of edges. We denote by π(0,j) the finite prefix v0v1vj of π, and by π(i,) the infinite suffix vivi+1 of π. We denote by 𝖯𝗅𝖺𝗒𝗌𝒢 and 𝖯𝗋𝖾𝖿𝗌𝒢 the set of all plays and the set of all finite prefixes in 𝒢 respectively. We denote by 𝖫𝖺𝗌𝗍(ρ) the last vertex of the prefix ρ𝖯𝗋𝖾𝖿𝗌𝒢. We denote by 𝖯𝗋𝖾𝖿𝗌𝒢i (i{1,2}) the set of all prefixes ρ such that 𝖫𝖺𝗌𝗍(ρ)Vi.

A stochastic game with V= is a non-stochastic two-player game, a stochastic game with V2= is a Markov decision process (MDP), and a stochastic game with V1=V2= is a Markov chain. Figure 1 shows an example of a stochastic game; Player 1 vertices are shown as circles, Player 2 vertices as boxes, and probabilistic vertices as diamonds.

Figure 1: A stochastic game. Player 1 vertices are denoted by circles, Player 2 vertices are denoted by boxes, and probabilistic vertices are denoted by diamonds. The payoff for each edge is shown in red and probability distribution out of probabilistic vertices is shown in blue.

Subgames and traps.

Given a stochastic game 𝒢=((V,E),(V1,V2,V),,w), a subset VV of vertices induces a subgame if (i) every vertex vV has an outgoing edge in V, that is E(v)V, and (ii) every probabilistic vertex vVV has all outgoing edges in V, that is E(v)V. The induced subgame is ((V,E),(V1V,V2V,VV),,w), where E=E(V×V), and and w are restrictions of and w respectively to (V,E). If TV is such that for all vT, if vV1V then E(v)T and if vV2 then E(v)T, then T induces a subgame, and the subgame is a trap for Player 1 in 𝒢, since Player 2 can ensure that if the token reaches T, then it never escapes.

Boolean objectives.

A Boolean objective ψ is a Borel-measurable subset of 𝖯𝗅𝖺𝗒𝗌𝒢 [35]. A play π𝖯𝗅𝖺𝗒𝗌𝒢 satisfies an objective ψ if πψ. In a stochastic game 𝒢 with objective ψ, the objective of Player 1 is ψ, and since 𝒢 is a zero-sum game, the objective of Player 2 is the complement set ψ¯=𝖯𝗅𝖺𝗒𝗌𝒢ψ. An example of a Boolean objective is reachability, denoted Reach(T), the set of all plays that visit a vertex in the target set TV. This is formally defined and more examples of Boolean objectives are given in [21].

Quantitative objectives.

A quantitative objective is a Borel-measurable function of the form φ:𝖯𝗅𝖺𝗒𝗌𝒢. In a stochastic game 𝒢 with objective φ, the objective of Player 1 is φ and the objective of Player 2 is φ, the negative of φ. Let π=v0v1v2 be a play. Some common examples of quantitative objectives include the mean-payoff objective MP(π)=lim infn1ni=0nw(vi,vi+1), and the liminf objective liminf(π)=lim infnw(vn,vn+1). In this work, we also consider the window mean-payoff objective, which is defined in Section 4. Corresponding to a quantitative objective φ, we define threshold objectives which are Boolean objectives ψ of the form {π𝖯𝗅𝖺𝗒𝗌𝒢φ(π)λ} for thresholds λ and for {<,,>,}. We denote this threshold objective succinctly as {φλ}.

Prefix independence.

An objective is said to be prefix-independent if it only depends on the suffix of a play. Formally, a Boolean objective ψ is prefix-independent if for all plays π and π with a common suffix (that is, π can be obtained from π by removing and adding a finite prefix), we have that πψ if and only if πψ. Similarly, a quantitative objective φ is prefix-independent if for all plays π and π with a common suffix, we have that φ(π)=φ(π). Mean payoff and liminf are examples of prefix-independent objectives, whereas reachability and discounted payoff [2] are not.

Strategies.

A (deterministic or pure) strategy111 We only consider the satisfaction and expectation of Borel-measurable objectives, and deterministic strategies suffice for such objectives [10]. Satisfying two goals simultaneously, e.g., 𝖯𝗋(Reach(T1))>0.5𝖯𝗋(Reach(T2))>0.5 requires randomisation and is not allowed by our definition. for Player i{1,2} in a game 𝒢 is a function σi:𝖯𝗋𝖾𝖿𝗌𝒢iV that maps prefixes ending in a vertex vVi to a successor of v. Strategies can be realised as the output of a (possibly infinite-state) Mealy machine [33]. The memory size of a strategy σi is the smallest number of states a Mealy machine defining σi can have. A strategy σi is memoryless if σi(ρ) only depends on the last element of the prefix ρ, that is, for all prefixes ρ,ρ𝖯𝗋𝖾𝖿𝗌𝒢i if 𝖫𝖺𝗌𝗍(ρ)=𝖫𝖺𝗌𝗍(ρ), then σi(ρ)=σi(ρ).

A strategy profile σ=(σ1,σ2) is a pair of strategies σ1 and σ2 of Player 1 and Player 2 respectively. A play π=v0v1 is consistent with a strategy σi of Player i (i{1,2}) if for all j0 with vjVi, we have vj+1=σi(π(0,j)). A play π is an outcome of a profile σ=(σ1,σ2) if it is consistent with both σ1 and σ2. For a Boolean objective ψ, we denote by 𝖯𝗋𝒢,vσ1,σ2(ψ) the probability that an outcome of the profile (σ1,σ2) in 𝒢 with initial vertex v satisfies ψ.

Satisfaction probability of Boolean objectives.

Let ψ be a Boolean objective. A strategy σ1 of Player 1 is winning with probability p from a vertex v in 𝒢 for objective ψ if 𝖯𝗋𝒢,vσ1,σ2(ψ)p for all strategies σ2 of Player 2. A strategy σ1 of Player 1 is positive winning (resp., almost-sure winning) from v for Player 1 in 𝒢 with objective ψ if 𝖯𝗋𝒢,vσ1,σ2(ψ)>0 (resp., 𝖯𝗋𝒢,vσ1,σ2(ψ)=1) for all strategies σ2 of Player 2. In the above, if such a strategy σ1 exists, then the vertex v is said to be positive winning (resp., almost-sure winning) for Player 1. If a vertex v is positive winning (resp., almost-sure winning) for Player 1, then Player 1 is said to play optimally from v if she follows a positive (resp., almost-sure) winning strategy from v. We omit analogous definitions for Player 2.

Expected value of quantitative objectives.

Let φ be a quantitative objective. Given a strategy profile σ=(σ1,σ2) and an initial vertex v, let 𝔼vσ(φ) denote the expected φ-value of the outcome of the strategy profile σ from v, that is, the expectation of φ over all plays with initial vertex v under the probability measure 𝖯𝗋𝒢,vσ1,σ2(φ). We only consider objectives φ that are Borel-measurable and whose image is bounded. Thus, by determinacy of Blackwell games [35], we have that stochastic games with objective φ are determined. That is, we have supσ1infσ2𝔼vσ(φ)=infσ2supσ1𝔼vσ(φ). We call this quantity the expected φ-value of the vertex v and denote it by 𝔼v(φ). We say that Player 1 plays optimally from a vertex v if she follows a strategy σ1 such that for all strategies σ2 of Player 2, the expected φ-value of the outcome is at least 𝔼v(φ). Similarly, Player 2 plays optimally if he follows a strategy σ2 such that for all strategies σ1 of Player 1, the expected φ-value of the outcome is at most 𝔼v(φ). If φ is a prefix-independent objective, then we have the following relation between the expected φ-value of a vertex v and the expected φ-values of its out-neighbours.

Proposition 1 (Bellman equations).

If φ is a prefix-independent objective, then the following equations hold for all vV.

𝔼v(φ)={maxvE(v)𝔼v(φ)if vV1minvE(v)𝔼v(φ)if vV2vE(v)(v)(v)𝔼v(φ)if vV

In this paper, we consider the expectation problem for prefix-independent objectives. Our solution in turn uses the almost-sure satisfaction problem. The decision problems are defined as follows.

Decision problems.

Given a stochastic game 𝒢, a quantitative objective φ, a vertex v, and a threshold λ, the following decision problems are relevant:

  • almost-sure satisfaction problem: Is vertex v almost-sure winning for Player 1 for a threshold objective {φ>λ}?

  • expectation problem: Is 𝔼v(φ)λ? That is, is the expected φ-value of v at least λ?

The reader is pointed to [2] and [25] for a more comprehensive discussion on the above-mentioned concepts.

3 Reducing expectation to almost-sure satisfaction

In this section, we show a reduction (Theorem 7) of the expectation problem for bounded quantitative prefix-independent objectives φ to the almost-sure satisfaction problem for the corresponding threshold objectives {φ>λ}. The reduction involves guessing a value 𝗋v for every vertex v in the game, and then verifying if the guessed values are equal to the expected φ-values of the vertices. Theorem 7 generalises [16, Lemma 8] which studies the satisfaction problem for prefix-independent Boolean objectives, as Boolean objectives can be viewed as a special case of quantitative objectives by restricting the range to {0,1}. We further discuss the difference in approaches between Theorem 7 and [16, Lemma 8] in Section 5.

Given a game 𝒢 and a bounded prefix-independent quantitative objective φ, our reduction requires the existence of an integer bound denφ on the denominators of expected φ-values of vertices in 𝒢. Since φ is bounded, there exists an integer Wφ such that |φ(π)|Wφ for every play π in 𝒢. Thus, for every vertex v in 𝒢, one can write 𝔼v(φ) as pq, where p and q are integers such that |p|Wφdenφ and 0<qdenφ. The bounds Wφ and denφ may depend on the objective and the structure of the graph, i.e., number of vertices, edge payoffs, probability distributions in the game, etc. These bounds effectively discretise the set of possible expected φ-values of the vertices, as there are at most (2Wφdenφ+1)denφ distinct possible values. This directly gives a bound on the granularity of the possible expected φ-values of vertices, that is, the minimum difference between two possible values of vertices, and we represent this quantity by εφ. Observe that given two rational numbers with denominators at most denφ, the difference between them is at least (1denφ)2, and thus, we let εφ be (1denφ)2.

3.1 Value vectors and value classes

We first define and give notations for value vectors, which are useful in describing the reduction, and then look at some of their interesting and useful properties.

Definitions and notations.

A vector 𝗋=(𝗋v)vV of reals indexed by vertices in V induces a partition of V such that all vertices with the same value in 𝗋 belong to the same part in the partition. Let k𝗋 denote the number of parts in the partition, and let us denote the parts by {𝖱(1),𝖱(2),,𝖱(k𝗋)}. We call each part 𝖱(i) of the partition an 𝗋-class, or simply, class if 𝗋 is clear from the context. For all 1ik𝗋, let 𝔯i denote the 𝗋-value of the class 𝖱(i). Given two vectors 𝗋,𝗌, we write 𝗋𝗌 if for all vV, we have 𝗋v𝗌v, and we write 𝗋>𝗌 if we have that 𝗋𝗌 and there exists vV such that 𝗋v>𝗌v. For a constant c, we denote by 𝗋+c the vector obtained by adding c to each component of 𝗋.

For all 1ik𝗋, a vertex v𝖱(i) is a boundary vertex if v is a probabilistic vertex and has an out-neighbour not in 𝖱(i), i.e., if vV and E(v)𝖱(i). Let Bnd(𝖱(i)) denote the set of boundary vertices in the class 𝖱(i). For all 1ik𝗋, let 𝒢𝖱(i) denote the restriction of 𝒢 to vertices in 𝖱(i) with all vertices in Bnd(𝖱(i)) changed to absorbing vertices with a self-loop. The edge payoffs of these self loops are not important (we assume them to be 0) as we restrict our attention to a subgame of 𝒢𝖱(i) that does not contain boundary vertices.

Example 2.
Figure 2: Restrictions 𝒢𝖱(1), 𝒢𝖱(2), 𝒢𝖱(3), 𝒢𝖱(4), and 𝒢𝖱(5) of the game shown in Figure 1 for the vector 𝗋=(2,1,1,1,1,0,0,0,0,1,1,2,2,2).

For the game in Figure 1, let 𝗋=(2,1,1,1,1,0,0,0,0,1,1,2,2,2) be a value vector for vertices v1,v2,,v14 respectively. Since 𝗋 has five distinct values, we have k𝗋=5, and the five 𝗋-classes are 𝖱(1)={v1}, 𝖱(2)={v2,v3,v4,v5}, 𝖱(3)={v6,v7,v8,v9}, 𝖱(4)={v10,v11}, and 𝖱(5)={v12,v13,v14} with values 𝔯1=2, 𝔯2=1, 𝔯3=0, 𝔯4=1, and 𝔯5=2 respectively. Out of the five probabilistic vertices v2, v5, v8, v9 and v12, we see that v2, v5, and v8 are boundary vertices while v9 and v12 are not. Thus, Bnd(𝖱(2))={v2,v5}, Bnd(𝖱(3))={v8}, and Bnd(𝖱(1))=Bnd(𝖱(4))=Bnd(𝖱(5))=. We show the restrictions 𝒢𝖱(i) in Figure 2.  

Let φ be a bounded prefix-independent quantitative objective. Analogous to the notation of a general value vector 𝗋, we describe notations for the expected φ-value vector consisting of the expected φ-values of vertices in V. For all vertices vV, let 𝗌v denote 𝔼v(φ), the expected φ-value of vertex v, and let 𝗌=(𝗌v)vV denote the expected φ-value vector. Let 𝖲(i) denote the ith 𝗌-class and let 𝔰i denote the 𝗌-value of 𝖲(i).

Given a vector 𝗋, it follows from Proposition 1 that the following is a necessary (but not sufficient) condition for 𝗋 to be the expected φ-value vector 𝗌.

  • Bellman condition: for every vertex vV, the following Bellman equations hold

    𝗋v={maxvE(v)𝗋vif vV1,minvE(v)𝗋vif vV2,vE(v)(v)(v)𝗋vif vV.

Consequences of the Bellman condition.

We now see some properties of value vectors 𝗋 that satisfy the Bellman condition. Since boundary vertices are probabilistic vertices, the following is immediate.

Proposition 3.

Let 𝗋 be a value vector satisfying the Bellman condition. Then for all 1ik𝗋, for all vBnd(𝖱(i)), there exists an out-neighbour of v with 𝗋-value less than 𝔯i and there exists an out-neighbour of v with 𝗋-value greater than 𝔯i. Formally, there exist 1i1,i2k𝗋 such that 𝔯i1<𝔯i<𝔯i2 and E(v)𝖱(i1) and E(v)𝖱(i2).

A corollary of Proposition 3 is that the 𝗋-classes with the smallest and the biggest 𝗋-values have no boundary vertices. Note that there may also exist 𝗋-classes other than these that do not contain boundary vertices (see 𝖱(4) in Example 2). Next, we see that the Bellman condition entails that each restriction 𝒢𝖱(i) is a stochastic game.

Proposition 4.

If 𝗋 is a value vector that satisfies the Bellman condition, then for all 1ik𝗋, we have that 𝒢𝖱(i) is a stochastic game.

In Proposition 5, we make a crucial observation about long-run behaviours of plays in 𝒢, which is that either player can ensure with probability 1 that the token eventually reaches an 𝗋-class from which it does not exit. This follows from the Borel-Cantelli lemma [23] due to the fact that there is a positive probability to reach an 𝗋-class without boundary vertices following a finite number of edges out of boundary vertices.

Proposition 5.

Let 𝗋 be a value vector satisfying the Bellman condition. Suppose the strategy of Player i (i{1,2}) is such that each time the token reaches a vertex vVi, (s)he moves the token to a vertex v in the same 𝗋-class as v. Then, with probability 1, the token eventually reaches a class 𝖱(j) for some 1jk𝗋 from which it never exits.

Finally, we define the notion of trap subgames of 𝒢𝖱(i) which will be used in the subsequent discussion. We denote by P𝖱(i)1 the Player 1 positive attractor set of Bnd(𝖱(i)), i.e., the set of vertices in 𝒢𝖱(i) that are positive winning for Player 1 for the Reach(Bnd(𝖱(i))) objective. The complement T𝖱(i)1=𝖱(i)P𝖱(i)1 is a trap for Player 1 in 𝒢𝖱(i), and with abuse of notation, we use the same symbol T𝖱(i)1 to denote the subset of 𝒢𝖱(i) as well as the trap subgame. We note that if 𝖱(i) does not have boundary vertices, that is, if Bnd(𝖱(i))=, then it holds that P𝖱(i)1= and T𝖱(i)1=𝖱(i). We can analogously define P𝖱(i)2 and T𝖱(i)2 for Player 2. Given 𝒢𝖱(i), these sets can be computed in polynomial time using attractor computations [25].

Example 6.

We compute these sets for the restrictions shown in Figure 2. For i{1,4,5}, since Bnd(𝖱(i)) is empty, we have that T𝖱(i)1=𝖱(i) and P𝖱(i)1=. For 𝖱(2), we have that P𝖱(2)1=𝖱(2), and T𝖱(2)1=. For 𝖱(3), we have that T𝖱(3)1={v6}, P𝖱(3)1={v7,v8,v9}.  

3.2 Characterisation of the value vector

We describe in Theorem 7 a necessary and sufficient set of conditions for a given vector 𝗋 to be equal to the expected φ-value vector 𝗌. In addition to Bellman, Theorem 7 makes use of two more conditions, which we define before stating the theorem.

  • lower-bound condition: for all 1ik𝗋, Player 1 wins {φ>𝔯iεφ} almost surely in the trap subgame T𝖱(i)1 from all vertices in T𝖱(i)1.

  • upper-bound condition: for all 1ik𝗋, Player 2 wins {φ<𝔯i+εφ} almost surely in the trap subgame T𝖱(i)2 from all vertices in T𝖱(i)2.

Theorem 7.

The only vector 𝗋, whose every component has denominator at most denφ, that satisfies Bellman, lower-bound, and upper-bound is the expected φ-value vector 𝗌.

Proof.

We show in Lemma 8 that 𝗌 satisfies the three conditions. We show in Lemma 9 that if 𝗋 is a vector that satisfies the three conditions, then 𝗋 is less than εφ distance away from 𝗌, that is, 𝗌εφ<𝗋<𝗌+εφ. In particular, if each component of 𝗋 can be written as pq, where p, q are both integers and q is at most denφ, then it follows that 𝗋 is equal to 𝗌.

In the rest of the section, we prove Lemmas 8 and 9 used in the proof of Theorem 7.

Lemma 8.

The expected φ-value vector 𝗌 satisfies the three conditions in Theorem 7.

Proof.

The fact that 𝗌 satisfies Bellman follows directly from Proposition 1. We show that lower-bound holds for 𝗌. The proof for upper-bound is analogous.

Suppose for the sake of contradiction that lower-bound does not hold, that is, there exists 1ik𝗌 and a vertex v in T𝖲(i)1 such that Player 2 has a positive winning strategy from v for the {φ𝔰iεφ} objective in T𝖲(i)1. Since {φ𝔰iεφ} is a prefix-independent objective, from [9, Theorem 1] we have that there exists another vertex v in T𝖲(i)1 such that Player 2 has an almost-sure winning strategy from v for the same objective {φ𝔰iεφ} in T𝖲(i)1. If Player 2 follows this strategy from v in the original game 𝒢, then one of the following two cases holds

  • Player 1 always moves the token to a vertex in 𝖲(i). Since v is in the trap T𝖲(i)1 for Player 1 in 𝒢𝖲(i), Player 2 can force the token to remain in T𝖲(i)1 forever, and follow the almost-sure winning strategy to ensure that with probability 1, the outcome satisfies the objective {φ𝔰iεφ}.

  • Player 1 eventually moves the token to a vertex out of 𝖲(i). Since 𝗌 satisfies Bellman, the token moves to an 𝗌-class with a smaller value than 𝔰i.

In both cases, the expected φ-value of the outcome is less than 𝔰i. This is a contradiction since v𝖲(i), and the expected φ-value of every vertex in 𝖲(i) is equal to 𝔰i.

Lemma 9.

If a vector 𝗋 satisfies the three conditions in Theorem 7, then 𝗌εφ<𝗋<𝗌+εφ. In particular, we have the following:

  • If 𝗋 satisfies the Bellman and lower-bound conditions, then 𝗌>𝗋εφ.

  • If 𝗋 satisfies the Bellman and upper-bound conditions, then 𝗌<𝗋+εφ.

Proof sketch..

We prove the first case. The proof for the second case follows by symmetry, that is, we essentially replace Player 1 by Player 2, and {φ>𝔯iεφ} by {φ<𝔯i+εφ}. We describe an optimal strategy σ1 of Player 1 and give a sketch of its optimality.

Since 𝗋 satisfies the lower-bound condition, we have that for all 1ik𝗋, Player 1 has an an almost-sure winning strategy σ𝖱(i)T in the trap subgame T𝖱(i)1 to win the objective {φ>𝔯iεφ} in T𝖱(i)1 almost surely from all vertices in T𝖱(i)1. From the definition of P𝖱(i)1, Player 1 has a positive winning strategy σ𝖱(i)P in the restricted game 𝒢𝖱(i) from vertices in P𝖱(i)1 for the Reach(Bnd(𝖱(i))) objective. By following σ𝖱(i)P, the token either reaches Bnd(𝖱(i)) with positive probability, or ends up in T𝖱(i)1 from where Player 2 can ensure that the token never leaves T𝖱(i)1. Using these strategies of Player 1 in 𝒢𝖱(i), we construct a strategy σ1 of Player 1 that is optimal for expected φ-value in the original game 𝒢: As long as the token is in the class 𝖱(i) in 𝒢, the strategy σ1 mimics σ𝖱(i)T if the token is in T𝖱(i)1 and σ1 mimics σ𝖱(i)P if the token is in P𝖱(i)1.

Note that whenever the token is on a vertex vV1 in 𝖱(i), the strategy σ1 always moves the token to a vertex v in same 𝗋-class 𝖱(i) as v (i.e. a token only exits an 𝗋-class from a Player 2 vertex or from a boundary vertex), and thus, Proposition 5 holds. Whenever the token exits a class 𝖱(i) to reach a different class 𝖱(i), then as long as the token remains in 𝖱(i), the strategy σ1 follows σ𝖱(i)T if the token is in T𝖱(i)1, and σ1 follows σ𝖱(i)P if the token is in P𝖱(i)1.

Since Proposition 5 holds, we have that for all strategies of Player 2, with probability 1, the token eventually reaches an 𝗋-class 𝖱(j) from which it never exits. Moreover, the strategy σ1 ensures that with probability 1, the token eventually reaches T𝖱(j)1 in 𝖱(j) from which it never leaves. Because if not, then the token would visit vertices in P𝖱(j)1 infinitely often, having a fixed positive probability of reaching Bnd(𝖱(j)) in every step because of σ𝖱(j)P. Thus, with probability 1, the token would eventually reach Bnd(𝖱(i)) from where it could escape to a different 𝗋-class, which contradicts the fact that the token stays in 𝖱(j) forever.

Since φ is prefix-independent, the φ-value of a play only depends on the trap T𝖱(j)1 it ends up in. If the game begins from a vertex v𝖱(i), then for 1jk𝗋, let pj denote the probability that the token ends up in the trap subgame T𝖱(j)1 from which it never exits. Since 𝗋 satisfies Bellman, we have that jpj𝔯j=𝔯i. Since 𝗋 satisfies lower-bound, Player 1 has an almost-sure winning strategy for {φ>𝔯jεφ} in T𝖱(j)1. Thus, for all strategies σ2 of Player 2, the expected value of an outcome of (σ1,σ2) from v𝖱(i) is greater than jpj(𝔯jεφ), which is 𝔯iεφ. This holds for all vertices v in 𝒢, giving us the desired result 𝗌>𝗋εφ.

We also note that the optimal strategy σ1 always either follows an almost-sure winning strategy σ𝖱(i)T for the threshold objective {φ>𝔯iεφ} or a positive winning strategy for a Reach objective. Since there exist memoryless positive winning strategies for the Reach objective [20], we have the following bound on the memory requirement of σ1.

Corollary 10.

The memory requirement of σ1 is at most the maximum over all 1ik𝗋 of the memory requirement of an almost-sure winning strategy σ𝖱(i)T for the threshold objective {φ>𝔯iεφ}. Moreover, if σ𝖱(i)T is a deterministic strategy, then so is σ1.

3.3 Bounding the denominators in the value vector

In this section, we discuss the problem of obtaining an upper bound denφ for the denominators of the expected φ-values of vertices 𝔰i for a bounded prefix-independent objective φ in a game 𝒢. In [16], the technique of value class is used to compute the values of vertices for Boolean prefix-independent objectives. It is stated without proof that the probability of satisfaction of a parity or a Streett objective [2] from each vertex can be written as pq where q(^)4|E| and ^ is the maximum denominator over all edge probabilities in the game. As such, we were not able to directly generalise this bound for the expectation of quantitative prefix-independent objectives. Instead, we make the following observations:

  • Let 𝖲(i) be an 𝗌-class without boundary vertices. If the token is in 𝖲(i) at some point in the play, then since 𝗌 satisfies the Bellman condition, neither player has an incentive to move the token out of 𝖲(i). Since there are no boundary vertices in 𝖲(i), the token does not exit 𝖲(i) from a probabilistic vertex either, and remains in 𝖲(i) forever. Thus, the value 𝔰i of 𝖲(i) depends only on the internal structure of 𝖲(i). We denote by den¯φ an upper bound on the denominators of values of 𝗌-classes without boundary vertices. It is a simpler problem to find den¯φ than to find denφ, as each class without boundary vertices can be treated as a subgame in which each vertex has the same expected φ-value, or equivalently, the subgame consists of exactly one 𝗌-class.

  • On the other hand, suppose 𝖲(i) is an 𝗌-class containing at least one boundary vertex, and let v be a boundary vertex in 𝖲(i). Then, since 𝗌 satisfies the Bellman condition, we have 𝗌v=vE(v)(v)(v)𝗌v, which is also the value 𝔰i of 𝖲(i). Thus, 𝔰i can be written in terms of the values of classes reachable from v in one step and the probabilities with which those classes are reached. In fact, we construct in the proof of Theorem 11 a system of linear equations to show that the value of each 𝗌-class with boundary vertices can be expressed solely in terms of transition probabilities of the outgoing edges from boundary vertices and values of 𝗌-classes without boundary vertices.

The method to calculate den¯φ depends on the specific objective; we illustrate as an example in Section 4 a way to obtain den¯φ for a particular kind of objective called the window mean-payoff objective. Once we know den¯φ for an objective φ, we can use Theorem 11 to obtain denφ in terms of den¯φ.

Theorem 11.

The denominator of the value of each 𝗌-class in 𝒢 is at most denφ=2|V|^|V|3(den¯φ)|V|.

We note that this theorem implies that the number of bits required to write denφ is polynomial in the number of vertices in the game and in the number of bits required to write den¯φ. We devote the rest of this section to the proof of Theorem 11. For ease of notation, we denote the number of 𝗌-classes in the game by k instead of k𝗌 for the rest of this section. If every 𝗌-class has no boundary vertices, then we have denφ equal to den¯φ and we are done. So we assume there exists at least one class that contains boundary vertices. Let m1 denote the number of 𝗌-classes with boundary vertices, and therefore, there are km 𝗌-classes without boundary vertices. Since there always exists at least one 𝗌-class without boundary vertices (Proposition 3), we have that m<k. Let B={1,2,,m} and C={m+1,,k}. We index the 𝗌-classes such that each class with boundary vertices has its index in B and each class without boundary vertices has its index in C. Furthermore, in the sets B and C, the classes are indexed in increasing order of their values. That is, for i,j both in B or both in C, we have i<j if and only if 𝔰i<𝔰j. We show bounds on the denominators of 𝗌-values of classes with boundary vertices, i.e., 𝔰1,,𝔰m in terms of 𝗌-values of classes without boundary vertices, i.e., 𝔰m+1,,𝔰k.

For all iB={1,2,,m}, pick an arbitrary boundary vertex from Bnd(𝖲(i)) and call this the representative vertex ui of Bnd(𝖲(i)). For all iB and j{1,2,,k}, let pi,j denote the probability of reaching the class 𝖲(j) from ui in one step. Since 𝗌 satisfies the Bellman condition, we have that 1jkpi,j𝔰j=𝔰i. It is helpful to split this sum based on whether jB or jC, i.e., whether 1jm or m+1jk. We rewrite the sums as jBpi,j𝔰j+jCpi,j𝔰j=𝔰i for all iB, and we represent this system of equations below using matrices.

(p1,1p1,2p1,mp2,1p2,2p2,mpm,1pm,2pm,m)(𝔰1𝔰2𝔰m)+(p1,m+1p1,m+2p1,kp2,m+1p2,m+2p2,kpm,m+1pm,m+2pm,k)(𝔰m+1𝔰m+2𝔰k)=(𝔰1𝔰2𝔰m)

This system of equations is of the form QB𝔰B+QC𝔰C=𝔰B. Rearranging terms gives us (IQB)𝔰B=QC𝔰C where I is the m×m identity matrix. It follows from Proposition 12 that the equation (IQB)𝔰B=QC𝔰C has a unique solution.

Proposition 12.

The matrix (IQB) is invertible.

Let α denote the least common multiple (lcm) of the denominators of pi,j for 1im and 1jk. We have 0<α^mk, where ^ is the maximum denominator over all edge probabilities in 𝒢. We multiply both sides of the equation (IQB)𝔰B=Qc𝔰C by α to get α(IQB)𝔰B=αQC𝔰C and note that all the elements of α(IQB) and αQC are integers. Let D be the determinant of the matrix α(IQB), and for 1im, let Ni be the determinant of the matrix obtained by replacing the ith column of α(IQB) with the column vector αQC𝔰C. Since α(IQB) is invertible, by Cramer’s rule [32], we have that 𝔰i=Ni/D. Proposition 13 shows that |D| is an integer and is at most (2α)m and Proposition 14 shows that Ni has denominator at most (den¯φ)km.

Proposition 13.

The absolute value of the determinant of α(IQB), i.e., |D|, is an integer and is at most (2α)m, which is at most 2|V|^|V|3.

Proposition 14.

The denominator of Ni is at most (den¯φ)km, which is at most (den¯φ)|V|.

Since the denominator of 𝔰i is at most |D| times the denominator of Ni, we obtain the bound stated in Theorem 11. ∎

4 Expectation of window mean-payoff objectives

In this section, we apply the results from the previous section for two types of window mean-payoff objectives introduced in [12]:

  1. (i)

    fixed window mean-payoff (FWMP()) in which a window length 1 is given, and

  2. (ii)

    bounded window mean-payoff (BWMP) in which for every play, we need a bound on window lengths.

We define these objectives below.

For a play π in a stochastic game 𝒢, the mean payoff of an infix π(i,i+n) is the average of the payoffs of the edges in the infix and is defined as 𝖬𝖯(π(i,i+n))=k=ii+n11nw(vk,vk+1). Given a window length 1 and a threshold λ, a play π=v0v1 in 𝒢 satisfies the fixed window mean-payoff objective FWMP𝒢(,λ) if from every position after some point, it is possible to start an infix of length at most with mean payoff at least λ.

FWMP𝒢(,λ)={π𝖯𝗅𝖺𝗒𝗌𝒢k0ikj{1,,}:𝖬𝖯(π(i,i+j))λ}

We omit the subscript 𝒢 when it is clear from the context. We extend the definition of windows as defined in [12] for arbitrary thresholds. Given a threshold λ, a play π=v0v1, and 0i<j, we say that the λ-window π(i,j) is open if the mean payoff of π(i,k) is less than λ for all i<kj. Otherwise, the λ-window is closed. A play π satisfies FWMP(,λ) if and only if from some point on, every λ-window in π closes within at most steps. Note that FWMP(,λ)FWMP(,λ) for as a smaller window length is a stronger constraint.

We also consider another window mean-payoff objective called the bounded window mean-payoff objective BWMP𝒢(λ). A play satisfies the objective BWMP(λ) if there exists a window length 1 such that the play satisfies FWMP(,λ).

BWMP𝒢(λ)={π𝖯𝗅𝖺𝗒𝗌𝒢1:πFWMP𝒢(,λ)}

Note that both FWMP(,λ) and BWMP(λ) are Boolean prefix-independent objectives.

Expected window mean-payoff values.

Corresponding to the Boolean objectives FWMP(,λ) and BWMP(λ), we define quantitative versions of these objectives. Given a play π in a stochastic game 𝒢 and a window length , the φFWMP()-value of π is sup{λπFWMP𝒢(,λ)}, the supremum threshold λ such that the play satisfies FWMP𝒢(,λ). Using notations from Section 2, we denote the expected φFWMP()-value of a vertex v by 𝔼v(φFWMP()). We define 𝔼v(φBWMP), the expected φBWMP-value of a vertex v analogously. If W is an integer such that the payoff w(e) of each edge e in 𝒢 satisfies |w(e)|W, then for all plays π in 𝒢, we have that φFWMP()(π) and φBWMP(π) lie between W and W. Thus, φFWMP() and φBWMP are bounded objectives.

Decision problems.

Given a stochastic game 𝒢, a vertex v, and a threshold λ, we have the following expectation problems for the window mean-payoff objectives:

  • expected φFWMP()-value problem: Given a window length 1, is 𝔼v(φFWMP())λ?

  • expected φBWMP-value problem: Is 𝔼v(φBWMP)λ?

As considered in previous works [12, 5, 6], the window length is usually small (|V|), and hence we assume that is given in unary (while the edge-payoffs are given in binary).

4.1 Expected fixed window mean-payoff value

We give tight complexity bounds for the expected φFWMP()-value problem. We use the characterisation from Theorem 7 to present our main result that this problem is in 𝖴𝖯𝖼𝗈𝖴𝖯 (Theorem 18). We show in [21] that simple stochastic games [20], which are in 𝖴𝖯𝖼𝗈𝖴𝖯 [13], reduce to the expected φFWMP()-value problem, giving a tight lower bound.

In order to use the characterisation, we show the existence of the bound den¯FWMP() for the φFWMP() objective. We show in Lemma 15 that the expected φFWMP()-value 𝔰i of a class 𝖲(i) without boundary vertices takes a special form, that is, 𝔰i is the mean-payoff of a sequence of at most edges in 𝖲(i). We use the fact that the φFWMP()-value of every play π is the largest λ such that, eventually, every λ-window in π closes in at most steps, and that λ is the mean payoff of a sequence of at most edges in π. To complete the argument, we show that if both players play optimally, then, with probability 1, the φFWMP()-value of the outcome π is equal to 𝔰i and thus, 𝔰i is also of this form.

Lemma 15.

The expected φFWMP()-value 𝔰i of vertices in a class 𝖲(i) without boundary vertices is equal to the mean payoff of some sequence of or fewer edges in 𝖲(i). That is, 𝔰i is of the form 1j(w(e1)++w(ej)) for some j and edges e1,e2,,ej.

This observation gives us the bound den¯FWMP() on the denominators of the values of 𝗌-classes without boundary vertices. To see this, let w^=max{qp,q,eE:w(e)=pq with p,q co-prime} be the maximum denominator over all edge-payoffs in 𝒢. Since j, and each w(e1),w(e2),,w(ej) is a rational number with denominator at most w^, the denominator of the sum w(e1)++w(ej) is at most w^(w^1)(w^2)(w^(1)) if w^, and at most w^! if w^. In both cases, this is at most w^.

Corollary 16.

The expected φFWMP()-value of vertices in 𝗌-classes without boundary vertices can be written as pq where p and q are integers and qw^.

From Theorem 11, we get that the denominator of 𝔰i for each class 𝖲(i) in 𝒢 is at most 2|V|^|V|3(den¯FWMP())|V|, which is at most 2|V|^|V|3(w^)|V|.

Lemma 17.

The expected φFWMP()-value of each vertex in 𝒢 can be written as a fraction pq, where p,q are integers, and q2|V|^|V|3(w^)|V|, and WqpWq.

We now state the main result of this section for the expected φFWMP()-value problem.

Theorem 18.

The expected φFWMP()-value problem is in 𝖴𝖯𝖼𝗈𝖴𝖯 when is given in unary. Memory of size suffices for Player 1, while memory of size |V| suffices for Player 2.

Proof.

To show membership of the expected φFWMP()-value problem in 𝖴𝖯𝖼𝗈𝖴𝖯, we first guess the expected φFWMP()-value vector 𝗌, that is, the expected φFWMP()-value 𝗌v of every vertex v in the game. From Lemma 17, it follows that the number of bits required to write 𝗌v for every vertex v is polynomial in the size of the input. Thus, the vector 𝗌 can be guessed in polynomial time.

Then, to verify the guess, it is sufficient to verify the Bellman, lower-bound, and upper-bound conditions for φFWMP(). It is easy to see that the Bellman condition can be checked in polynomial time. Checking the lower-bound and upper-bound conditions, i.e., checking the almost-sure satisfaction of the threshold Boolean objective FWMP(,λ) for appropriate thresholds λ in trap subgames in each 𝗌-class can be done in polynomial time [22]. Thus, the decision problem of 𝔼v(φFWMP())λ is in 𝖭𝖯, and moreover, since there is exactly one value vector that satisfies the conditions in Theorem 7, the decision problem is, in fact, in 𝖴𝖯. Analogously, the complement decision problem of 𝔼v(φFWMP())<λ is also in 𝖴𝖯. Hence, the expected φFWMP()-value problem is in 𝖴𝖯𝖼𝗈𝖴𝖯.

From the description of the optimal strategy in Lemma 9, it follows from Corollary 10 that the memory requirement for the expected φFWMP() objective is no greater than the memory requirement for the almost-sure satisfaction of the corresponding threshold objectives, which are and |V| for Player 1 and Player 2 respectively [22].

4.2 Expected bounded window mean-payoff value

We would like to apply the characterisation in Theorem 7 to φBWMP to show that the expected φBWMP-value problem is in 𝖴𝖯𝖼𝗈𝖴𝖯, and thus, we show the existence of the bound denBWMP for the φBWMP objective. We show in Lemma 19 that the expected φBWMP-value 𝔰i of a class 𝖲(i) without boundary vertices is the mean payoff of a simple cycle in 𝖲(i).

Lemma 19.

The expected φBWMP-value 𝔰i of vertices in a class 𝖲(i) without boundary vertices is equal to the mean-payoff value of a simple cycle in 𝖲(i). That is, 𝔰i is of the form 1j(w(e1)++w(ej)) for some j|V| and edges e1,e2,,ej of a simple cycle.

While Lemma 19 is analogous to Lemma 15 for φFWMP(), the proof of Lemma 19 is more involved since the φBWMP objective requires one to consider windows of arbitrary lengths. In the proof, we make use of the fact that memoryless strategies suffice for Player 1 to play optimally for the almost-sure satisfaction of the BWMP objective [22]. In the resulting MDP (which has the same set of vertices as the game 𝒢𝖲(i)), we carefully analyse the resulting plays when Player 2 plays optimally. The following corollary of Lemma 19 states the bound den¯BWMP for the φBWMP objective.

Corollary 20.

The expected φBWMP-value of vertices in 𝗌-classes without boundary vertices can be written as pq where p and q are integers and qw^|V||V|.

From Theorem 11, we get that the denominator of 𝔰i of each class 𝖲(i) in 𝒢 is at most 2|V|^|V|3(den¯BWMP)|V|, which is at most 2|V|^|V|3(w^|V||V|)|V|.

Lemma 21.

The expected φBWMP-value of each vertex in 𝒢 can be written as pq, where p,q are integers, and q2|V|^|V|3(w^|V||V|)|V|, and WqpWq.

We now state the main result of this section for the expected φBWMP-value problem.

Theorem 22.

The expected φBWMP-value problem is in 𝖴𝖯𝖼𝗈𝖴𝖯. Memoryless strategies suffice for Player 1. Player 2 requires infinite memory in general.

Proof sketch.

This proof follows a similar structure as the proof of Theorem 18. As before, the Bellman condition can be checked in polynomial time. Checking the lower-bound and upper-bound conditions involves checking almost-sure satisfaction of the Boolean objective BWMP for appropriate thresholds, which reduces to checking the satisfaction of BWMP in non-stochastic games [22], which in turn reduces to total supremum payoff [12], which is in 𝖴𝖯𝖼𝗈𝖴𝖯 [27]. Both of these reductions are polynomial-time reduction, and hence, the expected φBWMP-value problem is in 𝖴𝖯𝖼𝗈𝖴𝖯.

Memoryless strategies suffice for Player 1 for almost-sure satisfaction of BWMP(λ) [22]. Player 2 requires infinite memory in general for the BWMP(λ) objective even in non-stochastic games [12], which are a special case of stochastic games. Deterministic strategies suffice for both players. Hence, we get the memory requirements of an optimal strategy for the expected φBWMP-value problem using Corollary 10.

5 Discussion

We discuss some concluding remarks about the relation of our work to previous work [16], which deals with the satisfaction of Boolean prefix-independent objectives. We also discuss practical implementations for window mean-payoff objectives and applicability of our technique to other prefix-independent objectives.

Comparison with [16].

In [16], it suffices to check the almost-sure satisfaction of the same Boolean objective ψ in all value classes. In contrast, for quantitative objectives, the threshold Boolean objective for which we check the almost-sure satisfaction depends on the guessed value of the value class (“Can Player 1 satisfy {φ>𝔯iεφ} with probability 1?”). Another key difference is that for Boolean objectives, the value classes without boundary vertices are precisely the extremal value classes, that is classes with values 0 and 1. In the case of quantitative objectives, there may be multiple intermediate value classes without boundary vertices, making reasoning about the correctness of the reduction more difficult.

We note that if we apply our approach to Boolean prefix-independent objectives (such as Büchi, coBüchi, parity) by viewing them as quantitative objectives mapping each play to 0 or 1, then we recover the algorithm given in [16].

Applicability to other prefix-independent objectives.

Recall that in order to be able to apply our characterisation to a prefix-independent objective φ, we require bounds Wφ on the image of φ and denφ on the denominators of the optimal expected φ-values of vertices in the game. These bounds can easily be derived individually for the following quantitative prefix-independent objectives, and thus, our technique applies to these objectives.

Mean-payoff objective.

The mean-payoff value of any play is bounded between the minimum and maximum edge payoffs in the game, directly giving bounds on the image of the mean-payoff objective. Since deterministic memoryless optimal strategies exist for both players [24], the expected mean-payoff value is a solution of a stationary distribution in the Markov chain obtained by fixing memoryless strategies of both players. This gives denominator bounds for the expected mean-payoff values of vertices in the game.

Limsup and liminf objectives.

Since the liminf objective is equivalent to FWMP() objective with window length =1 [21], and the limsup objective is the dual of the liminf objective, our analysis with window mean-payoff objectives generalises limsup and liminf objectives.

Positive frequency payoff objective [29].

Here, each vertex has a payoff, and this objective returns the maximum payoff among all vertices that are visited with positive frequency in an infinite play. This objective is prefix-independent, as the frequency of a vertex is independent of finite prefixes. The image of the objective is bounded between the minimum and maximum vertex payoffs. We observe that the value of vertices in a value class without boundary vertices is equal to the payoff of a vertex in the class, (giving a denominator bound for these vertices), and Theorem 11 uses this to give a denominator bound for vertices in value classes with boundary vertices.

Practical implementation.

We discuss approaches to solve the expected φ-value problem for the window mean-payoff objectives in practice.

A trivial algorithm that works for both φFWMP() and φBWMP objectives is to iterate over all possible value vectors. For each value vector, we check if the conditions in Theorem 7 are satisfied, which can be done in polynomial time. Since there are exponentially many possible value vectors, this algorithm has an exponential running time in the worst-case.

Another technique is value iteration [15], which has been seen to be an anytime algorithm for the standard mean-payoff objective [34]. An anytime algorithm gives better precision the longer it is run, and can be interrupted any time. Given a game 𝒢 with |V| vertices, the expected φFWMP()-value problem on 𝒢 reduces to the expected liminf-value problem on a game 𝒢 with |V| vertices, (that is, on an exponentially larger game graph). The liminf objective is a well-studied objective in the context of value iteration [15, 11]. We describe the reduction in [21], which also gives the expected φFWMP()-values of vertices in 𝒢.

Since the size of the graph 𝒢 is much bigger than that of 𝒢, we would like to work with 𝒢 on-the-fly rather than explicitly constructing the entire graph. In [34], the authors show bounded value iteration for objectives such as reachability and mean-payoff. They also discuss that the algorithm can be extended to be asynchronous and use partial exploration. As future work, we would like to look at the practicality of on-demand asynchronous value iteration for the liminf objective, or even the window mean-payoff objectives φFWMP() and φBWMP directly. An interesting aspect of it would be to investigate heuristics and optimisations such as sound value iteration [37], optimistic value iteration [31], and topological value iteration [3] to speed up the practical running time.

References

  • [1] D. Andersson and P. B. Miltersen. The Complexity of Solving Stochastic Games on Graphs. In ISAAC, volume 5878 of Lecture Notes in Computer Science, pages 112–121. Springer, 2009. doi:10.1007/978-3-642-10631-6_13.
  • [2] K. R. Apt and E. Grädel. Lectures in Game Theory for Computer Scientists. Cambridge University Press, 2011.
  • [3] M. Azeem, A. Evangelidis, J. Křetínský, A. Slivinskiy, and M. Weininger. Optimistic and Topological Value Iteration for Simple Stochastic Games. In ATVA, pages 285–302. Springer, 2022. doi:10.1007/978-3-031-19992-9_18.
  • [4] C. Baier and J-P. Katoen. Principles of model checking. MIT Press, 2008.
  • [5] B. Bordais, S. Guha, and J-F. Raskin. Expected Window Mean-Payoff. In FSTTCS, volume 150 of LIPIcs, pages 32:1–32:15, 2019. doi:10.4230/LIPIcs.FSTTCS.2019.32.
  • [6] T. Brihaye, F. Delgrange, Y. Oualhadj, and M. Randour. Life is Random, Time is Not: Markov Decision Processes with Window Objectives. Logical Methods in Computer Science, Volume 16, Issue 4, December 2020. doi:10.23638/LMCS-16(4:13)2020.
  • [7] V. Bruyère, E. Filiot, M. Randour, and J-F. Raskin. Meet your expectations with guarantees: Beyond worst-case synthesis in quantitative games. Information and Computation, 254:259–295, 2017. doi:10.1016/j.ic.2016.10.011.
  • [8] T. Brázdil, V. Brožek, K. Chatterjee, V. Forejt, and A. Kučera. Markov Decision Processes with Multiple Long-run Average Objectives. Logical Methods in Computer Science, Volume 10, Issue 1, Feb 2014. doi:10.2168/LMCS-10(1:13)2014.
  • [9] K. Chatterjee. Concurrent games with tail objectives. Theoretical Computer Science, 388(1):181–198, 2007. doi:10.1016/j.tcs.2007.07.047.
  • [10] K. Chatterjee, L. Doyen, H. Gimbert, and T. A. Henzinger. Randomness for free. Information and Computation, 245:3–16, 2015. doi:10.1016/j.ic.2015.06.003.
  • [11] K. Chatterjee, L. Doyen, and T. A. Henzinger. A Survey of Stochastic Games with Limsup and Liminf Objectives. In ICALP, pages 1–15. Springer, 2009. doi:10.1007/978-3-642-02930-1_1.
  • [12] K. Chatterjee, L. Doyen, M. Randour, and J-F. Raskin. Looking at mean-payoff and total-payoff through windows. Information and Computation, 242:25–52, 2015. doi:10.1016/j.ic.2015.03.010.
  • [13] K. Chatterjee and N. Fijalkow. A reduction from parity games to simple stochastic games. EPTCS, 54:74–86, 2011. doi:10.4204/eptcs.54.6.
  • [14] K. Chatterjee and T. A. Henzinger. Reduction of stochastic parity to stochastic mean-payoff games. Information Processing Letters, 106(1):1–7, 2008. doi:10.1016/j.ipl.2007.08.035.
  • [15] K. Chatterjee and T. A. Henzinger. Value Iteration. In 25 Years of Model Checking - History, Achievements, Perspectives, LNCS 5000, pages 107–138. Springer, 2008. doi:10.1007/978-3-540-69850-0_7.
  • [16] K. Chatterjee, T. A. Henzinger, and F. Horn. Stochastic Games with Finitary Objectives. In MFCS, pages 34–54. Springer, 2009. doi:10.1007/978-3-642-03816-7_4.
  • [17] K. Chatterjee, Z. Křetínská, and J. Křetínský. Unifying Two Views on Multiple Mean-Payoff Objectives in Markov Decision Processes. Logical Methods in Computer Science, Volume 13, Issue 2, July 2017. doi:10.23638/LMCS-13(2:15)2017.
  • [18] K. Chatterjee, T. Meggendorfer, R. Saona, and J. Svoboda. Faster Algorithm for Turn-based Stochastic Games with Bounded Treewidth. In SODA, pages 4590–4605, 2023. doi:10.1137/1.9781611977554.ch173.
  • [19] A. Church. Application of Recursive Arithmetic to the Problem of Circuit Synthesis. Journal of Symbolic Logic, 28(4):289–290, 1963. doi:10.2307/2271310.
  • [20] A. Condon. The complexity of stochastic games. Information and Computation, 96(2):203–224, 1992. doi:10.1016/0890-5401(92)90048-K.
  • [21] L. Doyen, P. Gaba, and S. Guha. Expectation in Stochastic Games with Prefix-independent Objectives, 2024. doi:10.48550/arXiv.2405.18048.
  • [22] L. Doyen, P. Gaba, and S. Guha. Stochastic Window Mean-payoff Games. In FoSSaCS Part I, volume 14574 of LNCS, pages 34–54. Springer, 2024. doi:10.1007/978-3-031-57228-9_3.
  • [23] R. Durrett. Probability: Theory and Examples. Cambridge University Press, 2010.
  • [24] A. Ehrenfeucht and J. Mycielski. Positional Strategies for Mean Payoff Games. Int. Journal of Game Theory, 8(2):109–113, 1979. doi:10.1007/BF01768705.
  • [25] N. Fijalkow, N. Bertrand, P. Bouyer, R. Brenguier, A. Carayol, J. Fearnley, H. Gimbert, F. Horn, R. Ibsen-Jensen, N. Markey, B. Monmege, P. Novotny, M. Randour, O. Sankur, S. Schmitz, O. Serre, and M. Skomra. Games on Graphs. Online, 2024. doi:10.48550/arXiv.2305.10546.
  • [26] P. Gaba and S. Guha. Optimising expectation with guarantees for window mean payoff in Markov decision processes. In AAMAS, pages 820–828. International Foundation for Autonomous Agents and Multiagent Systems / ACM, 2025. doi:10.5555/3709347.3743600.
  • [27] T. M. Gawlitza and H. Seidl. Games through Nested Fixpoints. In CAV, pages 291–305. Springer, 2009. doi:10.1007/978-3-642-02658-4_24.
  • [28] D. Gillette. Stochastic games with zero stop probabilities, pages 179–188. Princeton University Press, December 1958.
  • [29] H. Gimbert and E. Kelmendi. Submixing and Shift-Invariant Stochastic Games. International Journal of Game Theory, 52(4):1179–1214, 2023. doi:10.1007/s00182-023-00860-5.
  • [30] V. Gurvich and P. B. Miltersen. On the Computational Complexity of Solving Stochastic Mean-Payoff Games. CoRR, abs/0812.0486, 2008. arXiv:0812.0486.
  • [31] A. Hartmanns and B. L. Kaminski. Optimistic Value Iteration. In CAV, pages 488–511. Springer, 2020. doi:10.1007/978-3-030-53291-8_26.
  • [32] K. Hoffman and R. Kunze. Linear Algebra. Prentice-Hall, Inc., 1971.
  • [33] J. Hopcroft and J. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley Publishing Co., Inc., 1979.
  • [34] J. Křetínský, T. Meggendorfer, and M. Weininger. Stopping Criteria for Value Iteration on Stochastic Games with Quantitative Objectives. In LICS, pages 1–14, 2023. doi:10.1109/LICS56636.2023.10175771.
  • [35] D. A. Martin. The Determinacy of Blackwell Games. The Journal of Symbolic Logic, 63(4):1565–1581, 1998. doi:10.2307/2586667.
  • [36] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley and Sons, 1994.
  • [37] T. Quatmann and J-P. Katoen. Sound Value Iteration. In CAV, pages 643–661. Springer, 2018. doi:10.1007/978-3-319-96145-3_37.
  • [38] L. S. Shapley. Stochastic Games. Proceedings of the National Academy of Sciences, 39(10):1095–1100, October 1953. doi:10.1073/pnas.39.10.1095.
  • [39] U. Zwick and M. Paterson. The Complexity of Mean Payoff Games on Graphs. Theoretical Computer Science, 158(1&2):343–359, 1996. doi:10.1016/0304-3975(95)00188-3.