Expectation in Stochastic Games with Prefix-Independent Objectives
Abstract
Stochastic two-player games model systems with an environment that is both adversarial and stochastic. In this paper, we study the expected value of bounded quantitative prefix-independent objectives in the context of stochastic games. We show a generic reduction from the expectation problem to linearly many instances of the almost-sure satisfaction problem for threshold Boolean objectives. The result follows from partitioning the vertices of the game into so-called value classes where each class consists of vertices of the same value. Our procedure further entails that the memory required by both players to play optimally for the expectation problem is no more than the memory required by the players to play optimally for the almost-sure satisfaction problem for a corresponding threshold Boolean objective.
We show the applicability of the framework to compute the expected window mean-payoff measure in stochastic games. The window mean-payoff measure strengthens the classical mean-payoff measure by computing the mean payoff over windows of bounded length that slide along an infinite path. We show that the decision problem to check if the expected window mean-payoff value is at least a given threshold is in when the window length is given in unary.
Keywords and phrases:
Stochastic games, finitary objectives, mean payoff, reactive synthesisCopyright and License:
2012 ACM Subject Classification:
Mathematics of computing Stochastic processes ; Theory of computation Probabilistic computationEditors:
Patricia Bouyer and Jaco van de PolSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Reactive systems typically have an infinite execution where the controller continually reacts to the environment. Given a specification, the reactive controller synthesis problem [19] concerns with synthesising a policy for the controller such that the specification is satisfied by the system for all behaviours of the environment. This problem is modelled using two-player turn-based games on graphs, where the two players are the controller () and the environment (), the vertices and the edges of the game graph represent the states and transitions of the system, and the objective of is to satisfy the specification. An execution of the system is then an infinite path in the game graph. The reactive controller synthesis problem corresponds to determining if there exists a strategy of such that for all strategies of , the outcome satisfies the objective. If such a winning strategy exists, then we would also like to synthesise it. The environment is considered as an adversarial player to ensure that the specification is met even in the worst-case scenario.
Objectives are either Boolean or quantitative. Each execution either satisfies a Boolean objective or does not satisfy . The set of executions that satisfy form a language over infinite words with the alphabet being the set of vertices in the graphs. On the other hand, a quantitative objective evaluates the performance of the execution by a numerical metric, which aims to maximise and aims to minimise. A quantitative objective can be viewed as a real-valued function over infinite paths in the graph.
In the presence of uncertainty or probabilistic behaviour, the game graph becomes stochastic. Fixing the strategies of the two players gives a distribution over infinite paths in the game graph. For Boolean objectives , the goal of is to maximise the probability that an outcome satisfies . For quantitative objectives , there are two possible views:
- Satisfaction.
-
Given a threshold , to maximise the probability that -value of the outcome is greater than ;
- Expectation.
-
To maximise the -value of the outcome in expectation.
Either view may be desirable depending on the context [8, 7, 6, 5]. The satisfaction view can be seen as a Boolean objective: the -value of the outcome is either greater than or it is not. The expectation view is more nuanced, and is the subject of study in this paper.
In this paper, we look at the expectation problem for quantitative prefix-independent objectives (also referred to as tail objectives). These are objectives that do not depend on finite prefixes of the plays, but only on the long-run behaviour of the system. In systems, we are often willing to allow undesirable behaviour in the short-term, if the long run behaviour is desirable. Prefix-independent objectives model such requirements and thus are of interest to study [9]. Prefix-independent objectives also have the benefit that they satisfy the Bellman equations [34], which simplifies their analysis. The expectation problem for such objectives arises naturally in many scenarios. For example:
-
(i)
An algorithmic trading system is designed to generate profit by executing trades based on real-time market data. Following an initial phase of learning and unstable behaviour due to parameter tuning, average profit over a bounded time window must always exceed a threshold and decisions need to be made within short well-defined intervals for them to be effective.
-
(ii)
A power plant may have different strategies to produce power (such as coal, solar, nuclear, wind) and must allocate resources among these strategies so as to maximise the power produced in expectation.
Contributions.
All of our contributions are with regard to quantitative prefix-independent objectives that are bounded (i.e., the image of is bounded between integers and ) and such that a bound on the denominators of the optimal expected -values of vertices in the game is known. The bound on the image ensures determinacy [35], that is, the players have optimal strategies, and the bound on the denominator of optimal values of vertices discretise the search space. These bounds often exist and are easily derivable for common objectives of interest such as mean payoff.
Our primary contribution is a reduction of the expectation problem for such an objective to linearly many instances of the almost-sure satisfaction problem for threshold Boolean objectives for thresholds . Deciding the almost-sure satisfaction of is conceptually simpler than computing the expected value of , as in the former, we only need to consider if the measure of the paths that satisfy the objective is equal to one, whereas in the latter, one must take the averages of the measures of the sets of paths weighted with the value of the paths. Our technique is generic in the sense that when an algorithm for the almost-sure satisfaction problem for is known, we directly obtain the complexity and a way to solve the expectation problem for .
Our reduction builds on the technique introduced in [16] for Boolean prefix-independent objectives and non-trivially extends it to quantitative prefix-independent objectives for which the bounds and are known. The expected -values of vertices are nondeterministically guessed, and we present a characterisation (Theorem 7, similar to [16, Lemma 8] but with important and subtle differences) to verify the guess. We also explicitly construct strategies for both players that are optimal for the expectation of , in terms of almost-sure winning strategies for (proof of Lemma 9). The memory requirement for the constructed optimal strategies is the same as that of the almost-sure winning strategies (Corollary 10).
Our framework gives an alternative approach to solve the expectation problem for well-studied objectives such as expected mean payoff and gives new results for not-as-well-studied objectives such as the window mean-payoff objectives introduced in [12]. As our secondary contribution, we illustrate our technique by applying it to two variants of window mean-payoff objectives: fixed () and bounded (BWMP) window mean-payoff. Using our reduction, we are able to show that for both of these objectives, the expectation problem is in (Theorem 18 and Theorem 22), a result that was not known before. The upper bound for window mean-payoff objectives matches the special case of simple stochastic games [20, 13], and thus would require a major breakthrough to be improved. The lower bounds on the memory requirements for these objectives carry over from special case of the non-stochastic games [12, 22]. We summarise the complexity results and bounds on the memory requirements for the window mean-payoff objectives in Table 1.
Related work.
Stochastic games were introduced by Shapley [38] where these games were studied under expectation semantics for discounted-sum objectives. In [14], it was shown that solving stochastic parity games reduces to solving stochastic mean-payoff games. Further, solving stochastic parity games, stochastic mean-payoff games, and simple stochastic games (i.e., stochastic games with reachability objective) are all polynomial-time equivalent [30, 1], and thus, are all in [13]. A sub-exponential (or even quasi-polynomial) time deterministic algorithms for simple stochastic games on graphs with poly-logarithmic treewidth was proposed in [18]. In [29], sufficient conditions on the objective were shown such that optimal deterministic memoryless strategies exist for the players. In [34], value iteration to solve the expectation problem in stochastic games with reachability, safety, total-payoff, and mean-payoff objectives was studied.
Mean-payoff objectives were studied initially in two-player games, without stochasticity [24, 39], and with stochasticity in [28]. Finitary versions were introduced as window mean-payoff objectives [12]. For finitary mean-payoff objectives, the satisfaction problem [6] and the expectation problem [5] were studied in the special case of Markov decision processes (MDPs), which correspond to stochastic games with a trivial adversary. Expected mean payoff, expected discounted payoff, expected total payoff, etc., are widely studied for MDPs [36, 4]. Both the expectation problem [5] and the satisfaction problem [6] for the objective are in , while they are in for the BWMP objective. Ensuring the satisfaction and expectation semantics simultaneously was studied in MDPs for the mean-payoff objective in [17] and for the window mean-payoff objectives in [26]. In both cases, the complexity was shown to be no greater than that of only expectation optimisation.
The satisfaction problem for window mean-payoff objectives has been studied for two-player stochastic games in [22]. While positive and almost-sure satisfaction of are in , it follows from [22] that the problem is in for quantitative satisfaction i.e., with threshold probabilities . Furthermore, the satisfaction problem of BWMP is in and thus has the same complexity as that of the special case of MDPs [6].
Due to lack of space, some of the proofs and details have been omitted. A full version of the paper can be found in [21].
2 Preliminaries
Probability distributions.
A probability distribution over a finite non-empty set is a function such that . We denote by the set of all probability distributions over . For the algorithmic and complexity results, we assume that probabilities are given as rational numbers.
Stochastic games.
We consider two-player turn-based zero-sum stochastic games (or simply, stochastic games in the sequel). The two players are referred to as (she/her) and (he/him). A stochastic game is given by , where:
-
is a directed graph with a finite set of vertices and a set of directed edges such that for all vertices , the set of out-neighbours of is non-empty, i.e., (no deadlocks).
-
is a partition of . The vertices in belong to , the vertices in belong to , and the vertices in are called probabilistic vertices;
-
is a probability function that describes the behaviour of probabilistic vertices in the game. It maps each probabilistic vertex to a probability distribution over the set of out-neighbours of such that for all (i.e., all out-neighbours have non-zero probability);
-
is a payoff function assigning a rational payoff to every edge in the game.
Stochastic games are played in rounds. The game starts by initially placing a token on some vertex. At the beginning of a round, if the token is on a vertex , and for , then chooses an out-neighbour ; otherwise , and an out-neighbour is chosen with probability . receives from the amount given by the payoff function, and the token moves to for the next round. This continues ad infinitum resulting in an infinite sequence such that for all , called a play. For , we denote by the infix of . Its length is , the number of edges. We denote by the finite prefix of , and by the infinite suffix of . We denote by and the set of all plays and the set of all finite prefixes in respectively. We denote by the last vertex of the prefix . We denote by () the set of all prefixes such that .
A stochastic game with is a non-stochastic two-player game, a stochastic game with is a Markov decision process (MDP), and a stochastic game with is a Markov chain. Figure 1 shows an example of a stochastic game; vertices are shown as circles, vertices as boxes, and probabilistic vertices as diamonds.
Subgames and traps.
Given a stochastic game , a subset of vertices induces a subgame if every vertex has an outgoing edge in , that is , and every probabilistic vertex has all outgoing edges in , that is . The induced subgame is , where , and and are restrictions of and respectively to . If is such that for all , if then and if then , then induces a subgame, and the subgame is a trap for in , since can ensure that if the token reaches , then it never escapes.
Boolean objectives.
A Boolean objective is a Borel-measurable subset of [35]. A play satisfies an objective if . In a stochastic game with objective , the objective of is , and since is a zero-sum game, the objective of is the complement set . An example of a Boolean objective is reachability, denoted , the set of all plays that visit a vertex in the target set . This is formally defined and more examples of Boolean objectives are given in [21].
Quantitative objectives.
A quantitative objective is a Borel-measurable function of the form . In a stochastic game with objective , the objective of is and the objective of is , the negative of . Let be a play. Some common examples of quantitative objectives include the mean-payoff objective , and the liminf objective . In this work, we also consider the window mean-payoff objective, which is defined in Section 4. Corresponding to a quantitative objective , we define threshold objectives which are Boolean objectives of the form for thresholds and for . We denote this threshold objective succinctly as .
Prefix independence.
An objective is said to be prefix-independent if it only depends on the suffix of a play. Formally, a Boolean objective is prefix-independent if for all plays and with a common suffix (that is, can be obtained from by removing and adding a finite prefix), we have that if and only if . Similarly, a quantitative objective is prefix-independent if for all plays and with a common suffix, we have that . Mean payoff and liminf are examples of prefix-independent objectives, whereas reachability and discounted payoff [2] are not.
Strategies.
A (deterministic or pure) strategy111 We only consider the satisfaction and expectation of Borel-measurable objectives, and deterministic strategies suffice for such objectives [10]. Satisfying two goals simultaneously, e.g., requires randomisation and is not allowed by our definition. for Player in a game is a function that maps prefixes ending in a vertex to a successor of . Strategies can be realised as the output of a (possibly infinite-state) Mealy machine [33]. The memory size of a strategy is the smallest number of states a Mealy machine defining can have. A strategy is memoryless if only depends on the last element of the prefix , that is, for all prefixes if , then .
A strategy profile is a pair of strategies and of and respectively. A play is consistent with a strategy of () if for all with , we have . A play is an outcome of a profile if it is consistent with both and . For a Boolean objective , we denote by the probability that an outcome of the profile in with initial vertex satisfies .
Satisfaction probability of Boolean objectives.
Let be a Boolean objective. A strategy of is winning with probability from a vertex in for objective if for all strategies of . A strategy of is positive winning (resp., almost-sure winning) from for in with objective if (resp., ) for all strategies of . In the above, if such a strategy exists, then the vertex is said to be positive winning (resp., almost-sure winning) for . If a vertex is positive winning (resp., almost-sure winning) for , then is said to play optimally from if she follows a positive (resp., almost-sure) winning strategy from . We omit analogous definitions for .
Expected value of quantitative objectives.
Let be a quantitative objective. Given a strategy profile and an initial vertex , let denote the expected -value of the outcome of the strategy profile from , that is, the expectation of over all plays with initial vertex under the probability measure . We only consider objectives that are Borel-measurable and whose image is bounded. Thus, by determinacy of Blackwell games [35], we have that stochastic games with objective are determined. That is, we have . We call this quantity the expected -value of the vertex and denote it by . We say that plays optimally from a vertex if she follows a strategy such that for all strategies of , the expected -value of the outcome is at least . Similarly, plays optimally if he follows a strategy such that for all strategies of , the expected -value of the outcome is at most . If is a prefix-independent objective, then we have the following relation between the expected -value of a vertex and the expected -values of its out-neighbours.
Proposition 1 (Bellman equations).
If is a prefix-independent objective, then the following equations hold for all .
In this paper, we consider the expectation problem for prefix-independent objectives. Our solution in turn uses the almost-sure satisfaction problem. The decision problems are defined as follows.
Decision problems.
Given a stochastic game , a quantitative objective , a vertex , and a threshold , the following decision problems are relevant:
-
almost-sure satisfaction problem: Is vertex almost-sure winning for for a threshold objective ?
-
expectation problem: Is ? That is, is the expected -value of at least ?
3 Reducing expectation to almost-sure satisfaction
In this section, we show a reduction (Theorem 7) of the expectation problem for bounded quantitative prefix-independent objectives to the almost-sure satisfaction problem for the corresponding threshold objectives . The reduction involves guessing a value for every vertex in the game, and then verifying if the guessed values are equal to the expected -values of the vertices. Theorem 7 generalises [16, Lemma 8] which studies the satisfaction problem for prefix-independent Boolean objectives, as Boolean objectives can be viewed as a special case of quantitative objectives by restricting the range to . We further discuss the difference in approaches between Theorem 7 and [16, Lemma 8] in Section 5.
Given a game and a bounded prefix-independent quantitative objective , our reduction requires the existence of an integer bound on the denominators of expected -values of vertices in . Since is bounded, there exists an integer such that for every play in . Thus, for every vertex in , one can write as , where and are integers such that and . The bounds and may depend on the objective and the structure of the graph, i.e., number of vertices, edge payoffs, probability distributions in the game, etc. These bounds effectively discretise the set of possible expected -values of the vertices, as there are at most distinct possible values. This directly gives a bound on the granularity of the possible expected -values of vertices, that is, the minimum difference between two possible values of vertices, and we represent this quantity by . Observe that given two rational numbers with denominators at most , the difference between them is at least , and thus, we let be .
3.1 Value vectors and value classes
We first define and give notations for value vectors, which are useful in describing the reduction, and then look at some of their interesting and useful properties.
Definitions and notations.
A vector of reals indexed by vertices in induces a partition of such that all vertices with the same value in belong to the same part in the partition. Let denote the number of parts in the partition, and let us denote the parts by . We call each part of the partition an -class, or simply, class if is clear from the context. For all , let denote the -value of the class . Given two vectors , we write if for all , we have , and we write if we have that and there exists such that . For a constant , we denote by the vector obtained by adding to each component of .
For all , a vertex is a boundary vertex if is a probabilistic vertex and has an out-neighbour not in , i.e., if and . Let denote the set of boundary vertices in the class . For all , let denote the restriction of to vertices in with all vertices in changed to absorbing vertices with a self-loop. The edge payoffs of these self loops are not important (we assume them to be ) as we restrict our attention to a subgame of that does not contain boundary vertices.
Example 2.
For the game in Figure 1, let be a value vector for vertices respectively. Since has five distinct values, we have , and the five -classes are , , , , and with values , , , , and respectively. Out of the five probabilistic vertices , , , and , we see that , , and are boundary vertices while and are not. Thus, , , and . We show the restrictions in Figure 2.
Let be a bounded prefix-independent quantitative objective. Analogous to the notation of a general value vector , we describe notations for the expected -value vector consisting of the expected -values of vertices in . For all vertices , let denote , the expected -value of vertex , and let denote the expected -value vector. Let denote the -class and let denote the -value of .
Given a vector , it follows from Proposition 1 that the following is a necessary (but not sufficient) condition for to be the expected -value vector .
-
Bellman condition: for every vertex , the following Bellman equations hold
Consequences of the Bellman condition.
We now see some properties of value vectors that satisfy the Bellman condition. Since boundary vertices are probabilistic vertices, the following is immediate.
Proposition 3.
Let be a value vector satisfying the Bellman condition. Then for all , for all , there exists an out-neighbour of with -value less than and there exists an out-neighbour of with -value greater than . Formally, there exist such that and and .
A corollary of Proposition 3 is that the -classes with the smallest and the biggest -values have no boundary vertices. Note that there may also exist -classes other than these that do not contain boundary vertices (see in Example 2). Next, we see that the Bellman condition entails that each restriction is a stochastic game.
Proposition 4.
If is a value vector that satisfies the Bellman condition, then for all , we have that is a stochastic game.
In Proposition 5, we make a crucial observation about long-run behaviours of plays in , which is that either player can ensure with probability that the token eventually reaches an -class from which it does not exit. This follows from the Borel-Cantelli lemma [23] due to the fact that there is a positive probability to reach an -class without boundary vertices following a finite number of edges out of boundary vertices.
Proposition 5.
Let be a value vector satisfying the Bellman condition. Suppose the strategy of () is such that each time the token reaches a vertex , (s)he moves the token to a vertex in the same -class as . Then, with probability 1, the token eventually reaches a class for some from which it never exits.
Finally, we define the notion of trap subgames of which will be used in the subsequent discussion. We denote by the positive attractor set of , i.e., the set of vertices in that are positive winning for for the objective. The complement is a trap for in , and with abuse of notation, we use the same symbol to denote the subset of as well as the trap subgame. We note that if does not have boundary vertices, that is, if , then it holds that and . We can analogously define and for . Given , these sets can be computed in polynomial time using attractor computations [25].
Example 6.
We compute these sets for the restrictions shown in Figure 2. For , since is empty, we have that and . For , we have that , and . For , we have that , .
3.2 Characterisation of the value vector
We describe in Theorem 7 a necessary and sufficient set of conditions for a given vector to be equal to the expected -value vector . In addition to Bellman, Theorem 7 makes use of two more conditions, which we define before stating the theorem.
-
lower-bound condition: for all , wins almost surely in the trap subgame from all vertices in .
-
upper-bound condition: for all , wins almost surely in the trap subgame from all vertices in .
Theorem 7.
The only vector , whose every component has denominator at most , that satisfies Bellman, lower-bound, and upper-bound is the expected -value vector .
Proof.
We show in Lemma 8 that satisfies the three conditions. We show in Lemma 9 that if is a vector that satisfies the three conditions, then is less than distance away from , that is, . In particular, if each component of can be written as , where , are both integers and is at most , then it follows that is equal to .
Lemma 8.
The expected -value vector satisfies the three conditions in Theorem 7.
Proof.
The fact that satisfies Bellman follows directly from Proposition 1. We show that lower-bound holds for . The proof for upper-bound is analogous.
Suppose for the sake of contradiction that lower-bound does not hold, that is, there exists and a vertex in such that has a positive winning strategy from for the objective in . Since is a prefix-independent objective, from [9, Theorem 1] we have that there exists another vertex in such that has an almost-sure winning strategy from for the same objective in . If follows this strategy from in the original game , then one of the following two cases holds
-
always moves the token to a vertex in . Since is in the trap for in , can force the token to remain in forever, and follow the almost-sure winning strategy to ensure that with probability 1, the outcome satisfies the objective .
-
eventually moves the token to a vertex out of . Since satisfies Bellman, the token moves to an -class with a smaller value than .
In both cases, the expected -value of the outcome is less than . This is a contradiction since , and the expected -value of every vertex in is equal to .
Lemma 9.
If a vector satisfies the three conditions in Theorem 7, then . In particular, we have the following:
-
If satisfies the Bellman and lower-bound conditions, then .
-
If satisfies the Bellman and upper-bound conditions, then .
Proof sketch..
We prove the first case. The proof for the second case follows by symmetry, that is, we essentially replace by , and by . We describe an optimal strategy of and give a sketch of its optimality.
Since satisfies the lower-bound condition, we have that for all , has an an almost-sure winning strategy in the trap subgame to win the objective in almost surely from all vertices in . From the definition of , has a positive winning strategy in the restricted game from vertices in for the objective. By following , the token either reaches with positive probability, or ends up in from where can ensure that the token never leaves . Using these strategies of in , we construct a strategy of that is optimal for expected -value in the original game : As long as the token is in the class in , the strategy mimics if the token is in and mimics if the token is in .
Note that whenever the token is on a vertex in , the strategy always moves the token to a vertex in same -class as (i.e. a token only exits an -class from a vertex or from a boundary vertex), and thus, Proposition 5 holds. Whenever the token exits a class to reach a different class , then as long as the token remains in , the strategy follows if the token is in , and follows if the token is in .
Since Proposition 5 holds, we have that for all strategies of , with probability , the token eventually reaches an -class from which it never exits. Moreover, the strategy ensures that with probability , the token eventually reaches in from which it never leaves. Because if not, then the token would visit vertices in infinitely often, having a fixed positive probability of reaching in every step because of . Thus, with probability , the token would eventually reach from where it could escape to a different -class, which contradicts the fact that the token stays in forever.
Since is prefix-independent, the -value of a play only depends on the trap it ends up in. If the game begins from a vertex , then for , let denote the probability that the token ends up in the trap subgame from which it never exits. Since satisfies Bellman, we have that . Since satisfies lower-bound, has an almost-sure winning strategy for in . Thus, for all strategies of , the expected value of an outcome of from is greater than , which is . This holds for all vertices in , giving us the desired result .
We also note that the optimal strategy always either follows an almost-sure winning strategy for the threshold objective or a positive winning strategy for a Reach objective. Since there exist memoryless positive winning strategies for the Reach objective [20], we have the following bound on the memory requirement of .
Corollary 10.
The memory requirement of is at most the maximum over all of the memory requirement of an almost-sure winning strategy for the threshold objective . Moreover, if is a deterministic strategy, then so is .
3.3 Bounding the denominators in the value vector
In this section, we discuss the problem of obtaining an upper bound for the denominators of the expected -values of vertices for a bounded prefix-independent objective in a game . In [16], the technique of value class is used to compute the values of vertices for Boolean prefix-independent objectives. It is stated without proof that the probability of satisfaction of a parity or a Streett objective [2] from each vertex can be written as where and is the maximum denominator over all edge probabilities in the game. As such, we were not able to directly generalise this bound for the expectation of quantitative prefix-independent objectives. Instead, we make the following observations:
-
Let be an -class without boundary vertices. If the token is in at some point in the play, then since satisfies the Bellman condition, neither player has an incentive to move the token out of . Since there are no boundary vertices in , the token does not exit from a probabilistic vertex either, and remains in forever. Thus, the value of depends only on the internal structure of . We denote by an upper bound on the denominators of values of -classes without boundary vertices. It is a simpler problem to find than to find , as each class without boundary vertices can be treated as a subgame in which each vertex has the same expected -value, or equivalently, the subgame consists of exactly one -class.
-
On the other hand, suppose is an -class containing at least one boundary vertex, and let be a boundary vertex in . Then, since satisfies the Bellman condition, we have , which is also the value of . Thus, can be written in terms of the values of classes reachable from in one step and the probabilities with which those classes are reached. In fact, we construct in the proof of Theorem 11 a system of linear equations to show that the value of each -class with boundary vertices can be expressed solely in terms of transition probabilities of the outgoing edges from boundary vertices and values of -classes without boundary vertices.
The method to calculate depends on the specific objective; we illustrate as an example in Section 4 a way to obtain for a particular kind of objective called the window mean-payoff objective. Once we know for an objective , we can use Theorem 11 to obtain in terms of .
Theorem 11.
The denominator of the value of each -class in is at most .
We note that this theorem implies that the number of bits required to write is polynomial in the number of vertices in the game and in the number of bits required to write . We devote the rest of this section to the proof of Theorem 11. For ease of notation, we denote the number of -classes in the game by instead of for the rest of this section. If every -class has no boundary vertices, then we have equal to and we are done. So we assume there exists at least one class that contains boundary vertices. Let denote the number of -classes with boundary vertices, and therefore, there are -classes without boundary vertices. Since there always exists at least one -class without boundary vertices (Proposition 3), we have that . Let and . We index the -classes such that each class with boundary vertices has its index in and each class without boundary vertices has its index in . Furthermore, in the sets and , the classes are indexed in increasing order of their values. That is, for both in or both in , we have if and only if . We show bounds on the denominators of -values of classes with boundary vertices, i.e., in terms of -values of classes without boundary vertices, i.e., .
For all , pick an arbitrary boundary vertex from and call this the representative vertex of . For all and , let denote the probability of reaching the class from in one step. Since satisfies the Bellman condition, we have that . It is helpful to split this sum based on whether or , i.e., whether or . We rewrite the sums as for all , and we represent this system of equations below using matrices.
This system of equations is of the form . Rearranging terms gives us where is the identity matrix. It follows from Proposition 12 that the equation has a unique solution.
Proposition 12.
The matrix is invertible.
Let denote the least common multiple (lcm) of the denominators of for and . We have , where is the maximum denominator over all edge probabilities in . We multiply both sides of the equation by to get and note that all the elements of and are integers. Let be the determinant of the matrix , and for , let be the determinant of the matrix obtained by replacing the column of with the column vector . Since is invertible, by Cramer’s rule [32], we have that . Proposition 13 shows that is an integer and is at most and Proposition 14 shows that has denominator at most .
Proposition 13.
The absolute value of the determinant of , i.e., , is an integer and is at most , which is at most .
Proposition 14.
The denominator of is at most , which is at most .
Since the denominator of is at most times the denominator of , we obtain the bound stated in Theorem 11. ∎
4 Expectation of window mean-payoff objectives
In this section, we apply the results from the previous section for two types of window mean-payoff objectives introduced in [12]:
-
(i)
fixed window mean-payoff () in which a window length is given, and
-
(ii)
bounded window mean-payoff (BWMP) in which for every play, we need a bound on window lengths.
We define these objectives below.
For a play in a stochastic game , the mean payoff of an infix is the average of the payoffs of the edges in the infix and is defined as . Given a window length and a threshold , a play in satisfies the fixed window mean-payoff objective if from every position after some point, it is possible to start an infix of length at most with mean payoff at least .
We omit the subscript when it is clear from the context. We extend the definition of windows as defined in [12] for arbitrary thresholds. Given a threshold , a play , and , we say that the -window is open if the mean payoff of is less than for all . Otherwise, the -window is closed. A play satisfies if and only if from some point on, every -window in closes within at most steps. Note that for as a smaller window length is a stronger constraint.
We also consider another window mean-payoff objective called the bounded window mean-payoff objective . A play satisfies the objective if there exists a window length such that the play satisfies .
Note that both and are Boolean prefix-independent objectives.
Expected window mean-payoff values.
Corresponding to the Boolean objectives and , we define quantitative versions of these objectives. Given a play in a stochastic game and a window length , the -value of is , the supremum threshold such that the play satisfies . Using notations from Section 2, we denote the expected -value of a vertex by . We define , the expected -value of a vertex analogously. If W is an integer such that the payoff of each edge in satisfies , then for all plays in , we have that and lie between and W. Thus, and are bounded objectives.
Decision problems.
Given a stochastic game , a vertex , and a threshold , we have the following expectation problems for the window mean-payoff objectives:
-
expected -value problem: Given a window length , is ?
-
expected -value problem: Is ?
As considered in previous works [12, 5, 6], the window length is usually small (), and hence we assume that is given in unary (while the edge-payoffs are given in binary).
4.1 Expected fixed window mean-payoff value
We give tight complexity bounds for the expected -value problem. We use the characterisation from Theorem 7 to present our main result that this problem is in (Theorem 18). We show in [21] that simple stochastic games [20], which are in [13], reduce to the expected -value problem, giving a tight lower bound.
In order to use the characterisation, we show the existence of the bound for the objective. We show in Lemma 15 that the expected -value of a class without boundary vertices takes a special form, that is, is the mean-payoff of a sequence of at most edges in . We use the fact that the -value of every play is the largest such that, eventually, every -window in closes in at most steps, and that is the mean payoff of a sequence of at most edges in . To complete the argument, we show that if both players play optimally, then, with probability , the -value of the outcome is equal to and thus, is also of this form.
Lemma 15.
The expected -value of vertices in a class without boundary vertices is equal to the mean payoff of some sequence of or fewer edges in . That is, is of the form for some and edges .
This observation gives us the bound on the denominators of the values of -classes without boundary vertices. To see this, let be the maximum denominator over all edge-payoffs in . Since , and each is a rational number with denominator at most , the denominator of the sum is at most if , and at most if . In both cases, this is at most .
Corollary 16.
The expected -value of vertices in -classes without boundary vertices can be written as where and are integers and .
From Theorem 11, we get that the denominator of for each class in is at most , which is at most .
Lemma 17.
The expected -value of each vertex in can be written as a fraction , where are integers, and , and .
We now state the main result of this section for the expected -value problem.
Theorem 18.
The expected -value problem is in when is given in unary. Memory of size suffices for , while memory of size suffices for .
Proof.
To show membership of the expected -value problem in , we first guess the expected -value vector , that is, the expected -value of every vertex in the game. From Lemma 17, it follows that the number of bits required to write for every vertex is polynomial in the size of the input. Thus, the vector can be guessed in polynomial time.
Then, to verify the guess, it is sufficient to verify the Bellman, lower-bound, and upper-bound conditions for . It is easy to see that the Bellman condition can be checked in polynomial time. Checking the lower-bound and upper-bound conditions, i.e., checking the almost-sure satisfaction of the threshold Boolean objective for appropriate thresholds in trap subgames in each -class can be done in polynomial time [22]. Thus, the decision problem of is in , and moreover, since there is exactly one value vector that satisfies the conditions in Theorem 7, the decision problem is, in fact, in . Analogously, the complement decision problem of is also in . Hence, the expected -value problem is in .
From the description of the optimal strategy in Lemma 9, it follows from Corollary 10 that the memory requirement for the expected objective is no greater than the memory requirement for the almost-sure satisfaction of the corresponding threshold objectives, which are and for and respectively [22].
4.2 Expected bounded window mean-payoff value
We would like to apply the characterisation in Theorem 7 to to show that the expected -value problem is in , and thus, we show the existence of the bound for the objective. We show in Lemma 19 that the expected -value of a class without boundary vertices is the mean payoff of a simple cycle in .
Lemma 19.
The expected -value of vertices in a class without boundary vertices is equal to the mean-payoff value of a simple cycle in . That is, is of the form for some and edges of a simple cycle.
While Lemma 19 is analogous to Lemma 15 for , the proof of Lemma 19 is more involved since the objective requires one to consider windows of arbitrary lengths. In the proof, we make use of the fact that memoryless strategies suffice for to play optimally for the almost-sure satisfaction of the BWMP objective [22]. In the resulting MDP (which has the same set of vertices as the game ), we carefully analyse the resulting plays when plays optimally. The following corollary of Lemma 19 states the bound for the objective.
Corollary 20.
The expected -value of vertices in -classes without boundary vertices can be written as where and are integers and .
From Theorem 11, we get that the denominator of of each class in is at most , which is at most .
Lemma 21.
The expected -value of each vertex in can be written as , where are integers, and , and .
We now state the main result of this section for the expected -value problem.
Theorem 22.
The expected -value problem is in . Memoryless strategies suffice for . requires infinite memory in general.
Proof sketch.
This proof follows a similar structure as the proof of Theorem 18. As before, the Bellman condition can be checked in polynomial time. Checking the lower-bound and upper-bound conditions involves checking almost-sure satisfaction of the Boolean objective BWMP for appropriate thresholds, which reduces to checking the satisfaction of BWMP in non-stochastic games [22], which in turn reduces to total supremum payoff [12], which is in [27]. Both of these reductions are polynomial-time reduction, and hence, the expected -value problem is in .
Memoryless strategies suffice for for almost-sure satisfaction of [22]. requires infinite memory in general for the objective even in non-stochastic games [12], which are a special case of stochastic games. Deterministic strategies suffice for both players. Hence, we get the memory requirements of an optimal strategy for the expected -value problem using Corollary 10.
5 Discussion
We discuss some concluding remarks about the relation of our work to previous work [16], which deals with the satisfaction of Boolean prefix-independent objectives. We also discuss practical implementations for window mean-payoff objectives and applicability of our technique to other prefix-independent objectives.
Comparison with [16].
In [16], it suffices to check the almost-sure satisfaction of the same Boolean objective in all value classes. In contrast, for quantitative objectives, the threshold Boolean objective for which we check the almost-sure satisfaction depends on the guessed value of the value class (“Can satisfy with probability ?”). Another key difference is that for Boolean objectives, the value classes without boundary vertices are precisely the extremal value classes, that is classes with values and . In the case of quantitative objectives, there may be multiple intermediate value classes without boundary vertices, making reasoning about the correctness of the reduction more difficult.
We note that if we apply our approach to Boolean prefix-independent objectives (such as Büchi, coBüchi, parity) by viewing them as quantitative objectives mapping each play to or , then we recover the algorithm given in [16].
Applicability to other prefix-independent objectives.
Recall that in order to be able to apply our characterisation to a prefix-independent objective , we require bounds on the image of and on the denominators of the optimal expected -values of vertices in the game. These bounds can easily be derived individually for the following quantitative prefix-independent objectives, and thus, our technique applies to these objectives.
- Mean-payoff objective.
-
The mean-payoff value of any play is bounded between the minimum and maximum edge payoffs in the game, directly giving bounds on the image of the mean-payoff objective. Since deterministic memoryless optimal strategies exist for both players [24], the expected mean-payoff value is a solution of a stationary distribution in the Markov chain obtained by fixing memoryless strategies of both players. This gives denominator bounds for the expected mean-payoff values of vertices in the game.
- Limsup and liminf objectives.
-
Since the liminf objective is equivalent to objective with window length [21], and the limsup objective is the dual of the liminf objective, our analysis with window mean-payoff objectives generalises limsup and liminf objectives.
- Positive frequency payoff objective [29].
-
Here, each vertex has a payoff, and this objective returns the maximum payoff among all vertices that are visited with positive frequency in an infinite play. This objective is prefix-independent, as the frequency of a vertex is independent of finite prefixes. The image of the objective is bounded between the minimum and maximum vertex payoffs. We observe that the value of vertices in a value class without boundary vertices is equal to the payoff of a vertex in the class, (giving a denominator bound for these vertices), and Theorem 11 uses this to give a denominator bound for vertices in value classes with boundary vertices.
Practical implementation.
We discuss approaches to solve the expected -value problem for the window mean-payoff objectives in practice.
A trivial algorithm that works for both and objectives is to iterate over all possible value vectors. For each value vector, we check if the conditions in Theorem 7 are satisfied, which can be done in polynomial time. Since there are exponentially many possible value vectors, this algorithm has an exponential running time in the worst-case.
Another technique is value iteration [15], which has been seen to be an anytime algorithm for the standard mean-payoff objective [34]. An anytime algorithm gives better precision the longer it is run, and can be interrupted any time. Given a game with vertices, the expected -value problem on reduces to the expected liminf-value problem on a game with vertices, (that is, on an exponentially larger game graph). The liminf objective is a well-studied objective in the context of value iteration [15, 11]. We describe the reduction in [21], which also gives the expected -values of vertices in .
Since the size of the graph is much bigger than that of , we would like to work with on-the-fly rather than explicitly constructing the entire graph. In [34], the authors show bounded value iteration for objectives such as reachability and mean-payoff. They also discuss that the algorithm can be extended to be asynchronous and use partial exploration. As future work, we would like to look at the practicality of on-demand asynchronous value iteration for the liminf objective, or even the window mean-payoff objectives and directly. An interesting aspect of it would be to investigate heuristics and optimisations such as sound value iteration [37], optimistic value iteration [31], and topological value iteration [3] to speed up the practical running time.
References
- [1] D. Andersson and P. B. Miltersen. The Complexity of Solving Stochastic Games on Graphs. In ISAAC, volume 5878 of Lecture Notes in Computer Science, pages 112–121. Springer, 2009. doi:10.1007/978-3-642-10631-6_13.
- [2] K. R. Apt and E. Grädel. Lectures in Game Theory for Computer Scientists. Cambridge University Press, 2011.
- [3] M. Azeem, A. Evangelidis, J. Křetínský, A. Slivinskiy, and M. Weininger. Optimistic and Topological Value Iteration for Simple Stochastic Games. In ATVA, pages 285–302. Springer, 2022. doi:10.1007/978-3-031-19992-9_18.
- [4] C. Baier and J-P. Katoen. Principles of model checking. MIT Press, 2008.
- [5] B. Bordais, S. Guha, and J-F. Raskin. Expected Window Mean-Payoff. In FSTTCS, volume 150 of LIPIcs, pages 32:1–32:15, 2019. doi:10.4230/LIPIcs.FSTTCS.2019.32.
- [6] T. Brihaye, F. Delgrange, Y. Oualhadj, and M. Randour. Life is Random, Time is Not: Markov Decision Processes with Window Objectives. Logical Methods in Computer Science, Volume 16, Issue 4, December 2020. doi:10.23638/LMCS-16(4:13)2020.
- [7] V. Bruyère, E. Filiot, M. Randour, and J-F. Raskin. Meet your expectations with guarantees: Beyond worst-case synthesis in quantitative games. Information and Computation, 254:259–295, 2017. doi:10.1016/j.ic.2016.10.011.
- [8] T. Brázdil, V. Brožek, K. Chatterjee, V. Forejt, and A. Kučera. Markov Decision Processes with Multiple Long-run Average Objectives. Logical Methods in Computer Science, Volume 10, Issue 1, Feb 2014. doi:10.2168/LMCS-10(1:13)2014.
- [9] K. Chatterjee. Concurrent games with tail objectives. Theoretical Computer Science, 388(1):181–198, 2007. doi:10.1016/j.tcs.2007.07.047.
- [10] K. Chatterjee, L. Doyen, H. Gimbert, and T. A. Henzinger. Randomness for free. Information and Computation, 245:3–16, 2015. doi:10.1016/j.ic.2015.06.003.
- [11] K. Chatterjee, L. Doyen, and T. A. Henzinger. A Survey of Stochastic Games with Limsup and Liminf Objectives. In ICALP, pages 1–15. Springer, 2009. doi:10.1007/978-3-642-02930-1_1.
- [12] K. Chatterjee, L. Doyen, M. Randour, and J-F. Raskin. Looking at mean-payoff and total-payoff through windows. Information and Computation, 242:25–52, 2015. doi:10.1016/j.ic.2015.03.010.
- [13] K. Chatterjee and N. Fijalkow. A reduction from parity games to simple stochastic games. EPTCS, 54:74–86, 2011. doi:10.4204/eptcs.54.6.
- [14] K. Chatterjee and T. A. Henzinger. Reduction of stochastic parity to stochastic mean-payoff games. Information Processing Letters, 106(1):1–7, 2008. doi:10.1016/j.ipl.2007.08.035.
- [15] K. Chatterjee and T. A. Henzinger. Value Iteration. In 25 Years of Model Checking - History, Achievements, Perspectives, LNCS 5000, pages 107–138. Springer, 2008. doi:10.1007/978-3-540-69850-0_7.
- [16] K. Chatterjee, T. A. Henzinger, and F. Horn. Stochastic Games with Finitary Objectives. In MFCS, pages 34–54. Springer, 2009. doi:10.1007/978-3-642-03816-7_4.
- [17] K. Chatterjee, Z. Křetínská, and J. Křetínský. Unifying Two Views on Multiple Mean-Payoff Objectives in Markov Decision Processes. Logical Methods in Computer Science, Volume 13, Issue 2, July 2017. doi:10.23638/LMCS-13(2:15)2017.
- [18] K. Chatterjee, T. Meggendorfer, R. Saona, and J. Svoboda. Faster Algorithm for Turn-based Stochastic Games with Bounded Treewidth. In SODA, pages 4590–4605, 2023. doi:10.1137/1.9781611977554.ch173.
- [19] A. Church. Application of Recursive Arithmetic to the Problem of Circuit Synthesis. Journal of Symbolic Logic, 28(4):289–290, 1963. doi:10.2307/2271310.
- [20] A. Condon. The complexity of stochastic games. Information and Computation, 96(2):203–224, 1992. doi:10.1016/0890-5401(92)90048-K.
- [21] L. Doyen, P. Gaba, and S. Guha. Expectation in Stochastic Games with Prefix-independent Objectives, 2024. doi:10.48550/arXiv.2405.18048.
- [22] L. Doyen, P. Gaba, and S. Guha. Stochastic Window Mean-payoff Games. In FoSSaCS Part I, volume 14574 of LNCS, pages 34–54. Springer, 2024. doi:10.1007/978-3-031-57228-9_3.
- [23] R. Durrett. Probability: Theory and Examples. Cambridge University Press, 2010.
- [24] A. Ehrenfeucht and J. Mycielski. Positional Strategies for Mean Payoff Games. Int. Journal of Game Theory, 8(2):109–113, 1979. doi:10.1007/BF01768705.
- [25] N. Fijalkow, N. Bertrand, P. Bouyer, R. Brenguier, A. Carayol, J. Fearnley, H. Gimbert, F. Horn, R. Ibsen-Jensen, N. Markey, B. Monmege, P. Novotny, M. Randour, O. Sankur, S. Schmitz, O. Serre, and M. Skomra. Games on Graphs. Online, 2024. doi:10.48550/arXiv.2305.10546.
- [26] P. Gaba and S. Guha. Optimising expectation with guarantees for window mean payoff in Markov decision processes. In AAMAS, pages 820–828. International Foundation for Autonomous Agents and Multiagent Systems / ACM, 2025. doi:10.5555/3709347.3743600.
- [27] T. M. Gawlitza and H. Seidl. Games through Nested Fixpoints. In CAV, pages 291–305. Springer, 2009. doi:10.1007/978-3-642-02658-4_24.
- [28] D. Gillette. Stochastic games with zero stop probabilities, pages 179–188. Princeton University Press, December 1958.
- [29] H. Gimbert and E. Kelmendi. Submixing and Shift-Invariant Stochastic Games. International Journal of Game Theory, 52(4):1179–1214, 2023. doi:10.1007/s00182-023-00860-5.
- [30] V. Gurvich and P. B. Miltersen. On the Computational Complexity of Solving Stochastic Mean-Payoff Games. CoRR, abs/0812.0486, 2008. arXiv:0812.0486.
- [31] A. Hartmanns and B. L. Kaminski. Optimistic Value Iteration. In CAV, pages 488–511. Springer, 2020. doi:10.1007/978-3-030-53291-8_26.
- [32] K. Hoffman and R. Kunze. Linear Algebra. Prentice-Hall, Inc., 1971.
- [33] J. Hopcroft and J. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley Publishing Co., Inc., 1979.
- [34] J. Křetínský, T. Meggendorfer, and M. Weininger. Stopping Criteria for Value Iteration on Stochastic Games with Quantitative Objectives. In LICS, pages 1–14, 2023. doi:10.1109/LICS56636.2023.10175771.
- [35] D. A. Martin. The Determinacy of Blackwell Games. The Journal of Symbolic Logic, 63(4):1565–1581, 1998. doi:10.2307/2586667.
- [36] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley and Sons, 1994.
- [37] T. Quatmann and J-P. Katoen. Sound Value Iteration. In CAV, pages 643–661. Springer, 2018. doi:10.1007/978-3-319-96145-3_37.
- [38] L. S. Shapley. Stochastic Games. Proceedings of the National Academy of Sciences, 39(10):1095–1100, October 1953. doi:10.1073/pnas.39.10.1095.
- [39] U. Zwick and M. Paterson. The Complexity of Mean Payoff Games on Graphs. Theoretical Computer Science, 158(1&2):343–359, 1996. doi:10.1016/0304-3975(95)00188-3.
