Finding Equilibria: Simpler for Pessimists, Simplest for Optimists
Abstract
We consider equilibria in multiplayer stochastic graph games with terminal-node rewards. In such games, Nash equilibria are defined assuming that each player seeks to maximise their expected payoff, ignoring their aversion or tolerance to risk. We therefore study risk-sensitive equilibria (RSEs), where the expected payoff is replaced by a risk measure. A classical risk measure in the literature is the entropic risk measure, where each player has a real valued parameter capturing their risk-averseness. We introduce the extreme risk measure, which corresponds to extreme cases of entropic risk measure, where players are either extreme optimists or extreme pessimists. Under extreme risk measure, every player is an extremist: an extreme optimist perceives their reward as the maximum payoff that can be achieved with positive probability, while an extreme pessimist expects the minimum payoff achievable with positive probability. We argue that the extreme risk measure, especially in multi-player graph based settings, is particularly relevant as they can model several real life instances such as interactions between secure systems and potential security threats, or distributed controls for safety critical systems. We prove that RSEs defined with the extreme risk measure are guaranteed to exist when all rewards are non-negative. Furthermore, we prove that the problem of deciding whether a given game contains an RSE that generates risk measures within specified intervals is decidable and -complete for our extreme risk measure, and even -complete when all players are extreme optimists, while that same problem is undecidable using the entropic risk measure or even the classical expected payoff. This establishes, to our knowledge, the first decidable fragment for equilibria in simple stochastic games without restrictions on strategy types or number of players.
Keywords and phrases:
Nash equilibria, stochastic games, graph games, risk-sensitive equilibriaCopyright and License:
2012 ACM Subject Classification:
Theory of computation Models of computationAcknowledgements:
We thank anonymous reviewers for pointing us to the Hurwicz criterion and to the work of Gallego-Hernández and Mansutti [13]. We thank Marie van den Bogaard for her valuable feedback on the first author’s PhD dissertation, which helped improve the quality of this work.Funding:
This work is a part of project VAMOS that has received funding from the European Research Council (ERC), grant agreement No 101020093.Editors:
Paweł Gawrychowski, Filip Mazowiecki, and Michał SkrzypczakSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Stochastic systems have been used extensively in several areas including verification [11], learning theory [1], epidemic processes [20] to name a few. Several real-world systems however do not work with a centralised control. Therefore, modelling using stochastic systems with multiple agents makes for more faithful abstractions of such systems without a centralised control. Some examples of fields in which multi-agent stochastic modelling include cyber physical systems [29], distributed and probabilistic computer programs [8], probabilistic planning [30]. In such cases, the problem of reasoning about multiple agents with several, often times orthogonal objectives, becomes important. For multi-agent systems modelled with stochasticity on the underlying arena, a fundamental question to ask is the existence or finding of an equilibrium – typically, an important decision problem in such games is the constrained existence problem that asks whether a given game contains an equilibrium where each player’s (perceived) payoff lies within a given interval. The most popular equilibria in literature are Nash equilibria (NEs) [23], where each player plays optimally to maximise their expected payoff, with regard to the other players’ strategies. However, Nash equilibria come with their own downsides. In terms of computational complexity, the cost of the constrained existence problem of NEs in stochastic games is prohibitively expensive, or even undecidable in the general case [31]. But more importantly, NEs do not faithfully model agents in real world settings since they do not consider their tolerance or averseness to risk.
An example.
We illustrate this claim with the help of a simple game, depicted in Figure 1(a). Each vertex other than the start vertex is owned by one player among the players and a play in this game refers to a sequence of moves of a token along the edges of the graph. The initial vertex denotes a stochastic vertex, which moves the token to the next vertex with probability at each step. When the token is on a vertex owned by a player , that player decides where the token goes next, and when a terminal vertex (vertex , or ) is reached, the numbers at the terminal vertex indicate the payoff obtained by each player.
If the token moves out of vertex , player then chooses between a green or an orange action . If the action was selected, the player must make a similar choice (between actions and ). If both players selected the action , then the player is given the opportunity to reach a terminal that offers negative reward to either player or , while the other player gets an additional profit (payoff instead of ).
A strategy profile is a Nash equilibrium (NE) if no player can increase their expected payoff by changing their strategy as long as the other players stick to theirs. If player does not use randomisation, there is no NE in this game where she gets an expected payoff , since along any play that reaches deterministically either the vertex or , one of the two players or gets the expected payoff and has an incentive to deviate to the action at their vertex. However, using randomisation, there is such an NE if both player and choose action , and player chooses at random to go either to vertex or , each with probability . Then, both player and player have expected payoff , the same as if they reach the terminals vertices or .
stochastic vertex.
However, if both the players and were modelling, say, safety-critical systems, such a Nash equilibrium is dubious because neither player can afford, even a low probability of, the reward as they do not have this tolerance to risk. Such a strategy profile would not be stable even if only one of the two players or , let alone both, were risk-averse. However, the strategy where player choses the action , player the action , and player randomises between her choices with equal probability, is a potential equilibrium in the case where player is risk averse. On the other hand, if all players , , and were risk loving, say they model entities like hackers of a system that are happy with some chance at a small positive reward, the same strategy would not be an equilibrium, as player would rather gamble for chance at the reward , despite the risk of a negative reward, thus deviating from the action instead to the “risky” alternative .
This example thus generalises a line of reasoning that is often made in finance or decision theory: consider a participant of a certain lottery where they lose $100 with probability , and wins $10000 with probability . While the expected payoff is positive ($1), expectation alone is insufficient as a decision criterion to participate in the lottery.
Risk measures.
A risk measure captures the perception that a player has of what their payoff will be. Thus, it generalises the notion of expected payoff. Various risk measures exist in the literature, and have been used extensively in the field of economics and finance. They include expected shortfall (ES), value at risk (VaR) [2], variance [4], entropic risk measure (ER) [10]. In terms of graph modelling, these risk measures have been studied mainly over Markov decision processes (MDPs) using variance (along with mean) as a risk measure [9, 27, 21], ES [28, 19, 22] (also referred to as conditional value at risk (CVaR), average value at risk (AVaR), expected tail loss (ETL), and superquantile in literature) and ER [15, 7, 3]. Studying the entropic risk measure in MDPs appears more practical compared to ES or using variance-penalised risk measures, due to the intractable exponential memory [14] and time required to compute optimal strategies [27] for the other measures, even in MDPs. On the other hand, when the risk measure used is ER, players have optimal positional strategies in MDPs [16].
Entropic risk measure.
The entropic risk measure is computed by assigning a risk parameter to an agent, i.e., a value . The entropic risk measure of a random variable is then defined as [12] for . Assume is a player’s payoff. If the risk parameter is positive, the corresponding players are risk-averse and therefore more weight is given to worse payoffs. Conversely, players with a negative are risk-loving. When tends to , the entropic risk measure converges to the classical expectation .
Consider again the game in Figure 1(a), and the strategy profile in which player and player choose action , and player goes to the vertex with probability , and to vertex with probability . Then, player ’s perception of such a strategy profile, depending on her risk parameter , is defined by the entropic risk measure . Figure 1(b) illustrates this formula, where each curve captures the perceived payoff for a fixed probability , the thick blue line corresponds to the case , the thick red line to , and mixtures of blue and red curves to intermediary cases – the thick purple curve corresponds to . Note that this purple curve reaches exactly the value when . This is because is the expected payoff of player when , and the entropic risk measure when – defined as the limit when approaches zero – corresponds exactly to the expected payoff. For higher values of , the entropic risk measure is lower, since it corresponds to cases where player is more risk-averse and focuses on the event of reaching the vertex ; for symmetric reasons, for lower values of , the values of entropic risk measure are higher.
Extreme risk measure.
Consider the perception of risk our agents may have in the previous example. If indeed such entities were modelling safety-critical agents where rewards below a certain threshold are unacceptable, such agents usually do not accept any strategy profile where they have any positive probability of such rewards, disregarding what the probabilities actually are, behaving as extremely pessimistic players – they assume that the worst of the cases that may happen. On the other hand, we may model certain external environment factors, or hackers of a system as an extremely optimistic player: they are happy with a small probability of success – they can repeat such attempts with a large number of trials – and seek to maximise the best payoff they may receive with positive probability.
We define the pessimistic risk measure of a random variable as the supremum of values such that almost surely takes values above , or its essential infimum (); the optimistic risk measure is defined symmetrically (using ). In Figure 1(b), the reader may have observed that all curves (except the thick blue and the thick red ones) converge to the value when tends to , and to the value when tends to . These extreme cases correspond to the pessimistic and the optimistic risk measure, respectively, and we group them under the umbrella term extreme risk measure (XR).
Risk-sensitive equilibria.
Risk-sensitive equilibria are equilibria in multi-player games defined using a specified risk measure for each player, where no player can improve the risk measure of their payoff by deviating unilaterally from their strategy. Risk-sensitive equilibria takes into account the risk-sensitivities of a player to ensure that a player does not have an incentive to deviate. When the risk measure used is the entropic risk measure, we call such equilibria entropic risk-sensitive equilibria (ERSEs). When using our novel risk measure – the extreme risk measure – we refer to them as extreme risk-sensitive equilibria (XRSEs).
Example 1.
In the game depicted by Figure 1(a), if players and are extreme pessimists, and player is an extreme optimist, then in order for player to get the risk measure , it suffices that either the terminal vertex or is reached with positive probability. But such a strategy profile cannot be an XRSE since there is a play such that either player or gets the payoff with positive probability and, as a pessimist, they get then the risk measure if they deviate to using the . They both have a profitable deviation by avoiding player ’s vertex. The only XRSEs in this game are therefore the strategy profiles that almost-surely reach terminals or , leaving then player with the risk measure .
Risk-sensitive equilibria is particularly relevant in security contexts, where, for example, defenders may act conservatively while attackers behave opportunistically. Similarly, in autonomous systems, different components – such as safety controllers, efficiency optimizers, or compliance monitors – may operate under independent objectives and distinct risk profiles. We study ERSEs and XRSEs in a general enough framework where it might also be suitable to model a variety of situations including epidemic processes and probabilistic planning, where equilibria on graph games are considered. Incorporating risk measures while considering equilibria enables more realistic and nuanced models of such multi-agent interactions.
Our results.
We consider simple quantitative multiplayer stochastic games played on graphs, that is, games in which the payoffs that the players receive depend on the terminal vertex that is reached – and in which an infinite play is associated to the the payoff zero for all players. In such games, we consider two questions in particular: the existence and the complexity of the constrained existence problem of both ERSEs and XRSEs.
In Section 3, we show that the constrained existence problem of ERSEs is undecidable, extending from the same problem for Nash equilibria in the work of Ummels and Wojtczak [31]. However, building on results on the two-player zero-sum case by Baier et al. [3], we find restrictions on strategies to recover decidability. Using known results about NEs, we also show that, when the rewards are all nonnegative, such an equilibrium always exists.
We define extreme risk measure (XR) as a novel risk measure to consider in multi-agent stochastic systems. In Section 4, we show that our new definition is robust, since it exactly captures the well-studied entropic risk measure when the risk parameters tend to . Our main technical contributions are about the extreme risk measure, and extreme risk-sensitive equilibria (XRSEs). We show that the constrained existence problem of XRSEs is decidable and -complete. The technical crux of proving membership lies in proving that if an XRSE satisfying the constraints exists, then there exists one that has finite memory, and a polynomial representation. Such a succinct representation then can therefore can be guessed and verified in polynomial time. We show that the problem remains -complete when strategies are required to be stationary, pure, or positional. Finally, we show that if all players are extreme optimists, the problem is -complete.
As we do for ERSEs, we also prove the existence of XRSEs in games with nonnegative rewards – and we show that there exists such a stationary equilibrium (where players use no memory and only randomness) that can be algorithmically constructed in polynomial time.
Related work.
Hurwicz criterion is used in decision theory in situations where probabilities are unknown, and it assigns for a given random variable and a parameter , the objective of maximising the quantity [17, 5], which generalises Wald’s maximin criterion [32] that constitutes to one extreme () of Hurwicz criterion. Our definition of pessimistic risk measure corresponds to Wald’s maximin criterion when only outcomes of positive probability are considered, while our pessimistic risk measure corresponds to the other extreme of Hurwicz’s criterion (). Even classical notions such as considering the worst-case scenario obtained by abstracting any stochastic environment and treating it instead as an adversary can be seen as Wald’s maximin criterion applied on all outcomes, including probability events.
To the best of our knowledge, equilibria defined using any risk measure, and in particular, the entropic risk measure, have not been studied in multiplayer graph games. Two works have considered risk in the specific case of two-player zero-sum games on stochastic arenas. The first of these works is by Bruyère, Filiot, Randour, and Raskin [6] who have studied “beyond worst-case synthesis problem” which considers some measures that address risk averseness in a player in the context of synthesis. They consider “strongly risk-averse” strategies, those which avoid outcomes below a certain threshold and maximise the expected payoff, resulting in an entirely different risk measure. Baier, Chatterjee, Meggendorfer, and Piribauer have considered two-player zero-sum stochastic games with total-reward objectives (payoff is the sum of the rewards seen along the way), when one player is risk-sensitive and wants to optimise their entropic risk measure [3]. They show that computing optimal strategies can be done in (when the base is replaced by an algebraic number). If the base of the exponent is , computing optimal strategies is in due to a recent result of Gallego-Hernández and Mansutti [13].
The two-player zero-sum case is a specific case of the constrained existence problem in two-agent systems, where the payoff functions of the two agents as well as their risk parameters are exactly the negation of each other. On the other hand, another subcase of the constrained existence problem of equilibria defined using the entropic risk measure is of course the constrained existence problem of NEs, when the risk parameters of all agents are set to , which is known to be undecidable from the work of Ummels and Wojtczak [31].
2 Preliminaries
We assume that the reader is familiar with the basics of probability and graph theory. However, we define some concepts for establishing notation.
Probabilities.
Given a (finite or infinite) set of outcomes and a probability measure over , let be a random variable over , that is, a mapping . We then write , or simply , for the expectation of , when it is defined. Given a finite set , a probability distribution over is a mapping that satisfies the equality . We write for the support of the distribution , that is, the set of elements such that .
Risk measures.
Given a set of outcomes, a risk measure over is a mapping which maps a probability measure over and a random variable to a real value . Sometimes, in the literature, risk measures are expected to have the following three properties: (1) they are normalised, i.e., we have ; (2) they are monotone, i.e., the pointwise inequality implies ; and (3) they are translative, i.e., for every constant . In particular, the expectation of a random variable satisfies those properties. We will not need those properties, and only state them here to remark that the risk measures we consider satisfy them. Note that translativity sometimes refers to the opposite of the definition given here, i.e., the property for every . This is a matter of whether we use a risk measure or its negation.
Graph, paths, games.
A directed graph consists of a set of vertices and a set of ordered pair of vertices, called edges, . In a directed graph , for each vertex , we write to denote the set . For simplicity, we often write for an edge . A path in the directed graph is a (finite or infinite) word over the alphabet such that for every such that and exist. We write for the set of vertices occurring along , and for those that occur infinitely often, if there are any. The prefix is written as or , and the suffix is written as or . A finite path is simple if every vertex occurs at most once along . It is a cycle if its last vertex is such that .
Definition 2 (Game).
A game is a tuple , where we have:
-
a directed graph , called the underlying graph of ;
-
a finite set of players;
-
a partition of the set , where denotes the set of vertices controlled by player , and the vertices in are called stochastic vertices;
-
a probability function , such that for each stochastic vertex , the restriction of to is a probability distribution;
-
a mapping called payoff function, where is the set of terminal vertices, i.e. vertices of the graph that have no outgoing edges. We also write , for each player , for the function that maps a terminal vertex to the coordinate of the tuple .
In a more general framework, payoffs can be assigned to all infinite paths. Here, we only focus on what is usually called simple quantitative games, i.e. games in which the underlying graph contains terminal vertices and the payoffs depend only on which terminal vertex is eventually reached. We thus extend the mapping to the set by defining , and (if no terminal vertex is reached, everyone gets the payoff ). A game is Boolean if all payoffs belong to the set .
An initialised game is a tuple , usually written , where is an initial vertex. In what follows, when the context is clear, we use the word game also for an initialised game. We often assume that we are given a game and implicitly use the same notations as in the definition above.
Example 3.
The example of initialised game given in Figure 1(a) is redrawn in Figure 2(a). There, the vertex is controlled by player , the vertex by player , and the vertex by player . The vertex is stochastic, with probability of going to and probability back to the stochastic vertex .
Definition 4 (Markov decision process, Markov chain).
A (initialised or not) Markov decision process is a game with one player. A Markov chain is a game with zero players.
Histories and plays.
We call play a path in the underlying graph that is infinite, or whose last vertex is a terminal vertex. Other paths are called histories. We will then use the notations to denote finite paths in the graph of the game, and to denote both finite and infinite paths. For a history , we write . We will also write for the set of histories whose last vertex is controlled by player . A history or play in an initialised game is a history or play in whose first vertex is .
Strategies, and strategy profiles.
In a game , a strategy for player is a mapping that maps each history to a probability distribution over . The set of possible strategies for player in is written as . A path (be it a history or a play) is compatible with the strategy if for each such that , the probability that the strategy proposes after history is positive, that is, we have . A strategy profile for a subset is a tuple . A strategy profile for the set of players is written , or simply when . We also write for where . Similarly, we use to denote the strategy profile defined by and for . We sometimes write to mean where is the player controlling , or when .
For some history , and a strategy , we define the strategy truncated to a history , written , as the strategy in the game .
A strategy profile in the game defines an initialised Markov decision process , where the vertices of the (infinite) underlying graph are the histories of and the edges are added from to each history iff . Similarly, a strategy profile for defines an initialised Markov chain . Thus, it also defines a probability measure over plays – which turns the payoff functions into random variables.
Pure, stationary, and positional strategies.
We say that a strategy is pure when for each history , there is a vertex such that ; then we often just write . We say that is stationary when for every two histories , we have . In that case, we sometimes assume that strategy is defined in every game and simply write for . Finally, the strategy is positional when it is pure and stationary. Those concepts are naturally generalised to strategy profiles.
Memory structures.
A memory structure for player is a tuple , where is a finite set of memory states, where is an initial state, where is a memory-update mapping that maps each pair to a memory state , and where is an output mapping that maps each pair to a distribution over . The memory-update mapping can be extended to a mapping with and for each history . The memory structure then induces a strategy defined by for each history . A strategy induced by a memory structure is called finite-memory strategy. Note that stationary strategies are exactly the strategies that are induced by a memory structure with .
We analogously define memory structures for a subset of players , which also defines finite-memory strategy profiles. Note that if is a finite-memory strategy profile, then each with is finite-memory – the memory structure inducing that strategy is obtained by replacing by its restriction to . Conversely, if each with is finite-memory, then the strategy profile is finite-memory – a collective memory structure can be obtained by constructing the product of individual memory structures.
Example 5.
Figure 2(b) depicts an example of memory structure, for player , on the game of Figure 2(a). The strategy induced can be presented as follows: when the vertex is reached, player goes deterministically to the vertex if has been visited an even number of times. Otherwise, he goes to with probability . Note that the output only depends on the history that was seen: player does not see whether player deterministically chose to go to the vertex , or did it as the outcome of a randomised choice.
Risk-sensitive equilibria, constrained existence problem.
In multiplayer stochastic games, we wish to study generalisations of the classical Nash equilibria where the expectation is replaced by other risk measures. Such generalisations are called risk-sensitive equilibria [24]. When is a risk measure and is a strategy profile, we write for .
Definition 6 (Risk-sensitive equilibrium).
Let be a game, and let be a tuple of risk measures. Let be a strategy profile in , let be a player, and let be a strategy for player , called deviation of player from . The deviation is profitable with regards to the risk measure if we have . The strategy profile is a -risk-sensitive equilibrium, or -RSE, if no player has a profitable deviation from with regards to .
Nash equilibria in a game can then be defined as -risk-sensitive equilibria, where for each player . The following problem is the main focus throughout our paper.
Question (Constrained existence of risk-sensitive equilibria).
Given a game , a tuple of risk measures , and two payoff vectors , does there exist a -RSE in such that for each , we have ?
We mainly consider the entropic risk measure and the extreme risk measure.
3 Entropic risk measure: intractable in multiplayer games
The entropic risk measure is defined using a risk parameter, i.e. a real value : large positive values indicate risk-averseness, large negative values risk-inclination.
Definition 7 (Entropic risk measure).
Given a risk parameter , the entropic risk measure is defined for every probability measure and random variable as:
This definition is generalised by replacing Euler’s constant with some base . The entropic risk measure with base is then defined by . The probability measure is omitted when clear from the context.
To see a visual representation of the entropic risk measure, see Figure 1 in the introduction.
Remark 8.
-
The generalisation to any base follows only computational goals, since algebraic bases will be easier to handle. Baring such concerns, the definitions are equivalent, since for every we have , where .
-
Definition 7 implies that for , the function is not defined. However, it is known that for all , and , the quantity converges to when tends to (see e.g. [26]). Therefore, we henceforth assume that to make risk entropy defined for all finite risk parameters .
When we are given a profile of risk parameters, we write for the tuple . Risk entropy defines a family of RSEs, namely the -RSEs, that we also call -entropic risk-sensitive equilibria, or -ERSEs. We now study the constrained existence problem of -ERSEs. Unfortunately, it is undecidable in the general case.
Theorem 9.
The constrained existence problem of -ERSEs with is undecidable, even for any fixed value of , for , and with only nonnegative payoffs.
Proof.
The constrained existence problem of Nash equilibria is undecidable [31, Theorem 4.9]. Since Nash equilibria are ERSEs with for each player , our result follows.
We therefore briefly study cases where the class of strategies considered is restricted.
Theorem 10.
The constrained existence problem of -ERSEs, in quantitative simple stochastic games, with rational risk parameters:
-
1.
remains undecidable when players are restricted to pure strategies;
-
2.
is decidable when players are restricted to stationary strategies, or positional strategies:
-
(a)
in if ;
-
(b)
in if the base is algebraic:
-
it is -hard when restricted to positional strategies, and
-
-complete when restricted to stationary strategies.
-
-
(a)
We end this section with this result: ERSEs are guaranteed to exist, if all rewards are nonnegative.
Theorem 11.
Let be a simple stochastic game with only nonnegative payoffs. Then, there exists a (pure) -ERSE over .
We conjecture that this result remains true when negative rewards are involved.
4 Extreme risk measure: limit of entropic risk measure
This section introduces a new risk measure that provides a tractable alternative to existing risk measures available in the literature. Let be a random variable ranging over . The pessimistic risk measure of is the highest value such that almost-surely takes a value above . When takes finitely many values, that corresponds to the least value that it takes with positive probability. In probability theory, that measure is sometimes referred to as essential infimum, written . The definition of optimistic risk measure is symmetric.
Definition 12 (Optimistic, pessimistic risk measure).
The pessimistic risk measure of a random variable is defined by . Analogously, the optimistic risk measure of is .
Note that the values of probabilities do not influence the optimistic or the pessimistic risk measures, which depend only on which events have zero or nonzero probability. Thus, those risk measures are well-suited to model players that do not care about probabilities because they need certainties, or simply because they do not know them – which is often the case in real-world stochastic processes.
Given a game , we can assign a risk measure for each player by defining a partition of , where the set represents the set of players that are pessimists, whose perceived payoffs are defined by the pessimistic risk measure, while represents the optimists, who intend to maximise their optimistic risk measure. For convenience, we group both measures under the umbrella term extreme risk measure (XR), and often assume that is given; then, we write for when , and for when . Since each player is interested only in the risk measure of their own payoff, we also write for the quantity . We define extreme risk-sensitive equilibria, or XRSEs for short, as -RSEs.
Our definition, which considers only outcomes of positive probability, is slightly different from a situation where each player treats every other player or stochastic vertex as an adversary. Indeed, in Figure 1, such an adversarial treatment would mean that player assumes that the probability event of the play staying in the vertex could also be realised, whereas in for extreme risk measure, such a probability event is disregarded. The same would hold if the stochastic vertex was instead assigned to any player whose strategy randomises between staying and leaving that vertex.
We show that our definition of extreme risk measure corresponds to the limit cases of entropic risk measure. Observe that in Figure 1, when player follows a randomized strategy, player ’s risk measure converges to when tends to , and to when tends to . We formally prove that our definition of extreme risk measure coincides with the limit case of the entropic risk measure.
Theorem 13.
Let be a random variable that ranges over , and let .
-
The limit of the risk entropy of when tends to exists and is equal to the pessimistic risk measure, that is, we have .
-
Similarly, the limit of the risk entropy of when tends to exists and is equal to the pessimistic risk measure, that is, we have .
5 Constrained existence of extreme risk sensitive equilibria
We now study the computational complexity of the constrained existence problem of XRSEs. The main result of this section will show that, contrary to the same problem with ERSEs, it is a decidable fragment of the constrained existence of RSEs, as it is -complete. We will also study some subcases, showing that -completeness remains true when players are restricted to positional, stationary, or pure strategies. Finally, we will show that when all players are optimists, the problem becomes -complete.
5.1 Membership in
-membership is a consequence of this fact: when an XRSE exists, there also exists one with the same extreme risk measures that uses finite memory, polynomial in the size of the game. Let us first illustrate, with examples, how and why memory is required in such XRSEs.
Example 14.
We consider the following constrained existence question and analyse the same question on three example graphs.
Is there an XRSE such that we have ?
Game in Figure 3(a).
The answer to Question in this game is no. Indeed, if there is such an XRSE in this game, then necessarily, following , both terminal vertices and are reached with positive probability, and the probability of following the cycle forever is (see Appendix E of full version). Intuitively, the problem here is that such a strategy profile would require a randomisation: for both players to get the payoff with positive probability, there must be a random event that decides which of them will actually have that payoff. In our world, such a situation would be solved by tossing a coin. In the game, it means that one of the players proceeds to a randomised action. But that player, say again, can deviate from such a strategy and refuse to leave the cycle. Such refusals form undetectable deviations from the strategy, since player only sees that player went from to , which can be interpreted as a possible outcome of the expected randomisation. In other words, the player that tosses a coin is the only one that sees the result, and can therefore lie about it. This phenomenon can be avoided only if such randomised choices are made by other “impartial” players or by stochastic vertices.
Game in Figure 3(b).
Consider now a slight modification, as shown in Figure 3(b). There, the first player that plays is determined at random by the edge that is taken from an initial stochastic vertex. The answer to Question for this game is yes. Indeed, the pure strategy profile defined by a strategy of player that maps the history to the terminal and the strategy of player that maps the history to terminal , and both strategies maps any other history to vertex , is an XRSE. Indeed, if any player deviates and refuses to leave the cycle, the other one will immediately refuse too, making it impossible to gain any benefit from such a deviation. Compared to the previous example, the existence of the stochastic vertex provides the players with a trustable common coin, that they can use to decide which of them will get payoff and which will get payoff .
Game in Figure 3(c).
Finally, consider the game depicted in Figure 3(c). Here, along every play, one player is given the opportunity of getting payoff , by going to the terminal vertex . But the answer to Question remains yes. Indeed, the pure strategy profile where from vertex , player goes from vertex to and then to , from which player leaves to , and symmetrically, from vertex , player goes to through from which player goes to is an XRSE. For instance, if player deviates and goes from to , then there is still the play that is generated with positive probability and from which player cannot deviate without triggering a punishing strategy that would make such a deviation non-profitable.
We see in this last example that to build an XRSE that generates a given tuple of risk measure values, we need one play for each player that anchors that player – i.e., a play in which they get their risk measure as a payoff (the lowest/highest value they obtain with positive probability), and from which they cannot deviate in a profitable way. Then, randomisation (or stochastic vertices) is required to split plays into two or more potential future plays where each anchors different subset of players, and memory is required to remember the subset of players that are being anchored – or whether a player has deviated from the strategy and must be punished.
In line with this intuition, the following theorem bounds the amount of memory required by an XRSE. We further show that there is a succinct encoding of this bounded memory that has size that is polynomial in the number of players and vertices in the game.
Theorem 15.
Let be a game with vertices and players, let be a partition of the set , and let be an XRSE in . Then, there exists a finite-memory XRSE with at most many memory states, such that . Furthermore, if is pure, then there is such a strategy profile that is pure.
Proof sketch.
We prove this theorem by formalising the idea of anchoring plays. To do so, we define a labelling function , that maps each history compatible with to the set of players that is, after the history , currently being anchored. In Lemma 16, we show the existence of such a labelling, with some properties. In the sequel, we write to denote the risk measure of each player in the strategy profile , that is, the tuple .
Lemma 16 (The labelling ).
There exists a labelling that maps each history compatible with to a subset of players, that is, , such that for each such , if we write , the labelling satisfies the following properties.
-
1.
If the vertex is stochastic, or belongs to some player , then the sets form a partition of .
-
2.
If the vertex belongs to some player , then the sets , …, and form a partition of , and belongs to all sets .
-
3.
For each optimistic player , we have .
-
4.
For each pessimistic , for all strategies of player , we have .
-
5.
If there is a successor such that , then all other successors are such that for each optimist , and there exists with for each pessimist .
Since labels vertices with sets of players, potentially there are exponentially many subsets of players that are in the range of . However, for a that satisfies Items 2, 1, 3, 4, and 5 of Lemma 16, we show that there are at most subsets such that for some history by an inductive counting argument. We use those subsets to create memory states, to remember which players are being anchored, and what was the last vertex visited (to detect deviations). We add one punishing state for each player, used when a deviation by that player is detected, adding up to the desired number. We construct an XRSE from that uses only those memory states.
Using Theorem 15, we can show the following lemma.
Lemma 17.
The constrained existence problem of XRSEs is in . The same problem when players are restricted to pure strategies is still in .
Proof.
Let be a game. Let be a partition of , and let and be threshold vectors. By Theorem 15, if there exists a (pure) XRSE with , then there exists one with at most memory states, where is the number of players and is the number of vertices. Such a strategy profile can be guessed in polynomial time. We now show that, once such a finite-memory strategy profile is guessed, one can check in polynomial time whether it is an XRSE, and satisfies the constraint .
-
First, given , for each player , the quantity can be computed in polynomial time, since it reduces to computing player ’s risk measure in the Markov chain induced by (which has polynomial size).
-
Second, checking that can be done in polynomial time.
-
Third, for each player , we check that player has no profitable deviation. This can also be done in polynomial time by computing the best risk measure player can get in the MDP induced by (which has polynomial size).
5.2 Membership in with restrictions on strategies
We have shown -membership of the constrained existence problem of XRSEs in the general case. We now consider variants where the space of a strategies is restricted to stationary, positional or pure strategies. We show that the problem is in for each of these cases.
Lemma 18.
The constrained existence problem, when all the players are restricted to positional, stationary, or pure strategies, is in .
Proof.
We show that we can still guess a strategy profile, and verify in polynomial time whether it is indeed an XRSE. For the positional and stationary cases, guessing a strategy profile is straightforward, since such a strategy profile can always be represented using polynomially many bits. We can then verify that a given strategy profile satisfies the constraints and also is an XRSE in polynomial time. For pure strategies, memory may be required. But we showed, alongside Theorem 15, that if there is a pure XRSE satisfying the constraints, there is one with polynomial memory: our result follows.
5.3 -hardness
We now prove hardness, for the general setting as well as when strategies are restricted.
Lemma 19.
The constrained existence problem of XRSEs is -hard, even when all players are pessimists and all rewards are nonnegative. It remains -hard when the strategies are restricted to stationary, pure, or positional ones.
Proof sketch.
We reduce the problem to our problem. From a given formula , we construct the game depicted by Figure 4, where all players are pessimistic, and the symbol is used to mean “every other player”. For each literal , we define two players, namely player and player . We also define a player for each clause . Each of those players controls one vertex, labelled by their name. We show in the complete proof that this game contains a (positional) XRSE where a witness player, player , gets risk measure if and only if is satisfiable.
Intuitively, the game consists of a sequence of gadgets where a value is given to each variable , depending on whether player takes the edge to the vertex (variable set to true) or to the vertex (variable set to false). That player cannot randomise between those two edges, because she does not get the same risk measure on both sides, and therefore would have a profitable undetectable deviation. When it is decided that the literal is true, player is given the possibility to deviate and get the payoff . In the final gadget, each clause player must choose a literal of the clause that she claims to be true, under the valuation that has been defined with the previous gadgets. Then, player gets risk measure : that player will thus have a profitable deviation if and only if he was given the possibility to deviate, i.e., if the literal was actually set to false.
This lemma, along with Lemma 17 and Lemma 18, proves the following theorem. Observe that our construction did not use any negative rewards, and hence the hardness results hold already for nonnegative rewards, while our membership result works with any combination of positive and negative rewards.
Theorem 20.
The constrained existence problems for XRSEs, for pure XRSEs, for stationary XRSEs, and for positional XRSEs, are all -complete. The lower bound holds even when all players are pessimistic and all rewards are non-negative.
We will now turn to a last subcase, where our problem turns out to be decidable in deterministic polynomial time.
5.4 Things get easier when everyone is optimistic
Our -hardness results required only pessimistic players. We now show that optimists are easier to deal with. The constrained existence problem of XRSEs becomes -complete when the perceived reward of each player is defined by the risk measure .
We first show an upper bound by giving a polynomial-time algorithm. The intuition is that an optimistic player has a profitable deviation as soon as a vertex is reached, with positive probability, from which that player can get a payoff higher than their risk measure. This is in contrast to pessimists, who have a profitable deviation only when they can avoid getting their risk measure as a payoff in all plays compatible with the strategy profile. Our algorithm iteratively identifies vertices that must never be reached, and then removing from the game the edges that must consequently never be taken.
Lemma 21.
If all players are optimists, then the constrained existence problem of XRSEs is in , and there is an algorithm that decides it in time , where is the number of edges in and the number of players. Moreover, the algorithm can be modified to output an XRSE that satisfies the constraints, if one exists, in time . If all the upper thresholds are nonnegative, it takes time .
Proof Sketch..
We want to decide whether there exists an XRSE satisfying the constraint . The algorithm deals with two cases, that we call cycle-friendly and cycle-averse cases, separately. In the cycle-friendly case, we have for all players . Then, an XRSE could have positive probability of reaching no terminal vertex. However that is impossible in the cycle-averse case, where there is a player such that . In this proof sketch, we describe only the algorithm in the cycle-friendly case.
The algorithm constructs a decreasing sequence of sets of edges until it reaches a fixed point. For each set , it considers the strategy profile , defined as follows: from each non-stochastic vertex , when is seen for the first time, it randomises uniformly between all edges . Later, if is visited again, it always repeats the same choice. If some player deviates and takes an edge that they are not supposed to take, then all the players switch to a positional strategy profile designed to minimise their risk measure. Such a strategy profile is finite-memory, but requires memory states to be represented as a memory structure: we therefore use the set as a succinct representation.
At each iteration , the algorithm identifies new sets of vertices that must be avoided. This includes the terminals that give some player a payoff that is larger than , or vertices from which player can deviate and obtain a higher value than the value offered by the strategy profile . If it is not possible to avoid reaching the set , the answer is returned. Otherwise, the set is defined from by removing edges that ensure that is not reached with positive probability. The algorithm stops when there are no more edges to remove and answers and if we have for each , and otherwise.
Each iteration requires time to identify and remove edges. Since there are many edges, the algorithm terminates in time .
Finally, we show that the problem is -hard, even when there are only two players.
Lemma 22.
The constrained existence problem of XRSEs with optimistic players is -hard even with only two players.
Proof sketch.
We give a log-space reduction from the problem of deciding two-player zero-sum reachability games, which is known to be -complete [18, Proposition 6].
We can now conclude this section with the following theorem.
Theorem 23.
The constrained existence problem of XRSE is -complete when all players are optimists, that is, when .
6 The existence of extreme risk sensitive equilibria
This section finally answers the classical question about equilibria: are they guaranteed to exist? We show that (stationary) XRSEs exist when all rewards are nonnegative. Although the result is reminiscent of the same result we proved for ERSEs (Theorem 11), it requires a different, and constructive proof that we discuss below.
We motivate the constructive algorithm with an example. Consider the game depicted in Figure 3(a) that involves two pessimists: player and player . Both players want to leave the cycle, but each of them would prefer that the other player leaves. If we first consider the strategy profile that always randomises between all the available edges, then both terminal vertices are reached with positive probability, and it is almost sure that one of them is reached: both players get therefore risk measure . Then, player (and symmetrically player ) has a profitable deviation by refusing to leave the cycle, and always going back to the vertex . Note that player cannot detect such a deviation, since she does not have access to the internal coins tossed by player . Then, we remove the edge (or ). This results in a set of edges where player gets the payoff , and player cannot get more than , ensuring that the new strategy profile that we obtain is a (stationary) XRSE.
Theorem 24.
Let be a game with only non-negative rewards, and let be a partition of . Then, there exists a stationary XRSE in . Moreover, there exists an algorithm that, given such a game, outputs the representation of such an XRSE in time , where is the number of edges, and the number of pessimistic players.
Proof sketch..
Our algorithm constructs a decreasing sequence of sets of edges, and considers, for each , the stationary strategy profile that randomises between all the outgoing edges in from all vertices. If this strategy profile is not an XRSE, there is a player who has a profitable deviation. We carefully identify edges used by player , and remove them. This process always terminates, and the set obtained defines an XRSE.
Like in the case of ERSEs, we conjecture that existence, and even existence of a stationary XRSE, remain true in the general case. We remark that even for Nash equilibria, existence of an NE in such simple stochastic game is known only if all the rewards are nonnegative.
7 Discussion
Our definition of extreme risk measure opens up several promising directions for future research. One immediate extension of our work would be to study games with more sophisticated objectives, such as mean payoff or discounted sum. Another extension is to study the concurrent version of such games, where players choose actions concurrently rather than in a turn-based setting. Finally, our definition of risk-sensitive equilibria is modelled after Nash equilibria and suffers from several of their limitations. Like Nash equilibria, RSEs allow irrational behaviours when one player deviates and must be punished, as in our game of Figure 3(c). Exploring alternative definitions of RSE, modelled after other equilibria concepts more suited for games on graphs [25, Section 7.1], such as subgame-perfect equilibria, could provide a relevant framework for player decision-making.
References
- [1] Agarwal Alekh, Nan Jiang, Sham M. Kakade, and Sun Wen. Reinforcement Learning: Theory and Algorithms. https://rltheorybook.github.io/, 2021.
- [2] M. Auer. Hands-On Value-at-Risk and Expected Shortfall: A Practical Primer. Management for Professionals. Springer International Publishing, 2018. URL: https://books.google.at/books?id=4EFKDwAAQBAJ.
- [3] Christel Baier, Krishnendu Chatterjee, Tobias Meggendorfer, and Jakob Piribauer. Entropic risk for turn-based stochastic games. Information and Computation, 301:105214, 2024. doi:10.1016/j.ic.2024.105214.
- [4] Hans Wolfgang Brachinger. From variance to value at risk: A unified perspective on standardized risk measures. In Wolfgang Gaul and Hermann Locarek-Junge, editors, Classification in the Information Age, pages 91–99, Berlin, Heidelberg, 1999. Springer Berlin Heidelberg.
- [5] Richard Bradley. Decision Theory: A Formal Philosophical Introduction, pages 611–655. Springer International Publishing, 2018. doi:10.1007/978-3-319-77434-3_34.
- [6] Véronique Bruyère, Emmanuel Filiot, Mickael Randour, and Jean-François Raskin. Meet your expectations with guarantees: Beyond worst-case synthesis in quantitative games. Information and Computation, 254:259–295, 2017. SR 2014. doi:10.1016/j.ic.2016.10.011.
- [7] Nicole Bäuerle and Ulrich Rieder. More risk-sensitive markov decision processes. Mathematics of Operations Research, 39(1):105–120, 2014. doi:10.1287/MOOR.2013.0601.
- [8] Luca de Alfaro, Thomas A. Henzinger, and Ranjit Jhala. Compositional methods for probabilistic systems. In CONCUR 2001 — Concurrency Theory, pages 351–365, Berlin, Heidelberg, 2001. Springer Berlin Heidelberg. doi:10.1007/3-540-44685-0_24.
- [9] Jerzy Filar and Lodewijk Kallenberg. Variance-penalized markov decision process. Mathematics of Operations Research, 14, February 1989. doi:10.1287/moor.14.1.147.
- [10] Hans Föllmer and Alexander Schied. Convex measures of risk and trading constraints. Finance and Stochastics, 6(4):429–447, October 2002. doi:10.1007/s007800200072.
- [11] Vojtěch Forejt, Marta Kwiatkowska, Gethin Norman, and David Parker. Automated Verification Techniques for Probabilistic Systems, pages 53–113. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011. doi:10.1007/978-3-642-21455-4_3.
- [12] Hans Föllmer and Alexander Schied. Stochastic Finance: An Introduction in Discrete Time. Walter de Gruyter, Berlin, Boston, 4 edition, 2016.
- [13] Jorge Gallego-Hernández and Alessio Mansutti. On the existential theory of the reals enriched with integer powers of a computable number. In Olaf Beyersdorff, Michal Pilipczuk, Elaine Pimentel, and Kim Thang Nguyen, editors, 42nd International Symposium on Theoretical Aspects of Computer Science, STACS 2025, March 4-7, 2025, Jena, Germany, volume 327 of LIPIcs, pages 37:1–37:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2025. doi:10.4230/LIPICS.STACS.2025.37.
- [14] Christoph Haase and Stefan Kiefer. The odds of staying on budget. In Automata, Languages, and Programming, pages 234–246, Berlin, Heidelberg, 2015. Springer Berlin Heidelberg. doi:10.1007/978-3-662-47666-6_19.
- [15] Ronald A. Howard and James E. Matheson. Risk-sensitive markov decision processes. Management Science, 18(7):356–369, 1972.
- [16] Ronald A. Howard and James E. Matheson. Risk-sensitive markov decision processes. Management Science, 18(7):356–369, 1972. URL: http://www.jstor.org/stable/2629352.
- [17] Leonid Hurwicz. Optimality criteria for decision-making under ignorance. Technical report, Cowles Commission for Research in Economics, 1951. URL: https://cowles.yale.edu/sites/default/files/2024-03/s-0355.pdf.
- [18] Neil Immerman. Number of quantifiers is better than number of tape cells. J. Comput. Syst. Sci., 22:384–406, 1981. doi:10.1016/0022-0000(81)90039-8.
- [19] Jan Křetínský and Tobias Meggendorfer. Conditional value-at-risk for reachability and mean payoff in markov decision processes. In Proceedings of the 33rd Annual ACM/IEEE Symposium on Logic in Computer Science, LICS ’18, pages 609–618. Association for Computing Machinery, 2018. doi:10.1145/3209108.3209176.
- [20] Claude Lefèvre. Optimal control of a birth and death epidemic process. Operations Research, 29(5):971–982, 1981. doi:10.1287/OPRE.29.5.971.
- [21] Shie Mannor and John N. Tsitsiklis. Mean-variance optimization in markov decision processes. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, pages 177–184. Omnipress, 2011. URL: https://icml.cc/2011/papers/156_icmlpaper.pdf.
- [22] Tobias Meggendorfer. Risk-aware stochastic shortest path. Proceedings of the AAAI Conference on Artificial Intelligence, pages 9858–9867, 2022. doi:10.1609/AAAI.V36I9.21222.
- [23] John F. Nash. Equilibrium points in -person games. Proceedings of the National Academy of Sciences, 36(1):48–49, 1950. doi:10.1073/pnas.36.1.48.
- [24] Andrzej S. Nowak. Notes on Risk-Sensitive Nash Equilibria, pages 95–109. Birkhäuser Boston, Boston, MA, 2005. doi:10.1007/0-8176-4429-6_5.
- [25] Martin J Osborne. An introduction to game theory. Oxford University Press google schola, 2:672–713, 2004.
- [26] Bernardo K. Pagnoncelli, Oscar Dowson, and David P. Morton. Multistage stochastic programs with the entropic risk measure, 2020. URL: https://optimization-online.org/2020/08/7984/.
- [27] Jakob Piribauer, Ocan Sankur, and Christel Baier. The variance-penalized stochastic shortest path problem. In 49th International Colloquium on Automata, Languages, and Programming, ICALP, volume 229 of LIPIcs, pages 129:1–129:19. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2022. doi:10.4230/LIPICS.ICALP.2022.129.
- [28] Mickael Randour, Jean-François Raskin, and Ocan Sankur. Variations on the stochastic shortest path problem. In Verification, Model Checking, and Abstract Interpretation, pages 1–18. Springer, 2015. doi:10.1007/978-3-662-46081-8_1.
- [29] Dawei Shi, Robert J Elliott, and Tongwen Chen. On finite-state stochastic modeling and secure estimation of cyber-physical systems. IEEE Transactions on Automatic Control, 62(1):65–80, 2016. doi:10.1109/TAC.2016.2541919.
- [30] Florent Teichteil-Königsbuch, Ugur Kuter, and Guillaume Infantes. Incremental plan aggregation for generating policies in MDPs. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: Volume 1 - Volume 1, AAMAS ’10, pages 1231–1238. International Foundation for Autonomous Agents and Multiagent Systems, 2010. URL: https://dl.acm.org/citation.cfm?id=1838366.
- [31] Michael Ummels and Dominik Wojtczak. The Complexity of Nash Equilibria in Stochastic Multiplayer Games. Logical Methods in Computer Science, Volume 7, Issue 3, September 2011. doi:10.2168/LMCS-7(3:20)2011.
- [32] Abraham Wald. Statistical Decision Functions. John Wiley & Sons, New York, 1950. Classic formulation of the minimax (Wald) criterion.
