Abstract 1 Introduction 2 Preliminaries 3 Entropic risk measure: intractable in multiplayer games 4 Extreme risk measure: limit of entropic risk measure 5 Constrained existence of extreme risk sensitive equilibria 6 The existence of extreme risk sensitive equilibria 7 Discussion References

Finding Equilibria: Simpler for Pessimists, Simplest for Optimists

Léonard Brice ORCID Université Libre de Bruxelles, Belgium Thomas A. Henzinger ORCID Institute of Science and Technology Austria, Klosterneuburg, Austria K. S. Thejaswini ORCID Institute of Science and Technology Austria, Klosterneuburg, Austria
Abstract

We consider equilibria in multiplayer stochastic graph games with terminal-node rewards. In such games, Nash equilibria are defined assuming that each player seeks to maximise their expected payoff, ignoring their aversion or tolerance to risk. We therefore study risk-sensitive equilibria (RSEs), where the expected payoff is replaced by a risk measure. A classical risk measure in the literature is the entropic risk measure, where each player has a real valued parameter capturing their risk-averseness. We introduce the extreme risk measure, which corresponds to extreme cases of entropic risk measure, where players are either extreme optimists or extreme pessimists. Under extreme risk measure, every player is an extremist: an extreme optimist perceives their reward as the maximum payoff that can be achieved with positive probability, while an extreme pessimist expects the minimum payoff achievable with positive probability. We argue that the extreme risk measure, especially in multi-player graph based settings, is particularly relevant as they can model several real life instances such as interactions between secure systems and potential security threats, or distributed controls for safety critical systems. We prove that RSEs defined with the extreme risk measure are guaranteed to exist when all rewards are non-negative. Furthermore, we prove that the problem of deciding whether a given game contains an RSE that generates risk measures within specified intervals is decidable and 𝖭𝖯-complete for our extreme risk measure, and even 𝖯𝖳𝖨𝖬𝖤-complete when all players are extreme optimists, while that same problem is undecidable using the entropic risk measure or even the classical expected payoff. This establishes, to our knowledge, the first decidable fragment for equilibria in simple stochastic games without restrictions on strategy types or number of players.

Keywords and phrases:
Nash equilibria, stochastic games, graph games, risk-sensitive equilibria
Copyright and License:
[Uncaptioned image] © Léonard Brice, Thomas A. Henzinger, and K. S. Thejaswini; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Models of computation
Related Version:
Full Version: https://arxiv.org/abs/2502.05316
Acknowledgements:
We thank anonymous reviewers for pointing us to the Hurwicz criterion and to the work of Gallego-Hernández and Mansutti [13]. We thank Marie van den Bogaard for her valuable feedback on the first author’s PhD dissertation, which helped improve the quality of this work.
Funding:
This work is a part of project VAMOS that has received funding from the European Research Council (ERC), grant agreement No 101020093.
Editors:
Paweł Gawrychowski, Filip Mazowiecki, and Michał Skrzypczak

1 Introduction

Stochastic systems have been used extensively in several areas including verification [11], learning theory [1], epidemic processes [20] to name a few. Several real-world systems however do not work with a centralised control. Therefore, modelling using stochastic systems with multiple agents makes for more faithful abstractions of such systems without a centralised control. Some examples of fields in which multi-agent stochastic modelling include cyber physical systems [29], distributed and probabilistic computer programs [8], probabilistic planning [30]. In such cases, the problem of reasoning about multiple agents with several, often times orthogonal objectives, becomes important. For multi-agent systems modelled with stochasticity on the underlying arena, a fundamental question to ask is the existence or finding of an equilibrium – typically, an important decision problem in such games is the constrained existence problem that asks whether a given game contains an equilibrium where each player’s (perceived) payoff lies within a given interval. The most popular equilibria in literature are Nash equilibria (NEs) [23], where each player plays optimally to maximise their expected payoff, with regard to the other players’ strategies. However, Nash equilibria come with their own downsides. In terms of computational complexity, the cost of the constrained existence problem of NEs in stochastic games is prohibitively expensive, or even undecidable in the general case [31]. But more importantly, NEs do not faithfully model agents in real world settings since they do not consider their tolerance or averseness to risk.

An example.

We illustrate this claim with the help of a simple game, depicted in Figure 1(a). Each vertex other than the start vertex is owned by one player among the players {,,} and a play in this game refers to a sequence of moves of a token along the edges of the graph. The initial vertex s denotes a stochastic vertex, which moves the token to the next vertex with probability 12 at each step. When the token is on a vertex owned by a player i, that player decides where the token goes next, and when a terminal vertex (vertex t1,t2,t3, or t4) is reached, the numbers at the terminal vertex indicate the payoff obtained by each player.

If the token moves out of vertex s, player then chooses between a green () or an orange action (). If the action was selected, the player must make a similar choice (between actions and ). If both players selected the action , then the player is given the opportunity to reach a terminal that offers negative reward to either player or , while the other player gets an additional profit (payoff 1 instead of 0).

A strategy profile is a Nash equilibrium (NE) if no player can increase their expected payoff by changing their strategy as long as the other players stick to theirs. If player does not use randomisation, there is no NE in this game where she gets an expected payoff 1, since along any play that reaches deterministically either the vertex t3 or t4, one of the two players or gets the expected payoff 0 and has an incentive to deviate to the action at their vertex. However, using randomisation, there is such an NE if both player and choose action , and player chooses at random to go either to vertex t3 or t4, each with probability 12. Then, both player and player have expected payoff 1, the same as if they reach the terminals vertices t1 or t2.

(a) A 3-player stochastic game with a
stochastic vertex.
(b) Entropic risk measures for player for varying risk-parameters.
Figure 1: Entropic risk measure.

However, if both the players and were modelling, say, safety-critical systems, such a Nash equilibrium is dubious because neither player can afford, even a low probability of, the reward 1 as they do not have this tolerance to risk. Such a strategy profile would not be stable even if only one of the two players or , let alone both, were risk-averse. However, the strategy where player choses the action , player the action , and player randomises between her choices with equal probability, is a potential equilibrium in the case where player is risk averse. On the other hand, if all players , , and were risk loving, say they model entities like hackers of a system that are happy with some chance at a small positive reward, the same strategy would not be an equilibrium, as player would rather gamble for chance at the reward +1, despite the risk of a negative reward, thus deviating from the action instead to the “risky” alternative .

This example thus generalises a line of reasoning that is often made in finance or decision theory: consider a participant of a certain lottery where they lose $100 with probability 99100, and wins $10000 with probability 1100. While the expected payoff is positive ($1), expectation alone is insufficient as a decision criterion to participate in the lottery.

Risk measures.

A risk measure captures the perception that a player has of what their payoff will be. Thus, it generalises the notion of expected payoff. Various risk measures exist in the literature, and have been used extensively in the field of economics and finance. They include expected shortfall (ES), value at risk (VaR) [2], variance [4], entropic risk measure (ER) [10]. In terms of graph modelling, these risk measures have been studied mainly over Markov decision processes (MDPs) using variance (along with mean) as a risk measure [9, 27, 21], ES [28, 19, 22] (also referred to as conditional value at risk (CVaR), average value at risk (AVaR), expected tail loss (ETL), and superquantile in literature) and ER [15, 7, 3]. Studying the entropic risk measure in MDPs appears more practical compared to ES or using variance-penalised risk measures, due to the intractable exponential memory [14] and time required to compute optimal strategies [27] for the other measures, even in MDPs. On the other hand, when the risk measure used is ER, players have optimal positional strategies in MDPs [16].

Entropic risk measure.

The entropic risk measure is computed by assigning a risk parameter to an agent, i.e., a value ρ. The entropic risk measure of a random variable X is then defined as 𝕄ρ[X]=1ρloge(𝔼[eρX]) [12] for ρ0. Assume X is a player’s payoff. If the risk parameter ρ is positive, the corresponding players are risk-averse and therefore more weight is given to worse payoffs. Conversely, players with a negative ρ are risk-loving. When ρ tends to 0, the entropic risk measure converges to the classical expectation 𝔼[X].

Consider again the game in Figure 1(a), and the strategy profile in which player and player choose action , and player goes to the vertex t3 with probability p[0,1], and to vertex t4 with probability 1p. Then, player ’s perception of such a strategy profile, depending on her risk parameter ρ, is defined by the entropic risk measure 𝕄[μ]=1ρloge(peρ+(1p)eρ). Figure 1(b) illustrates this formula, where each curve captures the perceived payoff for a fixed probability p, the thick blue line corresponds to the case p=0, the thick red line to p=1, and mixtures of blue and red curves to intermediary cases – the thick purple curve corresponds to p=12. Note that this purple curve reaches exactly the value 0 when ρ=0. This is because 0 is the expected payoff of player when p=12, and the entropic risk measure when ρ=0 – defined as the limit when ρ approaches zero – corresponds exactly to the expected payoff. For higher values of ρ, the entropic risk measure is lower, since it corresponds to cases where player is more risk-averse and focuses on the event of reaching the vertex t4; for symmetric reasons, for lower values of ρ, the values of entropic risk measure are higher.

Extreme risk measure.

Consider the perception of risk our agents may have in the previous example. If indeed such entities were modelling safety-critical agents where rewards below a certain threshold are unacceptable, such agents usually do not accept any strategy profile where they have any positive probability of such rewards, disregarding what the probabilities actually are, behaving as extremely pessimistic players – they assume that the worst of the cases that may happen. On the other hand, we may model certain external environment factors, or hackers of a system as an extremely optimistic player: they are happy with a small probability of success – they can repeat such attempts with a large number of trials – and seek to maximise the best payoff they may receive with positive probability.

We define the pessimistic risk measure of a random variable X as the supremum of values x such that X almost surely takes values above x, or its essential infimum (essinf); the optimistic risk measure is defined symmetrically (using esssup). In Figure 1(b), the reader may have observed that all curves (except the thick blue and the thick red ones) converge to the value 1 when ρ tends to +, and to the value 1 when ρ tends to . These extreme cases correspond to the pessimistic and the optimistic risk measure, respectively, and we group them under the umbrella term extreme risk measure (XR).

Risk-sensitive equilibria.

Risk-sensitive equilibria are equilibria in multi-player games defined using a specified risk measure for each player, where no player can improve the risk measure of their payoff by deviating unilaterally from their strategy. Risk-sensitive equilibria takes into account the risk-sensitivities of a player to ensure that a player does not have an incentive to deviate. When the risk measure used is the entropic risk measure, we call such equilibria entropic risk-sensitive equilibria (ERSEs). When using our novel risk measure – the extreme risk measure – we refer to them as extreme risk-sensitive equilibria (XRSEs).

Example 1.

In the game depicted by Figure 1(a), if players and are extreme pessimists, and player is an extreme optimist, then in order for player to get the risk measure 1, it suffices that either the terminal vertex t3 or t4 is reached with positive probability. But such a strategy profile cannot be an XRSE since there is a play such that either player or gets the payoff 1 with positive probability and, as a pessimist, they get then the risk measure 0 if they deviate to using the . They both have a profitable deviation by avoiding player ’s vertex. The only XRSEs in this game are therefore the strategy profiles that almost-surely reach terminals t1 or t2, leaving then player  with the risk measure 0.

Risk-sensitive equilibria is particularly relevant in security contexts, where, for example, defenders may act conservatively while attackers behave opportunistically. Similarly, in autonomous systems, different components – such as safety controllers, efficiency optimizers, or compliance monitors – may operate under independent objectives and distinct risk profiles. We study ERSEs and XRSEs in a general enough framework where it might also be suitable to model a variety of situations including epidemic processes and probabilistic planning, where equilibria on graph games are considered. Incorporating risk measures while considering equilibria enables more realistic and nuanced models of such multi-agent interactions.

Our results.

We consider simple quantitative multiplayer stochastic games played on graphs, that is, games in which the payoffs that the players receive depend on the terminal vertex that is reached – and in which an infinite play is associated to the the payoff zero for all players. In such games, we consider two questions in particular: the existence and the complexity of the constrained existence problem of both ERSEs and XRSEs.

In Section 3, we show that the constrained existence problem of ERSEs is undecidable, extending from the same problem for Nash equilibria in the work of Ummels and Wojtczak [31]. However, building on results on the two-player zero-sum case by Baier et al. [3], we find restrictions on strategies to recover decidability. Using known results about NEs, we also show that, when the rewards are all nonnegative, such an equilibrium always exists.

We define extreme risk measure (XR) as a novel risk measure to consider in multi-agent stochastic systems. In Section 4, we show that our new definition is robust, since it exactly captures the well-studied entropic risk measure when the risk parameters tend to ±. Our main technical contributions are about the extreme risk measure, and extreme risk-sensitive equilibria (XRSEs). We show that the constrained existence problem of XRSEs is decidable and 𝖭𝖯-complete. The technical crux of proving 𝖭𝖯 membership lies in proving that if an XRSE satisfying the constraints exists, then there exists one that has finite memory, and a polynomial representation. Such a succinct representation then can therefore can be guessed and verified in polynomial time. We show that the problem remains 𝖭𝖯-complete when strategies are required to be stationary, pure, or positional. Finally, we show that if all players are extreme optimists, the problem is 𝖯𝖳𝖨𝖬𝖤-complete.

As we do for ERSEs, we also prove the existence of XRSEs in games with nonnegative rewards – and we show that there exists such a stationary equilibrium (where players use no memory and only randomness) that can be algorithmically constructed in polynomial time.

Related work.

Hurwicz criterion is used in decision theory in situations where probabilities are unknown, and it assigns for a given random variable X and a parameter α[0,1], the objective of maximising the quantity αmax(X)+(1α)min(X) [17, 5], which generalises Wald’s maximin criterion [32] that constitutes to one extreme (α=0) of Hurwicz criterion. Our definition of pessimistic risk measure corresponds to Wald’s maximin criterion when only outcomes of positive probability are considered, while our pessimistic risk measure corresponds to the other extreme of Hurwicz’s criterion (α=1). Even classical notions such as considering the worst-case scenario obtained by abstracting any stochastic environment and treating it instead as an adversary can be seen as Wald’s maximin criterion applied on all outcomes, including probability 0 events.

To the best of our knowledge, equilibria defined using any risk measure, and in particular, the entropic risk measure, have not been studied in multiplayer graph games. Two works have considered risk in the specific case of two-player zero-sum games on stochastic arenas. The first of these works is by Bruyère, Filiot, Randour, and Raskin [6] who have studied “beyond worst-case synthesis problem” which considers some measures that address risk averseness in a player in the context of synthesis. They consider “strongly risk-averse” strategies, those which avoid outcomes below a certain threshold and maximise the expected payoff, resulting in an entirely different risk measure. Baier, Chatterjee, Meggendorfer, and Piribauer have considered two-player zero-sum stochastic games with total-reward objectives (payoff is the sum of the rewards seen along the way), when one player is risk-sensitive and wants to optimise their entropic risk measure [3]. They show that computing optimal strategies can be done in 𝖯𝖲𝖯𝖠𝖢𝖤 (when the base e is replaced by an algebraic number). If the base of the exponent is e, computing optimal strategies is in 3𝖤𝖷𝖯𝖳𝖨𝖬𝖤 due to a recent result of Gallego-Hernández and Mansutti [13].

The two-player zero-sum case is a specific case of the constrained existence problem in two-agent systems, where the payoff functions of the two agents as well as their risk parameters are exactly the negation of each other. On the other hand, another subcase of the constrained existence problem of equilibria defined using the entropic risk measure is of course the constrained existence problem of NEs, when the risk parameters of all agents are set to 0, which is known to be undecidable from the work of Ummels and Wojtczak [31].

2 Preliminaries

We assume that the reader is familiar with the basics of probability and graph theory. However, we define some concepts for establishing notation.

Probabilities.

Given a (finite or infinite) set of outcomes Ω and a probability measure over Ω, let X be a random variable over Ω, that is, a mapping X:Ω. We then write 𝔼[𝕏], or simply 𝔼[X], for the expectation of X, when it is defined. Given a finite set S, a probability distribution over S is a mapping d:S[0,1] that satisfies the equality xSd(x)=1. We write 𝖲𝗎𝗉𝗉(d) for the support of the distribution d, that is, the set of elements xS such that d(x)>0.

Risk measures.

Given a set Ω of outcomes, a risk measure over Ω is a mapping M which maps a probability measure over Ω and a random variable X to a real value M[X]. Sometimes, in the literature, risk measures are expected to have the following three properties: (1) they are normalised, i.e., we have M[0]=0; (2) they are monotone, i.e., the pointwise inequality XY implies M[X]M[Y]; and (3) they are translative, i.e., M[X+c]=M[X]+c for every constant c. In particular, the expectation of a random variable 𝔼 satisfies those properties. We will not need those properties, and only state them here to remark that the risk measures we consider satisfy them. Note that translativity sometimes refers to the opposite of the definition given here, i.e., the property M[X+c]=M[X]c for every c. This is a matter of whether we use a risk measure or its negation.

Graph, paths, games.

A directed graph (V,E) consists of a set of vertices V and a set of ordered pair of vertices, called edges, E. In a directed graph (V,E), for each vertex u, we write E(u) to denote the set E({u}×V). For simplicity, we often write uv for an edge (u,v)E. A path in the directed graph (V,E) is a (finite or infinite) word π=π0π1 over the alphabet V such that πnπn+1E for every n such that πn and πn+1 exist. We write 𝖮𝖼𝖼(π) for the set of vertices occurring along π, and 𝖨𝗇𝖿(π) for those that occur infinitely often, if there are any. The prefix π0πn is written as πn or π<n+1, and the suffix πnπn+1 is written as πn or π>n1. A finite path π=π0πn is simple if every vertex occurs at most once along π. It is a cycle if its last vertex πn is such that πnπ0E.

Definition 2 (Game).

A game is a tuple 𝒢=(V,E,Π,(Vi)iΠ,𝗉,μ), where we have:

  • a directed graph (V,E), called the underlying graph of 𝒢;

  • a finite set Π of players;

  • a partition (Vi)iΠ{?} of the set V, where Vi denotes the set of vertices controlled by player i, and the vertices in V? are called stochastic vertices;

  • a probability function 𝗉:E(V?)[0,1], such that for each stochastic vertex s, the restriction of 𝗉 to E(s) is a probability distribution;

  • a mapping μ:TΠ called payoff function, where T is the set of terminal vertices, i.e. vertices of the graph (V,E) that have no outgoing edges. We also write μi, for each player i, for the function that maps a terminal vertex t to the ith coordinate of the tuple μ(t).

In a more general framework, payoffs can be assigned to all infinite paths. Here, we only focus on what is usually called simple quantitative games, i.e. games in which the underlying graph contains terminal vertices and the payoffs depend only on which terminal vertex is eventually reached. We thus extend the mapping μ to the set (VT)ω(VT)T by defining μ(v1vkt)=μ(t), and μ(v1v2)=(0)iΠ (if no terminal vertex is reached, everyone gets the payoff 0). A game is Boolean if all payoffs belong to the set {0,1}.

An initialised game is a tuple (𝒢,v0), usually written 𝒢v0, where v0V is an initial vertex. In what follows, when the context is clear, we use the word game also for an initialised game. We often assume that we are given a game 𝒢v0 and implicitly use the same notations as in the definition above.

Example 3.

The example of initialised game given in Figure 1(a) is redrawn in Figure 2(a). There, the vertex a is controlled by player , the vertex b by player , and the vertex c by player . The vertex s is stochastic, with probability 12 of going to a and probability 12 back to the stochastic vertex s.

Definition 4 (Markov decision process, Markov chain).

A (initialised or not) Markov decision process is a game with one player. A Markov chain is a game with zero players.

Histories and plays.

We call play a path in the underlying graph that is infinite, or whose last vertex is a terminal vertex. Other paths are called histories. We will then use the notations 𝖧𝗂𝗌𝗍(𝒢) to denote finite paths in the graph of the game, and 𝖯𝗅𝖺𝗒𝗌(𝒢) to denote both finite and infinite paths. For a history h=h0hn, we write 𝗅𝖺𝗌𝗍(h)=hn. We will also write 𝖧𝗂𝗌𝗍i(𝒢) for the set of histories whose last vertex is controlled by player i. A history or play in an initialised game 𝒢v0 is a history or play in 𝒢 whose first vertex is v0.

Strategies, and strategy profiles.

In a game 𝒢v0, a strategy for player i is a mapping σi that maps each history hu𝖧𝗂𝗌𝗍i(𝒢v0) to a probability distribution over E(u). The set of possible strategies for player i in 𝒢v0 is written as 𝖲𝗍𝗋𝖺𝗍i(𝒢v0). A path π0π1 (be it a history or a play) is compatible with the strategy σi if for each k such that πkVi, the probability that the strategy σi proposes hk+1 after history hk is positive, that is, we have σi(hk)(hk+1)>0. A strategy profile for a subset PΠ is a tuple (σi)iP. A strategy profile for the set P of players is written σ¯P, or simply σ¯ when P=Π. We also write σ¯i for σ¯P where P=Π{i}. Similarly, we use (σi,σi) to denote the strategy profile τ¯ defined by τi=σi and τj=σj for ji. We sometimes write σ¯(hv) to mean σi(hv) where i is the player controlling v, or 𝗉(v) when vV?.

For some history h, and a strategy σi, we define the strategy truncated to a history h, written σihv, as the strategy σi:hσi(hh) in the game 𝒢v.

A strategy profile σ¯i in the game 𝒢v0 defines an initialised Markov decision process 𝒢v0(σ¯i), where the vertices of the (infinite) underlying graph are the histories of 𝒢v0 and the edges are added from hu to each history huv iff uvE. Similarly, a strategy profile σ¯ for Π defines an initialised Markov chain 𝒢v0(σ¯). Thus, it also defines a probability measure σ¯ over plays – which turns the payoff functions μi into random variables.

Pure, stationary, and positional strategies.

We say that a strategy σi is pure when for each history hu, there is a vertex v such that σi(hu)(v)=1; then we often just write σi(hu)=v. We say that σi is stationary when for every two histories hu,hu𝖧𝗂𝗌𝗍i(𝒢v0), we have σi(hu)=σi(hu). In that case, we sometimes assume that strategy σi is defined in every game 𝒢u and simply write σi(u) for σi(hu). Finally, the strategy σi is positional when it is pure and stationary. Those concepts are naturally generalised to strategy profiles.

(a) Figure˜1(a) redrawn for convenience.
(b) A memory structure.
Figure 2: A game and a memory structure.

Memory structures.

A memory structure for player i is a tuple (Si,s0,δi,νi), where Si is a finite set of memory states, where s0Si is an initial state, where δi is a memory-update mapping that maps each pair (s,v)Si×V to a memory state s, and where νi is an output mapping that maps each pair (s,v)Si×Vi to a distribution d over E(v). The memory-update mapping can be extended to a mapping δi:𝖧𝗂𝗌𝗍(𝒢v0)Si with δi(ε)=s0 and δi(hu)=δi(δi(h),u) for each history hu. The memory structure then induces a strategy σi defined by σi(hu)=νi(δi(h),u) for each history hu𝖧𝗂𝗌𝗍i𝒢v0. A strategy induced by a memory structure is called finite-memory strategy. Note that stationary strategies are exactly the strategies that are induced by a memory structure with |Si|=1.

We analogously define memory structures for a subset of players PΠ, which also defines finite-memory strategy profiles. Note that if σ¯P is a finite-memory strategy profile, then each σi with iP is finite-memory – the memory structure inducing that strategy is obtained by replacing ν by its restriction to S×Vi. Conversely, if each σi with iP is finite-memory, then the strategy profile σ¯P is finite-memory – a collective memory structure can be obtained by constructing the product of individual memory structures.

Example 5.

Figure 2(b) depicts an example of memory structure, for player , on the game of Figure 2(a). The strategy induced can be presented as follows: when the vertex b is reached, player goes deterministically to the vertex t1 if s has been visited an even number of times. Otherwise, he goes to c with probability 13. Note that the output only depends on the history that was seen: player does not see whether player deterministically chose to go to the vertex b, or did it as the outcome of a randomised choice.

Risk-sensitive equilibria, constrained existence problem.

In multiplayer stochastic games, we wish to study generalisations of the classical Nash equilibria where the expectation is replaced by other risk measures. Such generalisations are called risk-sensitive equilibria [24]. When M is a risk measure and σ¯ is a strategy profile, we write M(σ¯) for Mσ¯.

Definition 6 (Risk-sensitive equilibrium).

Let 𝒢v0 be a game, and let M¯=(Mi)iΠ be a tuple of risk measures. Let σ¯ be a strategy profile in 𝒢v0, let i be a player, and let σi be a strategy for player i, called deviation of player i from σ¯. The deviation σi is profitable with regards to the risk measure Mi if we have Mi(σ¯i,σi)[μi]>Mi(σ¯)[μi]. The strategy profile σ¯ is a M¯-risk-sensitive equilibrium, or M¯-RSE, if no player i has a profitable deviation from σ¯ with regards to Mi.

Nash equilibria in a game 𝒢v0 can then be defined as M¯-risk-sensitive equilibria, where Mi=𝔼 for each player i. The following problem is the main focus throughout our paper.

Question (Constrained existence of risk-sensitive equilibria).

Given a game 𝒢v0, a tuple of risk measures M¯, and two payoff vectors x¯,y¯Π, does there exist a M¯-RSE σ¯ in 𝒢v0 such that for each iΠ, we have xiMi(σ¯)[μi]yi?

We mainly consider the entropic risk measure and the extreme risk measure.

3 Entropic risk measure: intractable in multiplayer games

The entropic risk measure is defined using a risk parameter, i.e. a real value ρ{0}: large positive values indicate risk-averseness, large negative values risk-inclination.

Definition 7 (Entropic risk measure).

Given a risk parameter ρ, the entropic risk measure is defined for every probability measure and random variable X as:

𝕄ρ[X]=1ρloge(𝔼[eρX]).

This definition is generalised by replacing Euler’s constant with some base β>1. The entropic risk measure with base β is then defined by 𝕄βρ[X]=1ρlogβ(𝔼[βρX]). The probability measure is omitted when clear from the context.

To see a visual representation of the entropic risk measure, see Figure 1 in the introduction.

 Remark 8.
  • The generalisation to any base β>1 follows only computational goals, since algebraic bases will be easier to handle. Baring such concerns, the definitions are equivalent, since for every β we have 𝕄βρ=𝕄ρ, where ρ=ρloge(β).

  • Definition 7 implies that for ρ=0, the function is not defined. However, it is known that for all , β and X, the quantity 𝕄ρ[X] converges to 𝔼[X] when ρ tends to 0 (see e.g. [26]). Therefore, we henceforth assume that 𝕄β0[X]=𝔼[X] to make risk entropy defined for all finite risk parameters ρ.

When we are given a profile ρ¯=(ρi)iΠ of risk parameters, we write 𝕄βρ¯[μ] for the tuple (𝕄βρi[μi])iΠ. Risk entropy defines a family of RSEs, namely the (𝕄βρi)i-RSEs, that we also call (β,ρ¯)-entropic risk-sensitive equilibria, or (β,ρ¯)-ERSEs. We now study the constrained existence problem of (β,ρ¯)-ERSEs. Unfortunately, it is undecidable in the general case.

Theorem 9.

The constrained existence problem of (β,ρ¯)-ERSEs with ρ¯Π is undecidable, even for any fixed value of β, for ρ¯=(0)i, and with only nonnegative payoffs.

Proof.

The constrained existence problem of Nash equilibria is undecidable [31, Theorem 4.9]. Since Nash equilibria are ERSEs with ρi=0 for each player i, our result follows.

We therefore briefly study cases where the class of strategies considered is restricted.

Theorem 10.

The constrained existence problem of (β,ρ¯)-ERSEs, in quantitative simple stochastic games, with rational risk parameters:

  1. 1.

    remains undecidable when players are restricted to pure strategies;

  2. 2.

    is decidable when players are restricted to stationary strategies, or positional strategies:

    1. (a)

      in 𝟥𝖤𝖷𝖯𝖳𝖨𝖬𝖤 if β=e;

    2. (b)

      in 𝖯𝖲𝖯𝖠𝖢𝖤 if the base β is algebraic:

      • it is 𝖭𝖯-hard when restricted to positional strategies, and

      • -complete when restricted to stationary strategies.

We end this section with this result: ERSEs are guaranteed to exist, if all rewards are nonnegative.

Theorem 11.

Let 𝒢v0 be a simple stochastic game with only nonnegative payoffs. Then, there exists a (pure) (β,ρ)-ERSE over 𝒢v0.

We conjecture that this result remains true when negative rewards are involved.

4 Extreme risk measure: limit of entropic risk measure

This section introduces a new risk measure that provides a tractable alternative to existing risk measures available in the literature. Let X be a random variable ranging over . The pessimistic risk measure of X is the highest value x such that X almost-surely takes a value above x. When X takes finitely many values, that corresponds to the least value that it takes with positive probability. In probability theory, that measure is sometimes referred to as essential infimum, written essinf. The definition of optimistic risk measure is symmetric.

Definition 12 (Optimistic, pessimistic risk measure).

The pessimistic risk measure of a random variable X is defined by 𝕄[X]=essinf(X)=sup{xX|(Xx)=1}. Analogously, the optimistic risk measure of X is 𝕆𝕄[X]=esssup(X)=inf{xX|(Xx)=1}.

Note that the values of probabilities do not influence the optimistic or the pessimistic risk measures, which depend only on which events have zero or nonzero probability. Thus, those risk measures are well-suited to model players that do not care about probabilities because they need certainties, or simply because they do not know them – which is often the case in real-world stochastic processes.

Given a game 𝒢v0, we can assign a risk measure for each player by defining a partition (P,O) of Π, where the set P represents the set of players that are pessimists, whose perceived payoffs are defined by the pessimistic risk measure, while O represents the optimists, who intend to maximise their optimistic risk measure. For convenience, we group both measures under the umbrella term extreme risk measure (XR), and often assume that (P,O) is given; then, we write 𝕏i for 𝕄 when iP, and for 𝕆𝕄 when iO. Since each player i is interested only in the risk measure of their own payoff, we also write 𝕏i(σ¯) for the quantity 𝕏i(σ¯)[μi]. We define extreme risk-sensitive equilibria, or XRSEs for short, as (𝕏i)i-RSEs.

Our definition, which considers only outcomes of positive probability, is slightly different from a situation where each player treats every other player or stochastic vertex as an adversary. Indeed, in Figure 1, such an adversarial treatment would mean that player assumes that the probability 0 event of the play staying in the vertex s could also be realised, whereas in for extreme risk measure, such a probability 0 event is disregarded. The same would hold if the stochastic vertex was instead assigned to any player whose strategy randomises between staying and leaving that vertex.

We show that our definition of extreme risk measure corresponds to the limit cases of entropic risk measure. Observe that in Figure 1, when player follows a randomized strategy, player ’s risk measure converges to 1 when ρ tends to +, and to +1 when ρ tends to . We formally prove that our definition of extreme risk measure coincides with the limit case of the entropic risk measure.

Theorem 13.

Let X be a random variable that ranges over , and let β>1.

  • The limit of the risk entropy of X when ρ tends to + exists and is equal to the pessimistic risk measure, that is, we have limρ+𝕄βρ[X]=𝕄[X].

  • Similarly, the limit of the risk entropy of X when ρ tends to exists and is equal to the pessimistic risk measure, that is, we have limρ𝕄βρ[X]=𝕆𝕄[X].

5 Constrained existence of extreme risk sensitive equilibria

We now study the computational complexity of the constrained existence problem of XRSEs. The main result of this section will show that, contrary to the same problem with ERSEs, it is a decidable fragment of the constrained existence of RSEs, as it is 𝖭𝖯-complete. We will also study some subcases, showing that 𝖭𝖯-completeness remains true when players are restricted to positional, stationary, or pure strategies. Finally, we will show that when all players are optimists, the problem becomes 𝖯𝖳𝖨𝖬𝖤-complete.

5.1 Membership in 𝗡𝗣

𝖭𝖯-membership is a consequence of this fact: when an XRSE exists, there also exists one with the same extreme risk measures that uses finite memory, polynomial in the size of the game. Let us first illustrate, with examples, how and why memory is required in such XRSEs.

Example 14.

We consider the following constrained existence question and analyse the same question on three example graphs.

() Is there an XRSE σ¯ such that we have 𝕏(σ¯)=𝕏(σ¯)=1?

Game in Figure 3(a).

The answer to Question () in this game is no. Indeed, if there is such an XRSE σ¯ in this game, then necessarily, following σ¯, both terminal vertices t1 and t2 are reached with positive probability, and the probability of following the cycle ab forever is 0 (see Appendix E of full version). Intuitively, the problem here is that such a strategy profile would require a randomisation: for both players to get the payoff 1 with positive probability, there must be a random event that decides which of them will actually have that payoff. In our world, such a situation would be solved by tossing a coin. In the game, it means that one of the players proceeds to a randomised action. But that player, say again, can deviate from such a strategy and refuse to leave the cycle. Such refusals form undetectable deviations from the strategy, since player only sees that player went from b to a, which can be interpreted as a possible outcome of the expected randomisation. In other words, the player that tosses a coin is the only one that sees the result, and can therefore lie about it. This phenomenon can be avoided only if such randomised choices are made by other “impartial” players or by stochastic vertices.

Game in Figure 3(b).

Consider now a slight modification, as shown in Figure 3(b). There, the first player that plays is determined at random by the edge that is taken from an initial stochastic vertex. The answer to Question () for this game is yes. Indeed, the pure strategy profile defined by a strategy of player that maps the history ca to the terminal t1 and the strategy of player that maps the history cb to terminal t2, and both strategies maps any other history to vertex b, is an XRSE. Indeed, if any player deviates and refuses to leave the cycle, the other one will immediately refuse too, making it impossible to gain any benefit from such a deviation. Compared to the previous example, the existence of the stochastic vertex c provides the players with a trustable common coin, that they can use to decide which of them will get payoff 1 and which will get payoff 2.

Game in Figure 3(c).

Finally, consider the game depicted in Figure 3(c). Here, along every play, one player is given the opportunity of getting payoff 2, by going to the terminal vertex t3. But the answer to Question () remains yes. Indeed, the pure strategy profile σ where from vertex d, player goes from vertex d to a and then to b, from which player leaves to t2, and symmetrically, from vertex e, player goes to a through b from which player goes to t1 is an XRSE. For instance, if player deviates and goes from d to t3, then there is still the play cebat1 that is generated with positive probability and from which player cannot deviate without triggering a punishing strategy that would make such a deviation non-profitable.

(a) Two pessimists.
(b) Two pessimists with a common coin.
(c) Two pessimists with a common coin and some temptation.
Figure 3: Some games involving two pessimistic players.

We see in this last example that to build an XRSE that generates a given tuple of risk measure values, we need one play for each player that anchors that player – i.e., a play in which they get their risk measure as a payoff (the lowest/highest value they obtain with positive probability), and from which they cannot deviate in a profitable way. Then, randomisation (or stochastic vertices) is required to split plays into two or more potential future plays where each anchors different subset of players, and memory is required to remember the subset of players that are being anchored – or whether a player has deviated from the strategy and must be punished.

In line with this intuition, the following theorem bounds the amount of memory required by an XRSE. We further show that there is a succinct encoding of this bounded memory that has size that is polynomial in the number of players and vertices in the game.

Theorem 15.

Let 𝒢v0 be a game with n vertices and p players, let (P,O) be a partition of the set Π, and let σ¯ be an XRSE in 𝒢v0. Then, there exists a finite-memory XRSE σ¯ with at most 3np2n+p+1 many memory states, such that 𝕏(σ¯)=𝕏(σ¯). Furthermore, if σ¯ is pure, then there is such a strategy profile σ¯ that is pure.

Proof sketch.

We prove this theorem by formalising the idea of anchoring plays. To do so, we define a labelling function Λ, that maps each history h compatible with σ¯ to the set of players that is, after the history h, currently being anchored. In Lemma 16, we show the existence of such a labelling, with some properties. In the sequel, we write z¯ to denote the risk measure of each player in the strategy profile σ¯, that is, the tuple z¯=(zi)i=𝕏(σ¯).

Lemma 16 (The labelling Λ).

There exists a labelling Λ that maps each history h𝖧𝗂𝗌𝗍𝒢v0 compatible with σ¯ to a subset of players, that is, Λ(h)Π, such that for each such h, if we write {v1,,vk}=𝖲𝗎𝗉𝗉(σ¯(h)), the labelling Λ satisfies the following properties.

  1. 1.

    If the vertex 𝗅𝖺𝗌𝗍(h) is stochastic, or belongs to some player iΛ(h), then the sets Λ(hv1),,Λ(hvk) form a partition of Λ(h).

  2. 2.

    If the vertex 𝗅𝖺𝗌𝗍(h) belongs to some player iΛ(h), then the sets Λ(hv1){i}, …, and Λ(hvk){i} form a partition of Λ(h){i}, and i belongs to all sets Λ(hv1),,Λ(hvk).

  3. 3.

    For each optimistic player iΛ(h), we have 𝕏i(σ¯h)=zi.

  4. 4.

    For each pessimistic iΛ(h), for all strategies τi of player i, we have 𝕏i(σ¯ih,τi)zi.

  5. 5.

    If there is a successor v such that Λ(hv)=Λ(h), then all other successors v are such that 𝕏i(σ¯hv)<zi for each optimist iΛ(h), and there exists τi with 𝕏i(σ¯ihv,τi)>zi for each pessimist iΛ(h).

Since Λ labels vertices with sets of players, potentially there are exponentially many subsets of players A that are in the range of Λ. However, for a Λ that satisfies Items 2, 1, 3, 4, and 5 of Lemma 16, we show that there are at most 3p2 subsets A such that λ(h)=A for some history h by an inductive counting argument. We use those subsets to create (3p2)n memory states, to remember which players are being anchored, and what was the last vertex visited (to detect deviations). We add one punishing state for each player, used when a deviation by that player is detected, adding up to the desired number. We construct an XRSE σ¯ from Λ that uses only those memory states.

Using Theorem 15, we can show the following lemma.

Lemma 17.

The constrained existence problem of XRSEs is in 𝖭𝖯. The same problem when players are restricted to pure strategies is still in 𝖭𝖯.

Proof.

Let 𝒢v0 be a game. Let (P,O) be a partition of Π, and let x¯ and y¯ be threshold vectors. By Theorem 15, if there exists a (pure) XRSE with x¯𝕏(σ¯)y¯, then there exists one with at most 3np2n+p+1 memory states, where p is the number of players and n is the number of vertices. Such a strategy profile can be guessed in polynomial time. We now show that, once such a finite-memory strategy profile σ¯ is guessed, one can check in polynomial time whether it is an XRSE, and satisfies the constraint x¯𝕏(σ¯)y¯.

  • First, given σ¯, for each player i, the quantity 𝕏i(σ¯) can be computed in polynomial time, since it reduces to computing player i’s risk measure in the Markov chain induced by σ¯ (which has polynomial size).

  • Second, checking that x¯𝕏(σ¯)y¯ can be done in polynomial time.

  • Third, for each player i, we check that player i has no profitable deviation. This can also be done in polynomial time by computing the best risk measure player i can get in the MDP induced by σ¯i (which has polynomial size).

5.2 Membership in 𝗡𝗣 with restrictions on strategies

We have shown 𝖭𝖯-membership of the constrained existence problem of XRSEs in the general case. We now consider variants where the space of a strategies is restricted to stationary, positional or pure strategies. We show that the problem is in 𝖭𝖯 for each of these cases.

Lemma 18.

The constrained existence problem, when all the players are restricted to positional, stationary, or pure strategies, is in 𝖭𝖯.

Proof.

We show that we can still guess a strategy profile, and verify in polynomial time whether it is indeed an XRSE. For the positional and stationary cases, guessing a strategy profile is straightforward, since such a strategy profile can always be represented using polynomially many bits. We can then verify that a given strategy profile σ¯ satisfies the constraints and also is an XRSE in polynomial time. For pure strategies, memory may be required. But we showed, alongside Theorem 15, that if there is a pure XRSE satisfying the constraints, there is one with polynomial memory: our result follows.

5.3 𝗡𝗣-hardness

We now prove hardness, for the general setting as well as when strategies are restricted.

Lemma 19.

The constrained existence problem of XRSEs is 𝖭𝖯-hard, even when all players are pessimists and all rewards are nonnegative. It remains 𝖭𝖯-hard when the strategies are restricted to stationary, pure, or positional ones.

Proof sketch.

We reduce the 𝟥𝖲𝖠𝖳 problem to our problem. From a given formula Φ, we construct the game depicted by Figure 4, where all players are pessimistic, and the symbol is used to mean “every other player”. For each literal {xi,¬xii{1,,n}}, we define two players, namely player and player . We also define a player Cj for each clause Cj. Each of those players controls one vertex, labelled by their name. We show in the complete proof that this game contains a (positional) XRSE where a witness player, player , gets risk measure 2 if and only if Φ is satisfiable.

Intuitively, the game consists of a sequence of gadgets where a value is given to each variable xi, depending on whether player xi takes the edge to the vertex sxi (variable set to true) or to the vertex ¬xi (variable set to false). That player cannot randomise between those two edges, because she does not get the same risk measure on both sides, and therefore would have a profitable undetectable deviation. When it is decided that the literal is true, player ¯ is given the possibility to deviate and get the payoff 2. In the final gadget, each clause player Cj must choose a literal of the clause Cj that she claims to be true, under the valuation that has been defined with the previous gadgets. Then, player gets risk measure 1: that player will thus have a profitable deviation if and only if he was given the possibility to deviate, i.e., if the literal was actually set to false.

Figure 4: Construction of a game 𝒢Φ from a 𝟥𝖲𝖠𝖳 formula Φ.

This lemma, along with Lemma 17 and Lemma 18, proves the following theorem. Observe that our construction did not use any negative rewards, and hence the hardness results hold already for nonnegative rewards, while our membership result works with any combination of positive and negative rewards.

Theorem 20.

The constrained existence problems for XRSEs, for pure XRSEs, for stationary XRSEs, and for positional XRSEs, are all 𝖭𝖯-complete. The lower bound holds even when all players are pessimistic and all rewards are non-negative.

We will now turn to a last subcase, where our problem turns out to be decidable in deterministic polynomial time.

5.4 Things get easier when everyone is optimistic

Our 𝖭𝖯-hardness results required only pessimistic players. We now show that optimists are easier to deal with. The constrained existence problem of XRSEs becomes 𝖯𝖳𝖨𝖬𝖤-complete when the perceived reward of each player is defined by the risk measure 𝕆𝕄.

We first show an upper bound by giving a polynomial-time algorithm. The intuition is that an optimistic player has a profitable deviation as soon as a vertex is reached, with positive probability, from which that player can get a payoff higher than their risk measure. This is in contrast to pessimists, who have a profitable deviation only when they can avoid getting their risk measure as a payoff in all plays compatible with the strategy profile. Our algorithm iteratively identifies vertices that must never be reached, and then removing from the game the edges that must consequently never be taken.

Lemma 21.

If all players are optimists, then the constrained existence problem of XRSEs is in 𝖯𝖳𝖨𝖬𝖤, and there is an algorithm that decides it in time 𝒪(pm2), where m is the number of edges in 𝒢 and p the number of players. Moreover, the algorithm can be modified to output an XRSE that satisfies the constraints, if one exists, in time 𝒪(pm2+m3). If all the upper thresholds yi are nonnegative, it takes time 𝒪(pm2).

Proof Sketch..

We want to decide whether there exists an XRSE σ¯ satisfying the constraint x¯𝕏(σ¯)y¯. The algorithm deals with two cases, that we call cycle-friendly and cycle-averse cases, separately. In the cycle-friendly case, we have yi0 for all players i. Then, an XRSE could have positive probability of reaching no terminal vertex. However that is impossible in the cycle-averse case, where there is a player i such that yi<0. In this proof sketch, we describe only the algorithm in the cycle-friendly case.

The algorithm constructs a decreasing sequence of sets of edges E0,E1, until it reaches a fixed point. For each set Ek, it considers the strategy profile σ¯Ek, defined as follows: from each non-stochastic vertex v, when v is seen for the first time, it randomises uniformly between all edges vwEk. Later, if v is visited again, it always repeats the same choice. If some player i deviates and takes an edge that they are not supposed to take, then all the players switch to a positional strategy profile designed to minimise their risk measure. Such a strategy profile is finite-memory, but requires 2|V||V|+p memory states to be represented as a memory structure: we therefore use the set Ek as a succinct representation.

At each iteration k, the algorithm identifies new sets of vertices Vk that must be avoided. This includes the terminals that give some player i a payoff that is larger than yi, or vertices from which player i can deviate and obtain a higher value than the value offered by the strategy profile 𝕏i(σ¯Ek). If it is not possible to avoid reaching the set Vk, the answer 𝖭𝗈 is returned. Otherwise, the set Ek+1 is defined from Ek by removing edges that ensure that Vk is not reached with positive probability. The algorithm stops when there are no more edges to remove and answers 𝖸𝖾𝗌 and if we have 𝕏i(σ¯Ek)xi for each i, and 𝖭𝗈 otherwise.

Each iteration requires time 𝒪(mp) to identify and remove edges. Since there are 𝒪(m) many edges, the algorithm terminates in time 𝒪(pm2).

Finally, we show that the problem is 𝖯𝖳𝖨𝖬𝖤-hard, even when there are only two players.

Lemma 22.

The constrained existence problem of XRSEs with optimistic players is 𝖯𝖳𝖨𝖬𝖤-hard even with only two players.

Proof sketch.

We give a log-space reduction from the problem of deciding two-player zero-sum reachability games, which is known to be 𝖯𝖳𝖨𝖬𝖤-complete [18, Proposition 6].

We can now conclude this section with the following theorem.

Theorem 23.

The constrained existence problem of XRSE is 𝖯𝖳𝖨𝖬𝖤-complete when all players are optimists, that is, when P=.

6 The existence of extreme risk sensitive equilibria

This section finally answers the classical question about equilibria: are they guaranteed to exist? We show that (stationary) XRSEs exist when all rewards are nonnegative. Although the result is reminiscent of the same result we proved for ERSEs (Theorem 11), it requires a different, and constructive proof that we discuss below.

We motivate the constructive algorithm with an example. Consider the game depicted in Figure 3(a) that involves two pessimists: player and player . Both players want to leave the cycle, but each of them would prefer that the other player leaves. If we first consider the strategy profile that always randomises between all the available edges, then both terminal vertices are reached with positive probability, and it is almost sure that one of them is reached: both players get therefore risk measure 1. Then, player (and symmetrically player ) has a profitable deviation by refusing to leave the cycle, and always going back to the vertex a. Note that player cannot detect such a deviation, since she does not have access to the internal coins tossed by player . Then, we remove the edge bt2 (or at1). This results in a set of edges where player gets the payoff 2, and player cannot get more than 1, ensuring that the new strategy profile that we obtain is a (stationary) XRSE.

Theorem 24.

Let 𝒢v0 be a game with only non-negative rewards, and let (P,O) be a partition of Π. Then, there exists a stationary XRSE in 𝒢v0. Moreover, there exists an algorithm that, given such a game, outputs the representation of such an XRSE in time 𝒪(m2p), where m is the number of edges, and p the number of pessimistic players.

Proof sketch..

Our algorithm constructs a decreasing sequence E=E0,E1, of sets of edges, and considers, for each k, the stationary strategy profile that randomises between all the outgoing edges in Ek from all vertices. If this strategy profile is not an XRSE, there is a player i who has a profitable deviation. We carefully identify edges used by player i, and remove them. This process always terminates, and the set obtained defines an XRSE.

Like in the case of ERSEs, we conjecture that existence, and even existence of a stationary XRSE, remain true in the general case. We remark that even for Nash equilibria, existence of an NE in such simple stochastic game is known only if all the rewards are nonnegative.

7 Discussion

Our definition of extreme risk measure opens up several promising directions for future research. One immediate extension of our work would be to study games with more sophisticated objectives, such as mean payoff or discounted sum. Another extension is to study the concurrent version of such games, where players choose actions concurrently rather than in a turn-based setting. Finally, our definition of risk-sensitive equilibria is modelled after Nash equilibria and suffers from several of their limitations. Like Nash equilibria, RSEs allow irrational behaviours when one player deviates and must be punished, as in our game of Figure 3(c). Exploring alternative definitions of RSE, modelled after other equilibria concepts more suited for games on graphs [25, Section 7.1], such as subgame-perfect equilibria, could provide a relevant framework for player decision-making.

References

  • [1] Agarwal Alekh, Nan Jiang, Sham M. Kakade, and Sun Wen. Reinforcement Learning: Theory and Algorithms. https://rltheorybook.github.io/, 2021.
  • [2] M. Auer. Hands-On Value-at-Risk and Expected Shortfall: A Practical Primer. Management for Professionals. Springer International Publishing, 2018. URL: https://books.google.at/books?id=4EFKDwAAQBAJ.
  • [3] Christel Baier, Krishnendu Chatterjee, Tobias Meggendorfer, and Jakob Piribauer. Entropic risk for turn-based stochastic games. Information and Computation, 301:105214, 2024. doi:10.1016/j.ic.2024.105214.
  • [4] Hans Wolfgang Brachinger. From variance to value at risk: A unified perspective on standardized risk measures. In Wolfgang Gaul and Hermann Locarek-Junge, editors, Classification in the Information Age, pages 91–99, Berlin, Heidelberg, 1999. Springer Berlin Heidelberg.
  • [5] Richard Bradley. Decision Theory: A Formal Philosophical Introduction, pages 611–655. Springer International Publishing, 2018. doi:10.1007/978-3-319-77434-3_34.
  • [6] Véronique Bruyère, Emmanuel Filiot, Mickael Randour, and Jean-François Raskin. Meet your expectations with guarantees: Beyond worst-case synthesis in quantitative games. Information and Computation, 254:259–295, 2017. SR 2014. doi:10.1016/j.ic.2016.10.011.
  • [7] Nicole Bäuerle and Ulrich Rieder. More risk-sensitive markov decision processes. Mathematics of Operations Research, 39(1):105–120, 2014. doi:10.1287/MOOR.2013.0601.
  • [8] Luca de Alfaro, Thomas A. Henzinger, and Ranjit Jhala. Compositional methods for probabilistic systems. In CONCUR 2001 — Concurrency Theory, pages 351–365, Berlin, Heidelberg, 2001. Springer Berlin Heidelberg. doi:10.1007/3-540-44685-0_24.
  • [9] Jerzy Filar and Lodewijk Kallenberg. Variance-penalized markov decision process. Mathematics of Operations Research, 14, February 1989. doi:10.1287/moor.14.1.147.
  • [10] Hans Föllmer and Alexander Schied. Convex measures of risk and trading constraints. Finance and Stochastics, 6(4):429–447, October 2002. doi:10.1007/s007800200072.
  • [11] Vojtěch Forejt, Marta Kwiatkowska, Gethin Norman, and David Parker. Automated Verification Techniques for Probabilistic Systems, pages 53–113. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011. doi:10.1007/978-3-642-21455-4_3.
  • [12] Hans Föllmer and Alexander Schied. Stochastic Finance: An Introduction in Discrete Time. Walter de Gruyter, Berlin, Boston, 4 edition, 2016.
  • [13] Jorge Gallego-Hernández and Alessio Mansutti. On the existential theory of the reals enriched with integer powers of a computable number. In Olaf Beyersdorff, Michal Pilipczuk, Elaine Pimentel, and Kim Thang Nguyen, editors, 42nd International Symposium on Theoretical Aspects of Computer Science, STACS 2025, March 4-7, 2025, Jena, Germany, volume 327 of LIPIcs, pages 37:1–37:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2025. doi:10.4230/LIPICS.STACS.2025.37.
  • [14] Christoph Haase and Stefan Kiefer. The odds of staying on budget. In Automata, Languages, and Programming, pages 234–246, Berlin, Heidelberg, 2015. Springer Berlin Heidelberg. doi:10.1007/978-3-662-47666-6_19.
  • [15] Ronald A. Howard and James E. Matheson. Risk-sensitive markov decision processes. Management Science, 18(7):356–369, 1972.
  • [16] Ronald A. Howard and James E. Matheson. Risk-sensitive markov decision processes. Management Science, 18(7):356–369, 1972. URL: http://www.jstor.org/stable/2629352.
  • [17] Leonid Hurwicz. Optimality criteria for decision-making under ignorance. Technical report, Cowles Commission for Research in Economics, 1951. URL: https://cowles.yale.edu/sites/default/files/2024-03/s-0355.pdf.
  • [18] Neil Immerman. Number of quantifiers is better than number of tape cells. J. Comput. Syst. Sci., 22:384–406, 1981. doi:10.1016/0022-0000(81)90039-8.
  • [19] Jan Křetínský and Tobias Meggendorfer. Conditional value-at-risk for reachability and mean payoff in markov decision processes. In Proceedings of the 33rd Annual ACM/IEEE Symposium on Logic in Computer Science, LICS ’18, pages 609–618. Association for Computing Machinery, 2018. doi:10.1145/3209108.3209176.
  • [20] Claude Lefèvre. Optimal control of a birth and death epidemic process. Operations Research, 29(5):971–982, 1981. doi:10.1287/OPRE.29.5.971.
  • [21] Shie Mannor and John N. Tsitsiklis. Mean-variance optimization in markov decision processes. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, pages 177–184. Omnipress, 2011. URL: https://icml.cc/2011/papers/156_icmlpaper.pdf.
  • [22] Tobias Meggendorfer. Risk-aware stochastic shortest path. Proceedings of the AAAI Conference on Artificial Intelligence, pages 9858–9867, 2022. doi:10.1609/AAAI.V36I9.21222.
  • [23] John F. Nash. Equilibrium points in n-person games. Proceedings of the National Academy of Sciences, 36(1):48–49, 1950. doi:10.1073/pnas.36.1.48.
  • [24] Andrzej S. Nowak. Notes on Risk-Sensitive Nash Equilibria, pages 95–109. Birkhäuser Boston, Boston, MA, 2005. doi:10.1007/0-8176-4429-6_5.
  • [25] Martin J Osborne. An introduction to game theory. Oxford University Press google schola, 2:672–713, 2004.
  • [26] Bernardo K. Pagnoncelli, Oscar Dowson, and David P. Morton. Multistage stochastic programs with the entropic risk measure, 2020. URL: https://optimization-online.org/2020/08/7984/.
  • [27] Jakob Piribauer, Ocan Sankur, and Christel Baier. The variance-penalized stochastic shortest path problem. In 49th International Colloquium on Automata, Languages, and Programming, ICALP, volume 229 of LIPIcs, pages 129:1–129:19. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2022. doi:10.4230/LIPICS.ICALP.2022.129.
  • [28] Mickael Randour, Jean-François Raskin, and Ocan Sankur. Variations on the stochastic shortest path problem. In Verification, Model Checking, and Abstract Interpretation, pages 1–18. Springer, 2015. doi:10.1007/978-3-662-46081-8_1.
  • [29] Dawei Shi, Robert J Elliott, and Tongwen Chen. On finite-state stochastic modeling and secure estimation of cyber-physical systems. IEEE Transactions on Automatic Control, 62(1):65–80, 2016. doi:10.1109/TAC.2016.2541919.
  • [30] Florent Teichteil-Königsbuch, Ugur Kuter, and Guillaume Infantes. Incremental plan aggregation for generating policies in MDPs. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: Volume 1 - Volume 1, AAMAS ’10, pages 1231–1238. International Foundation for Autonomous Agents and Multiagent Systems, 2010. URL: https://dl.acm.org/citation.cfm?id=1838366.
  • [31] Michael Ummels and Dominik Wojtczak. The Complexity of Nash Equilibria in Stochastic Multiplayer Games. Logical Methods in Computer Science, Volume 7, Issue 3, September 2011. doi:10.2168/LMCS-7(3:20)2011.
  • [32] Abraham Wald. Statistical Decision Functions. John Wiley & Sons, New York, 1950. Classic formulation of the minimax (Wald) criterion.