Abstract 1 Introduction 2 Preliminaries 3 Relaxing the Statistical Criteria of Fairness 4 Main Results 5 Discussion References Appendix A Achieving Separation and Sufficiency Together Appendix B Deviation from Sufficiency and Multicalibration Appendix C Details for the Proof of Lemma 10 Appendix D Separation Is at Odds with the Predictor’s Overall Performance Appendix E Simulation of the Achievable Region

Mapping the Tradeoffs and Limitations of Algorithmic Fairness

Etam Benger ORCID The Hebrew University of Jerusalem, Israel Katrina Ligett ORCID The Hebrew University of Jerusalem, Israel
Abstract

Sufficiency and separation are two fundamental criteria in classification fairness. For binary classifiers, these concepts correspond to subgroup calibration and equalized odds, respectively, and are known to be incompatible except in trivial cases. In this work, we explore a relaxation of these criteria based on f-divergences between distributions – essentially the same relaxation studied in the literature on approximate multicalibration – analyze their relationships, and derive implications for fair representations and downstream uses (post-processing) of representations. We show that when a protected attribute is determinable from features present in the data, the (relaxed) criteria of sufficiency and separation exhibit a tradeoff, forming a convex Pareto frontier. Moreover, we prove that when a protected attribute is not fully encoded in the data, achieving full sufficiency may be impossible. This finding not only strengthens the case against “fairness through unawareness” but also highlights an important caveat for work on (multi-)calibration.

Keywords and phrases:
Algorithmic fairness, information theory, sufficiency-separation tradeoff
Copyright and License:
[Uncaptioned image] © Etam Benger and Katrina Ligett; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Machine learning theory
; Mathematics of computing Information theory ; Social and professional topics Computing / technology policy
Acknowledgements:
We thank Flavio Calmon for an insightful suggestion that inspired the formulation using f-information.
Funding:
This work was supported in part by a gift to the McCourt School of Public Policy and Georgetown University, Simons Foundation Collaboration 733792, Israel Science Foundation (ISF) grant 2861/20, ERC grant 101125913, and a grant from the Israeli Council for Higher Education. Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.
Editors:
Mark Bun

1 Introduction

Machine learning algorithms increasingly influence many aspects of our lives – ranging from seemingly minor decisions, such as which ads appear on a website, to critical financial determinations like loan approvals, and even life-altering outcomes in healthcare and the judicial system. Determining whether and how fairness should be taken into consideration in a specific context; examining whether a particular fairness criterion is suitable, aligns with legal standards, or reflects ethical norms; deciding what trade-offs to strike between the many desiderata one might hold – these quandaries cannot be addressed by math alone. What theoretical computer science can do and has been doing for fairness is to reveal what is possible and what is impossible – for example, what notions can be enforced computationally efficiently and when? What tradeoffs between desiderata are fundamental?

A substantial literature has focused on statistical criteria of group fairness, formalizing various notions of equality across groups or attributes. In this vein, a landmark early result of Chouldechova [3] and of Kleinberg, Mullainathan, and Raghavan [11], revealed that the multiple natural statistical fairness criteria in classification are often inherently incompatible, except in trivial cases.

In this setting, a classification algorithm takes data of an individual as input (for example, her financial history) and outputs some predicted label (for example, whether she is expected to repay a loan). The utility of a classifier or predictor is measured with respect to the true label of the individual (would she actually eventually repay the loan?). Group fairness notions consider some protected attribute (for example, the individual’s ethnicity); each set of people with a particular value of the protected attribute is sometimes referred to as a group or subgroup. The main statistical criteria of fairness in classification are concerned with the statistical relationships that the algorithm induces between the true label, the algorithm’s output, and the protected attribute.

Independence, also known as statistical parity or demographic parity, requires that the predictor’s outcome be independent of the protected attribute. This notion is very restrictive and has been criticized for its seeming unfairness in settings where there exists a correlation between the true label and the protected attribute (see, for example, [5]). It is not a major focus of our study.

Separation, also known as equalized odds [8], requires that the predictor’s outcome be conditionally independent of the protected attribute, given the true label. This means that the distribution of outcomes should be the same across different groups, as long as individuals share the same true label. When the label and the prediction are both binary, this criterion is equivalent to asking for the same true positive rate and true negative rate across the different protected groups. In other words, it guarantees that an imperfect classifier would not err in either direction more on one subgroup than on others.

Sufficiency requires that the protected attribute and the true label be conditionally independent, given the classifier’s prediction. This criterion requires that, among individuals that are given the same prediction, there would be no correlation between the true label and membership in a protected group. In other words, given the outcome of the classifier for an individual, the probability that she has a certain true label should be the same, regardless of her protected attribute. When the true label is binary, sufficiency is equivalent to calibration within subgroups [1]; roughly speaking, a predictor is calibrated if its prediction can be interpreted as the probability of the true label being positive.

1.1 Our contribution

As mentioned above, it is known that separation and sufficiency are often incompatible (although, as we show in Appendix A, they are sometimes compatible in non-trivial scenarios). A common consequence of this impossibility is that researchers and practitioners have been told that they must choose one or the other – separation or sufficiency – but not both. In this work, we propose a relaxation of these statistical criteria of fairness (Section 3.2), study the combinations of these relaxed criteria that are achievable (Lemma 10 and Theorem 12), and show that, in certain scenarios, there exists a continuous, meaningful tradeoff between the relaxed notions, forming a Pareto frontier (Theorem 14). We explore the relationships between the relaxed criteria, see that full sufficiency is not always achievable (Theorem 15), and draw conclusions regarding the post-processing of fair representations – post-processing cannot degrade our notion of approximate separation, but there are conditions on representations under which full sufficiency cannot be recovered by any post-processing (Theorem 17). These results enrich our understanding of the possible fairness tradeoffs that can be achieved, giving policymakers and domain experts a richer design space from which to select.

1.2 Related Work

There has been substantial work on relaxations of the statistical criteria of group fairness. Some approaches weaken the criteria in some non-continuous fashion. This includes, for example, relaxed equalized odds [12] (weakening separation) and equal precision [3] (weakening sufficiency), both of which are motivated by the incompatibility results, in an attempt to find weaker fairness notions that are compatible with each other. A major motivation for equal opportunity [8] (a relaxation of separation) in contrast, is compatibility with higher prediction accuracy than the stronger notion of equalized odds.

More continuous relaxations have primarily appeared in the contexts of learning and optimization. They typically involve the definition of some approximation or error term that quantifies how much a predictor violates the desired criterion, and then use this term either as an objective or as a penalty or regularization term in the learning objective of the predictor; such work has not focused on the compatibilities that such relaxations enable. Examples include Zafar et al. [16] and Woodworth et al. [15], who propose moment-based approximations in order to solve the intractability of the optimization problem, as well as Zemel et al. [17] and Bechavod and Ligett [2], who define a regularization term in order to achieve independence and separation, respectively. Of particular relevance to our work are Kashima et al. [10], who use mutual information as a regularization term for learning classifiers that satisfy independence, and Gopalan et al. [6], who define the objective of multicalibration [9] in terms of average conditional covariance (as discussed in Section 3.3).

Perhaps most closely related to our work is the analysis of Hamman and Dutta [7], who adopt an information-theoretic perspective with very similar definitions to ours. However, their analysis – based on partial information decomposition – focuses only on scenarios where at least one of the fairness criteria is fully met.

2 Preliminaries

Statistical notions of fairness focus on the aggregate properties of a population rather than of any particular individual. The literature on fair classification typically considers the relationships between three variables when defining statistical criteria for fairness: the true label (or target variable) Y, the protected attribute (or subgroup) A, and the predictor’s outcome R. In principle, all relationships between these variables are possible (see Figure 1(a)), and various fairness notions can be seen as restrictions on the relationships between these variables.

Interestingly, the data upon which the predictor makes the prediction (or is learned), X, is usually not explicitly modeled. In the multicalibration literature [9], however, the data does play a central role, as all the variables Y,A and R are defined as (possibly stochastic) functions of X (see Figure 1(b)). If this is the case, not all relationships between Y,A, and R are possible. In particular, under this model, AYX, meaning that all the information about the true label that is encoded in the protected attribute is also encoded in the data X.

In this work we consider a model that generalizes these two approaches: we explicitly include the data X in our model, but we do not assume that A and Y are functions only of X, thus allowing for more complex relationships between them (see Figure 1(c)). This model can capture any joint distribution P(X,Y,A) without any further assumptions, with R determined exclusively by the conditional distribution P(R|X).

Figure 1: Diagrams of the relationships between the variables in models of fair classification: (a) a model where the data X is implicit and all relationships are permissible; (b) all variables are functions of X; (c) our model, where R is a function of X, but otherwise, all the relationships between X, Y and A are possible.

Moreover, we prefer to view R as a representation of the data X – either an explicit representation that can be passed to some end-user, or an inner state of a model – and not necessarily as the final prediction or classification, which can be made upon this representation. This view allows for greater flexibility and raises interesting points with respect to post-processing.

2.1 Statistical Criteria for Fair Classification

As mentioned above, the main statistical criteria for fair classification are defined in terms of statistical independence of the three variables Y, A and R:

Definition 1 (Statistical criteria for fair classification).

Let Y, A and R be jointly distributed random variables, representing the true label, a protected attribute and the predictor’s outcome, accordingly. Then,

Independence

requires the predictor’s outcome to be independent of the protected attribute: RA. Specifically, for all r and a, P(r|a)=P(r).

Separation

requires the predictor’s outcome to be conditionally independent of the protected attribute, given the value of the true label: RAY. Specifically, for all r, a and y, P(r|a,y)=P(r|y).

Sufficiency

requires the true label and the protected attribute to be conditionally independent of each other, given the predictor’s outcome: YAR. Specifically, for all y, a and r, P(y|a,r)=P(y|r).

In this work we focus on separation and sufficiency, as independence is usually too restrictive.

2.2 Notation

Throughout this paper, we use capital letters for random variables, their respective lower case letters for their realizations, and script for the alphabet, as in X=x𝒳. The notation P(X|y) is short for P(X|Y=y). For simplicity, we assume that all variables are finite; if, for example, R naturally takes values in [0,1], we consider a suitable quantization.

3 Relaxing the Statistical Criteria of Fairness

A notable result in algorithmic fairness due to Chouldechova [3] and Kleinberg, Mullainathan, and Raghavan [11] states that for binary classification, separation and sufficiency cannot simultaneously hold except in trivial cases – either when Y and A are independent (meaning that different groups share the same base rate for the true label, eliminating any inherent fairness concerns), or when the predictor R perfectly predicts Y. More generally, they show that if Y and A are not independent and the joint distribution of Y, A, and R has full support (meaning that every combination of these variables has nonzero probability), then sufficiency and separation are fundamentally incompatible. (In Appendix A we show that there are non-trivial (though limited) settings where separation and sufficiency are compatible.)

Considering this incompatibility, we may look for suitable relaxations of the notions of separation and sufficiency, and analyze the relationships between the relaxed criteria. Since both criteria are defined in terms of statistical independence, one natural relaxation is to quantify the deviation from independence using (conditional) mutual information. We adopt a slightly more general approach, using f-information – a measure of statistical association based on f-divergences, which generalizes the well-established concept of mutual information.

3.1 𝒇-Divergences and 𝒇-Information

The next section provides the essential background on f-divergences and f-information, including definitions and key properties.

Definition 2 (f-Divergence).

Let f:(0,) be a convex function with f(1)=0, and let P and Q be two probability distributions over the same space. The f-divergence of P from Q is defined as

Df[PQ]:=𝔼Qf(dPdQ)=uQ(u)f(P(u)Q(u)),

where the second equality holds for discrete distributions with the conventions that f(0)=limt0f(t), 0f(00)=0 and 0f(c0)=limt0tf(ct) for c>0. For a complete definition and analysis, see [13, Chapter 7].

Some important examples of f-divergences are the Kullback-Leibler (KL) divergence, taking f(t)=tlogt to obtain DKL[PQ]=𝔼PlogPQ (note that the expectation is over P); the total variation (TV), with f(t)=12|t1| yielding TV(P,Q)=12PQ1; and the χ2-divergence, with f(t)=(x1)2. The Rényi divergences, although not f-divergences, are closely related and share many of their properties.

The following is a useful property of f-divergences.

Proposition 3.

[13, Theorem 7.5] For any f-divergence Df, Df[PQ]0. If f is strictly convex at 1, then Df[PQ]=0 iff P=Q.

In all the examples above, f is indeed strictly convex at 1.

Inspired by the definition of mutual information, a similar measure of association between random variables can be defined using f-divergences:

Definition 4 (f-Information).

Let U and V be random variables with joint distribution P(U,V) and respective marginals P(U) and P(V). The f-information between U and V is defined as

If(U;V):=Df[P(U,V)P(U)P(V)],

that is, the f-divergence between their joint distribution and the product distribution of their marginals.

Let W be another jointly distributed random variable. Then, the conditional f-information between U and V given W is defined as

If(U;V|W) :=𝔼WDf[P(U,V|W)P(U|W)P(V|W)]
=wP(w)Df[P(U,V|w)P(U|w)P(V|w)].

For KL divergence, f-information is exactly Shannon’s mutual information. For TV, ITV is closely related to the covariance (see Proposition 8).

The following property of f-information is an immediate consequence of Proposition 3 and the definition of statistical independence.

Proposition 5 (f-Information and independence).

If f is strictly convex at 1, then

If(U;W)=0 UW,
If(U;W|V)=0 UWV.

(If f is not strictly convex at 1, then independence still implies zero information, but not vice versa.)

We highlight another important property of f-information – the data processing inequality (DPI):

Proposition 6 (Data processing inequality).

[13, Theorem 7.16] Let random variables UVW form a Markov chain in that order, meaning that P(U|V,W)=P(U|V). Then If(U;W)If(U;V).

3.2 Relaxing the Fairness Criteria

In light of Proposition 5, the statistical criteria for fair classification, which are defined in terms of conditional independence, can be relaxed using the notion of f-information.

Definition 7 (Deviations from fairness criteria).

Let Y, A and R be jointly distributed random variables, and let If be an f-information. We define

Deviation from Independence

as Δind(R):=If(A;R),

Deviation from Separation

as Δsep(R):=If(A;R|Y), and

Deviation from Sufficiency

as Δsuff(R):=If(A;Y|R).

Each of these quantities measures to what extent the joint distribution of Y, A and R diverges from the statistical independence that defines the respective criterion, where a value of zero means that the criterion is fully satisfied. When the specific f-information is of relevance, we may write Δsepf for clarity. We sometimes refer to the original statistical criteria as, e.g., “full separation” and “full sufficiency” to distinguish them from these relaxed notions.

An additional relevant quantity that is closely related to the deviations above is If(R;Y), which expresses the amount of “information” that the predictor R conveys about the true label Y. This can be seen as a measure of the predictor’s overall performance, similar to its accuracy. It can be shown that maximizing I(R;Y) (using KL divergence) is equivalent to minimizing the expected cross-entropy loss, 𝔼Y,RlogP(Y|R), which is a common practice in the optimization of classifiers.111Note that I(R;Y)=H(Y)H(Y|R)=𝔼YlogP(Y)+𝔼R𝔼Y|RlogP(Y|R), where the first term, H(Y), is the entropy of Y and does not depend on R, and the second term is the negative of the expected cross-entropy loss (see, e.g., [4]).

3.3 Interpreting the Relaxed Criteria

The definition of the relaxed criteria in terms of f-information (Definition 7) may not seem intuitive at first. However, applying Bayes’ rule and some algebra transforms these expressions into a form that more closely resembles the original definitions of the fairness criteria. Consider, for example, the deviation from separation:

Δsep(R)=If(A;R|Y) =𝔼YDf[P(A,R|Y)P(A|Y)P(R|Y)]
=yP(y)a,rP(a|y)P(r|y)f(P(a,r|y)P(a|y)P(r|y))
=yP(y)aP(a|y)rP(r|y)f(P(r|a,y)P(r|y))
=a,yP(a,y)rP(r|y)f(P(r|a,y)P(r|y))
=𝔼A,YDf[P(R|A,Y)P(R|Y)].

This expression represents the expected divergence between the conditional distribution of R given both A and Y, and its conditional distribution given only Y. In other words, it quantifies how much, on average (over the values of A and Y), the predictor’s outcome distribution within the subgroups differs from the average in the population, among individuals that share the same true label.

Similarly, for the deviation from sufficiency we have

Δsuff(R)=If(A;Y|R)=𝔼A,RDf[P(Y|A,R)P(Y|R)], (1)

that is, the average difference between the the true label distribution within the subgroups and its average in the population, among individuals that share the same predicted value.

One potential drawback of using f-information to define relaxed fairness criteria is its average-case nature. In particular, smaller subgroups – especially those where certain labels or prediction values are rare – contribute less to the deviation. One could argue that this is an unfair approach to fairness. In any case, some of our results treat scenarios where one or more of the deviations is zero, in which case taking the average or the maximum yields the same result.

To address this drawback, some works instead use a maximum-based approach to defining relaxed group fairness criteria (see, for example, [15]). However, placing excessive weight on rare examples can introduce statistical biases and negatively impact the learning process of fair predictors and representations. This very concern has led the multicalibration literature [9, 6] to adopt a relaxed objective that is essentially equivalent to TV-information, as we demonstrate next.

Following the definition in [6], a predictor R is said to be α-approximate subgroup calibrated with respect to A if it satisfies 𝔼R|Cov[A,YR]|α.

Proposition 8.

Let A, Y and R be jointly distributed random variables, and assume that A and Y take values in {0,1}. Then, R is α-approximate subgroup calibrated with respect to A iff ΔsuffTV(R)2α.

The proof, provided in Appendix B, relies on the fact that in the binary case, 𝔼R|Cov[A,YR]|=12ΔsuffTV(R). More generally, for A and Y that take on numeric values, denote the ranges Y=maxYminY and A=maxAminA. Then we have 𝔼R|Cov[A,YR]|12AYΔsuffTV(R).

Finally, Pinsker’s inequality [13, Theorem 7.10] states that TV(P,Q)12DKL[PQ] (when KL divergence is taken with the natural logarithm), meaning that using KL divergence for the deviation dominates the use of total variation. In particular, this gives 𝔼R|Cov[A,YR]|AY18ΔsuffKL(R).

4 Main Results

Given that the criteria of full separation and full sufficiency cannot always be satisfied simultaneously, our primary goal is to address the following question:

For any given setting (defined by X, Y, and A), what are the best combinations of deviations from sufficiency and separation that are achievable?

This question leads us to formalize the following definition:

Definition 9 (The separation-sufficiency function).

Let X, Y and A be jointly distributed finite random variables. We define the separation-sufficiency function as

Φ(ξ):=inf{Δsuff(R)|R is finite, (A,Y)XR is Markov, and Δsep(R)=ξ},

for all ξ, such that the set above is not empty.

Since Δsuff(R)=If(A;Y|R), we know from Proposition 3 that the separation-sufficiency tradeoff function Φ is nonnegative. In what follows, we prove that Φ is attained as a minimum and it is continuous and convex. Moreover, we show that in certain settings it is monotonically nonincreasing, and in others it is bounded away from zero – two properties with important implications.

4.1 The Achievable Region and the Separation-Sufficiency Curve

Consider any joint distribution P(X,Y,A). Any choice of a predictor R – that is, a choice of a conditional distribution P(R|X) – corresponds to a point in the nonnegative octant representing the triplet (Δsep(R),Δsuff(R),If(R;Y)). We refer to the set of all such possible points as the achievable region. We first show that the achievable region (as illustrated in Figure 2) is convex and compact.

Lemma 10 (Convexity of the achievable region).

Let X, Y and A be jointly distributed finite random variables. The associated achievable region, that is, the set

𝒮:={(Δsep(R),Δsuff(R),If(R;Y))|R is finite and (A,Y)XR is Markov}03

is convex and compact.

Proof.

Denote by 𝒫(𝒳) the set of all probability distributions over the alphabet of X. For all Q𝒫(𝒳), define the following functions:

𝝃(Q) :=a,yP(a|y)(xP(y|x)Q(x))f(xP(a,y|x)Q(x)P(a|y)(xP(y|x)Q(x))),
𝜼(Q) :=a,y(xP(a|x)Q(x))(xP(y|x)Q(x))f(xP(a,y|x)Q(x)(xP(a|x)Q(x))(xP(y|x)Q(x))),
𝜻(Q) :=yP(y)f(xP(y|x)Q(x)P(y)).

Define the function F:𝒫(𝒳)𝒫(𝒳)×3 as F(Q):=(Q,𝝃(Q),𝜼(Q),𝜻(Q)), and let 𝒞:=coF(𝒫(𝒳)) be the convex hull of the image of F in 𝒫(𝒳)×3. Finally, define 𝒞P(X):={(ξ,η,ζ)(P(X),ξ,η,ζ)𝒞}3, that is, the “slice” of 𝒞 that corresponds to the marginal of X in the first coordinate.

Since X is finite, 𝒫(𝒳) is finite dimensional and compact, and since F is continuous, the set 𝒞 is both convex and compact. Consequently, so is 𝒞P(X), as an intersection with a hyperplane.

We will now show that 𝒞P(X)=𝒮, that is, the achievable region of P(X,Y,A). Indeed, if (ξ,η,ζ)𝒞P(X), then by its definition and the definition of a convex hull, there exist k, Q1,,Qk𝒫(𝒳) and α1,,αk>0 with r=1kαr=1, such that

P(X) =r=1kαrQr(X), ξ =r=1kαr𝝃(Qr), η =r=1kαr𝜼(Qr) ζ =r=1kαr𝜻(Qr). (2)

Define a predictor R[k] according to the conditional distribution P(r|x)=P(R=r|X=x):=αrQr(x)P(x), for all x𝒳 and r[k]. This distribution is well-defined, since by (2), r=1kP(r|x)=r=1kαrQr(x)P(x)=P(x)P(x)=1. In addition, we have P(r)=xP(x)P(r|x)=αrxQr(x)P(x)P(x)=αr, and therefore by Bayes’ rule, P(x|r)=Qr(x). Moreover, (A,Y)XR clearly form a Markov chain, because R is defined in terms of X alone. Now, substituting αr and Qr in (2), it can be shown that ξ=If(A;R|Y)=Δsep(R), η=If(A;Y|R)=Δsuff(R) and ζ=If(R;Y) (see Appendix C for a detailed derivation).

Conversely, if there exists a finite random variable R, such that (A,Y)XR form a Markov chain, then following the approach in Appendix C, we get that Δsep(R)=If(A;R|Y)=rP(r)𝝃(P(X|r)), Δsuff(R)=If(A;Y|R)=rP(r)𝜼(P(X|r)) and If(R;Y)=rP(r)𝜻(P(X|r)). In other words, (P(X),Δsep(R),Δsuff(R),If(R;Y)) is a convex combination of the points (P(X|r),𝝃(P(X|r)),𝜼(P(X|r)),𝜻(P(X|r)))F(𝒫(𝒳)), with the weights given by P(R), and thus by its definition, (P(X),Δsep(R),Δsuff(R),If(R;Y))𝒞. Consequently, (Δsep(R),Δsuff(R),If(R;Y))𝒞P(X).

It is worth noting that the definition of the achievable region can be extended to include Δind(R)=If(A;R), while preserving its convexity and compactness. The proof is a straightforward extension of the arguments above.

The connection between the achievable region and the separation-sufficiency function is emphasized in the following corollary:

Corollary 11.

The separation-sufficiency function Φ is a continuous convex curve, and it is attained as a minimum.

Proof.

Let 𝒮¯={(ξ,η)|(ξ,η,ζ)𝒮}, that is, the projection of the achievable region 𝒮 onto the first two coordinates, which correspond to Δsep(R) and Δsuff(R). By Lemma 10, 𝒮 is convex and compact, implying the same for the projection 𝒮¯. The claim now follows from the observation that the curve Φ(ξ) is the lower boundary of 𝒮¯.

The next result further characterizes the function Φ and establishes its domain.

Theorem 12.

Let X, Y and A be jointly distributed finite random variables and denote by

𝒟={Δsep(R)|R is finite and (A,Y)XR is Markov}

the set of all achievable values of deviation from separation, corresponding to all possible predictors. Then

𝒟=[0,If(A;X|Y)].

For the proof, we use the following bound:

Proposition 13.

Let X, Y, A and R be jointly distributed random variables, such that (A,Y)XR form a Markov chain. Then Δsep(R)=If(A;R|Y)If(A;X|Y).

Proof.

This follows from the data processing inequality for the Markov chain AXR, conditioned on each value of Y, so that If(A;R|y)If(A;X|y).

Proof of Theorem 12.

Since Δsep(R)=If(A;R|Y), Proposition 3 implies that all ξ𝒟 satisfy ξ0. Let R00 be a constant predictor, that is, P(R0=0)=1. Then R0 meets the Markov condition and Δsep(R0)=If(A;R0|Y)=0. Therefore, the bound 𝒟0 is tight.

On the other hand, by Proposition 13, all ξ𝒟 satisfy ξIf(A;X|Y). Let Rid=X be the identity predictor. Then Rid satisfies the Markov condition and Δsep(Rid)=If(A;Rid|Y)=If(A;X|Y). Therefore, the bound 𝒟If(A;X|Y) is also tight.

Finally, since 𝒟 is the projection of the achievable set S onto the first coordinate, and since S is convex and compact by Lemma 10, 𝒟 must be a closed interval in . Consequently, following the bounds above, 𝒟=[0,If(A;X|Y)].

The proof shows that the predictors R0 and Rid attain the left and right bounds of 𝒟, respectively. Since we have Δsuff(R0)=If(A;Y|R0)=If(A;Y) and Δsuff(Rid)=If(A;Y|Rid)=If(A;Y|X), and the achievable region 𝒮 is convex, we conclude that the line segment between the points (0,If(A;Y)) and (If(A;X|Y),If(A;Y|X)) is contained in 𝒮¯, the projection of S onto the first two coordinates (see Figure 2(a)). In particular, this implies the following bounds on the right endpoint of Φ:

0Φ(If(A;X|Y))If(A;Y|X). (3)
Figure 2: The achievable region: (a) A schematic illustration of the achievable region 𝒮 (blue; in fact, what is depicted is the projection of 𝒮 onto the first two coordinates, corresponding to Δsep(R) and Δsuff(R)), as well as the separation-sufficiency curve Φ (orange). Note that, in general, If(A;Y) is not necessarily larger than If(A;Y|X). (b) When A and Y are conditionally independent given X, i.e., If(A;Y|X)=0, the curve Φ is monotonically nonincreasing, attains the value zero, and represents a Pareto frontier of fairness, reflecting the tradeoff between separation and sufficiency.

4.2 The Separation-Sufficiency Tradeoff

If both A and Y are determined by (possibly stochastic) functions of X, then all the information about the true label that is encoded in the protected attribute is also encoded in the data X. This assumption is common in the multicalibration literature, where A denotes either a subset of 𝒳 [9], or more generally a real valued function of X [6]. It is equivalent to saying that YXA form a Markov chain, that AY|X, or – most importantly for our analysis – that If(A;Y|X)=0. In this case, we say that the data itself satisfies sufficiency.

In this specific setting, as we prove in the following theorem, the separation-sufficiency curve is monotonically nonincreasing, and thus represents a Pareto frontier of fairness, reflecting the fundamental tradeoff between separation and sufficiency (see Figure 2(b) for a schematic, and the left panel of Figure 3 for a simulation).

Theorem 14 (The separation-sufficiency tradeoff).

Let X, Y and A be jointly distributed finite random variables, such that AYX. Then the separation-sufficiency curve Φ(ξ) is convex, nonincreasing in ξ, and attains the value zero. In particular, there exist predictors that satisfy sufficiency.

Proof.

From Corollary 11 and Theorem 12, we know that Φ(ξ) is convex for all ξ[0,If(A;X|Y)]. In addition, the assumption that AYX, meaning that If(A;Y|X)=0, together with (3), implies that Φ(If(A;X|Y))=0. Since Φ is attained as a minimum, it follows that there exist predictors that satisfy sufficiency (in particular, the identity predictor Rid=X).

Let 0ξ1<ξ2<If(A;X|Y). From convexity, we have

Φ(ξ2)Φ(ξ1)ξ2ξ1Φ(If(A;X|Y))Φ(ξ2)If(A;X|Y)ξ2=0Φ(ξ2)If(A;X|Y)ξ2.

Since Φ(ξ2)0, If(A;X|Y)ξ2>0 and ξ2ξ1>0, we conclude that Φ(ξ2)Φ(ξ1)0, that is, Φ(ξ2)Φ(ξ1). This means that Φ(ξ) is monotonically nonincreasing in ξ.

This result shows that, rather than needing to abandon either separation or sufficiency due to their incompatibility, a parametric approach reveals a continuous tradeoff between the measures of deviation from those criteria, exposing a range of intermediate combinations that can be achieved.

For example, suppose that a predictor was trained to achieve no worse than a certain subgroup calibration error. This objective corresponds to a level of deviation from sufficiency, and may be compatible with a quite good level of separation. The convexity of the curve suggests that a small relaxation of the subgroup calibration objective can result in a relatively large improvement in the separation measure.

Refer to caption
Figure 3: Simulations of the achievable region (using KL divergence) for a fixed joint distribution P(X,Y) and three different scenarios of P(A|X,Y). Left: A0YX; it can be seen that the lower boundary of the achievable region, Φ(ξ), is indeed convex, monotonically nonincreasing, and attains the value zero. Middle: A1⟂̸YX, but I(A1;Y)<I(A1;X); the lower boundary in not monotonic, but there exist predictors that achieve ΔsuffKL=0. Right: A2⟂̸YX and I(A2;Y)>I(A2;X); the entire achievable region is bounded away from zero. See Appendix E for a detailed description of the simulation.

4.3 Full Sufficiency Is Unachievable When the Data Is Not Rich Enough

The previous result relies heavily on the assumption that the data itself satisfies sufficiency. When this is not the case, that is, when If(A;Y|X)>0, it means that the data X is not expressive enough to convey all the information about the relationship between the protected attribute A and the true label Y. This can occur if the protected attribute is not an inherent part of the data, yet it bears information about the true label. In particular, this can happen when the protected attribute is deliberately stripped from the data, as in the practice of “fairness through unawareness.”

In this case, there is no guarantee that the separation-sufficiency curve is monotonic (see an example of such a situation in the middle panel of Figure 3), nor that it attains the value zero (see Figure 3, right panel). In other words, there is no guarantee that there exist predictors that satisfy perfect sufficiency. In fact, we show that in certain scenarios such predictors do not exist.

In the following analysis, we rely exclusively on Shannon’s mutual information, as it allows us to apply its chain rule (see, for example, [4]), which does not necessarily have a counterpart in other forms of f-information.

Theorem 15.

Let X, Y and A be jointly distributed finite random variables. Then for all ξ in its domain, the separation-sufficiency function using mutual information satisfies Φ(ξ)ξ+I(A;Y)I(A;X).

Proof.

Let R be a predictor defined by the conditional distribution P(R|X), meaning that (A,Y)XR form a Markov chain. In particular, AXR form a Markov chain too, and thus by the data processing inequality we have

I(A;R)I(A;X). (4)

Now, using the chain rule of mutual information in two different ways, we get

I(A;R,Y) =I(A;R)+I(A;Y|R)=I(A;R)+ΔsuffKL(R)
=I(A;Y)+I(A;R|Y)=I(A;Y)+ΔsepKL(R).

Therefore, ΔsuffKL(R)=ΔsepKL(R)+I(A;Y)I(A;R), and from (4) it follows that

ΔsuffKL(R)ΔsepKL(R)+I(A;Y)I(A;X). (5)

Finally, let ξ[0,I(A;X|Y)]. Then by Corollary 11, Φ(ξ) is attained by some predictor R, such that ξ=ΔsepKL(R) and Φ(ξ)=ΔsuffKL(R). Substituting these equalities in (5) yields Φ(ξ)ξ+I(A;Y)I(A;X).

Corollary 16.

If I(A;X)<I(A;Y) then there exists no predictor that satisfies sufficiency.

Proof.

By Theorem 15 we have Φ(ξ)ξ+I(A;Y)I(A;X)>ξ0. Since Φ is the lower boundary of the achievable region, this means that no predictor attains ΔsuffKL(R)=I(A;Y|R)=0.

(Note that, although the statement of the corollary is in terms of Shannon’s mutual information, the result that Δsuff(R)>0 holds for all f-information with strictly convex f at 1, since in that case Δsuff(R)=If(A;Y|R)=0 implies sufficiency.)

It is worth noting that when YXA form a Markov chain, the data processing inequality guarantees that I(A;Y)I(A;X), and thus this corollary does not contradict Theorem 14.

This result has important implications when considering downstream uses of representations (post-processing), as we explore in the next section.

4.4 Post-Processing

Our analysis of the achievable region, along with the tradeoffs and limitations of separation and sufficiency, has some straightforward implications for the downstream use of fair representations. Specifically, note that when R is regarded as a representation, upon which a further prediction R will be made, then R takes the place of X in our model (Figure 1(c)). This is formalized in the next result.

Theorem 17 (Post-processing).

Let X, Y and A be jointly distributed finite random variables. If R and R are two finite random variables, such that (A,Y)XRR form a Markov chain in that order, we say that the predictor R is a post-processing of R, or a downstream predictor based on R. We have the following results:

  1. 1.

    If R is a post-processing of R, then Δsep(R)Δsep(R), meaning that the deviation from separation cannot increase with post-processing.

  2. 2.

    If R satisfies sufficiency, then the separation-sufficiency curve restricted to downstream predictors is convex, nonincreasing, and attains the value zero.

  3. 3.

    If R satisfies I(A;R)<I(A;Y), then for any post-processing R we have Δsuff(R)>0, meaning that no downstream predictor can satisfy sufficiency.

Proof.

Note that if R is a post-processing of R then, in particular, (A,Y)RR form a Markov chain. Therefore,

  1. 1.

    From Theorem 12 we have Δsep(R)If(A;R|Y)=Δsep(R).

  2. 2.

    If R satisfies sufficiency, then AYR, and the claim follows from Theorem 14.

  3. 3.

    This is an immediate consequence of Corollary 16.

Note that, as a consequence of the first result above, if R satisfies full separation then any downstream predictor will necessarily satisfy full separation too – essentially collapsing the space of possible tradeoffs with sufficiency. This is especially restrictive when taking into account that separation is fundamentally at odds with If(R;Y), the amount of relevant information that the predictor conveys about the true label (see Appendix D).

Regarding statement 3, it should be noted that by the data processing inequality, I(A;X)I(A;R)I(A;R), meaning that the quantity I(A;R) can only decrease with post-processing, running the risk that repeated or aggressive post-processing of a representation can result in a state where full sufficiency is no longer achievable.

5 Discussion

We hope that this work contributes to elucidating the possible tradeoffs between various desiderata that are achievable in the design of classification algorithms, providing a richer space of options to domain experts and decision-makers.

Although we have primarily focused on separation-sufficiency tradeoffs, the tradeoff is, in fact, triple – between separation, sufficiency, and performance, as measured by If(R;Y). In particular, there may exist several predictors that attain a specific point on the separation-sufficiency curve, with different levels of overall performance. In Appendix D we make a first step in this direction, and provide an analysis of the tension between separation and the predictor’s performance. This is a promising direction for future research.

Furthermore, our focus in this work has been information theoretic – exploring the possible tradeoffs that predictors can achieve; developing algorithms to efficiently compute such predictors is an interesting open question.

References

  • [1] Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine Learning: Limitations and Opportunities. MIT Press, 2023.
  • [2] Yahav Bechavod and Katrina Ligett. Penalizing unfairness in binary classification. arXiv preprint arXiv:1707.00044, 2017.
  • [3] Alex Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data, 5(2):153–163, October 2016. doi:10.1089/BIG.2016.0047.
  • [4] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley-Interscience, 2nd ed. edition, 2006.
  • [5] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In 3rd Innovations in Theoretical Computer Science Conference (ITCS), pages 214–226, 2012. doi:10.1145/2090236.2090255.
  • [6] Parikshit Gopalan, Adam Tauman Kalai, Omer Reingold, Vatsal Sharan, and Udi Wieder. Omnipredictors. In 13th Innovations in Theoretical Computer Science Conference (ITCS), volume 215, pages 79:1–79:21, 2022. doi:10.4230/LIPICS.ITCS.2022.79.
  • [7] Faisal Hamman and Sanghamitra Dutta. A unified view of group fairness tradeoffs using partial information decomposition. In 2024 IEEE International Symposium on Information Theory (ISIT), pages 214–219, 2024. doi:10.1109/ISIT57864.2024.10619698.
  • [8] Moritz Hardt, Eric Price, and Nathan Srebro. Equality of opportunity in supervised learning. Advances in Neural Information Processing Systems, 29, 2016.
  • [9] Ursula Hébert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. Multicalibration: Calibration for the (computationally-identifiable) masses. In International Conference on Machine Learning, pages 1939–1948. PMLR, 2018.
  • [10] Toshihiro Kamishima, Shotaro Akaho, and Jun Sakuma. Fairness-aware learning through regularization approach. In 2011 IEEE 11th International Conference on Data Mining Workshops, pages 643–650, 2011. doi:10.1109/ICDMW.2011.83.
  • [11] Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent Trade-Offs in the Fair Determination of Risk Scores. In 8th Innovations in Theoretical Computer Science Conference (ITCS), pages 43:1–43:23, 2017. doi:10.4230/LIPICS.ITCS.2017.43.
  • [12] Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q Weinberger. On fairness and calibration. Advances in Neural Information Processing Systems, 30, 2017.
  • [13] Yury Polyanskiy and Yihong Wu. Information Theory: From Coding to Learning. Cambridge University Press, 2025.
  • [14] Hans S. Witsenhausen and Aaron D. Wyner. A conditional entropy bound for a pair of discrete random variables. IEEE Transactions on Information Theory, 21(5):493–501, 1975.
  • [15] Blake Woodworth, Suriya Gunasekar, Mesrob I. Ohannessian, and Nathan Srebro. Learning non-discriminatory predictors. In Conference on Learning Theory, pages 1920–1953. PMLR, 2017. URL: http://proceedings.mlr.press/v65/woodworth17a.html.
  • [16] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th International Conference on World Wide Web, pages 1171–1180, 2017. doi:10.1145/3038912.3052660.
  • [17] Rich Zemel, Yu Wu, Kevin Swersky, Toniann Pitassi, and Cynthia Dwork. Learning fair representations. In International Conference on Machine Learning, pages 325–333. PMLR, 2013.

Appendix A Achieving Separation and Sufficiency Together

If Y and A are not independent, then separation and sufficiency cannot simultaneously hold if the joint distribution of A, Y and R has full support [3, 11]. In the binary case, this means that separation and sufficiency can be achieved together only if R perfectly predicts Y. However, in the general case, this does not always result in trivial or degenerate conditions.

Indeed, assume that R satisfies both ARY and AYR. The first condition means that AYR form a Markov chain in that order. From the data processing inequality for mutual information we have I(R;A)I(Y;A). The second condition means that ARY form a Markov chain, and thus I(Y;A)I(R;A). Together, we get that I(Y;A)=I(R;A), meaning that R and Y convey the same information about A. In other words, this implies that R is a sufficient statistic of Y for A (see, for example, [4]).

The following is an example of such a case, where full separation and full sufficiency are not incompatible:

Example 18.

Let ABer(q) for some q(0,1). For i=1,2,3, let Xi|ABer(pA), where p0=12 and p1=14. Define Y=i=13Xi2i1, that is, the value of the binary word X1X2X3, and R=i=13Xi. Then

Y01234567P(Y|A)=[1818182764964964364964364364164]01A,andR=R(Y)={0,Y=01,Y{1,2,4}2,Y{3,5,6}3,Y=7

From this definition of R, we see that the sequence AYR forms a Markov chain. In other words, R is conditionally independent of A given Y, meaning that it satisfies full separation.

A simple calculation shows that R also satisfies full sufficiency. For example,

P(Y=3|R=2,A=0) =P(Y=3,R=2|A=0)P(R=2|A=0)=P(Y=3|A=0)P(Y{3,5,6}|A=0)=18318=13,
P(Y=3|R=2,A=1) =P(Y=3,R=2|A=1)P(R=2|A=1)=P(Y=3|A=1)P(Y{3,5,6}|A=1)=3643364=13.

Consequently, even though Y is not independent of A, we have an example of a predictor that satisfies both separation and sufficiency, and yet does not allow for a perfect estimation of Y (for example, if R=1, then Y can be either 1, 2 or 4). The predictor R is indeed a sufficient statistic of Y for A—it can be seen in the matrix above that for all A, the probability P(Y|A) does not distinguish between values of Y that correspond to binary words X1X2X3 with the same number of ones, which is exactly the value of R.

Appendix B Deviation from Sufficiency and Multicalibration

Following Definition 5.3 in [6], a predictor R is said to be α-approximate subgroup calibrated with respect to A if it satisfies 𝔼R|Cov[A,YR]|α.

Proposition 19.

Let A, Y and R be jointly distributed random variables, and assume that A and Y take values in {0,1}. Then R is α-approximate subgroup calibrated with respect to A iff ΔsuffTV(R)2α.

Proof.

We show that when A and Y are binary, ΔsuffTV(R)=2𝔼R|Cov[A,YR]|. Indeed,

Cov[A,Y|R] =𝔼A,Y|RAY𝔼A|RA𝔼Y|RY
=a,y{0,1}2P(a,y|R)aya{0,1}P(a|R)ay{0,1}P(y|R)y
=a,y{0,1}2P(a|R)P(y|a,R)aya{0,1}P(a|R)ay{0,1}P(y|R)y
=a{0,1}P(a|R)a(y{0,1}P(y|a,R)yy{0,1}P(y|R)y)
=P(A=1|R)(P(Y=1|A=1,R)P(Y=1|R)),

so we have |Cov[A,Y|R]|=P(A=1|R)|P(Y=1|A=1,R)P(Y=1|R)|.

In a similar way, it can be also shown that

|Cov[A,YR]| =P(A=0|R)|P(Y=1|A=0,R)P(Y=1|R)|
=P(A=1|R)|P(Y=0|A=1,R)P(Y=0|R)|
=P(A=0|R)|P(Y=0|A=0,R)P(Y=0|R)|.

Therefore,

|Cov[A,YR]| =14a{0,1}P(a|R)y{0,1}|P(y|a,R)P(y|R)|
=14𝔼A|Ry{0,1}|P(y|A,R)P(y|R)|.

Now, using total variation to measure the deviation from sufficiency, we get from (1) that

ΔsuffTV(R)=ITV(A;Y|R) =𝔼A,RTV(P(Y|A,R),P(Y|R))
=12𝔼A,Ry{0,1}|P(y|A,R)P(y|R)|
=12𝔼R𝔼A|Ry{0,1}|P(y|A,R)P(y|R)|=2𝔼R|Cov[A,Y|R]|.

Appendix C Details for the Proof of Lemma 10

From (2) we have ξ=r=1kαr𝝃(Qr). Substituting P(r) for αr and P(X|r) for Qr we get

ξ =rP(r)𝝃(P(X|r))
=rP(r)a,yP(a|y)(xP(y|x)P(x|r))f(xP(a,y|x)P(x|r)P(a|y)(xP(y|x)P(x|r))) (6)
=rP(r)a,yP(a|y)P(y|r)f(P(a,y|r)P(a|y)P(y|r)) (7)
=rP(r)a,yP(a|y)P(y|r)f(P(a,r|y)P(a|y)P(r|y)) (8)
=yP(y)a,rP(a|y)P(r|y)f(P(a,r|y)P(a|y)P(r|y))=If(A;R|Y)=Δsep(R), (9)

where (6) is simply the definition of 𝝃; (7) follows from the Markov assumption, (A,Y)XR, so P(y|r)=xP(y|x)P(x|r) and P(a,y|r)=xP(a,y|x)P(x|r); and (8) and (9) are due to Bayes’ rule, as P(a,y|r)P(y|r)=P(a,r|y)P(r|y) and P(r)P(y|r)=P(y)P(r|y).

Similar steps show that rP(r)𝜼(P(X|r))=If(A;Y|R)=Δsuff(R) and rP(r)𝜻(P(X|r))=If(R;Y).

Appendix D Separation Is at Odds with the Predictor’s Overall Performance

The criterion of separation has sometimes been criticized (see, for example, [1, Chapter 3]) because imposing equal error rates across subgroups often leads to suboptimal performance for some of them. In this appendix we formalize this argument—that separation may conflict with the overall performance quality of the predictor.

Indeed, in addition to the tradeoff between separation and sufficiency, the convexity of the achievable region (Lemma 10) also implies a similar tradeoff between separation and If(R;Y). As noted earlier, this quantity can be interpreted as a measure of the predictor’s overall performance. The analysis below builds on the same reasoning as that of the separation-sufficiency tradeoff.

Definition 20 (The separation-performance function).

Let X, Y and A be jointly distributed finite random variables. We define the separation-performance function as

Ψ(ξ):=sup{If(R;Y)|R is finite, (A,Y)XR is Markov, and Δsep(R)=ξ},

for all ξ[0,If(A;X|Y)].

Note that the domain of Ψ follows from Theorem 12.

Theorem 21 (The separation-performance tradeoff).

The separation-performance function Ψ(ξ) is continuous, concave, monotonically nondecreasing in ξ, and attained as a maximum.

Proof.

Recall the definition of the achievable region,

𝒮:={(Δsep(R),Δsuff(R),If(R;Y))|R is finite and (A,Y)XR is Markov}03,

and denote by 𝒮~={(ξ,ζ)|(ξ,η,ζ)𝒮} the projection of 𝒮 onto the first and third coordinates, corresponding to Δsep(R) and If(R;Y). By Lemma 10, the achievable region 𝒮 is convex and compact, implying the same for its projection 𝒮~. Since Ψ(ξ) is the upper boundary of 𝒮~, it follows that it is a continuous concave curve. In addition, the compactness of 𝒮~ implies that Ψ(ξ) is attained as a maximum.

Let ξ[0,If(A;X|Y)]. Since Ψ(ξ) is attained as a maximum, there exists a predictor R forming a Markov chain (A,Y)XR, such that Δsep(R)=ξ and If(R;Y)=Ψ(ξ). By the data processing inequality we have If(R;Y)If(X;Y), and thus

0Ψ(ξ)If(X;Y). (10)

Now, denote by Rid=X the identity predictor. Then Δsep(Rid)=If(A;Rid|Y)=If(A;X|Y) and If(Rid;Y)=If(X;Y). Therefore, by its definition, Ψ(If(A;X|Y))If(X;Y), and together with (10) we get that Ψ(If(A;X|Y))=If(X;Y).

Finally, let 0ξ1<ξ2<If(A;X|Y). From concavity, we have

Ψ(ξ2)Ψ(ξ1)ξ2ξ1Ψ(If(A;X|Y))Ψ(ξ2)If(A;X|Y)ξ2=If(X;Y)Ψ(ξ2)If(A;X|Y)ξ2.

Since by (10), If(X;Y)Ψ(ξ2)0, If(A;X|Y)ξ2>0 and ξ2ξ1>0, we conclude that Ψ(ξ2)Ψ(ξ1)0; that is, Ψ(ξ1)Ψ(ξ2). This means that Ψ(ξ) is monotonically nondecreasing in ξ.

This result is of particular concern if R serves as a representation, upon which further predictions will be made. The reason is that imposing a low deviation from separation already at an upstream stage may decrease the information that R bears on the true label Y; by the data processing inequality, this in turn will constrain all downstream predictors.

Appendix E Simulation of the Achievable Region

We provide the details and full results of the simulation that we ran in order to visualize the achievable region in concrete examples, corresponding to different settings of X, Y and A.

E.1 Method

First, we fix a joint distribution P(X,Y). We then consider three different cases of P(A|X,Y), corresponding to three distinct scenarios:

  1. 1.

    A0YX, so we only need to define P(A0|X); this corresponds to the conditions of Theorem 14.

  2. 2.

    A1⟂̸YX, with I(A1;X)>I(A1;Y).

  3. 3.

    A2⟂̸YX, with I(A2;X)<I(A2;Y); corresponding to the conditions of Corollary 16.

For each of these scenarios we draw at random 10,000 conditional distributions P(R|X) in the following manner: denote by n=|𝒳| the size of the alphabet of X, then by Carathéodory-Fenchel-Eggleston theorem (see, e.g., [14, Lemma 2.2]), every point in the achievable region can be attained by a predictor R taking as most n+1 values. Hence, for each sampling of a distribution P(R|X), it suffices to draw n distributions from 𝒫([n+1]), the probability simplex over n+1 points. We do that using an (n+1)-dimensional Dirichlet distribution.

For each choice of P(R|X), we calculate Δsep(R), Δsuff(R) and If(R;Y), using different f-divergences. Finally, we plot the points corresponding to the 10,000 sampled predictors in the ΔsepΔsuff plane, colored according to their respective value of If(R;Y).

E.2 Details

We considered X taking 4 values, and binary Y and A. Specifically, we fixed P(X)=[0.1,0.1,0.1,0.7] and P(Y=1|X)=[0.8,0.6,0.4,0.2], as well as the following conditional distributions for A:

  1. 1.

    P(A0=1|X)=[1.0,0.4,0.8,0.2] with P(A0|X,Y)=P(A0|X), so that A0YX.

  2. 2.

    P(A1=1|X,Y=0)=[0.9,1.0,0.8,0.2] and P(A1=1|X,Y=1)=[0.9,0.6,0.4,0.4].

  3. 3.

    P(A2=1|X,Y=0)=[0.9,0.6,0.8,1.0] and P(A2=1|X,Y=1)=[0.7,0.4,0.3,0.4].

See Table 1 for the resulting values of If(A;X|Y), If(A;Y|X), If(A;X) and If(A;Y).

Table 1: Main information values of the simulated distributions for A0, A1 and A2, using KL divergence, total variation and χ2-divergence.
KL divergence Total variation χ2-divergence
A0 A1 A2 A0 A1 A2 A0 A1 A2
If(A;Y|X) 0.000 0.036 0.074 0.000 0.045 0.098 0.000 0.041 0.107
If(A;X|Y) 0.338 0.177 0.066 0.200 0.162 0.096 0.553 0.267 0.090
If(A;X) 0.393 0.152 0.035 0.224 0.148 0.070 0.583 0.251 0.049
If(A;Y) 0.055 0.011 0.044 0.098 0.048 0.099 0.081 0.016 0.055

For each P(R|X), as explained above, we sampled 4 distribution vectors (corresponding to the values of X) from a 5-dimensional symmetric Dirichlet distribution with parameter α=0.1. The resulting plots are shown in Figure 4.

Refer to caption
Figure 4: Simulations of the achievable region using KL divergence (top row; shown also in Figure 3), total variation (middle row) and χ2-divergence (bottom row). The left plots correspond to the setting where AYX, the middle and right plots correspond to the setting where A⟂̸YX, with I(A;Y)<I(A;X) and I(A;Y)>I(A;X), respectively.