Abstract 1 Introduction 2 Multi-Group Fairness as Integral Probability Metrics 3 Multiaccuracy in Hilbert Space 4 Experiments 5 Discussion and Conclusion References Appendix Appendix A Details of Theoretical Framework Appendix B Complete Experimental Results Appendix C KMAcc: Conditions for One-Step Sufficiency

Kernel Multiaccuracy

Carol Xuan Long111The authors contributed equally to this work. ORCID John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, USA Wael Alghamdi1 ORCID John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, USA Alexander Glynn1 John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, USA Yixuan Wu John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, USA Flavio P. Calmon ORCID John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, USA
Abstract

Predefined demographic groups often overlook the subpopulations most impacted by model errors, leading to a growing emphasis on data-driven methods that pinpoint where models underperform. The emerging field of multi-group fairness addresses this by ensuring models perform well across a wide range of group-defining functions, rather than relying on fixed demographic categories. We demonstrate that recently introduced notions of multi-group fairness can be equivalently formulated as integral probability metrics (IPM). IPMs are the common information-theoretic tool that underlie definitions such as multiaccuracy, multicalibration, and outcome indistinguishably. For multiaccuracy, this connection leads to a simple, yet powerful procedure for achieving multiaccuracy with respect to an infinite-dimensional class of functions defined by a reproducing kernel Hilbert space (RKHS): first perform a kernel regression of a model’s errors, then subtract the resulting function from a model’s predictions. We combine these results to develop a post-processing method that improves multiaccuracy with respect to bounded-norm functions in an RKHS, enjoys provable performance guarantees, and, in binary classification benchmarks, achieves favorable multiaccuracy relative to competing methods.

Keywords and phrases:
algorithmic fairness, integral probability metrics, information theory
Copyright and License:
[Uncaptioned image] © Carol Xuan Long, Wael Alghamdi, Alexander Glynn, Yixuan Wu, and Flavio P. Calmon; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Mathematics of computing Information theory
Acknowledgements:
This material is based upon work supported by the National Science Foundation under Grant No FAI 2040880, CIF 2231707, and CIF 2312667.
Editors:
Mark Bun

1 Introduction

Machine learning (ML) models can be inaccurate or miscalibrated on underrepresented population groups defined by categorical features such as race, religion, and sex [3]. Equitable treatment of groups defined by categorical features is a central aspect of the White House’s “Blueprint for an AI Bill of Rights” [23]. Over the past decade, hundreds of fairness metrics and interventions have been introduced to quantify and control an ML model’s performance disparities across pre-defined population groups [12, 24]. Examples of group-fairness-ensuring interventions include post-processing [21, 25, 2] or retraining [1] a model.

Although common, using pre-determined categorical features for measuring “fairness” in ML poses several limitations. Crucially, we design group attributes based on our preconception of where discrimination commonly occurs and whether group-denoting information can be readily measured and obtained. A more complex structure of unfairness can easily elude group-fairness interventions. For instance, [26] demonstrates that algorithms designed to ensure fairness on binary group attributes can be maximally unfair across more complex, intersectional groups – a phenomenon termed “fairness gerrymandering.” Recently, [31] shows that group fairness interventions do not control for – and may exacerbate – arbitrary treatment at the individual and subgroup level.

The paradigm of fairness over categorical groups is an instance of embedded human bias in ML: tools are developed to fit a pre-defined metric on predefined groups, and once a contrived audit is passed, we call the algorithm “fair.” Defining groups by indicator functions over categorical groups is not expressive enough, and the most discriminated groups may not be known a priori. This fact has fueled recent calls for new data-driven methods that uncover groups where a model errs the most. In particular, the burgeoning field of multi-group fairness, and definitions such as multicalibration and multiaccuracy [22, 28, 9], are important steps towards a more holistic view of fairness in ML, requiring a model to be calibrated on a large, potentially uncountable number of group-denoting functions instead of pre-defined categorical groups [22].

Multi-group fairness notions trade the choice of pre-determined categorical features for selecting a function class over features. Here, the group most correlated with a classifier’s errors (multiaccuracy) or against which a classifier is most miscalibrated (multicalibration) is indexed by a function in this class. [22] describes the class as being computable by a circuit of a fixed size. More concretely, [28] and [15] take this class to be linear regression, ridge regression, or shallow decision trees.

We build on this line of work by considering a more general class of functions given by a Reproducing Kernel Hilbert Space (RKHS), defined on an infinite-dimensional feature space [38]. In fact, an RKHS with a universal kernel is a dense subset of the space of continuous functions [39]. Surprisingly, by leveraging results from information and statistical learning theory [33, 37], we show that the multi-group fairness problem in an RKHS is tractable: the most biased group has a closed form up to a proportionality constant. This leads to an exceedingly simple algorithm (KMAcc, Algorithm 1), which first identifies the function in the RKHS that correlates the most with error yf(𝒙) (called the witness function), and then improves multiaccuracy by subtracting this function from the original predictions. As an example, Figure 1 illustrates that the error of a logistic regression model on the Two Moons synthetic dataset shows a strong correlation with the witness function values.

Refer to caption Refer to caption Refer to caption
Figure 1: Witness function values are highly correlated with errors of the model. Left: Visualization of the moon dataset, with the logistic regression classifier decision boundaries displayed. Middle: Witness function values (Definition 10 with rbf kernel) ck,𝒟0,f is plotted as a contour under the error of the classifiers on test samples yf(𝒙). The colored dots denote the error for each test sample yf(x). For samples where the model is most erroneous (dark green and dark purple dots), the predicted witness values are high (dark contour underneath). Right: The error yf(x) is plotted against the witness values ck,𝒟0,f, with a Pearson correlation coefficient of 0.828.

The main contributions of this work include:

  1. 1.

    We show that multiaccuracy, multicalibration, and outcome indistinguishability are integral probability metrics (IPMs), a well-studied family of statistical distance measures. When the groups or distinguishers lie in an RKHS, these IPMs have closed-form estimators, characterized by a witness function that achieves the supremum.

  2. 2.

    We introduce a consistent estimator for multiaccuracy, which flags the most discriminated group in terms of a function in Hilbert space, effectively revealing the previously unknown group that suffers the most from inaccurate predictions.

  3. 3.

    We propose an algorithm, KMAcc (Algorithm 1), which provably corrects the given predictor’s scores against its witness function. Empirically, our algorithm improves multiaccuracy and multicalibration after applying a standard score quantization technique, without the need for the iterative updates required by competing boosting-based models.

  4. 4.

    We conduct extensive experiments on both synthetic and real-world tabular datasets commonly used in fairness research. We show competitive or improved performance compared to competing models, both in terms of multi-group fairness metrics and AUC.

1.1 Related Literature

Multiaccuracy and Multicalibration.

Multiaccuracy and multicalibration, which emerged from theoretical computer science, ensure fairness over the set of computationally identifiable subgroups [22]. Multiaccuracy aims to make classification errors uncorrelated with subgroups, while multicalibration additionally requires predictions to be calibrated. [22] and [28] ensure multiaccuracy and multicalibration via a two-step process: identify subgroups with accuracy disparities, then apply a transformation to the classification function to boost accuracy over those groups – a method akin to weak agnostic learning [11]. Subsequent works [18, 17, 15] connect multicalibration to the general framework of loss minimization, introducing new techniques including reducing squared multicalibration error and projection-based error corrections [15, 8]. Recent developments include online multicalibration algorithms across Lipschitz convex loss functions [13] and via a game-theoretic approach [20]. In addition, [41] adopts multicalibration for multi-dimensional outputs for fair risk control.

A common thread across work on multigroup fairness is to define subgroups in terms of function classes instead of pre-determined discrete combinations of group-denoting features [28, 9, 18, 29]. Examples of such function classes include “learnable” classes (in the usual statistical learning sense) [28] and the set of indicator functions [10]. Practical implementations of multigroup-fairness ensuring algorithms include MCBoost [28], which uses ridge regression and decision tree regression, and LSBoost [15], which uses linear regression and decision trees. Here, we use both methods as benchmarks. Unlike prior work, we consider the class of functions to be an RKHS and show that this class yields closed-form expressions for the function that correlates the most with error, allowing an efficient multiaccuracy intervention.

Kernel-Based Calibration Metrics.

Calibration ensures that probabilistic predictions are neither over- nor under-confident [40]. Prior works have formulated calibration errors for tasks such as classification [7, 34], regression [36], and beyond [40]. Calibration constraints may be directly incorporated into the training objective of a model [30]. [30, 39, 32, 6] have adopted RKHS as the class of functions to ensure calibration. We build on this prior work and develop kernel-based metrics and consistent estimators focused on multi-group fairness.

Integral Probability Measures (IPMs).

[9] introduces outcome indistinguishability to unify multiaccuracy, multicalibration through a pseudo-randomness perspective – whether one can(not) tell apart “Nature’s” and the predictor’s predictions. We provide an alternative unifying perspective through distances between Nature’s and the predictor’s distributions. As discussed in [9], outcome indistinguishability is closely connected to statistical distance (total variation distance), which, in turn, is one instantiation of an IPM [33], an extensively studied concept in statistical theory that measures the distance between two distributions with respect to a class of functions. [37] provides estimators for IPM defined on various classes of functions, which we apply to develop a consistent estimator for multiaccuracy.

1.2 Notation

We consider a pair of random variables 𝑿 and Y, taking values in 𝒳 and 𝒴 respectively, where 𝒳 denotes the input features space to a prediction task and 𝒴 the output space. Often, we will take 𝒴={0,1}, i.e., binary prediction. The pair (𝑿,Y) is distributed according to a fixed unknown joint distribution (Nature’s distribution) P𝑿,Y with marginals P𝑿 and PY. In binary prediction, we refer to a measurable function f:𝒳[0,1] as a predictor. The predictor f gives rise to a conditional distribution QY𝑿=𝒙(1)f(𝒙). We think of QY𝑿 as an estimate of Nature’s distribution, i.e., QY𝑿=𝒙(1)PY𝑿=𝒙(1). The induced joint distribution for QY𝑿=𝒙 is denoted by Q𝑿,YP𝑿QY𝑿; this joint distribution Q𝑿,Y will be referred to as the predictor’s distribution. The marginal distribution P𝑿 is the same for both Q𝑿,Y and P𝑿,Y; only the conditional distribution QY|𝑿 changes due to using f.

Given a measurable function c and a random variable ZP, we interchangeably denote expectation by 𝔼P[c]=𝔼[c(Z)]=𝔼ZP[c(Z)]𝒵c(z)𝑑P(z) depending on what is clearer from context. If 𝒟 is a finite set of i.i.d. samples, then we denote the empirical average by 𝔼𝒟[c]=𝔼Z𝒟[c(Z)]|𝒟|1z𝒟c(z).

2 Multi-Group Fairness as Integral Probability Metrics

We show the connection between IPMs [33, 37] – a concept rooted in statistical learning theory – and multi-group fairness notions such as multiaccuracy, multicalibration [22], and outcome indistinguishability [10]. The key property allowing for these connections is that the multi-group fairness notions and IPMs are both variational forms of measures of deviation between probability distributions. IPMs give perhaps the most general form of such variational representations, and we recall the definition next.

Definition 1 (Integral Probability Metric [33, 37]).

Given two probability measures P and Q supported on 𝒵 and a collection of functions {c:𝒵}. We define the integral probability metric (IPM) between P and Q with respect to by

γ(P,Q) supc|𝔼ZP[c(Z)]𝔼ZQ[c(Z)]|. (1)
Example 2.

IPMs recover other familiar metrics on probability measures, such as the total variation (statistical distance) metric. Indeed, when is the unit L ball of real-valued functions, i.e., ={c:𝒵:supz𝒵|c(z)|1}, then γ(P,Q)=𝖳𝖵(P,Q).

As the example above shows, the complete freedom in choosing the set allows IPMs the ability to subsume existing metrics on probability measures. We show that the expressiveness of IPMs carries through to multi-group fairness notions. Later, in Section 3, we instantiate our IPM framework for multiaccuracy to the particular case when is the unit ball in an infinite-dimensional Hilbert space, which then recovers another familiar metric on probability measures called the maximum mean discrepancy (MMD) or kernel distance.

2.1 Multi-group Fairness Notions

We recall the definitions of multiaccuracy and multicalibration from [28, 29], where the guarantees are parametrized by a class of real-valued functions {c:𝒳}. We call herein the set of calibrating functions. Intuitively, multi-group notions ensure that c(𝑿) for every group-denoting function c is uncorrelated with a model’s errors Yf(𝑿).

Definition 3 (Multiaccuracy [28, 29]).

Fix a collection of functions222The range is [1,1] in [28] and + in [29]. We extend the range to . {c:𝒳} and a distribution P𝐗,Y supported on 𝒳×𝒴. A predictor f:𝒳[0,1] is (,α)-multiaccurate over P𝐗,Y if for all c the following inequality holds:

μ(c,f,P𝑿,Y)|𝔼[c(𝑿)(f(𝑿)Y)]|α (2)

Multicalibration proposed by [22] requires the predictor to be unbiased and calibrated against groups denoted by functions in .

Definition 4 (Multicalibration [22, 29, 8]).

Fix a collection of functions {c:𝒳×[0,1]} and a distribution P𝐗,Y supported on 𝒳×𝒴. Fix a predictor f:𝒳[0,1] such that f(𝐗) is a discrete random variable.333Alternatively, one can consider a quantization of f(𝐗) such as done in [14]. We say that f is (,α)-multicalibrated over P𝐗,Y if for all c and vsupp(f(𝐗)):

|𝔼[c(𝑿,f(𝑿))(f(𝑿)Y)f(𝑿)=v]|α (3)

As discussed in [9], multi-group fairness constraints are equivalent to a broader framework of learning called outcome indistinguishability (OI). The object of interest is the distance between the two distributions – the ones induced by the predictor and by Nature.

Definition 5 (Outcome Indistinguishability [9, 10]).

Fix a collection of functions {c:𝒳×[0,1]×𝒴} and a distribution P𝐗,Y supported on 𝒳×𝒴. We say that a predictor f:𝒳[0,1] is (,α)-outcome-indistinguishable if for all c,

|𝔼(𝑿,Y)P𝑿,Y[c(𝑿,f(𝑿),Y)]𝔼(𝑿,Y)Q𝑿,Y[c(𝑿,f(𝑿),Y)]|α,

where we define the induced distribution by the predictor Q𝐗,YP𝐗QY𝐗 for QY𝐗(1)f(1).

Total Variation (statistical) distance, one instantiation of an IPM [33], provides sufficient conditions for OI ([9]). We establish this broader connection next.

2.2 Equivalence Between Multi-group Fairness Notions and IPMs

Since multiaccuracy, multicalibration, and outcome indistinguishability all pertain to finding the largest distance between distributions with respect to a collection of functions, we can unify them in terms of IPMs. First, we show that ensuring a predictor’s multiaccuracy with respect to a set of calibrating functions is equivalent to ensuring an upper bound on the IPM between Nature’s and the predictor’s distribution with respect to a modified set of function ~, given explicitly in the following result.

Proposition 6 (Multiaccuracy as an IPM).

Fix a collection of functions L1(𝒳), and let 𝒴={0,1}. Fix a predictor f:𝒳[0,1] inducing the distribution Q𝐗,Y. Denote the modified set of functions

~={c~:𝒳×𝒴|c~(𝒙,y)=(1)1yc(𝒙)/2for c}. (4)

Then, for any α0, the predictor f is (,α)-multiaccurate if and only if the IPM between Nature’s distribution and the predictor’s distribution is upper bounded by α:

γ~(P𝑿,Y,Q𝑿,Y)α. (5)

Proof.

Let (𝝃,Y) be an identical copy of (𝑿,Y). Using the notation in (2) in the definition of multiaccuracy (Definition 3), we have that for every c

μ(c,f,P𝑿,Y) =|𝔼[c(𝝃)(f(𝝃)Y)]| (6)
=|𝔼[𝔼[c(𝝃)(Yf(𝝃))𝝃]]| (7)
=|𝔼[c(𝝃)(PY𝑿=𝝃(1)QY𝑿=𝝃(1))]| (8)
=|𝔼[c(𝝃)2PY𝑿=𝝃(1)c(𝝃)2PY𝑿=𝝃(0)] (9)
𝔼[c(𝝃)2QY𝑿=𝝃(1)c(𝝃)2QY𝑿=𝝃(0)]| (10)
=|𝔼P𝑿,Y[c~]𝔼Q𝑿,Y[c~]|, (11)

where c~(𝒙,y)(1)1yc(𝒙)/2. By definition of multiaccuracy, we have that f is (,α)-multiaccurate if and only if supcμ(c,f,P𝑿,Y)α. This is equivalent, by the above, to having the IPM bound γ~(P𝑿,Y,Q𝑿,Y)α, where ~ is as defined in the proposition statement, i.e., it is the collection of modified functions c~ as c ranges over .

Expressing multiaccuracy as an IPM bound will allow us to rigorously accomplish two goals: 1) quantifying multiaccuracy from finitely many samples of P𝑿,Y, and 2) correcting a given predictor f to be multiaccurate. These two goals are the subject of Section 3. Similarly, multicalibration and OI can be expressed as IPMs.

Proposition 7 (Multicalibration as an IPM).

Fix a collection of functions {c:𝒳}, and let 𝒴={0,1}. Fix a predictor f:𝒳[0,1] inducing the distribution Q𝐗,Y. Moreover, let ηy(1)1y. Let d:[0,1]𝒱[0,1], |𝒱|< be a discrete, finite quantization of [0,1], where P𝐗(d(f(𝐗))=v)>0 for all v𝒱. Define the set of functions

~v{c~:𝒳×𝒴×𝒱|c~(𝒙,y,v)=c(𝒙)𝟙f(𝑿)=vηy2P𝑿(f(𝑿)=v)for some c}.

Then f is (,α)-multicalibrated if and only if γ~v(P𝐗,Y,Q𝐗,Y)α for every v𝒱.

Proof.

Let (𝝃,Y) be an identical copy of (𝑿,Y). Using the notation in the definition of multicalibration (Definition 4), we have that for every c, v𝒱

𝔼[c(𝝃)(Yf(𝝃))|f(𝝃)=v] (12)
= 𝔼[c(𝝃)𝟙f(𝝃)=vP𝑿(f(𝝃)=v)(PY|𝑿=𝝃(1)QY|𝑿=𝝃(1))] (13)
= 𝔼[c(𝝃)𝟙f(𝝃)=v2P𝑿(f(𝝃)=v)(PY|𝑿=𝝃(1)PY|𝑿=𝝃(0))] (14)
𝔼[c(𝝃)𝟙f(𝝃)=v2P𝑿(f(𝝃)=v)(QY|𝑿=𝝃(1)QY|𝑿=𝝃(0))] (15)
= 𝔼P𝑿,Y[c~]𝔼Q𝑿,Y[c~] (16)

where c~(𝒙,y,v)c(𝒙)𝟙f(𝑿)=vηy2P𝑿(f(𝑿)=v).

Proposition 8 (OI as an IPM).

Let {c:𝒳×[0,1]×{0,1}} be a collection of functions, and fix a predictor f:𝒳[0,1] inducing the distribution Q𝐗,Y on 𝒳×{0,1} via composing with P𝐗. Define the set of function

~={c~:𝒳×{0,1}c~(𝒙,y)=c(𝒙,f(𝒙),y) for some c}. (17)

Then, for any α0, f is (,α)-OI if and only if γ~(P𝐗,Y,Q𝐗,Y)α.

Proof.

From Definition 1, if γ~(P𝑿,Y,Q𝑿,Y)α,

γ~(P𝑿,Y,Q𝑿,Y) supc~~|𝔼(𝑿,Y)P𝑿,Y[c~(𝑿,Y)]𝔼(𝑿,Y)Q𝑿,Y[c~(𝑿,Y)]| (18)
=supc|𝔼(𝑿,Y)P𝑿,Y[c(𝑿,f(𝑿),Y)] (19)
𝔼(𝑿,Y)Q𝑿,Y[c(𝑿,f(𝑿),Y)]| (20)
α (21)

By Definition 5, f is (,α)-OI. The other direction is analogous.

3 Multiaccuracy in Hilbert Space

We develop a theoretical framework and an algorithm for quantifying and ensuring (,α)-multiaccuracy. We consider the group-denoting functions k to be the unit ball in an infinite-dimensional Hilbert space, namely, an RKHS k defined by a given kernel k (Definition 9). The proposed set of calibration functions k can easily exhibit and exceed the expressivity of group-denoting indicator functions. Surprisingly, despite the expressiveness of k, we show that the calibration function that maximizes multiaccuracy error, i.e. the witness function ck (Definition 10), has a closed form – in contrast to when is, for example, a set of decision trees [15, 28]. This enables us to derive a procedure for ensuring multiaccuracy (KMAcc, Algorithm 1).

3.1 Calibration Functions in RKHS and its Witness Function for Multiaccuracy

Our choice of calibrating functions is the set of functions with bounded norm in an RKHS. First, recall that an RKHS can be defined via kernal functions, as follows444The characterizing property of a real RKHS is that it is a Hilbert space of functions c:𝒳 for which every evaluation map cc(𝒙) is a continuous function from to for each fixed 𝒙𝒳..

Definition 9 (Reproducing kernel Hilbert space (RKHS)).

Let {c:𝒳} be a real Hilbert space with inner product ,, and fix a function k:𝒳2. We say that is a reproducing kernel Hilbert space with kernel k if it holds that k(,𝐱) for all 𝐱𝒳 and c,k(,𝐱)=c(𝐱) for all c and 𝐱𝒳. We denote by k if k is given.

We use the structure of the RKHS as our group-denoting functions. Thus, for a prescribed multiaccuracy level α, we will need to restrict attention to elements of k whose norm satisfies a given bound. To normalize, we choose the unit ball in k as our set of calibration functions, i.e.

k{ck:ck1}. (22)

We note that when the class of functions is the unit ball in an RKHS, the induced IPM γ(P,Q) is called the maximum mean discrepancy (MMD) [37].

Of particular importance are calibration functions c that attain the maximal multiaccuracy error (the LHS of (2)). Such functions, called witness functions [27], encode the multiaccuracy definition without the need to consider the full set .

Definition 10 (Witness function for multiaccuracy).

For a fixed set of calibration functions {c:𝒳}, predictor f:𝒳[0,1], and distribution P𝐗,Y, we say that c is a witness function for multiaccuracy of f with respect to if it attains the maximum on the LHS in (2):

μ(c,f,P𝑿,Y)=maxcμ(c,f,P𝑿,Y). (23)

While an RKHS can encompass a broader class of functions than shallow decision trees or linear models, finding the function in the RKHS that errs the most (i.e., the witness function as per Definition 10) is surprisingly simple. Firstly, it can be shown that for the IPM γk(P,Q) (where k is the unit ball in k{c:𝒵}), the function ck that maximizes the RHS of (1) is in closed form, up to a multiplicative constant [19, 27]

c(z)𝔼ζP[k(z,ζ)]𝔼ζQ[k(z,ζ)]. (24)

By the connection between IPM and multiaccuracy, we can similarly find the closed form of the witness function for multiaccuracy(Definition 10).

Proposition 11 (Witness function for multiaccuracy).

Given a the kernel function k:𝒳2 and distribution P𝐗,Y over 𝒳×{0,1}. We assume that kL1(𝒳)555L1(𝒳) denotes the space of real-valued functions that are integrable against P𝐗, i.e. L1(𝒳){c:𝒳:𝔼[|c(𝐗)|]<}.. Fix a predictor f:𝒳[0,1] satisfying 𝔼[k(,𝐗)f(𝐗)]k. Then, there exists a unique (up to sign) witness function for multiaccuracy of f with respect to k (as per Definition 10), and it is given by

ck,f(𝒙)𝔼[θ(Yf(𝑿))k(𝒙,𝑿)], (25)

where θ is a normalizing constant so that ck,fk=1.

Proof.

First, by continuity of the evaluation functionals on k, we obtain that hn(𝒙)i=1ncik(𝒙i,𝒙)c(𝒙) pointwise for each 𝒙𝒳 as n [4, Chapter 1, Corollary 1]. Let h(𝒙)i=1|cik(𝒙i,𝒙)|. Next, applying Proposition 6, (k,α)-multiaccuracy of f is equivalent to the IPM bound γ~(P𝑿,Y,Q𝑿,Y)α, where ~ and Q𝑿,Y are as constructed in Proposition 6. Next, we use the definition of IPMs to deduce the formula for the witness function.

We rewrite the function inside the maximization definition of γ~(P𝑿,Y,Q𝑿,Y) as an inner product in . Fix c as above. Then, with c~(𝒙,y)(1)1yc(𝒙)/2, we have that

2𝔼P𝑿,Y[c~] =𝔼P𝑿,Y[(1)1Yc(𝑿)] (26)
=𝔼P𝑿,Y[(1)1Yc,k(,𝑿)] (27)
=𝔼P𝑿,Y[(1)1Yicik(𝒙i,),k(,𝑿)] (28)
=𝔼P𝑿,Y[(1)1Yicik(𝒙i,),k(,𝑿)] (29)
=ici𝔼P𝑿,Y[(1)1Yk(𝒙i,𝑿)], (30)

where (29) follows by continuity of the inner product and (30) by Fubini’s theorem since kL1(𝒳). The same steps follow for Q𝑿,Y in place of P𝑿,Y, and subtracting the ensuing two equations we obtain

𝔼P𝑿,Y[c~]𝔼Q𝑿,Y[c~] =ici𝔼P𝑿,Y[(Yf(𝑿))k(𝒙i,𝑿)] (31)
=c,𝔼P𝑿,Y[(Yf(𝑿))k(,𝑿)]. (32)

Therefore, the maximizing function is given up to a normalizing constant by

ck,f(𝒙)𝔼P𝑿,Y[(Yf(𝑿))k(𝒙,𝑿)].

In the presence of finitely many samples, one must resort to numerical approximations of the witness function.

Definition 12 (Empirical Witness Function).

Let 𝒟0 be a finite set of i.i.d. samples from P𝐗,Y. We define the empirical witness function as the plug-in estimator of (25):

ck,𝒟0,f(𝒙)=𝔼(𝑿,Y)𝒟0[θ^(Yf(𝑿))k(𝒙,𝑿)], (33)

where θ^ is a normalizing constant so that ck,𝒟0,f=1.

Observe that given a training dataset {𝒙i}i=1n, the witness function for a new sample 𝒙 is proportional to the sum of the error of 𝒙i weighted by k(𝒙,𝒙i) – the distance between 𝒙i and the new sample 𝒙 in the kernel space. The witness function is performing a kernel regression of a model’s errors. From the definition of the witness function, it attains the supremum in the IPM, which measures the distance between Nature and the Predictor’s distribution. Hence, if a new sample 𝒙 attains a high witness function value, f(𝒙) is likely erroneous.

We call the multiaccuracy error when comes from an RKHS the kernel multiaccuracy error, defined with the witness function ck,f(𝑿) which attains the maximum error.

Definition 13 (Kernel Multiaccuracy Error (KME)).

Let k be the set of calibration functions in the RKHS k as defined in (22). Given a predictor f, the kernel multiaccuracy error (KME) is defined as

γk(f,P𝑿,Y)|𝔼[ck,f(𝑿)(Yf(𝑿))]|. (34)

The empirical version has the plug-in estimator of the witness function ck,f.

Definition 14 (Empirical KME).

Given a test set of freshly sampled i.i.d. datapoints 𝒟, we define the empirical KME by

γk(f,𝒟)|𝔼(𝑿,Y)𝒟[ck,𝒟0,f(𝑿)(Yf(𝑿))]|. (35)
 Remark 15 (Overcoming the Curse of Dimensionality).

One important observation is that the MMD estimator depends on the dataset 𝒟 only through the kernel k. Hence, once k(𝐱i,𝐱j) is known, the complexity of the estimator is independent of the dimensionality of 𝐗 – e.g., for 𝐗d, sample complexity does not scale exponentially with d (see the end of Section 2.1 in [37].  

We give the consistency and rate of convergence of KME – the finite-sample estimation of KME converges to the true expectation, following a direct application of [37, Corollary 3.5].

Theorem 16 (Consistency of the KME Estimator, [37, Corollary 3.5]).

Suppose the kernel k is measurable and satisfies sup𝐱𝒳k(𝐱,𝐱)C<. Then, with probability at least 12eτ over the choice of i.i.d. samples 𝒟 from P𝐗,Y and for every predictor f:𝒳[0,1], there is a constant A=A(f,P𝐗,Y) such that the inequality

|γk(f,𝒟)γk(f,P𝑿,Y)|A(1+τ|𝒟|+τ|𝒟|), (36)

In addition, we have the almost-sure convergence γk(f,𝒟)γk(f,P𝐗,Y) as |𝒟|.

Next, we proceed to show an algorithm, KMAcc, that corrects a given predictor f of its multiaccuracy error using the empirical witness function.

3.2 KMAcc: Proposed Algorithm for Multiaccuracy

We propose a simple algorithm KMAcc (Algorithm 1) that corrects the original predictor from multiaccuracy error. Notably, KMAcc does not require iterative updates, unlike all competing boosting or projection-based models [28, 15, 8]. In a nutshell, KMAcc first identifies the function in the RKHS that correlates the most with the predictor’s error yf(𝒙) (called the witness function) and subtracts this function from the original prediction to get a multi-group fair model. The first step is surprisingly simple – as we have shown above, the witness function of an RKHS has a closed form up to a proportionality constant. The second step is an additive update followed by clipping.

As outlined in Algorithm 1, the algorithm takes in a pre-trained base predictor f, a proportionality constant λ, and a (testing) dataset 𝒟 on which the model is evaluated. Additionally, to define the witness function and the RKHS, the algorithm is given a dataset reserved for learning the witness function 𝒟0 and a kernel function k. With these, for each sample, the algorithm first computes the witness function value, and subtracts away the witness function value multiplied by λ, the proportionality constant which we learn from data (described in the next paragraph). Finally, we clip the output to fall within [0,1].

Learning the Proportionality Constant.

There are multiple approaches to obtaining the proportionality constant that scales the witness function appropriately. As an example, we adopt a data-driven approach to find λ. We use a validation set to perform a grid search on [0,1] to get the λ that produces a predictor g(𝒙)=f(𝒙)+λck,𝒟0,f(𝒙) that is closest to f in terms of L2 distance, while also satisfying a αmultiaccuracy with a specified α666When the multiaccuracy constraint cannot be met, we output the λ that achieves the lowest multiaccuracy error using the witness values of f..

Algorithm 1 KMAcc.
 Remark 17 (One-Step Update).

For the linear kernel k(x,x)=xTx, we show that KMAcc yields a 0-multiaccurate predictor in a single step. While this property does not extend to nonlinear kernels, we observe empirically that the one-step update in KMAcc significantly reduces the empirical KME for RBF kernels. See Appendix C. for a detailed discussion.

We discuss in the following section a theoretical framework that gives rise to KMAcc and the grid-search approach.

3.3 Theoretical Framework for KMAcc

We formulate an optimization that, given a prediction f:𝒳[0,1] that is not necessarily multiaccurate, finds the “closest” predictor g:𝒳[0,1] that is corrected for multiaccuracy with respect to the empirical witness function ck,𝒟0,f of f. Specifically, we consider the mean-squared loss to obtain the problem:

minimizeg:𝒳[0,1] 12𝔼(𝑿,Y)𝒟[(f(𝑿)g(𝑿))2] (37)
subject to |𝔼(𝑿,Y)𝒟[ck,𝒟0,f(𝑿)(g(𝑿)Y)]|α.

where 𝒟0 and 𝒟 are sets of i.i.d. samples that are sampled independently of each other.

A closer look at (37) shows that it is a quadratic program (QP)777Please find details of QP formulation in Appendix A. Thus, we can solve this QP through its dual problem to obtain a closed-form formula for the solution g. The following formula follows by applying standard results on QP [5, Chapter 4.4].

Theorem 18.

Fix two independently sampled sets of i.i.d. samples 𝒟0 and 𝒟 from P𝐗,Y with |𝒟|=n, and let 𝐟=𝐟𝒟, 𝐲=𝐲𝒟, 𝐜=𝐜𝒟0,𝒟, 𝐀=𝐀𝒟0,𝒟 and 𝐛=𝐛𝒟0,𝒟 be the fixed vectors and matrix as defined in (41)–(44). Denote an optimization variable =(λ+,λ,𝛏+T,𝛏T)T2n+2 and let 𝐁=12𝐀𝐀T(2n+2)×(2n+2) and 𝐝=𝐛𝐀𝐟. Let =(λ+,λ,(𝛏+)T,(𝛏)T)T be the unique solution to the QP

minimize𝟎T𝑩+𝒅T. (38)

Then, the predictors solving the optimization (37) are determined by their restriction to 𝒟 as

g(𝒙i)=f(𝒙i)+λck,𝒟0,f(𝒙i)+ξi (39)

where λ1n(λλ+) and 𝛏𝛏𝛏+. Furthermore, the value of 𝛏 may be chosen888To see this, note that, thinking of λ and λ+ as constants, the optimization over a single pair (ξ,i,ξ+,i) takes the form of minimizing (ξi+λci)2/2+fiξi+ξ+,i over ξ+,i0 and ξiξ+,i. The optimal value for this can be easily seen to be ξi=λcifi if λci+fi<0, or ξi=0 if λci+fi[0,1], or 1λcifi if λci+fi>1. This translates to clipping g to be within [0,1]. so that g is projected onto [0,1]. In particular, applying KMAcc (Algorithm 1) with the value λ=λ attains a solution to (37).

4 Experiments

Refer to caption
Figure 2: Multiaccuracy error (KME, Definition 14) vs. calibration error (MSCE, Definition 19) for KMAcc(our method), competing methods (LSBoost and MCBoost), and KMAcc with isotonic calibration, a standard score quantization technique. Predictor performances are measured as AUC and labeled next to each method. KMAcc achieves improved or comparable KME and AUC to the baselines and MCBoost (with the exception of Naive Bayes baseline classifier). Notably, KMAcc + isotonic calibration significantly improves MSCE while maintaining KME and AUC, with more favorable results than both LSBoostand MCBoost. Results are shown for the FolkTables Income of WA dataset (first row) and the Health Coverage of WI dataset (second row), with error bars on both axes101010The MSCE standard deviation is often imperceptible.

We benchmark our proposed algorithm, KMAcc (Algorithm 1), on four synthetic datasets and eight real-world tabular datasets111111Implementation of KMAcccan be found at https://github.com/Carol-Long/KMAcc. We demonstrate KMAcc’s competitive or improved performance among competing interventions, both in multi-group fairness metrics and in AUC. Full experimental results are provided in Appendix B.

4.1 Datasets

We provide experimental results from the US Census dataset FolkTables. We conduct 4 binary classification tasks, including ACSIncome, ACSPublicCoverage, ACSMobility, ACSEmployment, using two different states for each of these tasks. In addition, we generate four synthetic datasets using the sklearn.datasets class in Scikit-Learn [35] – moons, concentric circles, blobs with varied variance, and anisotropically distributed data.

4.2 Competing Methods

We benchmark our method against LSBoost121212We use the official implementation of LSBoost available at https://github.com/Declancharrison/Level-Set-Boosting. by [15] and MCBoost131313We use the official implementation of MCBoost available at https://osf.io/kfpr4/?view_only=adf843b070f54bde9f529f910944cd99. by [29], which are (to the best of our knowledge) the two existing algorithms of multi-group fairness with usable Python implementations.

The mechanism of LSBoost is the following: over a number of level set partitions, each called v, LSBoost finds a function cvt+1𝒞 through a squared error regression oracle before updating a function ft+1 as a rounding of the values to each level set using a successive updating of indicator values as to which set they lie in under the previous ft combined with the learned cvt+1: f^=v[1/m]𝟏[ft(x)=v]cvt+1(x), and f=Round(f^t+1,m). This updating continues so long as an error term measured by the expectation of the squared error continues to decrease at a rate above a parameterized value. 𝒞 is taken to be linear regression or decision trees.

The MCBoost algorithm performs an iterative multiplicative weights update applied to successively learned functions. Starting with an initial predictor p0, it learns a series of grouping functions c(x)𝒞, that maximize multiaccuracy error. The algorithm now stores a set of both calibration points S and validation points V, at each step generating the set St by, (x,y)S, having (x,ypt(x))St. Then, using the weak agnostic learner on St, it produces a function c which has its multiaccuracy checked on the validation set V with the empirical estimate of the multiaccuracy error before enacting a multiplicative weights update ft+1(𝒙)=eηht,S(𝒙)ft(𝒙) if the multiaccuracy error is large. There are three different classes 𝒞 it might draw c from – either taking sub-populations parameterized by some number of intersections of features, using ridge regression, or using shallow decision trees.

4.3 Performance Metrics

We evaluate the performance of baseline and multi-group fair models across three metrics: Kernel Multiaccuracy Error (KME, Definition 13), Area Under the ROC Curve (AUC), and Mean-Squared Calibration Error (MSCE), where MSCE is defined as follows.

Definition 19 (Mean-Squared Calibration Error (MSCE), [15]).

The Mean-Squared Calibration Error (MSCE) over a dataset 𝒟 of a predictor f:𝒳[0,1] with a countable range R(f) is defined by

MSCE(f,𝒟)vR(f)Pr(𝑿,Y)𝒟[f(𝑿)=v](𝔼(𝑿,Y)𝒟[(Yv)f(𝑿)=v])2,

Our algorithm optimizes for KME and utility, while LSBoost [15] optimizes for MSCE. Hence, both of these metrics are reported. MCBoost [28] optimizes for multiaccuracy error (without considering calibration functions in the kernel space) and classification accuracy. We report AUC since it captures the models’ performance across all classification thresholds.

4.4 Methodology

To implement and benchmark KMAcc, we proceed through the following steps.

Data splits.

We assume access to a set of i.i.d. samples 𝒟 drawn from P𝑿,Y, where P𝑿,Y is a distribution over 𝒳×𝒴=m×{0,1}. We randomly partition 𝒟 into four disjoint subsets: 𝒟train (for training the baseline predictor f), 𝒟0 for computing the witness function ck,𝒟0,f, 𝒟val for finding the proportionality constant λ), and finally 𝒟test for benchmarking the performance of KMAcc in a test set against the state-of-the-art methods.

Baseline predictor 𝒇.

Using the training data 𝒟train, we learn a baseline classifier f. Our algorithm treats this function f as a black box. For our experiments, we use four distinct supervised learning classification models as a baseline: Logistic Regression, 2-layer Neural Network, Random Forests, and Gaussian Naive Bayes, all implemented by Scikit-learn [35]. We train these on 𝒟train, values that are not used in learning our witness or in KMAcc.

Learning the witness function.

We take as our class of calibration functions the unit ball k in the RKHS k (Equation 22) with the kernel k being the RBF kernel, given explicitly for a parameter γ>0 by

kγ(𝒙,𝒙)=exp(γ𝒙𝒙22) (40)

The value of γ is a hyperparameter that we finetune using 𝒟0. We conduct a grid search over the parameter γ to find a γ such that ck,𝒟0,f has maximal correlation with the errors yf(𝒙), thus obtaining ck,𝒟0,fk in terms of f, γ, and 𝒟0 (see Proposition 11). To carry out this step, we run grid search on γ using K-fold validation on the data 𝒟0. The value of the normalizing constant θ in Proposition 11 (for attaining ck,𝒟0,fk=1) can be skipped in this step for the sake of finding the optimal multiaccurate predictor g solving (37), because θ can be subsumed in the value of the optimal parameter λ.

Performing KMAcc.

Using 𝒟val, we perform a simple grid search to find the smallest λ such that the multiaccuracy condition (Definition 13) is met (alternatively, one could solve the QP detailed in Theorem 18). Running KMAcc (Algorithm 1), we update f using λ and ck,𝒟0,f to obtain the multi-group fair classifier g=f+λck,𝒟0,f.

4.5 Results

Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption
Figure 3: Test errors over witness value contours using the RBF kernel. First Column: Visualization of the moon, concentric-circle, blob, and needle datasets. Red and blue represent the true labels. Second Column: Classification via a logistic regression classifier. Witness function values (Definition 10) ck,𝒟0,f is plotted as a contour under the error of the classifiers on test samples yf(𝒙). The witness function values are highly correlated with the errors of the predictor. Dark green and dark green dots mark where the classifier is most erroneous for the blue class and the red class, respectively. The linear regression model is not capable of classifying concentric circles, resulting in almost the entire blue class being misclassified. Similarly, we show a similar pattern between classifier error and the witness function values for a random forest classifier (Third Column) and a multi-layer perceptron classifier Fourth Column.

With the process described in Section 4.4, we test KMAccacross various baseline classifiers using implementations in Scikit-Learn [35]. In each US Census dataset, we execute five runs of each model on which we report, showing the mean value of each metric alongside error bars.

Firstly, on synthetic datasets, we demonstrate that the witness function is a good predictor of classifier error. In Figure 1, we train a logistic regression classifier on the moon and circle datasets to perform binary classification. The classifier has an accuracy of 0.85 and AUC of 0.94, and most errors occur in the middle where the red and blue classes are not linearly separable. Indeed, samples with high errors in scores yf(x) also receive high predicted errors in terms of witness function values. Indeed, the scatter plot (Right Column) illustrates the linear correlation between test error and witness value, with a high Pearson correlation coefficients of 0.828. Complete results using additional baseline models (Random Forest and Multi-layer Perceptron) are shown in Figure 3.

On US Census datasets and as demonstrated in Figure 10, KMAcc achieves the lowest KME relative to competing models without sacrificing AUC, and KMAcc paired with isotonic calibration achieves the lowest multi-group metrics (KME and MSCE) while maintaining competitive AUC. In Figure 10, baseline models (blue circle) have high MSCE, and most have non-negligible KME, with the exception of neural networks. Post-processing the baseline models using KMAcc  (yellow rectangle), we see a significant reduction in KME from the baseline (shifting to the left of the plot), and in a majority of experiments, the post-processed models achieve, on the test set, the pre-specified KME constraint with γk(g,P𝑿,Y)<.01. To target low calibration error (measured by MSCE on the y-axis), we apply off-the-shelf isotonic calibration on top of KMAcc. We observe that applying KMAcc+Isotonic Calibration (red diamond) to baseline results in low errors on both axes (KME and MSCE). Across all baselines and experiments, applying either KMAccor KMAcc+Isotonic Calibration does not degrade the predictive power of the models – the AUCs (labeled next to each model) of models corrected by the proposed methods either stay relatively unchanged or improved.

Competing method MCBoost achieves effective reduction in KME with minimal improvement on MSCE, without sacrificing AUC. We note that KMAcc+Isotonic Calibration enjoys comparable or better performance with regards to MCBoost on KME and better performance on MSCE, while eliminating the need for iterative updates to minimize miscalibration that is required in MCBoost. LSBoost(orange polygon) achieves low MSCE while worsening both KME and AUC.

5 Discussion and Conclusion

We connect the multi-group notions to Integral Probability Measures (IPM), providing a unifying statistical perspective on Multiaccuracy, Multicalibration, and OI. This perspective leads us to a simple yet powerful algorithm (KMAcc) for achieving multiaccuracy with respect to a class of functions defined by an RKHS. KMAcc boils down to first predicting the error of the classifier using the witness function, and then subtracting the error away. This algorithm enjoys provable performance guarantees and empirically achieves favorable accuracy and multi-group metrics relative to competing methods.

A limitation of our empirical analysis in comparison to other methods is that we optimize over the calibration function class being the unit-ball RKHS with the RBF kernel, which may not be the set of calibration functions for which other benchmarks achieve the lowest multiaccuracy or multicalibration error on. Furthermore, while the proposed method achieves favorable multicalibration results, this algorithm does not have provable guarantees for multicalibration. Developing a multicalibration-ensuring algorithm through the IPM perspective is an exciting future direction.

To conclude, this work contributes to the greater effort of reducing embedded human bias in ML fairness. To this end, we adopt RKHS as the expressive group-denoting function class to ensure multi-group notions on, rather than using predefined groups. It remains an open question to explore the structure of the witness function – the most biased group-denoting function in the RKHS – and its relationship to the predefined group attributes, which may inform us of the intersectionality and the structure of errors in ML models.

References

  • [1] Alekh Agarwal, Alina Beygelzimer, Miroslav Dudík, John Langford, and Hanna Wallach. A reductions approach to fair classification. In International conference on machine learning, pages 60–69. PMLR, 2018. URL: http://proceedings.mlr.press/v80/agarwal18a.html.
  • [2] Wael Alghamdi, Hsiang Hsu, Haewon Jeong, Hao Wang, Peter Michalak, Shahab Asoodeh, and Flavio Calmon. Beyond adult and compas: Fair multi-class prediction via information projection. Advances in Neural Information Processing Systems, 35:38747–38760, 2022.
  • [3] Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and machine learning: Limitations and opportunities. MIT Press, 2023.
  • [4] Alain Berlinet and Christine Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer, 1 edition, 2004. doi:10.1007/9781441990969.
  • [5] Stephen P Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
  • [6] Peng Cui, Wenbo Hu, and Jun Zhu. Calibrated reliable regression using maximum mean discrepancy. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17164–17175. Curran Associates, Inc., 2020. URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/c74c4bf0dad9cbae3d80faa054b7d8ca-Paper.pdf.
  • [7] A Philip Dawid. The well-calibrated bayesian. Journal of the American Statistical Association, 77(379):605–610, 1982.
  • [8] Zhun Deng, Cynthia Dwork, and Linjun Zhang. Happymap: A generalized multi-calibration method. arXiv preprint arXiv:2303.04379, 2023. doi:10.48550/arXiv.2303.04379.
  • [9] Cynthia Dwork, Michael P Kim, Omer Reingold, Guy N Rothblum, and Gal Yona. Outcome indistinguishability. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 1095–1108, 2021. doi:10.1145/3406325.3451064.
  • [10] Cynthia Dwork, Daniel Lee, Huijia Lin, and Pranay Tankala. From pseudorandomness to multi-group fairness and back. In The Thirty Sixth Annual Conference on Learning Theory, pages 3566–3614. PMLR, 2023. URL: https://proceedings.mlr.press/v195/dwork23a.html.
  • [11] Vitaly Feldman. Distribution-specific agnostic boosting. In International Conference on Supercomputing, 2009. URL: https://api.semanticscholar.org/CorpusID:2787595.
  • [12] Sorelle A Friedler, Carlos Scheidegger, Suresh Venkatasubramanian, Sonam Choudhary, Evan P Hamilton, and Derek Roth. A comparative study of fairness-enhancing interventions in machine learning. In Proceedings of the conference on fairness, accountability, and transparency, pages 329–338, 2019. doi:10.1145/3287560.3287589.
  • [13] Sumegha Garg, Christopher Jung, Omer Reingold, and Aaron Roth. Oracle efficient online multicalibration and omniprediction. In Proceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2725–2792. SIAM, 2024. doi:10.1137/1.9781611977912.98.
  • [14] Ira Globus-Harris, Varun Gupta, Christopher Jung, Michael Kearns, Jamie Morgenstern, and Aaron Roth. Multicalibrated regression for downstream fairness. arXiv preprint arXiv:2209.07312, 2022. doi:10.48550/arXiv.2209.07312.
  • [15] Ira Globus-Harris, Declan Harrison, Michael Kearns, Aaron Roth, and Jessica Sorrell. Multicalibration as boosting for regression. arXiv preprint arXiv:2301.13767, 2023. doi:10.48550/arXiv.2301.13767.
  • [16] Robert Kent Goodrich. A riesz representation theorem. In Proc. Amer. Math. Soc, volume 24, pages 629–636, 1970.
  • [17] Parikshit Gopalan, Lunjia Hu, Michael P Kim, Omer Reingold, and Udi Wieder. Loss minimization through the lens of outcome indistinguishability. arXiv preprint arXiv:2210.08649, 2022. doi:10.48550/arXiv.2210.08649.
  • [18] Parikshit Gopalan, Adam Tauman Kalai, Omer Reingold, Vatsal Sharan, and Udi Wieder. Omnipredictors. arXiv preprint arXiv:2109.05389, 2021. arXiv:2109.05389.
  • [19] Arthur Gretton, Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alex Smola. A kernel method for the two-sample-problem. Advances in neural information processing systems, 19, 2006.
  • [20] Nika Haghtalab, Michael Jordan, and Eric Zhao. A unifying perspective on multi-calibration: Game dynamics for multi-objective learning. Advances in Neural Information Processing Systems, 36, 2024.
  • [21] Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. Advances in neural information processing systems, 29, 2016.
  • [22] Ursula Hébert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. Multicalibration: Calibration for the (computationally-identifiable) masses. In International Conference on Machine Learning, pages 1939–1948. PMLR, 2018.
  • [23] Emmie Hine and Luciano Floridi. The blueprint for an ai bill of rights: in search of enaction, at risk of inaction. Minds and Machines, pages 1–8, 2023.
  • [24] Max Hort, Zhenpeng Chen, Jie M Zhang, Federica Sarro, and Mark Harman. Bia mitigation for machine learning classifiers: A comprehensive survey. arXiv preprint arXiv:2207.07068, 2022. doi:10.48550/arXiv.2207.07068.
  • [25] Faisal Kamiran, Asim Karim, and Xiangliang Zhang. Decision theory for discrimination-aware classification. In 2012 IEEE 12th international conference on data mining, pages 924–929. IEEE, 2012. doi:10.1109/ICDM.2012.45.
  • [26] Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In International conference on machine learning, pages 2564–2572. PMLR, 2018.
  • [27] Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. Examples are not enough, learn to criticize! criticism for interpretability. Advances in neural information processing systems, 29, 2016.
  • [28] Michael P Kim, Amirata Ghorbani, and James Zou. Multiaccuracy: Black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 247–254, 2019. doi:10.1145/3306618.3314287.
  • [29] Michael P. Kim, Christoph Kern, Shafi Goldwasser, and Frauke Krueter. Universal adaptability: Target-independent inference that competes with propensity scoring. Proceedings of the National Academy of Sciences, page 119(4):e2108097119, 2022.
  • [30] Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. Trainable calibration measures for neural networks from kernel mean embeddings. In International Conference on Machine Learning, pages 2805–2814. PMLR, 2018.
  • [31] Carol Xuan Long, Hsiang Hsu, Wael Alghamdi, and Flavio Calmon. Individual arbitrariness and group fairness. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • [32] Charles Marx, Sofian Zalouk, and Stefano Ermon. Calibration by distribution matching: Trainable kernel calibration metrics. arXiv preprint arXiv:2310.20211, 2023. doi:10.48550/arXiv.2310.20211.
  • [33] Alfred Müller. Integral probability metrics and their generating classes of functions. Advances in applied probability, 29(2):429–443, 1997.
  • [34] Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning, pages 625–632, 2005. doi:10.1145/1102351.1102430.
  • [35] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. doi:10.5555/1953048.2078195.
  • [36] Hao Song, Tom Diethe, Meelis Kull, and Peter Flach. Distribution calibration for regression. In International Conference on Machine Learning, pages 5897–5906. PMLR, 2019. URL: http://proceedings.mlr.press/v97/song19a.html.
  • [37] Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, and Gert RG Lanckriet. On the empirical estimation of integral probability metrics, 2012.
  • [38] Bharath K Sriperumbudur, Kenji Fukumizu, and Gert RG Lanckriet. Universality, characteristic kernels and rkhs embedding of measures. Journal of Machine Learning Research, 12(7), 2011.
  • [39] David Widmann, Fredrik Lindsten, and Dave Zachariah. Calibration tests in multi-class classification: A unifying framework. Advances in neural information processing systems, 32, 2019.
  • [40] David Widmann, Fredrik Lindsten, and Dave Zachariah. Calibration tests beyond classification. arXiv preprint arXiv:2210.13355, 2022. doi:10.48550/arXiv.2210.13355.
  • [41] Lujing Zhang, Aaron Roth, and Linjun Zhang. Fair risk control: A generalized framework for calibrating multi-group fairness risks. arXiv preprint arXiv:2405.02225, 2024. doi:10.48550/arXiv.2405.02225.

Appendix

Appendix A Details of Theoretical Framework

From Equation (37), we show that it is a quadratic program (QP). To begin, writing 𝒟={(𝒙j,yj)}j=1n and denoting

𝒇𝒟=(f(𝒙1),,f(𝒙n))T,𝒈=(g(𝒙1),,g(𝒙n))T, (41)

the objective function becomes the quadratic function 12𝒇𝒟𝒈22. Similarly, the constraint is a linear inequality in 𝒈, which we write as 𝑨𝒟0,𝒟𝒈𝒃𝒟0,𝒟, where 𝑨𝒟0,𝒟(2n+2)×n and 𝒃𝒟0,𝒟2n+2 are fixed and determined by 𝒟0 and 𝒟 in view of equation (33) for the empirical witness function ck,𝒟0,f. Explicitly, denoting 𝒟0={(𝒙~i,y~i)}i=1m, let us use the shorthands

𝒚𝒟=(y1,,yn)T,𝒄𝒟0,𝒟=(ck,𝒟0,f(𝒙1),,ck,𝒟0,f(𝒙n))T. (42)

Then, the multiaccuracy constraint in (37) can be written as |𝒄𝒟0,𝒟T𝒈/n𝒄𝒟0,𝒟T𝒚𝒟/n|α. Taking the search space into consideration (i.e., g evaluates to [0,1]), we see that (37) may be rewritten as the the following QP:

minimize𝒈n 12𝒇𝒟𝒈22 (43)
subject to 𝑨𝒟0,𝒟𝒈𝒃𝒟0,𝒟,

where we define the constraint’s matrix and vector by

𝑨𝒟0,𝒟(𝒄𝒟0,𝒟T/n𝒄𝒟0,𝒟T/n𝑰n𝑰n),𝒃𝒟0,𝒟(α+𝒄𝒟0,𝒟T𝒚/nα𝒄𝒟0,𝒟T𝒚/n𝟏n𝟎n). (44)

Note that the witness function for a kernel k:𝒳2, dataset 𝒟0={(𝒙~j,y~j)}j[m], and predictor g:𝒳[0,1] is given by

ck,𝒟0,g(𝒙)=θk,𝒟0,gm(𝒈~𝒚~)T𝒌~(𝒙), (45)
ck,𝒟0,g(𝒙)=(𝒈~𝒚~)T𝒌~(𝒙)(𝒈~𝒚~)T𝑲~(𝒈~𝒚~) (46)

where 𝒌~:𝒳m is the vector-valued function defined by 𝒌~(𝒙)(k(𝒙,𝒙~j))j[m], 𝒈~(g(𝒙~j))j[m] and 𝒚~(y~j)j[m] are fixed vectors, and θk,𝒟0,g is a normalizing constant that is unique up to sign. We may compute θk,𝒟0,g by setting ck,𝒟0,gk=1, namely, we have

θk,𝒟0,g2=m2(𝒈~𝒚~)T𝑲~(𝒈~𝒚~), (47)

where 𝑲~(k(𝒙~i,𝒙~j))i,j[m] is a fixed matrix. Thus, the multiaccuracy constraint becomes

|𝒉¯T𝑲¯𝒉~|nατ𝒉~ (48)

where τ𝒉~𝒉~T𝑲~𝒉~ and 𝒉=(𝒉¯T,𝒉~T)T. With 𝒉=𝒈𝒚 and 𝒓=𝒇𝒚, the objective becomes 12n𝒓𝒉22. At each iteration of τ, check if ττ𝒉~.

We may compute the KME of a predictor g with respect to class k and dataset 𝒟1={(𝒙i,yi)}i[n] via the equation

KME(k,𝒟1,𝒟0,g)=1n|(𝒈𝒚)T𝑲(𝒈~𝒚~)|(𝒈~𝒚~)T𝑲~(𝒈~𝒚~), (49)

where 𝒈~,𝒚~,𝑲~ are computed on 𝒟0 as above, 𝒈(g(𝒙i))i[n] and 𝒚(yi)i[n] are fixed vectors, and 𝑲(k(𝒙i,𝒙~j))(i,j)[|𝒟1|]×[|𝒟0|] is a fixed matrix. Note that if 𝒟0 is used for computing ck,𝒟0,f for a given predictor f and then g is obtained using ck,𝒟0,f (so 𝒟0 was used for deriving g), then one should report KME(k,𝒟1,𝒟0,g) for a freshly sampled 𝒟0 at the testing phase.

Appendix B Complete Experimental Results

Ablation.

As we have presented evidence that isotonic calibration plus KMAcc can be an effective post-processing method, for the purpose of ablation we now analyze isotonic calibration being applied directly to the baseline classifier. We note that isotonic calibration tends to maintain an equivalent or higher AUC because the monotonic function preserves ranking of the samples up to tie breaking (which rarely has an influence) [34]. Our ablation method frequently achieves a similar or better MSCE than LSBoost (as discussed, the baseline plus isotonic calibration achieves a MSCE less than .02 in all benchmarks, while LSBoost only achieves this in 23 of 40 benchmarks), and a better average MSCE than KMAcc alone in all benchmarks. However, isotonic calibration alone has a significantly lower KME than KMAcc in 20 of 40 benchmarchs, confirming the utility of an algorithm targeting optimizing for multiaccuracy error as well.

Refer to caption Refer to caption Refer to caption Refer to caption
Figure 4: The Folktables Employment Task with data from the state of Alabama.
Refer to caption Refer to caption Refer to caption Refer to caption
Figure 5: The Folktables Health Task with data from the state of Ohio.
Refer to caption Refer to caption Refer to caption Refer to caption
Figure 6: The Folktables Health Task with data from the state of Wisconsin.
Refer to caption Refer to caption Refer to caption Refer to caption
Figure 7: The Folktables Income Task with data from the state of Illinois.
Refer to caption Refer to caption Refer to caption Refer to caption
Figure 8: The Folktables Income Task with data from the state of Washington.
Refer to caption Refer to caption Refer to caption Refer to caption
Figure 9: The Folktables Mobility Task with data from the state of New Jersey.
Refer to caption Refer to caption Refer to caption Refer to caption
Figure 10: The Folktables Mobility Task with data from the state of New York.
Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption
Figure 11: To understand the influence of LSBoost on our predictions, we viewed it through the lens of histograms of LSBoost’s estimates of the labels (Blue), and constructed K-means bins for the scores of functions (Orange) for each model type and the Health in Wisconsin and Wealth in Washington tasks (as examples of the empirical result guiding us to test the effect of these interventions).

Appendix C KMAcc: Conditions for One-Step Sufficiency

A one-step update using the witness function in KMAccleads to a 0-multiaccurate predictor all functions in the RKHS ck for the linear kernel and very specific constructions of nonlinear kernel. Running KMAcc as an iterative procedure is redundant under these restricted settings. This is a property of RKHSs that follows from the Riesz Representation Theorem [16].

 Remark 20.

To gain an intuition on why multiaccuracy may be zero upon a one-step update, we can observe an analogous result in the Euclidean space d. Given a linear subspace , a prediction vector 𝐟 and true labels 𝐲, multiaccuracy can be similarly defined as

maxc,c1cT(𝒚𝐟)

The error of a classifier 𝐞=(𝐲𝐟) can be decomposed into two components: the projection of 𝐞 onto the subspace and the residual. Hence, we have 𝐞=𝐞+𝐞R. Then, once we subtract away 𝐞, multiaccuracy error boils down to the dot product cT𝐞R, which equals 0. ∎

Next, we proceed with an RKHS HkL1. Given a base predictor f:𝒳[0,1], the kernel function k w.r.t an RKHS k. Again, L1(𝒳) denotes the space of real-valued functions that are integrable against P𝑿, i.e. L1(𝒳){c:𝒳:𝔼[|c(𝑿)|]<}. Let the multiaccuracy error of one function ck be L(c,f)=𝔼[c(X)(Yf(X))]. By the reproducing property of k, we have c(x)=c,k(,x)k for all ck,x𝒳. We can thus rewrite the multiaccuracy error as the following:

L(c,f) =𝔼[c,k(,X)k(Yf(X))]
=c,𝔼[(Yf(X))k(,X)]k
=c,hk

where h(x)=𝔼[(Yf(X))k(x,X)]. For the second step, since HkL1, we can invoke the Fubini’s Theorem to interchange expectation and inner product. Under the assumption of integrability, hk. By the Riesz Representation Theorem, the linear functional L(,f) has a unique representer hk, which is the function h(x) defined above. Specifically, by the Riesz Representation Theorem, there exists a unique ck,fk such that for all ck,

L(c,f)=c,hk,

where the function ck,f is defined as the normalized direction of h, i.e. ck,f=θh, and θ=1hk This is identical to ck,f(𝒙) as defined in (25). From Proposition 11, ck,f(𝒙) achieves the supremum over k: ck,f=argsupck,c1L(c,f). This simplifies L(ck,f,f)=θh,hk=θhk2=h, where we substitute θ=1hk. Let the updated predictor be: g(x)=f(x)+λck,f(x).

The multiaccuracy error after the one-step update is given by:

𝔼[c(X)(Yg(X))] =𝔼[c(X)(Y(f(x)+λck,f(x)))]
=𝔼[c(X)(Yf(x))]λ𝔼[c(X)ck,f(x)]
=𝔼[c(X)(Yf(x))]λ𝔼[c(X)θh(X)]
=L(c,f)(λ×θ)𝔼[c(X)h(X)]

For the linear kernel (as we have observed in the remark), we operate in the Euclidean space, and L(c,f)=𝔼[c(X)h(X)]. Hence, By taking λ=1θ, we have

𝔼[c(X)(Yg(X))]=L(c)(λ×θ)L(c)=0.

For non-linear kernels, L(c,f)𝔼[c(X)h(X)] in general, and equality holds only when

𝔼X[k(,X)k(X,X)]=κk(,X)

where κ=𝔼X[k(X,X)] is a scalar constant.

To see this, we need to simplify 𝔼[c(X)h(X)] in the kernel space:

𝔼[c(X)h(X)] =𝔼X[c(X)𝔼X[(Yf(X))k(X,X)]]
=𝔼X,X[c(X)(Yf(X))k(X,X)]
=𝔼X,X[c,k(,X)k(Yf(X))k(X,X)]
=c,𝔼X,X[(Yf(X))k(,X)k(X,X)]k
=c,𝔼X[(Yf(X))𝔼X[k(,X)k(X,X)]]k

In the first equality, we substitute in the definition of h(x). In the second equality, we apply Fubini’s Theorem to swap the two expectations. In the third equality, we apply the reproducing property where c(X)=c,k(,X). In the fourth equality, we interchange the expectation and inner product by Fubini’s theorem under integrability conditions. In the last equality, we expand into iterative expectations.