Abstract 1 Introduction 2 Preliminaries 3 Smoothed Predictions for the Batch Setting 4 Smoothed Predictions for the Online Setting 5 Discussion References Appendix A Missing Proof in Section 3 Appendix B Missing Proof in Section 4

Smooth Calibration and Decision Making

Jason Hartline ORCID Department of Computer Science, Northwestern University, Evanston, IL, USA Yifan Wu ORCID Department of Computer Science, Northwestern University, Evanston, IL, USA Yunran Yang ORCID Zhiyuan College, Shanghai Jiao Tong University, China
Abstract

Calibration requires predictor outputs to be consistent with their Bayesian posteriors. For machine learning predictors that do not distinguish between small perturbations, calibration errors are continuous in predictions, e.g. smooth calibration error [9], distance to calibration [2]. On the contrary, decision-makers who use predictions make optimal decisions discontinuously in probabilistic space, experiencing loss from miscalibration discontinuously. Calibration errors for decision-making are thus discontinuous, e.g., Expected Calibration Error [10], and Calibration Decision Loss [19]. Thus, predictors with a low calibration error for machine learning may suffer a high calibration error for decision-making, i.e. they may not be trustworthy for decision-makers optimizing assuming their predictions are correct. It is natural to ask if post-processing a predictor with a low calibration error for machine learning is without loss to achieve a low calibration error for decision-making. In our paper, we show post-processing an online predictor with ϵ distance to calibration achieves O(ϵ) ECE and CDL, which is asymptotically optimal. The post-processing algorithm adds noise to make predictions differentially private. The optimal bound from low distance to calibration predictors from post-processing is non-optimal compared with existing online calibration algorithms that directly optimize for ECE and CDL.

Keywords and phrases:
Calibration, calibration errors, decision making, differential privacy
Copyright and License:
[Uncaptioned image] © Jason Hartline, Yifan Wu, and Yunran Yang; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Machine learning theory
; Theory of computation Algorithmic game theory and mechanism design ; Theory of computation Online learning algorithms
Related Version:
Full Version: https://arxiv.org/abs/2504.15582
Editors:
Mark Bun

1 Introduction

Calibration requires that predictions are empirically conditionally unbiased. Consider a sequence of predictions for the chance of rain, a predictor is calibrated if, for every p, among the days that the prediction is p, the fraction of rainy days is also p. Calibrated predictions can thus be reliably interpreted as probabilities.

Calibration errors quantify the error of a predictor from being perfectly calibrated. Machine learning (ML) predictors make predictions continuously in probabilistic space, so calibration errors for ML are continuous in prediction values and do not distinguish between small perturbations in predictions. Two canonical examples are the smooth calibration error [9] and the distance to calibration (DTC) [2]. As an illustrating example of the calibration errors for ML, consider a predictor in Table 1.

Table 1: A miscalibrated predictor for the chance of rain.
Prediction value # days conditional frequency of rain
50.01% half of the days 0
49.99% half of the days 1

Although the predictions of 50.01% and 49.99% are biased, the total number of rainy days is 50%, indicating the predictor is very close to a calibrated predictor that always outputs 50%. Both DTC and the smooth calibration error are about 0.01%, close to 0. The smooth calibration error combines the bias over all the days by weighing biases continuously, e.g. weighing bias (50.01%0) by 0.01%, (49.99%1) by 0.01%, and summing together (the weights are Lipschitz continuous in prediction values). The smooth calibration error is linearly related to DTC, which calculates the expected 1 distance between the predictor and the nearest calibrated predictor, which in this example predicts 50% every day.

Decision-makers make decisions discontinuously in probabilistic space, thus, a calibration error for decision-making is discontinuous in the prediction space. For example, consider a decision problem with binary action space, bringing an umbrella or not. The decision maker receives a payoff of 1 when the decision matches the state, i.e. bringing an umbrella when rainy, not bringing when not rainy, and a payoff of 0 in other cases. When assisted by a prediction, the action of a decision-maker changes from not bringing an umbrella to bringing an umbrella at the prediction threshold of 50%. Two examples of calibration errors for decision-making are Expected Calibration Error (ECE) [10] and Calibration Decision Loss (CDL) [19]. CDL quantifies the worst-case decision loss of a decision-maker who trusts the prediction as a probability, where the worst-case is taken over all payoff-bounded decision tasks. By definition, CDL upperbounds any decision-maker’s loss. ECE, the most well-studied calibration error metric, is defined by the averaged absolute bias in predictions. For example, ECE averages over |50.01%0| and |49.99%1| for the predictor in Table 1 and has a calibration error of 50.01%. Kleinberg et al. [23] shows that ECE linearly upperbounds the decision loss of every payoff-bounded decision task, implying an upperbound of CDL.

From the decision-making perspective, having a low calibration error for ML, however, does not guarantee a low calibration error for decision-making or being trustworthy for decision-making. Consider the same example of a predictor in Table 1 and the umbrella decision problem above. According to a calibration error for ML, e.g. distance to calibration, the predictor is 0.01% close to a calibrated predictor that always outputs 50%. However, to the decision-maker, the prediction suggests not taking an umbrella when the weather is rainy, and taking an umbrella when not rainy. This non-trustworthiness comes from the discontinuity of decision-making which the decision-maker changes an action at the threshold 50%.

Here is the natural question: can we design a post-processing algorithm that, given any predictor with a low calibration error for machine learning, outputs predictions with a low calibration error for decision-making? Ideally, the post-processing algorithm should achieve near-optimal guarantees that asymptotically match the guarantees from directly optimizing for decision-making.

Our paper designs a post-processing algorithm that, given any predictor with DTC=ϵ, outputs differentially private predictions with ECE and CDL bounded by O(ϵ), in both the batch setting and the online setting. We give lower bounds, described below, for both that online and batch setting, that show that this post-processing algorithm is asymptotically optimal. Additionally the online lower bounds shows that the optimal predictors for decision makers cannot be constructed from optimal predictors from machine learning.

We show that the privacy-based post-processing algorithm is asymptotically optimal in the online setting. This optimality implies there does not exist a post-processing algorithm that achieves the same guarantee as known online algorithms that directly optimize predictions for ECE and CDL. For online calibration, there has been shown an O(T13c) (c>0) upperbound on optimal algorithm for ECE [5], a O~(T12) optimal bound to CDL [19], and an Ω(T23) lowerbound to DTC [27]. Thus, applying the lowerbound of Ω(ϵ), any post-processing algorithm can only achieve the non-optimal Ω(T13) ECE and CDL.

We show that the privacy-based post-processing algorithm is asymptotically optimal in the batched setting in two models. The first model considers post-processing algorithms applied individually to each prediction, and the same guarantee and lowerbound to ECE and CDL applies as in the online setting. The second model allows algorithms that post-process the entire batch of predictions. However, doing so just to attain calibration is too easy: simply ignoring the individual information in each prediction and averaging them all will be close to calibrated. Thus, we impose a stronger benchmark that measures the worst-case decision loss relative to a nearby – in the sense of ϵ Distance to Calibration – calibrated predictor. This worst case is taken over all such nearby calibrated predictors and all bounded decision problems. We show that the privacy-based post-processing algorithm achieves O(ϵ) decision loss and that this result is tight, i.e. no other post-processing algorithm achieves asymptotically better decision loss.

1.1 Related Work

Calibration Error Metrics

The most relevant work to ours, Błasiok and Nakkiran [4], introduces the error metric Smooth ECE, which, given a predictor, calculates the ECE with Gaussian noise added to the predictions. For any predictor with DTC=ϵ, smooth ECE is shown to be bounded by Θ(ϵ). Instead, our paper focuses on the decision-making perspective of calibration. We show that this bound of Θ(ϵ) is tight, suggesting that from a decision-making perspective, optimizing for DTC and post-processing achieves suboptimal guarantees. Our post-processing algorithm also generalizes the result of Błasiok and Nakkiran [4] by considering noise distributions for differential privacy.

As introduced previously, existing calibration error metrics mainly focus on two aspects: calibration errors for machine learning, continuous in predictions, e.g. smooth calibration error [9], distance to calibration111We follow Qiao and Zheng [27] and refer to distance to calibration as the lower distance to calibration in Błasiok et al. [2]. [2], smooth ECE [20]; and calibration errors for decision-making, e.g. the canonical ECE [10] and the Calibration Decision Loss [19]. Recently, as an orthogonal property to continuity and decision-making, Haghtalab et al. [16] propose an approximately truthful calibration error metric for an expected-error-minimizing sequential predictor.

Online Calibration

In online calibration, the predictor repeatedly interacts with an adversary selecting a binary state. In each round, both the predictor and the adversary know the history of predictions and states, but are not allowed to strategize conditioned on the opponent’s action in the current round. Foster and Vohra [11] showed an upperbound of ECE=O(T13), which is recently proven to be polynomial-time achievable by Noarov et al. [25]. Recently, Dagan et al. [7] improves the upperbound to O(T13c) for some constant c>0. On the lowerbound side, Qiao and Valiant [26] showed there exists an O(T0.472) lowerbound, strictly above O~(1T), which is improved to O(T0.456) by Dagan et al. [6].

For linearly related smooth calibration error and DTC, Qiao and Zheng [27] prove an O(1T) upperbound and an O(T23) lowerbound. Arunachaleswaran et al. [1] design a simple polynomial-time algorithm that achieves DTC=O(1T).

The Calibration Decision Loss (CDL) is introduced in Hu and Wu [19] with a bound of O~(1T), tight up to a logarithmic factor.

Omniprediction

Our definition of decision loss for the batch setting can be equivalently formulated as achieving omniprediction with regard to reference predictors and a set of loss functions. Calibration guarantees the trustworthiness of predictions by every decision-maker, allowing decision-making to be separated from predictions. Introduced in Gopalan et al. [13], omnipredictor follows the same idea, requiring an omnipredictor to achieve comparable guarantee with regard to a class of loss functions and a set of competing predictors. Techniques from the algorithmic fairness literature, e.g. Hebert-Johnson et al.; Kim et al. [17, 22], have been applied to achieve omniprediction in both online and batch setting [13, 15, 14, 12, 18]. While the classical guarantee usually learns an omnipredictor that competes with the hypothesis space of predictors, our decision loss evaluates a predictor with regard to the set of calibrated predictors close in DTC.

2 Preliminaries

Mathematical Notations

We write DX,Y as the joint distribution between random variables X and Y, and XD as random variable X drawn from distribution D. Where it is obvious from the context, we write Pr[X=x] for the probability of a discrete random variable as well as the probabilistic density function of a continuous random variable.

We consider a prediction problem of a binary state θΘ={0,1}. A predictor is specified by a joint distribution DP,Θ over the prediction p and the state θ. Slightly abusing the notation, we also write a predictor as a random variable P, omitting the state, where a realized prediction value is p.

Our privacy-based post-processing algorithm adds noise to make predictions differentially private. Definition 1 defines a differentially private mechanism for predictions.

Definition 1 (Differential Privacy).

A mechanism is (γ,δ)-differentially private (DP) if for any two predictions q,q[0,1]:

Pr[(q)]eγ|qq|Pr[(q)]+δ.

We construct our privacy-based algorithm by adding truncated noise, where truncation guarantees predictions fall in the range of [0,1]. The truncation of noise Y works in the following way: given a prediction q, for random variable Y with unbounded support, we draw XDϵ(q) such that

Pr[q+X=p]=Pr[q+Y=p]Pr[q+Y[0,1]].

2.1 Predictions for Decision-Making

A decision maker faces a decision problem (A,Θ,U):

  • the decision maker (DM) selects an action aA;

  • a payoff-relevant state θΘ is realized;

  • DM receives payoff U:A×Θ.

When assisted with a prediction, the best response a maps a prediction to the action:

a(p)=argmaxaA𝐄θp[U(a,θ)]. (1)

When the DM best responds, she assumes the state is drawn from p and takes the action that maximizes the expected payoff. We define the best-responding payoff as a function S of the prediction and the state.

Definition 2 (Scoring Rule from Decision).

Given a decision problem U, the scoring rule induced from U and belief pΔ(Θ) is

SU(p,θ)=U(a(p),θ)

Proper scoring rules characterize scoring rules induced from a decision problem.

Definition 3 (Proper Score).

A scoring rule S:[0,1]×{0,1} is proper if and only if

𝐄θp[S(p,θ)]𝐄θp[S(p,θ)],p[0,1].

Claim 4 shows that the space of best-responding payoff is equivalent to the space of proper scoring rules. Throughout the paper, we will write the best-responding decision payoff as proper scoring rules.

Claim 4 (Kleinberg et al. [23]; Hu and Wu [19]).

There exists a bijective mapping between a bounded proper scoring rules and scoring rule induced from a decision problem with bounded payoff:

  • Given a decision problem U(,)[0,1], the induced scoring rule SU(,)[0,1] is a proper scoring rule.

  • Given a proper scoring rule S(,)[0,1], there exists a decision problem U(,)[0,1] that induces S.

Given a set of reference predictors and a set of proper scoring rules 𝒮, we define the decision loss with regard to the set of reference predictors.

Definition 5 (Decision Loss).

Given a set of reference predictors and a set of proper scoring rules 𝒮, the decision loss of a predictor P is

DL(P;)=maxS𝒮,B𝐄p,b,θDP,B,Θ[S(b,θ)S(p,θ)].

Throughout the paper, we consider decision loss with regard to the set of all bounded proper scoring rules 𝒮={S(,)[0,1]}, i.e. all decision problems with bounded payoff.

Our decision loss is closely related to omniprediction. A predictor with ϵ decision loss is an ϵ omnipredictor with regard to reference predictors in and the set of scoring rules in 𝒮.

Definition 6 (Omniprediction [13]).

Given a set of reference predictors and a set of proper scoring rules 𝒮, a predictor is an ϵ-omnipredictor with regard to and 𝒮 if

𝐄(p,θ)DP,Θ[S(p,θ)]𝐄(b,θ)DB,Θ[S(b,θ)]ϵ, B,S𝒮.

2.2 Measures of Calibration Error

In this section, we define different calibration error metrics that are relevant to the paper. The definitions of error metrics follow the definitions of perfect calibration. We denote the Bayesian posterior of prediction values as p^=Pr[θ=1|P=p].

Definition 7 (Perfect Calibration).

A predictor P is perfectly calibrated if p=p^ for any p[0,1].

We introduce relevant calibration errors to the paper by two categories: calibration error for decision-making and calibration error for machine learning.

2.2.1 Calibration Errors for Decision-Making

The canonical calibration error metric is ECE, the averaged bias in predictions.

Definition 8 (Expected Calibration Error, ECE).

Given predictor P, the expected calibration error is

ECE(P)=𝐄pP[|pp^|].

The swap regret of a decision-maker is closely related to predictions being calibrated. Swap regret minimizers are special cases of omnipredictors where the set of reference predictors is the set of post-processed predictors, i.e. by applying a mapping σ:[0,1][0,1] to the orginal predictions p.

Definition 9 (Swap Regret).

Given a decision problem with proper scoring rule S, a predictor P, the swap regret for the decision-maker is

SwapS(P)=maxσ:[0,1][0,1]𝐄(p,θ)Dp,θ[S(σ(p),θ)S(p,θ)].

Proposition 10 shows that the swap regret equals the payoff improvement from calibrating a predictor.

Proposition 10.

Given a decision problem with proper scoring rule S, a predictor P, the mapping σ(p)=Pr[θ|p] is the swap regret maximizing mapping, i.e.

σargmaxσ:[0,1][0,1]𝐄(p,θ)Dp,θ[S(σ(p),θ)S(p,θ)].

the swap regret equals the payoff improvement from calibrating the predictor: σ(p)=Pr[θ=1|p].

Proposition 11 thus follows by the definition of calibration.

Proposition 11 (Foster and Vohra [11]).

A predictor is calibrated if and only if for any decision problem U, the decision-maker has no swap regret.

Motivated by Proposition 11, Hu and Wu [19] define Calibration Decision Loss (CDL), the worst-case decision loss induced by miscalibration, with worst-case over all bounded proper scoring rules. Instead of decision loss that compares with a set of fixed reference predictor, CDL calculates the decision loss where the reference is the calibrated correspondence of the predictor to be evaluated.

Definition 12 (Calibration Decision Loss, CDL).

For predictor P, the Calibration Decision Loss is defined as

CDL(P)=maxproper S(,)[0,1]SwapS(P),

where the maximum is taken over all bounded proper scoring rules.

Kleinberg et al. [23] shows that CDL is upperbounded by ECE.

Lemma 13 (Kleinberg et al. [23]).

For predictor P, CDL is upperbounded by ECE, i.e. CDL(P)ECE(P).

2.2.2 Calibration Errors for Machine Learning

The calibration errors in this section are continuos in prediction space. The calibration error metrics relevant to the paper are the smooth calibration error and the distance to calibration (DTC).

Definition 14 (Smooth Calibration Error).

Given predictor P, the smooth calibration error takes the supremum over the set Σ of 1-Lipschitz functions:

smCal(P)=supσΣ𝐄(p,θ)DP,Θ[σ(p)(pθ)]

Note that without the constraint that σ is 1-Lipschitz, smCal is the same as ECE. To see this, note that taking σ=1 when pp^0 and σ=1 when pp^<0 gives the same definition as ECE.

Given a predictor p, the distance to calibration finds a calibrated predictor b with a coupling Dp,b,θ with the given predictor p, such that b has the smallest distance from p. The distance is the 1 distance between predictions, as defined in Definition 15.

Definition 15 (Distance between Predictors).

Given predictors B and P, the distance between the predictors is defined as

Dist(P,B)=𝐄DP,B[|PB|].

Definition 16 defines the distance to calibration.

Definition 16 (Distance to Calibration, DTC).

For predictor P, the distance to calibration is

DTC(P)=minB is calibratedDist(P,B).

The smooth calibration error is linearly related to the distance to calibration. While in our paper, we mainly focus on DTC, our results also apply to the smooth calibration error smCal.

Lemma 17 (Błasiok et al. [2]).

Given any predictor P,

12DTC(P)smCal(P)DTC(P).

2.3 Online and Batch Post-Processing Algorithm

We design a post-processing algorithm for both the online setting and the batch setting, given predictions q1qT from a predictor Q. The post-processing algorithm knows the parameter DTC(Q)=ϵ.

The Online Setting

In the online setting, the goal of a post-processing algorithm is to generate trustworthy predictions 𝒑=(p1,,pT) with low ECE or CDL given a sequence of predictions with low DTC. At the end of T rounds, the predictor is evaluated by a calibration error against the sequence of states 𝜽=(θ1,,θT). We define the joint distribution of DP,Θ in definitions in Section 2.2 as the empirical distribution of (pt,θt) over T rounds, which gives equivalent definitions of online calibration errors in the literature. We will write the calibration error of online predictors as a function of 𝒑 and 𝜽.

In round t[T], the adversary selects a prediction qt. The post-processing algorithm f=(ft)t[T] makes a (randomized) prediction according to ft given qt and the history of (qk,pk)k[t1] but not the states222The algorithm in our paper only depends on qt. This dependence on history only reinforces the definition.. The adversary then reveals the state θt. The adversary knows the full history of interactions, i.e. (qk,pk,θk)k[t1]. When selecting the prediction qt, the adversary faces the constraint that DTC(Q)=ϵ at the end of T rounds.

Note that the restriction of the algorithm not knowing the state is slightly different from the classic online calibration [11]. This restriction effectively excludes a post-processing algorithm that ignores the predictions q and directly implements a calibrated predictor.

The Batch Setting

In the batch setting, the predictor Q is specified by the joint distribution DQ,Θ as introduced in the beginning of Section 2. We write QT as the joint distribution of T independent and identical draws of predictions from Q. Given T realizations of predictions 𝒒=(q1,,qT)QT, the post-processing algorithm f:[0,1]TΔ([0,1]T) outputs (randomized) predictions 𝒑=(p1,,pT). Since f is only allowed to depend on predictions 𝒒 not the states, it is without loss to write f𝒒(q):[0,1]Δ([0,1]), assuming the output follows the same distribution fixing samples 𝒒. Then the states 𝜽=(θ1,,θT) is realized. In addition to the calibration errors as defined in Section 2.2, the algorithm is evaluated by the performance for omniprediction as in Definition 6, where the set of reference predictors is the set of predictors with low DTC to Q.

3 Smoothed Predictions for the Batch Setting

In this section, we will focus on post-processing in the batch setting where q is stochastically generated. Given a prediction qQ, our privacy-based post-processing algorithm simply adds noise to q. We write the resulting predictor as P, with randomness from both Q and the privacy-based algorithm . Note that in the batch setting where predictions and states are stochastically drawn, the privacy-based post-processing algorithm optimizes for the expected error, where the expectation is taken with randomness from both the prediction, the state, and the post-processing algorithm.

  • Privacy-Based Post-Processing Algorithm

  • Input: prediction qQ, parameter ϵ such that DTC(Q)ϵ, DP mechanism .

  • Output: Prediction p(q)

Theorem 18 characterizes the decision loss of P with regard to all proper scoring rules and all predictors that are ϵ close to Q.

Theorem 18.

Suppose mechanism is (γ,δ)-differentially private, then the output predictor P has at most C decision loss with regard to all proper scoring rules 𝒮 and the set of calibrated predictors such that any B is ϵ-close to Q, i.e. Dist(Q,B)ϵ. The bound C is the following

C2maxq[0,1]𝐄[|(q)q|]+4(1eγϵ+δ).

Moreover, ECE of P has the same bound.

We prove Theorem 18 following the idea of the Follow-The-Perturbed-Leader Algorithm [21]. We apply the same privacy-based post-processing algorithm to any calibrated predictor B that is ϵ close to Q, which constructs a hypothetical predictor R as an intermediate connecting B and the post-processed predictor P=(Q). Theorem 18 follows from combining Lemma 19 and Lemma 20, where Lemma 19 bounds the decision loss from B to R, and Lemma 20 characterizes the decision loss from R to P via DP mechanism .

Lemma 19.

For any calibrated predictor B, we write R as the resulting predictor with the post-privacy-based processing algorithm applied to B. For any bounded proper scoring rule S(,)[0,1], the loss of R is bounded,

DL(R)2maxq[0,1]𝐄[|(q)q|].

The same bound holds for ECE.

ECE(R)maxq[0,1]𝐄[|(q)q|].
Lemma 20.

Suppose mechanism satisfies (γ,δ)-differentially privacy. We write R as the resulting predictor with the privacy-based post-processing algorithm applied to calibrated predictor B with Dist(Q,B)ϵ. The decision loss from R to P is bounded by

𝐄(p,θ)DP,Θ[S(p,θ)]𝐄(r,θ)DR,Θ[S(r,θ)]4(1eγϵ+δ).

A similar bound holds for ECE:

ECE(P)ECE(R)+4(1eγϵ+δ).

Lemma 21 shows the guarantee obtainable from some choices of the differentially private mechanism by adding noise Dϵ. We construct the noise by truncating the distribution with unbounded support into the feasible range of predictions. The parameters of (γ,δ) are standard for Laplace and Gaussian noise [8].

Lemma 21.

We consider two truncated noises that induce differential privacy.

Truncated Laplace

Noise variable X from a truncated Laplace distribution with parameters (0,1lnτ) is (2lnτ,0)-differentially private. The expectation of the bias induced by noise is bounded: 𝐄[|X|]1lnττ1τ. Combining the bounds and taking τ=exp(12ϵ), we have C=Θ(ϵ), the decision loss of the predictor is bounded by C, and ECEC.

Truncated Gaussian

Consider noise variable X from a truncated Gaussian distribution 𝒩(0,2ϵln(1.25ϵ)). The truncated noise has

𝐄[|X|]σ=2ϵln(1.25ϵ),

and is (γ,δ)-differentially private with δ=ϵ and 1eγϵ2ϵ . Combining the bounds and taking C=Θ(ϵln(1ϵ)), the decision loss of the predictor is bounded by C, and ECEC.333Section A.4 shows an improved O(ϵ) bound for truncated Gaussian noise without the log factor. Note that we obtain Lemma 21 by bounding the TV-distance between the DP-mechanism output of adjacent predictions. Our improved bound directly analyzes this TV-distance rather than using the (γ,δ) parameters of differential privacy.

Theorem 22 shows that, there exists a predictor with DTC=ϵ, such that no post-processing algorithm can achieve a worst-case decision loss better than ϵ2. Our guarantee of decision loss in Theorem 18 is asymptotically optimal.

Theorem 22 (Post-processing Lowerbound for Batch Decision Loss).

There exists a predictor Q, with DTC(Q)=ϵ and a reference calibrated predictor BargminBDist(B,Q), such that for any post-processing algorithm that depends on the sequence 𝐪 of predictions, f𝐪(q):[0,1]Δ([0,1]), f𝐪(Q) suffers a ϵ2 decision loss from B, i.e.

f,S(,)[0,1],𝐄f,(p,θ)DP,Θ[S(f𝒒(q),θ)]𝐄(b,θ)DB,Θ[S(b,θ)]ϵ2.

As the main idea of the proof of lowerbound, any post-processing algorithm that does not depend on the state achieves a score at most by outputting the Bayesian posterior of predictor Q. We construct a predictor Q with a calibrated reference predictor B that is more informative than Q. By definition of DTC that specifies a coupling between B, Q, and the state θ, a reference calibrated predictor B may correlate with the state θ when conditioned on Q. Thus, for this predictor Q and any post-processing algorithm f, f(Q) achieves a lower score than B.

If the post-processing algorithm is a function of only the prediction Q but not the prediction sequence 𝒒, our bounds are asymptotically optimal.

Corollary 23.

For any post-processing algorithm that depends only on the current prediction, f(q):[0,1]Δ([0,1]), there exists a predictor Q with DTC(Q)=ϵ and a reference calibrated predictor BargminBDist(B,Q), such that f(Q) has

ECE(Q)=Θ(ϵ)andCDL(Q)=Θ(ϵ).

Corollary 23 is a corollary from Theorem 27 which we will introduce later.

4 Smoothed Predictions for the Online Setting

To achieve guarantees for the online setting where the predictions 𝒒 and the states 𝜽 are adversarially selected, the algorithm outputs are discretized for the empirical distribution to be meaningful. We prove empirical guarantees of the post-processing algorithm.

  • Input: predictions qt, parameter ϵ such that DTC(𝒒,𝜽)ϵ, DP mechanism .

  • Discretize the space of predictions into T13 prediction values in {iT13i[ϵT]}.

  • Draw p(q).

  • Find i such that p[iT13,i+1T13].

  • Output: p=iT13.

By Theorem 18, the online privacy-based post-processing algorithm achieves the same bound for ECE up to a discretization error.

Theorem 24.

Suppose mechanism is (γ,δ)-differentially private. The output predictor 𝐩 satisfies

𝐄[ECE(𝒑;𝜽)]maxq[0,1]𝐄[|(q)q|]+4(1eγϵ+δ)+2T13.

By Lemma 13, the same bound holds for CDL

Corollary 25.

Suppose mechanism is (γ,δ)-differentially private. The output predictor 𝐩 satisfies

𝐄[CDL(𝒑;𝜽)]2maxq[0,1]𝐄[|(q)q|]+8(1eγϵ+δ)+2T13.

By Lemma 21, we obtain the guarantees for ECE and CDL in the online setting.

Lemma 26.

With truncated Laplace noise, the privacy-based post-processing algorithm for online calibration achieves CDL2ECE=O(ϵ)+2T13. With truncated Gaussian noise, the privacy-based post-processing algorithm achieves CDL2ECE=O(ϵln1ϵ)+2T13.

Arunachaleswaran et al. [1] provides an online DTC minimization algorithm that achieves DTC=O(1T). Plugging into Lemma 26, the post-processing algorithm achieves ECE=O(T14) with truncated Laplace noise and ECE=O(T14lnT) with truncated Gaussian noise.

Theorem 27 shows that there exist two sequences of predictions 𝒒,𝒒 and corresponding state realizations, such that both sequence has DTC=ϵ. However, no post-processing algorithm can guarantee ECE<Θ(ϵ) or CDL<Θ(ϵ) for both sequences. Theorem 27 shows the online post-processing algorithm is asymptotically optimal for ECE as well as for CDL.

Theorem 27 (Post-processing Lowerbound for Online ECE).

For any post-processing algorithm f=(f1,,fT) where ft depends on the prediction history (q1,,qt) and (p1,,pt1) before round t, there exists two sequences of predictions 𝐪 and 𝐪 with states 𝛉 and 𝛉, respectively, both satisfying DTC(𝐪)=DTC(𝐪)=ϵ, such that

max{𝐄[ECE(𝒑;𝜽)],𝐄[ECE(𝒑;𝜽)]}18ϵ+12ϵ=Θ(ϵ),

where we write 𝐩,𝐩 as the output of the post-processing algorithm f on 𝐪,𝐪, respectively.

Moreover, the same lowerbound holds for CDL.

The lowerbound for CDL is perhaps surprising because Hu and Wu [19] shows a O~(1T) optimal bound for CDL, indicating ECE overestimates CDL when there exists a ω(1T) lowerbound for ECE [26]. We expected the same observation for the post-processing bound, which turns out not to be true. Considering the ϵ=Ω(T23) lowerbound for DTC [27], the post-processing bound of O(ϵ)+2T13 is asymptotically optimal.

As an immediate corollary of our proof, even if the decision-makers are allowed to use different post-processing algorithms such as the differentially private exponential mechanism [24], there exists a worst-case decision-maker with a swap regret of Θ(ϵ).

Corollary 28.

There exists one decision-maker with proper scoring rule S such that for any post-processing algorithm f=(f1,,fT) where ft depends on the prediction history (q1,,qt) and (p1,,pt1) before round t, there exists two sequences of predictions 𝐪 and 𝐪 with states 𝛉 and 𝛉, respectively, both satisfying DTC(𝐪)=DTC(𝐪)=ϵ, such that

max{𝐄[SwapS(f(𝒑);𝜽)],𝐄[SwapS(f(𝒑);𝜽)]}18ϵ+12ϵ=Θ(ϵ),

5 Discussion

Our lowerbound presents a gap in post-processing a predictor with a low distance to calibration from directly optimizing for calibration errors related to decision-making. However, in the examples we present, the conditional empirical frequencies are discontinuous in the prediction space, which does not match the discussion of machine learning predictors not distinguishing between small perturbations. One follow-up question is, are there properties of the predictor that, combined with a low distance to calibration, guarantee the predictor trustworthy for decision-making after post-processing?

Błasiok et al. [3] provides an answer to the question above. When the bias q^q is 1-Lipschitz continuous in the prediction q, it follows that

ECE(Q)O(DTC(Q)),

and no post-processing algorithm is needed. This result, however, suggests the same problem as suggested by our lowerbound, that a given predictor with low DTC achieves a non-optimal ECE or CDL compared to optimizing for ECE or CDL directly in the online setting. Thus, it remains a question whether there exists a property of a predictor with low distance to calibration that guarantees an optimal ECE or CDL from post-processing.

References

  • [1] Eshwar Ram Arunachaleswaran, Natalie Collina, Aaron Roth, and Mirah Shi. An elementary predictor obtaining distance to calibration. In Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1366–1370. SIAM, 2025. doi:10.1137/1.9781611978322.41.
  • [2] Jarosław Błasiok, Parikshit Gopalan, Lunjia Hu, and Preetum Nakkiran. A unifying theory of distance from calibration. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 1727–1740, 2023.
  • [3] Jarosław Błasiok, Parikshit Gopalan, Lunjia Hu, and Preetum Nakkiran. When does optimizing a proper loss yield calibration? In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 72071–72095. Curran Associates, Inc., 2023. URL: https://proceedings.neurips.cc/paper_files/paper/2023/file/e4165c96702bac5f4962b70f3cf2f136-Paper-Conference.pdf.
  • [4] Jaroslaw Blasiok and Preetum Nakkiran. Smooth ece: Principled reliability diagrams via kernel smoothing. In ICLR, 2024. URL: https://openreview.net/forum?id=XwiA1nDahv.
  • [5] Yuval Dagan, Constantinos Daskalakis, Maxwell Fishelson, and Noah Golowich. From external to swap regret 2.0: An efficient reduction and oblivious adversary for large action spaces. arXiv preprint arXiv:2310.19786, 2023. doi:10.48550/arXiv.2310.19786.
  • [6] Yuval Dagan, Constantinos Daskalakis, Maxwell Fishelson, Noah Golowich, Robert Kleinberg, and Princewill Okoroafor. Improved bounds for calibration via stronger sign preservation games. arXiv preprint arXiv:2406.13668, 2024. doi:10.48550/arXiv.2406.13668.
  • [7] Yuval Dagan, Constantinos Daskalakis, Maxwell Fishelson, Noah Golowich, Robert Kleinberg, and Princewill Okoroafor. Breaking the t2/3 barrier for sequential calibration, 2025.
  • [8] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014. doi:10.1561/0400000042.
  • [9] Dean P Foster and Sergiu Hart. Smooth calibration, leaky forecasts, finite recall, and nash dynamics. Games and Economic Behavior, 109:271–293, 2018. doi:10.1016/J.GEB.2017.12.022.
  • [10] Dean P Foster and Rakesh V Vohra. Calibrated learning and correlated equilibrium. Games and Economic Behavior, 21(1-2):40–55, 1997.
  • [11] Dean P Foster and Rakesh V Vohra. Asymptotic calibration. Biometrika, 85(2):379–390, 1998.
  • [12] Sumegha Garg, Christopher Jung, Omer Reingold, and Aaron Roth. Oracle efficient online multicalibration and omniprediction. In Proceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2725–2792, 2024. doi:10.1137/1.9781611977912.98.
  • [13] Parikshit Gopalan, Adam Tauman Kalai, Omer Reingold, Vatsal Sharan, and Udi Wieder. Omnipredictors. In Mark Braverman, editor, 13th Innovations in Theoretical Computer Science Conference (ITCS 2022), volume 215 of Leibniz International Proceedings in Informatics (LIPIcs), pages 79:1–79:21, Dagstuhl, Germany, 2022. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.ITCS.2022.79.
  • [14] Parikshit Gopalan, Michael Kim, and Omer Reingold. Swap agnostic learning, or characterizing omniprediction via multicalibration. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 39936–39956. Curran Associates, Inc., 2023. URL: https://proceedings.neurips.cc/paper_files/paper/2023/file/7d693203215325902ff9dbdd067a50ac-Paper-Conference.pdf.
  • [15] Parikshit Gopalan, Princewill Okoroafor, Prasad Raghavendra, Abhishek Shetty, and Mihir Singhal. Omnipredictors for regression and the approximate rank of convex functions. arXiv preprint arXiv:2401.14645, 2024. doi:10.48550/arXiv.2401.14645.
  • [16] Nika Haghtalab, Mingda Qiao, Kunhe Yang, and Eric Zhao. Truthfulness of calibration measures. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.
  • [17] Ursula Hebert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. Multicalibration: Calibration for the (Computationally-identifiable) masses. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1939–1948. PMLR, 10–15 July 2018. URL: https://proceedings.mlr.press/v80/hebert-johnson18a.html.
  • [18] Lunjia Hu, Inbal Rachel Livni Navon, Omer Reingold, and Chutong Yang. Omnipredictors for constrained optimization. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 13497–13527. PMLR, 23–29 July 2023. URL: https://proceedings.mlr.press/v202/hu23b.html.
  • [19] Lunjia Hu and Yifan Wu. Predict to minimize swap regret for all payoff-bounded tasks (calibration error for decision making). 65th IEEE Symposium on Foundations of Computer Science (FOCS), 2024.
  • [20] Sham M. Kakade and Dean P. Foster. Deterministic calibration and nash equilibrium. Journal of Computer and System Sciences, 74(1):115–130, 2008. Learning Theory 2004. doi:10.1016/j.jcss.2007.04.017.
  • [21] Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005. doi:10.1016/J.JCSS.2004.10.016.
  • [22] Michael P. Kim, Amirata Ghorbani, and James Zou. Multiaccuracy: Black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’19, pages 247–254, New York, NY, USA, 2019. Association for Computing Machinery. doi:10.1145/3306618.3314287.
  • [23] Bobby Kleinberg, Renato Paes Leme, Jon Schneider, and Yifeng Teng. U-calibration: Forecasting for an unknown agent. In The Thirty Sixth Annual Conference on Learning Theory, pages 5143–5145. PMLR, 2023. URL: https://proceedings.mlr.press/v195/kleinberg23a.html.
  • [24] Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), pages 94–103. IEEE, 2007. doi:10.1109/FOCS.2007.41.
  • [25] Georgy Noarov, Ramya Ramalingam, Aaron Roth, and Stephan Xie. High-dimensional unbiased prediction for sequential decision making. In OPT 2023: Optimization for Machine Learning, 2023. URL: https://openreview.net/forum?id=P4j4l45NUq.
  • [26] Mingda Qiao and Gregory Valiant. Stronger calibration lower bounds via sidestepping. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2021, pages 456–466, New York, NY, USA, 2021. Association for Computing Machinery. doi:10.1145/3406325.3451050.
  • [27] Mingda Qiao and Letian Zheng. On the distance from calibration in sequential prediction. arXiv preprint arXiv:2402.07458, 2024. doi:10.48550/arXiv.2402.07458.

Appendix A Missing Proof in Section 3

A.1 Proof of Lemma 19

Proof of Lemma 19.

Since S is bounded by [0,1], we know for any fixed b,

𝐄θb[S(b,θ)S(r,θ)]2|br|.

Thus,

𝐄b,r[𝐄θb[S(b,θ)S(r,θ)]]2𝐄b,r[|br|],

which proves the argument for decision loss.

Now we prove Lemma 19 for ECE. We define Y(b0)=(b0)b0. The joint probability density function of state θ and prediction value r can be expressed as

Pr[θ=1,R=r]= 01Pr[B=b]Pr[θ=1,(b)=r|B=b]db
= 01Pr[B=b]Pr[(b)=r]Pr[θ=1|B=b]db
= 01Pr[B=b]Pr[(b)=r]bdb. (2)

Equation 2 is derived given that B is a calibrated predictor.

According to the definition of ECE,

ECE(R) =𝐄R[|rPr[θ=1R=r]|]
=01Pr[R=r]|rPr[θ=1R=r]|dr
=01|rPr[R=r]Pr[θ=1,R=r]|dr
=01|01Pr[(b)=r]Pr[B=b](rb)db|dr
0101Pr[(b)=r]Pr[B=b]|rb|dbdr
=𝐄[|(b)b|].

A.2 Proof of Lemma 20

Proof of Lemma 20.

We prove the lemma by Lemma 29, bounding the TV-distance between (B) and (Q). Combining with Lemma 30, we prove Lemma 20.

Lemma 29.

We write R as the resulting predictor with post-processing algorithm applied to calibrated predictor B with Dist(Q,B)ϵ. The decision loss from R to P is bounded by

𝐄(p,θ)DP,Θ[S(p,θ)]𝐄(r,θ)DR,Θ[S(r,θ)]4𝐄(b,q)DB,Q[dTV((b),(q))].

Note that the TV distance quantifies the distance between (b) and (q).

A similar bound holds for ECE:

ECE(P)ECE(R)+4𝐄(b,q)DB,Q[dTV((b),(q))].

Lemma 29 follows from the fact that the scoring rule S is bounded in [0,1].

Proof of Lemma 29.

Since the scoring rule S is bounded in [0,1], we know for any fixed b and q,

𝐄r(b),θb[S(r,θ)]𝐄p(q),θb[S(p,θ)]
= 01(Pr[(b)=p]Pr[(q)=p])𝐄θb[S(p,θ)]dp
4dTV((b),(q)).

Thus Lemma 29 for decision loss from R to P holds.

Now we prove Lemma 29 for ECE by dividing it into three parts, the first part is

01p|Pr[P=p]Pr[R=p]|dp (3)
= 01p|0101Pr[B=b,Q=q](Pr[(q)=p]Pr[(b)=p])dbdq|dp
0101Pr[B=b,Q=q]01|Pr[(b)=p]Pr[(q)=p]|dpdbdq
= 2𝐄(b,q)DB,Q[dTV((b),(q))].

The distance between joint distribution DR,Θ and DP,Θ is

01|Pr[R=p]Pr[θ=1R=p]Pr[P=p]Pr[θ=1P=p]|dp (4)
= 01|Pr[θ=1,R=p]Pr[θ=1,P=p]|dp
0101Pr[B=b,Q=q]01|Pr[(b)=p]Pr[(q)=p]|dpdbdq
= 2𝐄(b,q)DB,Q[dTV((b),(q))].

Combine (3) and (4),

ECE(P) =01Pr[P=p]|pPr[θ=1P=p]|dp
01|Pr[R=p](pPr[θ=1R=p])|dp
+01p|Pr[P=p]Pr[R=p]|dp
+01|Pr[R=p]Pr[θ=1R=p]Pr[P=p]Pr[θ=1P=p]|dp
ECE(R)+4𝐄(b,q)DB,Q[dTV((b),(q))].

Lemma 30.

Given noise X, Y for (γ,δ)-differential privacy, X and Y are drawn from the same distribution,

𝐄(b,q)DB,Q[dTV((b),(q))]1eγϵ+δ.
Proof of Lemma 30.

For any pair of fixed (q,b), consider the set of prediction values V={pPr[(b)=p]Pr[(q)=p]0}.

By Definition 1 of differential privacy,

Pr[(b)=p]Pr[(q)=p] eγ|bq|(Pr[(q)=p]δ)Pr[(q)=p]
=(eγ|bq|1)Pr[(q)=p]δeγ|bq|.

Calculate dTV((b),q+X) using prediction values in V:

dTV((b),(q)) =V|Pr[(b)=p]Pr[(q)=p]|dp
V[(1eγ|bq|)Pr[(q)=p]+δeγ|bq|]dp
1eγ|bq|+δeγ|bq|.

Take the expectation with respect to (b,p) and get

𝐄(b,q)DB,Q[dTV((b),(q))] 𝐄(b,q)DB,Q[1(1δ)eγ|bq|]
1(1δ)eγϵ
1eγϵ+δ.

The second inequality follows from Jensen’s inequality, give that 1δ0, function eγx is convex and 𝐄(b,q)DB,Q[|bp|]=ϵ.

Similarly combining Lemma 20 and Lemma 19, post-processed predictor P is calibrated in ECE.

A.3 Proof of Lemma 21

A.3.1 Truncated Laplace Noise

Proof of Lemma 21, Truncated Laplace.

For q,p[0,1] and differentially private mechanism as adding noise from the truncated Laplace distribution,

Pr[(q)=p]=lnτ2τqτ1qτ|pq|.
Pr[(q)=p]Pr[(q)=p]=τ|pq||pq|2τqτ1q2τqτ1q.

Since |pq||pq||qq|,

τ|pq||pq|τ|qq|.

The following steps will show

2τqτ1q2τqτ1qτ|qq|.

Case 1: qq.

2τqτ1q2τqτ1qτ|qq|
τq+1(τq)2+2τqτq+τ1q2.

Since τq[1,τq], τq+1(τq)2+2τqτq+τ1q achieves its maximum value at τq, and the maximum value is 2.
Case 2: qq.

2τqτ1q2τqτ1qτ|qq|
τq(τq)2+2τqτq+τq2.

Since τq[τ,τq], τq(τq)2+2τqτq+τq achieves its maximum value at τq, and the maximum value is 2.

Therefore,

Pr[(q)=p]τ2|qq|Pr[(q)=p].

For any subset [0,1] of predictions,

Pr[(q)]τ2|qq|Pr[(q)].

A.3.2 Truncated Gaussian Noise

Proof of Lemma 21, Truncated Gaussian.

The choice of parameters is adopted from Dwork et al. [8]. We write the proof here for reference. The proof has two main steps. First, we show that the Gaussian distribution Y𝒩(0,2ϵln(1.25ϵ)) is (γ0,δ)-differentially private with δ=ϵ and 1eγ0ϵϵ. Then we show the probability that Gaussian is truncated is bounded by 1exp(14ϵ), implying

Pr[X=p]Pr[Y=p]11exp(14ϵ).

By Definition 1, the truncated distribution has δ=ϵ and 1eγϵ1eγ0ϵ(1exp(14ϵ))2ϵ.

Now we show Gaussian distribution Y𝒩(0,2ϵln(1.25ϵ)) is differentially private. Notice that for Definition 1, it suffices to show

Prpq+Y[PrY[q+Y=p]PrY[q+Y=p]eγ0|qq|]δ.

Define L(p)=PrY[q+Y=p]PrY[q+Y=p]. We know

ln[L(p)]=(pq)2+(pq)22σ2=(qq)2+2(pq)(qq)2σ2,

where (pq) is the Gaussian 𝒩(0,σ2). Applying the tail bound for Gaussian distribution with γ1=γ0|qq|

Pr[ln[L(p)]γ1] exp((γ1σ212(qq)2)2(qq)2σ2)

For σ=2ϵln(1.25δ)2ln(1.25δ)|qq|γ1, we have Pr[ln[L(p)]δ.

A.4 Improved Bound for Truncated Gaussian Noise

For any distribution D with probability density function f, define fb(x) as the probability density function of truncated distribution of D on the interval [b,1b] and fb(x)=f(x)b1bf(x)dx.

Lemma 31.

Consider any distribution of noise with probability density function f(x) that is monotone on x0 and x<0 respectively. Then for b,q[0,1],

dTV((b),(q))max{fb(x),fq(x)}|qb|.
Proof of Lemma 31.

Fix b and q, without loss of generality, assume that bq. There exists t[b,q] that fb(tb)=fq(tq).

Represent the probability of (b)[0,t] by S=btbfb(x)dx, so dTV((b),(q))=Sqtqfq(x)dx.

When tqb, represent the probability of (b)[t+bq,t] by S1=tqtbfb(x)dx=Sbtqfb(x)dx. The aim is to show dTV((b),(q))S1.

(i) If b1bf(x)dxq1qf(x)dx, then

dTV((b),(q))S1
btqfb(x)dxqtqfq(x)dx
btq(fb(x)fq(x))dxqbfq(x)dx.

(ii) If b1bf(x)dx<q1qf(x)dx, then tq0fb(x)dx>tq0fq(x)dx. Since

b0fb(x)dx=b0f(x)dxb1bf(x)dx=11+01bf(x)dxb0f(x)dx

is an increasing function of b,

b0fb(x)dxq0fq(x)dx.
dTV((b),(q))S1
btqfb(x)dxqtqfq(x)dx
b0fb(x)dxtq0fb(x)dxq0fq(x)dxtq0fq(x)dx

Therefore,

dTV((b),(q))S1(qb)max{fb(x),fq(x)}.

When t<qb,

dTV((b),(q))=Sqtqfq(x)dx<S<(qb)max{fb(x),fq(x)}.

Lemma 32.

Consider adding the truncated noise from a Gaussian distribution 𝒩(0,ϵ) in the same way as Lemma 21, then for C=Θ(ϵ), the predictor is C-omnipredictor with ECEC.

Proof.

The truncated noise has

𝐄[|X|]σ=ϵ.

The maximum value of the truncated Gaussian distribution’s probability density function is

maxq,p[0,1]Prpq+X[q+X=p]=maxq,p[0,1]12πσexp((pq)22σ2)q1q12πσexp(x22σ2)dx=12πσ0112πσexp(x22σ2)dx.

Since exp(x22σ2) is concave on [0,σ], 0σ12πσexp(x22σ2)dx can be lower bounded by the area of a ladder:

0112πσexp(x22σ2)dx122πσ(1+exp(12))σ122π.

By Lemma 31,

𝐄(b,q)DB,Q[dTV((b),(q))]𝐄(b,q)DB,Q[2σ|qb|]=2ϵσ.

Therefore, the parameter C of the predictor can be upper bounded by σ+8ϵσ=Θ(ϵ).

A.5 Proof of Theorem 22

Proof of Theorem 22.

Fix a predictor Q, define predictor Q~ that predicts the Bayesian posterior of Q: for every prediction value qi, when Q predicts qi, let Q~ predict qi^=Pr[θ=1q=qi]. Post-process predictor Q by f and get predictor P.

Fix a prediction value qi and a proper scoring rule S, consider all predictions pf(qi), according to the definition of proper scoring rules, the score achievable by f is upperbounded by Q~:

𝐄pf(qi)[𝐄θqi^[S(p,θ)]]𝐄pf(qi)[𝐄θqi^[S(qi^,θ)]]=𝐄θqi^[S(qi^,θ)].
𝐄(p,θ)DP,Θ[S(p,θ)] =𝐄qiq[𝐄pf(qi)[𝐄θqi^[S(p,θ)]]]
𝐄qiq[𝐄θqi^[S(qi^,θ)]]=𝐄(p,θ)DQ~,Θ[S(p,θ)].

Consider the following predictor Q with DTC(Q)=ϵ.

Case 1: With probability 1ϵ, the distribution of predictions and states follows

(q,q^)={(12ϵ,12ϵ)w.p.12(12+ϵ,12+ϵ)w.p.12

Case 2: With probability ϵ, the distribution of predictions and states follows

(q,q^)={(12ϵ,1)w.p.12(12+ϵ,0)w.p.12

Therefore the corresponding q~ follows

(q~,q)={(1212ϵ+ϵ,12ϵ)w.p.12(12+12ϵϵ,12+ϵ)w.p.12

Define a calibrated predictor B, when Q follows from Case 1, let B outputs the same prediction of Q. When Q follows from Case 2, let B always predicts 12. Notice that

DTC(Q)Dist(Q,B)=ϵ,

to show DTC(Q)=ϵ, use a linear program with infinite constraints to prove DTC(Q)ϵ. Notice that 𝒬={12ϵ,12+ϵ}. Let ρ denotes joint probability distribution function of (b,q,θ)[0,1]×𝒬×{0,1}. The following linear program is feasible and its optimal value equals DTC(Q).

minimize (b,q,θ)[0,1]×𝒬×{0,1}|qb|ρ(b,q,θ) (5)
s.t. b[0,1]ρ(b,q,θ)=Pr[q,θ], for (q,θ)𝒬×{0,1};(r(q,θ))
(1b)q𝒬ρ(b,q,1)bq𝒬ρ(b,q,0)=0, for b[0,1];(s(b))
ρ(b,q,θ)0, for (b,q,θ)[0,1]×𝒬×{0,1}.

The objective of this linear program corresponds to DTC(Q). The first constraint ensures that the joint distribution of (b,p,θ) is consistent with the joint distribution of (q,θ). The second constraint ensures that predictor B is calibrated. This linear program is feasible, because

ρ(b,q,θ)={Pr[q,θ]if b=θ0else

is a feasible solution of this linear program. The dual of the linear program (5) is

maximize (q,θ)𝒬×{0,1}Pr[q,θ]r(q,θ) (6)
s.t. r(q,θ)|bq|+(θb)s(b), for (b,q,θ)[0,1]×𝒬×{0,1}.

If s(b)>1, change s(b) to 1 still satisfy the constraints and the objective stays the same:

r(q,0)|bq|bs(b)<|bq|b,
r(q,1)|1q||bq|+(1b).

If s(b)<1, change s(b) to 1 still satisfy the constraints and the objective stays the same:

r(q,0)q<|bq|+b,
r(q,1)|bq|+(1b)s(q)|bq|(1b).

Therefore, the optimal solution of linear program (6) stays the same after adding the constraints:

1s(b)1,for b[0,1].

The optimal value of linear program (5) can be lower bounded by the objective of linear program (6):

(q,θ)𝒬×{0,1}Pr[q,θ]r(q,θ)
= (q,θ)𝒬×{0,1}r(q,θ)b[0,1]ρ(b,q,θ)+b[0,1]s(b)(q,θ)𝒬×{0,1}(bθ)ρ(b,q,θ) (7)
= (q,θ)𝒬×{0,1}b[0,1]r(q,θ)ρ(b,q,θ)+b[0,1](q,θ)𝒬×{0,1}s(b)(bθ)ρ(b,q,θ) (8)
= (b,q,θ)[0,1]×𝒬×{0,1}[r(q,θ)+(bθ)s(b)]ρ(b,q,θ) (9)
(b,q,θ)[0,1]×𝒬×{0,1}|qb|ρ(b,q,θ).

(7)=(8) holds because b[0,1]ρ(b,q,θ) is absolutely convergent, the distributive property of multiplication still holds. (8)=(9) holds because Equation 8 is absolutely convergent, the commutative property of addition still holds.

Let

s(b)={2ϵ2ϵ+1if b<120if b=122ϵ2ϵ+1if b>12

Then the constraints for the dual linear program (6) are

r(12ϵ,0)minb[0,1]{|b12+ϵ|bs(b)}=ϵ(12ϵ)2ϵ+1,
r(12ϵ,1)minb[0,1]{|b12+ϵ|+(1b)s(b)}=ϵ,
r(12+ϵ,0)minb[0,1]{|b12ϵ|bs(b)}=ϵ,
r(12+ϵ,1)minb[0,1]{|b12ϵ|+(1b)s(b)}=ϵ(12ϵ)2ϵ+1.

Take maximum values of all r(q,θ) and get the optimal value of linear program (6) is no less than

12(12+ϵ2ϵ)[r(12ϵ,0)+r(12+ϵ,1)]
+ 12(12ϵ2+ϵ)[r(12ϵ,1)+r(12+ϵ,0)]=ϵ.

Therefore, DTC(Q)ϵ and thus DTC(Q)=ϵ.

Consider the proper scoring rule

S(p,θ)={1θif p12θif p>12

and calculate the expected payoff in decision making for predictor Q and B:

𝐄(p,θ)DQ~,Θ[S(p,θ)]=12+12ϵϵ.
𝐄(b,θ)DB,Θ[S(b,θ)]=12+ϵϵ.

Therefore, for any post-processed algorithm f, there exists predictor Q and a reference calibrated predictor b such that DTC(Q)=ϵ and

DL(f(Q);B)𝐄(b,θ)DB,Θ[S(b,θ)]𝐄(p,θ)DQ~,Θ[S(p,θ)]=ϵ2.

Appendix B Missing Proof in Section 4

B.1 Proof of Theorem 24

Proof of Theorem 24.

We write ni as the number of times that iT13 is predicted. Clearly, i[ϵT]ni=T. We also write pt as the output of post-processed predictor before discretization. Conditioning on a set of (ni)i, we know for each ni:

𝐄[|iT13t𝕀[pt=iT13]θtni|]
𝐄[|iT13t𝕀[pt=iT13]pt|]+1ni𝐄[|t𝕀[pt=iT13]θtt𝕀[pt=iT13]pt|]
T13+𝐕𝐚𝐫[t𝕀[p=iT13]θtni|]
+1ni𝐄[t|𝕀[pt=iT13]Pr[θpt]t𝕀[pt=iT13]pt|],

where Pr[θ|pt] is defined over the empirical distribution over T rounds with the noise of the algorithm. Summing over all prediction values, we know

1Ti𝐄[|iT13t𝕀[p=iϵT]θtni|]i1ni+ECE(P)+T13ECE(P)+2T13.

B.2 Proof of Theorem 27

We restate our lemmas for ECE and CDL separately here and prove them.

Theorem 33.

For any post-processing algorithm f, there exists two sequences of predictions 𝐪 and 𝐪 with states 𝛉 and 𝛉, respectively, both satisfying DTC(𝐪)=DTC(𝐪)=ϵ, such that

max{𝐄[ECE(𝒑;𝜽)],𝐄[ECE(𝒑;𝜽)]}18ϵ+12ϵ=Θ(ϵ),

where we write 𝐩,𝐩 as the output of the post-processing algorithm f on 𝐪,𝐪, respectively.

Lemma 34.

Given predictor Q=(q1,,qT), and a post-processing algorithm f=(f1,,fT), suppose the empirical posterior for each prediction is Q^=(q^1,,q^T). There exists a sequence of states 𝛉 such that 𝛉 is compatible with the empirical posterior, i.e.

i[T],q^i=t[T]θt𝕀[qt=qi]t[T]𝕀[qt=qi].

Moreover, the expected ECE of the predictor f with states 𝛉 is lowerbounded

𝐄𝒑f[ECE(𝒑,𝜽)]𝐄𝒑f[1Tpsupp(𝒑)|t[T](pq^t)𝕀[pt=p]|],

where supp is the support of the output of f in each round.

Proof.

Define Sθ={𝜽𝜽 is compatible with the empirical posterior}. Let 𝜽 be chosen uniformly at random from Sθ, fix a sequence of predictions 𝒑.

Given the distribution of 𝜽, 𝐄𝜽Sθ[tpi^𝕀[pt=pi]]=tqt^𝕀[pt=pi] holds for any sequences of predictions 𝒑 and any i[T]. By Jensen’s Inequality,

𝐄𝜽Sθ[|t(pipi^)𝕀[pt=pi]|] |t(pi𝕀[pt=pi]𝐄𝜽Sθ[pi^𝕀[pt=pi]])|
=|t(piqt^)𝕀[pt=pi]|,

apply this inequality to every prediction value pt:

𝐄𝜽Sθ[ECE(𝒑)] =1Tpi𝐄𝜽Sθ[|t(pipi^)𝕀[pt=pi]|]
1Tpi|t(piqt^)𝕀[pt=pi]|

Take expectation on the distribution of predictions,

𝐄𝜽Sθ𝐄𝒑P[ECE(𝒑)]𝐄𝒑f[1Tpsupp(𝒑)|t[T](pq^t)𝕀[pt=p]|].

Therefore, there must exist a sequence of states 𝜽 that

𝐄𝒑P[ECE(𝒑)]𝐄𝜽Sθ𝐄𝒑P[ECE(𝒑)]𝐄𝒑f[1Tpsupp(𝒑)|t[T](pq^t)𝕀[pt=p]|].

Proof of Theorem 33.

Assume there are 2T rounds, define 𝒒 and 𝒒 as following:

qt={12ϵif tT12+ϵelse 
t=1T𝕀[θt=1]=T(1212ϵ+ϵ),t=T+12T𝕀[θt=1]=T(12+12ϵϵ).

For any t[2T], qt=12ϵ and t=12T𝕀[θt=1]=2T(12ϵϵ). Define q^0=12ϵϵ, q^1=1212ϵ+ϵ, q^1=12+12ϵϵ.

Fix a post-processing algorithm f. For any sequence of predictions 𝒑 generated by post-processing 𝒒, denote the distribution of 𝒑 by 𝐟(𝒒).

For any t[2T] and any sequence of predictions 𝒑𝐟(𝒒), define

A(𝒑)t=q^1t=1T𝕀[pt=pt]+q^2t=T+12T𝕀[pt=pt]t=12T𝕀[pt=pt][q^1,q^2].

According to Lemma 34, there exists a sequence of states 𝜽 that

𝐄𝒑𝐟(𝒒)[ECE(𝒑)] 𝐄𝒑𝐟(𝒒)[12Tpsupp(𝐟(𝒒))|t[2T](pq^t)𝕀[pt=p]|]
=𝐄𝒑𝐟(𝒒)[12Tt[2T]|ptA(𝒑)t|]. (10)

According to Lemma 34,

𝐄𝒑𝐟(𝒒)[ECE(𝒑)]𝐄𝒑𝐟(𝒒)[12Tt[2T]|ptp^0|]. (11)

For any 𝒑𝐟(𝒒) and 𝒑𝐟(𝒒), pt=pt always holds for t[T], since qt=qt always holds for t[T]. Therefore, for any t[T],

|ptA(𝒑)t|+|ptp^0||A(𝒑)tp^0|12ϵ+2ϵ.

Add up inequality (10) and inequality (11),

𝐄𝒑𝐟(𝒒)[ECE(𝒑)]+𝐄𝒑𝐟(𝒒)[ECE(𝒑)]
𝐄𝒑𝐟(𝒒),𝒑𝐟(𝒒)[12Tt[2T](|ptA(𝒑)t|+|ptp^0|)]
𝐄𝒑𝐟(𝒒),𝒑𝐟(𝒒)[12Tt[T](|ptA(𝒑)t|+|ptp^0|)]
𝐄𝒑𝐟(𝒒),𝒑𝐟(𝒒)[12Tt[T]|A(𝒑)tq^0|]
= 14ϵ+ϵ.

Therefore,

max{𝐄p𝐟(𝒒)[ECE(p)],𝐄p𝐟(𝒒)[ECE(p)]}
12𝐄p𝐟(𝒒)[ECE(p)]+12𝐄p𝐟(𝒒)[ECE(p)]18ϵ+12ϵ.

Theorem 35.

For any post-processing algorithm f, there exists two sequences of predictions 𝐪 and 𝐪 with states 𝛉 and 𝛉, respectively, both satisfying DTC(𝐪)=DTC(𝐪)=ϵ, such that

max{𝐄[ECE(𝒑;𝜽)],𝐄[ECE(𝒑;𝜽)]}18ϵ+12ϵ=Θ(ϵ),

where we write 𝐩,𝐩 as the output of the post-processing algorithm f on 𝐪,𝐪, respectively.

Moreover, the same argument holds for CDL.

Proof of Theorem 35.

Define two sets of sequences of states corresponding to predictor 𝒒 and 𝒒 that every 𝜽 and 𝜽 in these sets are compatible with empirical posterior: Sθ={𝜽t[T]θt=T(1212ϵ+ϵ),t=T+12Tθt=T(12+12ϵϵ)}, Sθ={𝜽t[2T]θt=2T(12ϵϵ)}. Denote the number of predicting prediction value psupp(𝒑) by ni=t2T𝕀[pt=pi].

Fix a post-processing algorithm f. Define a proper scoring rule

Sμ(p,θ)={1212θμmax{μ,1μ}if pμ12+12θμmax{μ,1μ}else.

According to the definition of CDL,

𝐄𝒑𝐟(𝒒)[CDL(𝒑,𝜽)]12T𝐄𝒑𝐟(𝒒)[supμ[0,1]t[2T](Sμ(pt^,θt)Sμ(pt,θt))]. (12)

For any sequence of predictions 𝒑, define N𝒑=t[T]𝕀[ptμ], M𝒑=t=T+12T𝕀[ptμ].

𝐄𝜽Sθ[t[2T](Sμ(pt^,θt)Sμ(pt,θt))]
𝐄𝜽Sθ[1max{μ,1μ}pisupp(𝒑)ni𝕀[piμ](p^iμ)]
= 1max{μ,1μ}pisupp(𝒑),piμt[2T]𝕀[pt=pi](qt^μ)
1max{μ,1μ}pisupp(𝒑),piμt[2T]𝕀[pt=pi](1212ϵ+ϵμ)
1max{μ,1μ}(2TN𝒑M𝒑)(1212ϵ+ϵμ).
𝐄𝜽Sθ[𝐄𝒑𝐟(𝒒)[supμ[0,1]t[2T](Sμ(pt^,θt)Sμ(pt,θt))]]
= 𝐄𝒑𝐟(𝒒)[𝐄𝜽Sθ[supμ[0,1]t[2T](Sμ(pt^,θt)Sμ(pt,θt))]]
𝐄𝒑𝐟(𝒒)[supμ[0,1]𝐄𝜽Sθ[t[2T](Sμ(pt^,θt)Sμ(pt,θt))]]
𝐄𝒑𝐟(𝒒)[supμ[0,1]1max{μ,1μ}(2TN𝒑M𝒑)(1212ϵ+ϵμ)]. (13)

Combine inequality (12) and (13), there exists 𝜽Sθ, that

𝐄𝒑𝐟(𝒒)[CDL(𝒑,𝜽)] (14)
12T𝐄𝒑𝐟(𝒒)[supμ[0,1]1max{μ,1μ}(2TN𝒑M𝒑)(1212ϵ+ϵμ)].
𝐄𝜽Sθ[t[2T](Sμ(pt^,θt)Sμ(pt,θt))]
𝐄𝜽Sθ[1max{μ,1μ}pisupp(𝒑)ni𝕀[piμ](μp^i)]
= 1max{μ,1μ}pisupp(𝒑),piμt[2T]𝕀[pt=pi](μqt^)
= 1max{μ,1μ}pisupp(𝒑),piμt[2T]𝕀[pt=pi](μ12+ϵ+ϵ)
= 1max{μ,1μ}(N𝒑+M𝒑)(μ12+ϵ+ϵ).

Similarly, there exists 𝜽Sθ, that

𝐄𝒑𝐟(𝒒)[CDL(𝒑,𝜽)] (15)
12T𝐄𝒑𝐟(𝒒)[supμ[0,1]1max{μ,1μ}(N𝒑+M𝒑)(μ12+ϵ+ϵ)].

For any 𝒑𝐟(𝒒) and 𝒑𝐟(𝒒), pt=pt always holds for t[T], since qt=qt always holds for t[T]. So N𝒑=N𝒑 and M𝒑=M𝒑 always hold for t[T]. Combine inequality (14) and (15),

max{𝐄𝒑𝐟(𝒒)[CDL(𝒑,𝜽)],𝐄𝒑𝐟(𝒒)[CDL(𝒑,𝜽)]}
14T𝐄𝒑𝐟(𝒒)[supμ[0,1]1max{μ,1μ}(2TN𝒑M𝒑)(1212ϵ+ϵμ)]
+ 14T𝐄𝒑𝐟(𝒒)[supμ[0,1]1max{μ,1μ}(N𝒑+M𝒑)(μ12+ϵ+ϵ)] (16)
14T𝐄𝒑𝐟(𝒒),𝒑𝐟(𝒒)[112+34ϵ(2TM𝒑+M𝒑)(14ϵ+ϵ)] (17)
18ϵ+12ϵ. (18)

By taking μ=1234ϵ for both cases for 𝒑 and 𝒑 and get (16)(17). Since M𝒑,M𝒑[0,T], M𝒑M𝒑T, so (17)(18).