Decision-Making Under Miscalibration

Authors Guy N. Rothblum, Gal Yona

Thumbnail PDF


  • Filesize: 0.9 MB
  • 20 pages

Document Identifiers

Author Details

Guy N. Rothblum
  • Weizmann Institute, Rehovot, Israel
Gal Yona
  • Weizmann Institute, Rehovot, Israel

Cite AsGet BibTex

Guy N. Rothblum and Gal Yona. Decision-Making Under Miscalibration. In 14th Innovations in Theoretical Computer Science Conference (ITCS 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 251, pp. 92:1-92:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)


How should we use ML-based predictions (e.g., risk of heart attack) to inform downstream binary classification decisions (e.g., undergoing a medical procedure)? When the risk estimates are perfectly calibrated, the answer is well understood: a classification problem’s cost structure induces an optimal treatment threshold j^⋆. In practice, however, predictors are often miscalibrated, and this can lead to harmful decisions. This raises a fundamental question: how should one use potentially miscalibrated predictions to inform binary decisions? In this work, we study this question from the perspective of algorithmic fairness. Specifically, we focus on the impact of decisions on protected demographic subgroups, when we are only given a bound on the predictor’s anticipated degree of subgroup-miscalibration. We formalize a natural (distribution-free) solution concept for translating predictions into decisions: given anticipated miscalibration of α, we propose using the threshold j that minimizes the worst-case regret over all α-miscalibrated predictors, where the regret is the difference in clinical utility between using the threshold in question and using the optimal threshold in hindsight. We provide closed form expressions for j when miscalibration is measured using both expected and maximum calibration error which reveal that it indeed differs from j^⋆ (the optimal threshold under perfect calibration).

Subject Classification

ACM Subject Classification
  • Theory of computation → Models of learning
  • risk prediction
  • calibration
  • algorithmic fairness
  • multi-group fairness


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Todd A Alonzo. Clinical prediction models: a practical approach to development, validation, and updating: by ewout w. steyerberg, 2009. Google Scholar
  2. Susan Athey and Stefan Wager. Policy learning with observational data. Econometrica, 89(1):133-161, 2021. Google Scholar
  3. Stuart G Baker, Nancy R Cook, Andrew Vickers, and Barnett S Kramer. Using relative utility curves to evaluate risk prediction. Journal of the Royal Statistical Society: Series A (Statistics in Society), 172(4):729-748, 2009. Google Scholar
  4. Noam Barda, Gal Yona, Guy N Rothblum, Philip Greenland, Morton Leibowitz, Ran Balicer, Eitan Bachmat, and Noa Dagan. Addressing bias in prediction models by improving subpopulation calibration. Journal of the American Medical Informatics Association, 28(3):549-558, 2021. Google Scholar
  5. Ran Canetti, Aloni Cohen, Nishanth Dikkala, Govind Ramnarayan, Sarah Scheffler, and Adam Smith. From soft classifiers to hard decisions: How fair can we be? In Proceedings of the conference on fairness, accountability, and transparency, pages 309-318, 2019. Google Scholar
  6. Gary S Collins and Douglas G Altman. Predicting the 10 year risk of cardiovascular disease in the united kingdom: independent and external validation of an updated version of qrisk2. Bmj, 344, 2012. Google Scholar
  7. Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd acm sigkdd international conference on knowledge discovery and data mining, pages 797-806, 2017. Google Scholar
  8. Krzysztof Dembczyński, Wojciech Kotłowski, Oluwasanmi Koyejo, and Nagarajan Natarajan. Consistency analysis for binary classification revisited. In International Conference on Machine Learning, pages 961-969. PMLR, 2017. Google Scholar
  9. Cynthia Dwork, Michael P. Kim, Omer Reingold, Guy N. Rothblum, and Gal Yona. Learning from outcomes: Evidence-based rankings. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 106-125. IEEE, 2019. Google Scholar
  10. Cynthia Dwork, Michael P Kim, Omer Reingold, Guy N Rothblum, and Gal Yona. Outcome indistinguishability. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 1095-1108, 2021. Google Scholar
  11. Mitchell H Gail and Ruth M Pfeiffer. On criteria for evaluating models of absolute risk. Biostatistics, 6(2):227-239, 2005. Google Scholar
  12. Parikshit Gopalan, Adam Tauman Kalai, Omer Reingold, Vatsal Sharan, and Udi Wieder. Omnipredictors. arXiv preprint, 2021. URL:
  13. Varun Gupta, Christopher Jung, Georgy Noarov, Mallesh M Pai, and Aaron Roth. Online multivalid learning: Means, moments, and prediction intervals. arXiv preprint, 2021. URL:
  14. Ursula Hébert-Johnson, Michael P. Kim, Omer Reingold, and Guy N. Rothblum. Multicalibration: Calibration for the (Computationally-identifiable) masses. In Proceedings of the 35th International Conference on Machine Learning, pages 1939-1948, 2018. Google Scholar
  15. Christopher Jung, Changhwa Lee, Mallesh Pai, Aaron Roth, and Rakesh Vohra. Moment multicalibration for uncertainty estimation. In Conference on Learning Theory, pages 2634-2678. PMLR, 2021. Google Scholar
  16. Nathan Kallus and Angela Zhou. Minimax-optimal policy learning under unobserved confounding. Management Science, 67(5):2870-2890, 2021. Google Scholar
  17. Kathleen F Kerr, Marshall D Brown, Kehao Zhu, and Holly Janes. Assessing the clinical impact of risk prediction models with decision curves: guidance for correct interpretation and appropriate use. Journal of Clinical Oncology, 34(21):2534, 2016. Google Scholar
  18. Kathleen F Kerr, Zheyu Wang, Holly Janes, Robyn L McClelland, Bruce M Psaty, and Margaret S Pepe. Net reclassification indices for evaluating risk-prediction instruments: a critical review. Epidemiology (Cambridge, Mass.), 25(1):114, 2014. Google Scholar
  19. Michael P. Kim, Amirata Ghorbani, and James Zou. Multiaccuracy: Black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 247-254, 2019. Google Scholar
  20. Toru Kitagawa and Aleksey Tetenov. Who should be treated? empirical welfare maximization methods for treatment choice. Econometrica, 86(2):591-616, 2018. Google Scholar
  21. A Russell Localio and Steven Goodman. Beyond the usual prediction accuracy metrics: reporting results for clinical decision making. Annals of internal medicine, 157(4):294-295, 2012. Google Scholar
  22. Susan Mallett, Steve Halligan, Matthew Thompson, Gary S Collins, and Douglas G Altman. Interpreting diagnostic accuracy studies for patient care. Bmj, 345, 2012. Google Scholar
  23. Charles F Manski. Statistical treatment rules for heterogeneous populations. Econometrica, 72(4):1221-1246, 2004. Google Scholar
  24. Karel GM Moons, Joris AH de Groot, Kristian Linnet, Johannes B Reitsma, and Patrick MM Bossuyt. Quantifying the added value of a diagnostic test or marker. Clinical chemistry, 58(10):1408-1417, 2012. Google Scholar
  25. Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015. Google Scholar
  26. Nagarajan Natarajan, Oluwasanmi Koyejo, Pradeep Ravikumar, and Inderjit S Dhillon. Optimal decision-theoretic classification using non-decomposable performance metrics. arXiv preprint, 2015. URL:
  27. Stephen G Pauker and Jerome P Kassirer. Therapeutic decision making: a cost-benefit analysis. New England Journal of Medicine, 293(5):229-234, 1975. Google Scholar
  28. Stephen R Pfohl, Yizhe Xu, Agata Foryciarz, Nikolaos Ignatiadis, Julian Genkins, and Nigam H Shah. Net benefit, calibration, threshold selection, and training objectives for algorithmic fairness in healthcare. arXiv preprint, 2022. URL:
  29. Stephen R Pfohl, Haoran Zhang, Yizhe Xu, Agata Foryciarz, Marzyeh Ghassemi, and Nigam H Shah. A comparison of approaches to improve worst-case predictive model performance over patient subpopulations. Scientific reports, 12(1):1-13, 2022. Google Scholar
  30. Guy N Rothblum and Gal Yona. Multi-group agnostic pac learnability. ICML, 2021. Google Scholar
  31. Werner Vach. Calibration of clinical prediction rules does not just assess bias. Journal of clinical epidemiology, 66(11):1296-1301, 2013. Google Scholar
  32. Ben Van Calster, Daan Nieboer, Yvonne Vergouwe, Bavo De Cock, Michael J Pencina, and Ewout W Steyerberg. A calibration hierarchy for risk models was defined: from utopia to empirical data. Journal of clinical epidemiology, 74:167-176, 2016. Google Scholar
  33. Ben Van Calster and Andrew J Vickers. Calibration of risk prediction models: impact on decision-analytic performance. Medical decision making, 35(2):162-169, 2015. Google Scholar
  34. Ben Van Calster, Andrew J Vickers, Michael J Pencina, Stuart G Baker, Dirk Timmerman, and Ewout W Steyerberg. Evaluation of markers and risk prediction models: overview of relationships between nri and decision-analytic measures. Medical Decision Making, 33(4):490-501, 2013. Google Scholar
  35. Andrew J Vickers and Elena B Elkin. Decision curve analysis: a novel method for evaluating prediction models. Medical Decision Making, 26(6):565-574, 2006. Google Scholar
  36. Andrew J Vickers, Ben van Calster, and Ewout W Steyerberg. A simple, step-by-step guide to interpreting decision curve analysis. Diagnostic and prognostic research, 3(1):1-8, 2019. Google Scholar
  37. Yoav Wald, Amir Feder, Daniel Greenfeld, and Uri Shalit. On calibration and out-of-domain generalization. arXiv preprint, 2021. URL:
  38. Ximei Wang, Mingsheng Long, Jianmin Wang, and Michael I Jordan. Transferable calibration with lower bias and variance in domain adaptation. arXiv preprint, 2020. URL: