Decision-Making Under Miscalibration

Rothblum, Guy N.; Yona, Gal

doi:10.4230/LIPIcs.ITCS.2023.92

File

Subject Classification

ACM Subject Classification

Theory of computation → Models of learning

Keywords

risk prediction
calibration
algorithmic fairness
multi-group fairness

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

Document

0

Metadata

Abstract

How should we use ML-based predictions (e.g., risk of heart attack) to inform downstream binary classification decisions (e.g., undergoing a medical procedure)? When the risk estimates are perfectly calibrated, the answer is well understood: a classification problem’s cost structure induces an optimal treatment threshold j^⋆. In practice, however, predictors are often miscalibrated, and this can lead to harmful decisions. This raises a fundamental question: how should one use potentially miscalibrated predictions to inform binary decisions? In this work, we study this question from the perspective of algorithmic fairness. Specifically, we focus on the impact of decisions on protected demographic subgroups, when we are only given a bound on the predictor’s anticipated degree of subgroup-miscalibration. We formalize a natural (distribution-free) solution concept for translating predictions into decisions: given anticipated miscalibration of α, we propose using the threshold j that minimizes the worst-case regret over all α-miscalibrated predictors, where the regret is the difference in clinical utility between using the threshold in question and using the optimal threshold in hindsight. We provide closed form expressions for j when miscalibration is measured using both expected and maximum calibration error which reveal that it indeed differs from j^⋆ (the optimal threshold under perfect calibration).

Cite As Get BibTex

Guy N. Rothblum and Gal Yona. Decision-Making Under Miscalibration. In 14th Innovations in Theoretical Computer Science Conference (ITCS 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 251, pp. 92:1-92:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023) https://doi.org/10.4230/LIPIcs.ITCS.2023.92

Author Details

Guy N. Rothblum

Weizmann Institute, Rehovot, Israel

Gal Yona

Weizmann Institute, Rehovot, Israel

References

Todd A Alonzo. Clinical prediction models: a practical approach to development, validation, and updating: by ewout w. steyerberg, 2009.
Susan Athey and Stefan Wager. Policy learning with observational data. Econometrica, 89(1):133-161, 2021.
Stuart G Baker, Nancy R Cook, Andrew Vickers, and Barnett S Kramer. Using relative utility curves to evaluate risk prediction. Journal of the Royal Statistical Society: Series A (Statistics in Society), 172(4):729-748, 2009.
Noam Barda, Gal Yona, Guy N Rothblum, Philip Greenland, Morton Leibowitz, Ran Balicer, Eitan Bachmat, and Noa Dagan. Addressing bias in prediction models by improving subpopulation calibration. Journal of the American Medical Informatics Association, 28(3):549-558, 2021.
Ran Canetti, Aloni Cohen, Nishanth Dikkala, Govind Ramnarayan, Sarah Scheffler, and Adam Smith. From soft classifiers to hard decisions: How fair can we be? In Proceedings of the conference on fairness, accountability, and transparency, pages 309-318, 2019.
Gary S Collins and Douglas G Altman. Predicting the 10 year risk of cardiovascular disease in the united kingdom: independent and external validation of an updated version of qrisk2. Bmj, 344, 2012.
Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd acm sigkdd international conference on knowledge discovery and data mining, pages 797-806, 2017.
Krzysztof Dembczyński, Wojciech Kotłowski, Oluwasanmi Koyejo, and Nagarajan Natarajan. Consistency analysis for binary classification revisited. In International Conference on Machine Learning, pages 961-969. PMLR, 2017.
Cynthia Dwork, Michael P. Kim, Omer Reingold, Guy N. Rothblum, and Gal Yona. Learning from outcomes: Evidence-based rankings. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 106-125. IEEE, 2019.
Cynthia Dwork, Michael P Kim, Omer Reingold, Guy N Rothblum, and Gal Yona. Outcome indistinguishability. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 1095-1108, 2021.
Mitchell H Gail and Ruth M Pfeiffer. On criteria for evaluating models of absolute risk. Biostatistics, 6(2):227-239, 2005.
Parikshit Gopalan, Adam Tauman Kalai, Omer Reingold, Vatsal Sharan, and Udi Wieder. Omnipredictors. arXiv preprint, 2021. URL: http://arxiv.org/abs/2109.05389.
Varun Gupta, Christopher Jung, Georgy Noarov, Mallesh M Pai, and Aaron Roth. Online multivalid learning: Means, moments, and prediction intervals. arXiv preprint, 2021. URL: http://arxiv.org/abs/2101.01739.
Ursula Hébert-Johnson, Michael P. Kim, Omer Reingold, and Guy N. Rothblum. Multicalibration: Calibration for the (Computationally-identifiable) masses. In Proceedings of the 35th International Conference on Machine Learning, pages 1939-1948, 2018.
Christopher Jung, Changhwa Lee, Mallesh Pai, Aaron Roth, and Rakesh Vohra. Moment multicalibration for uncertainty estimation. In Conference on Learning Theory, pages 2634-2678. PMLR, 2021.
Nathan Kallus and Angela Zhou. Minimax-optimal policy learning under unobserved confounding. Management Science, 67(5):2870-2890, 2021.
Kathleen F Kerr, Marshall D Brown, Kehao Zhu, and Holly Janes. Assessing the clinical impact of risk prediction models with decision curves: guidance for correct interpretation and appropriate use. Journal of Clinical Oncology, 34(21):2534, 2016.
Kathleen F Kerr, Zheyu Wang, Holly Janes, Robyn L McClelland, Bruce M Psaty, and Margaret S Pepe. Net reclassification indices for evaluating risk-prediction instruments: a critical review. Epidemiology (Cambridge, Mass.), 25(1):114, 2014.
Michael P. Kim, Amirata Ghorbani, and James Zou. Multiaccuracy: Black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 247-254, 2019.
Toru Kitagawa and Aleksey Tetenov. Who should be treated? empirical welfare maximization methods for treatment choice. Econometrica, 86(2):591-616, 2018.
A Russell Localio and Steven Goodman. Beyond the usual prediction accuracy metrics: reporting results for clinical decision making. Annals of internal medicine, 157(4):294-295, 2012.
Susan Mallett, Steve Halligan, Matthew Thompson, Gary S Collins, and Douglas G Altman. Interpreting diagnostic accuracy studies for patient care. Bmj, 345, 2012.
Charles F Manski. Statistical treatment rules for heterogeneous populations. Econometrica, 72(4):1221-1246, 2004.
Karel GM Moons, Joris AH de Groot, Kristian Linnet, Johannes B Reitsma, and Patrick MM Bossuyt. Quantifying the added value of a diagnostic test or marker. Clinical chemistry, 58(10):1408-1417, 2012.
Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
Nagarajan Natarajan, Oluwasanmi Koyejo, Pradeep Ravikumar, and Inderjit S Dhillon. Optimal decision-theoretic classification using non-decomposable performance metrics. arXiv preprint, 2015. URL: http://arxiv.org/abs/1505.01802.
Stephen G Pauker and Jerome P Kassirer. Therapeutic decision making: a cost-benefit analysis. New England Journal of Medicine, 293(5):229-234, 1975.
Stephen R Pfohl, Yizhe Xu, Agata Foryciarz, Nikolaos Ignatiadis, Julian Genkins, and Nigam H Shah. Net benefit, calibration, threshold selection, and training objectives for algorithmic fairness in healthcare. arXiv preprint, 2022. URL: http://arxiv.org/abs/2202.01906.
Stephen R Pfohl, Haoran Zhang, Yizhe Xu, Agata Foryciarz, Marzyeh Ghassemi, and Nigam H Shah. A comparison of approaches to improve worst-case predictive model performance over patient subpopulations. Scientific reports, 12(1):1-13, 2022.
Guy N Rothblum and Gal Yona. Multi-group agnostic pac learnability. ICML, 2021.
Werner Vach. Calibration of clinical prediction rules does not just assess bias. Journal of clinical epidemiology, 66(11):1296-1301, 2013.
Ben Van Calster, Daan Nieboer, Yvonne Vergouwe, Bavo De Cock, Michael J Pencina, and Ewout W Steyerberg. A calibration hierarchy for risk models was defined: from utopia to empirical data. Journal of clinical epidemiology, 74:167-176, 2016.
Ben Van Calster and Andrew J Vickers. Calibration of risk prediction models: impact on decision-analytic performance. Medical decision making, 35(2):162-169, 2015.
Ben Van Calster, Andrew J Vickers, Michael J Pencina, Stuart G Baker, Dirk Timmerman, and Ewout W Steyerberg. Evaluation of markers and risk prediction models: overview of relationships between nri and decision-analytic measures. Medical Decision Making, 33(4):490-501, 2013.
Andrew J Vickers and Elena B Elkin. Decision curve analysis: a novel method for evaluating prediction models. Medical Decision Making, 26(6):565-574, 2006.
Andrew J Vickers, Ben van Calster, and Ewout W Steyerberg. A simple, step-by-step guide to interpreting decision curve analysis. Diagnostic and prognostic research, 3(1):1-8, 2019.
Yoav Wald, Amir Feder, Daniel Greenfeld, and Uri Shalit. On calibration and out-of-domain generalization. arXiv preprint, 2021. URL: http://arxiv.org/abs/2102.10395.
Ximei Wang, Mingsheng Long, Jianmin Wang, and Michael I Jordan. Transferable calibration with lower bias and variance in domain adaptation. arXiv preprint, 2020. URL: http://arxiv.org/abs/2007.08259.

Decision-Making Under Miscalibration

Authors Guy N. Rothblum, Gal Yona

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Decision-Making Under Miscalibration

Authors Guy N. Rothblum, Gal Yona

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message