Balanced Filtering via Disclosure-Controlled Proxies

Deng, Siqi; Diana, Emily; Kearns, Michael; Roth, Aaron

doi:10.4230/LIPIcs.FORC.2024.4

Abstract

We study the problem of collecting a cohort or set that is balanced with respect to sensitive groups when group membership is unavailable or prohibited from use at deployment time. Specifically, our deployment-time collection mechanism does not reveal significantly more about the group membership of any individual sample than can be ascertained from base rates alone. To do this, we study a learner that can use a small set of labeled data to train a proxy function that can later be used for this filtering or selection task. We then associate the range of the proxy function with sampling probabilities; given a new example, we classify it using our proxy function and then select it with probability corresponding to its proxy classification. Importantly, we require that the proxy classification does not reveal significantly more information about the sensitive group membership of any individual example compared to population base rates alone (i.e., the level of disclosure should be controlled) and show that we can find such a proxy in a sample- and oracle-efficient manner. Finally, we experimentally evaluate our algorithm and analyze its generalization properties.

Alekh Agarwal, Alina Beygelzimer, Miroslav Dudík, John Langford, and Hanna M. Wallach. A reductions approach to fair classification. CoRR, abs/1803.02453, 2018. URL: https://arxiv.org/abs/1803.02453.
Alekh Agarwal, Miroslav Dudík, and Zhiwei Steven Wu. Fair regression: Quantitative definitions and reduction-based algorithms. CoRR, abs/1905.12843, 2019. URL: https://arxiv.org/abs/1905.12843.
Larry Alexander. What makes wrongful discrimination wrong? biases, preferences, stereotypes, and proxies. University of Pennsylvania Law Review, 141(1):149-219, 1992. URL: http://www.jstor.org/stable/3312397.
Gustavo E. A. P. A. Batista, Ana Lúcia Cetertich Bazzan, and Maria Carolina Monard. Balancing training data for automated annotation of keywords: a case study. In WOB, 2003.
Gustavo E. A. P. A. Batista, Ronaldo C. Prati, and Maria Carolina Monard. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. Newsl., 6(1):20-29, June 2004. URL: https://doi.org/10.1145/1007730.1007735.
Patrick Billingsley. Probability and Measure. John Wiley and Sons, second edition, 1986.
Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. Smote: Synthetic minority over-sampling technique. J. Artif. Int. Res., 16(1):321-357, June 2002.
Yifan Cui, Hongming Pu, Xu Shi, Wang Miao, and Eric Tchetgen Tchetgen. Semiparametric proximal causal inference. Journal of the American Statistical Association, 0(0):1-12, 2023. URL: https://doi.org/10.1080/01621459.2023.2191817.
Emily Diana, Wesley Gill, Michael Kearns, Krishnaram Kenthapadi, Aaron Roth, and Saeed Sharifi-Malvajerdi. Multiaccurate proxies for downstream fairness. In 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT '22, pages 1207-1239, New York, NY, USA, 2022. Association for Computing Machinery. URL: https://doi.org/10.1145/3531146.3533180.
Georgios Douzas, Fernando Bacao, and Felix Last. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Information Sciences, 465:1-20, October 2018. URL: https://doi.org/10.1016/j.ins.2018.06.056.
Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL: http://archive.ics.uci.edu/ml.
Marc N. Elliott, Peter A. Morrison, Allen M. Fremont, Daniel F. McCaffrey, Philip M Pantoja, and Nicole Lurie. Using the census bureau’s surname list to improve estimates of race/ethnicity and associated disparities. Health Services and Outcomes Research Methodology, 9:69-83, 2009.
Yoav Freund and Robert E. Schapire. Game theory, on-line prediction and boosting. In Proceedings of the Ninth Annual Conference on Computational Learning Theory, 1996.
Hui Han, Wenyuan Wang, and Binghuan Mao. Borderline-smote: A new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing, 2005.
Peter E. Hart. The condensed nearest neighbor rule. IEEE Transactions on Information Theory, pages 515-516, 1968.
Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li. Adasyn: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pages 1322-1328, 2008.
Gabbrielle Johnson. Algorithmic bias: On the implicit biases of social technology, May 2020. URL: http://philsci-archive.pitt.edu/17169/.
Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In International Conference on Machine Learning, pages 2564-2572. PMLR, 2018.
David Madras, Elliot Creager, Toniann Pitassi, and Richard S. Zemel. Learning adversarially fair and transferable representations. CoRR, abs/1802.06309, 2018. URL: https://arxiv.org/abs/1802.06309.
Inderjeet Mani and Jianping Zhang. knn approach to unbalanced data distributions: A case study involving information extraction. Workshop on Learning from Imbalanced Datasets II, ICML, 126:1-7, 2003.
Daniel Mccaffrey, Greg Ridgeway, and Andrew Morral. Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychological methods, 9:403-25, January 2005. URL: https://doi.org/10.1037/1082-989X.9.4.403.
Giovanna Menardi and Nicola Torelli. Training and assessing classification rules with imbalanced data. Data Mining and Knowledge Discovery, 28:92-122, 2012.
Wang Miao, Zhi Geng, and Eric J. Tchetgen Tchetgen. Identifying causal effects with proxy variables of an unmeasured confounder. Biometrika, 105 4:987-993, 2016. URL: https://api.semanticscholar.org/CorpusID:88521475.
Sérgio Moro, Paulo Cortez, and Paulo Rita. A data-driven approach to predict the success of bank telemarketing. Decision Support Systems, 62:22-31, 2014. URL: https://doi.org/10.1016/j.dss.2014.03.001.
Hien M. Nguyen, Eric W. Cooper, and Katsuari Kamei. Borderline over-sampling for imbalanced data classification. Int. J. Knowl. Eng. Soft Data Paradigms, 3:4-21, 2009.
Valerio Perrone, Michele Donini, Muhammad Bilal Zafar, Robin Schmucker, Krishnaram Kenthapadi, and Cédric Archambeau. Fair bayesian optimization. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, AIES '21, pages 854-863, New York, NY, USA, 2021. Association for Computing Machinery. URL: https://doi.org/10.1145/3461702.3462629.
Flavien Prost, Hai Qian, Qiuwen Chen, Ed H. Chi, Jilin Chen, and Alex Beutel. Toward a better trade-off between performance and fairness with kernel-based distribution matching. ArXiv, abs/1910.11779, 2019. URL: https://api.semanticscholar.org/CorpusID:204900934.
Hongxiang Qiu, Xu Shi, Wang Miao, Edgar Dobriban, and Eric Tchetgen Tchetgen. Doubly robust proximal synthetic controls, 2023. URL: https://arxiv.org/abs/2210.02014.
M. A. Redmond and A. Baveja. A data-driven software tool for enabling cooperative information sharing among police departments, 2002.
Xu Shi, Kendrick Li, Wang Miao, Mengtong Hu, and Eric Tchetgen Tchetgen. Theory for identification and inference with synthetic controls: A proximal causal inference framework, 2023. URL: https://arxiv.org/abs/2108.13935.
Eric J Tchetgen Tchetgen, Andrew Ying, Yifan Cui, Xu Shi, and Wang Miao. An introduction to proximal causal learning, 2020. URL: https://arxiv.org/abs/2009.10982.
I. Tomek. An experiment with the edited nearest-neighbor rule. IEEE Transactions on Systems, Man, and Cybernetics, SMC-6(6):448-452, 1976. URL: https://doi.org/10.1109/TSMC.1976.4309523.
I. Tomek. Two modifications of cnn. IEEE Transactions on Systems, Man, and Cybernetics, 6:769-772, 1976.
Bureau of the Census U. S. Department of Commerce. Census of population and housing 1990 united states: Summary tape file 1a & 3a (computer files).
U.S. Students for fair admissions, inc. v. president and fellows of harvard college, 2023.
Bureau Of The Census Producer U.S. Department Of Commerce, 1992.
Bureau Of The Census Producer U.S. Department Of Commerce. U.s. department of justice, bureau of justice statistics, law enforcement management and administrative statistics (computer file), 1992.
Federal Bureau of Investigation U.S. Department of Justice. Crime in the united states (computer file), 1995.
Ioan Voicu. Using first name information to improve race and ethnicity classification. Statistics and Public Policy, 5:1-13, 2016.
Michael R. Wickens. A note on the use of proxy variables. Econometrica, 40(4):759-761, 1972. URL: http://www.jstor.org/stable/1912971.
Dennis L. Wilson. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern., 2:408-421, 1972.
Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization, 2018. URL: https://arxiv.org/abs/1710.09412.
Yan Zhang. Assessing fair lending risks using race/ethnicity proxies. Comparative Political Economy: Regulation eJournal, 2016.
Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the Twentieth International Conference on Machine Learning. Washington, DC, 2003.

Balanced Filtering via Disclosure-Controlled Proxies

Authors Siqi Deng, Emily Diana, Michael Kearns, Aaron Roth

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Balanced Filtering via Disclosure-Controlled Proxies

Authors Siqi Deng, Emily Diana, Michael Kearns, Aaron Roth

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message