Balanced Filtering via Disclosure-Controlled Proxies

Authors Siqi Deng, Emily Diana, Michael Kearns, Aaron Roth



PDF
Thumbnail PDF

File

LIPIcs.FORC.2024.4.pdf
  • Filesize: 1.69 MB
  • 23 pages

Document Identifiers

Author Details

Siqi Deng
  • Amazon AWS AI, Palo Alto, CA, USA
Emily Diana
  • Toyota Technological Institute at Chicago, IL, USA
Michael Kearns
  • University of Pennsylvania, Philadelphia, PA, USA
  • Amazon AWS AI, Palo Alto, CA, USA
Aaron Roth
  • University of Pennsylvania, Philadelphia, PA, USA
  • Amazon AWS AI, Palo Alto, CA, USA

Cite AsGet BibTex

Siqi Deng, Emily Diana, Michael Kearns, and Aaron Roth. Balanced Filtering via Disclosure-Controlled Proxies. In 5th Symposium on Foundations of Responsible Computing (FORC 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 295, pp. 4:1-4:23, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.FORC.2024.4

Abstract

We study the problem of collecting a cohort or set that is balanced with respect to sensitive groups when group membership is unavailable or prohibited from use at deployment time. Specifically, our deployment-time collection mechanism does not reveal significantly more about the group membership of any individual sample than can be ascertained from base rates alone. To do this, we study a learner that can use a small set of labeled data to train a proxy function that can later be used for this filtering or selection task. We then associate the range of the proxy function with sampling probabilities; given a new example, we classify it using our proxy function and then select it with probability corresponding to its proxy classification. Importantly, we require that the proxy classification does not reveal significantly more information about the sensitive group membership of any individual example compared to population base rates alone (i.e., the level of disclosure should be controlled) and show that we can find such a proxy in a sample- and oracle-efficient manner. Finally, we experimentally evaluate our algorithm and analyze its generalization properties.

Subject Classification

ACM Subject Classification
  • Theory of computation → Design and analysis of algorithms
  • Security and privacy → Human and societal aspects of security and privacy
Keywords
  • Algorithms
  • Sampling
  • Ethical/Societal Implications

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Alekh Agarwal, Alina Beygelzimer, Miroslav Dudík, John Langford, and Hanna M. Wallach. A reductions approach to fair classification. CoRR, abs/1803.02453, 2018. URL: https://arxiv.org/abs/1803.02453.
  2. Alekh Agarwal, Miroslav Dudík, and Zhiwei Steven Wu. Fair regression: Quantitative definitions and reduction-based algorithms. CoRR, abs/1905.12843, 2019. URL: https://arxiv.org/abs/1905.12843.
  3. Larry Alexander. What makes wrongful discrimination wrong? biases, preferences, stereotypes, and proxies. University of Pennsylvania Law Review, 141(1):149-219, 1992. URL: http://www.jstor.org/stable/3312397.
  4. Gustavo E. A. P. A. Batista, Ana Lúcia Cetertich Bazzan, and Maria Carolina Monard. Balancing training data for automated annotation of keywords: a case study. In WOB, 2003. Google Scholar
  5. Gustavo E. A. P. A. Batista, Ronaldo C. Prati, and Maria Carolina Monard. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. Newsl., 6(1):20-29, June 2004. URL: https://doi.org/10.1145/1007730.1007735.
  6. Patrick Billingsley. Probability and Measure. John Wiley and Sons, second edition, 1986. Google Scholar
  7. Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. Google Scholar
  8. Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. Smote: Synthetic minority over-sampling technique. J. Artif. Int. Res., 16(1):321-357, June 2002. Google Scholar
  9. Yifan Cui, Hongming Pu, Xu Shi, Wang Miao, and Eric Tchetgen Tchetgen. Semiparametric proximal causal inference. Journal of the American Statistical Association, 0(0):1-12, 2023. URL: https://doi.org/10.1080/01621459.2023.2191817.
  10. Emily Diana, Wesley Gill, Michael Kearns, Krishnaram Kenthapadi, Aaron Roth, and Saeed Sharifi-Malvajerdi. Multiaccurate proxies for downstream fairness. In 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT '22, pages 1207-1239, New York, NY, USA, 2022. Association for Computing Machinery. URL: https://doi.org/10.1145/3531146.3533180.
  11. Georgios Douzas, Fernando Bacao, and Felix Last. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Information Sciences, 465:1-20, October 2018. URL: https://doi.org/10.1016/j.ins.2018.06.056.
  12. Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL: http://archive.ics.uci.edu/ml.
  13. Marc N. Elliott, Peter A. Morrison, Allen M. Fremont, Daniel F. McCaffrey, Philip M Pantoja, and Nicole Lurie. Using the census bureau’s surname list to improve estimates of race/ethnicity and associated disparities. Health Services and Outcomes Research Methodology, 9:69-83, 2009. Google Scholar
  14. Yoav Freund and Robert E. Schapire. Game theory, on-line prediction and boosting. In Proceedings of the Ninth Annual Conference on Computational Learning Theory, 1996. Google Scholar
  15. Hui Han, Wenyuan Wang, and Binghuan Mao. Borderline-smote: A new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing, 2005. Google Scholar
  16. Peter E. Hart. The condensed nearest neighbor rule. IEEE Transactions on Information Theory, pages 515-516, 1968. Google Scholar
  17. Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li. Adasyn: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pages 1322-1328, 2008. Google Scholar
  18. Gabbrielle Johnson. Algorithmic bias: On the implicit biases of social technology, May 2020. URL: http://philsci-archive.pitt.edu/17169/.
  19. Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In International Conference on Machine Learning, pages 2564-2572. PMLR, 2018. Google Scholar
  20. David Madras, Elliot Creager, Toniann Pitassi, and Richard S. Zemel. Learning adversarially fair and transferable representations. CoRR, abs/1802.06309, 2018. URL: https://arxiv.org/abs/1802.06309.
  21. Inderjeet Mani and Jianping Zhang. knn approach to unbalanced data distributions: A case study involving information extraction. Workshop on Learning from Imbalanced Datasets II, ICML, 126:1-7, 2003. Google Scholar
  22. Daniel Mccaffrey, Greg Ridgeway, and Andrew Morral. Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychological methods, 9:403-25, January 2005. URL: https://doi.org/10.1037/1082-989X.9.4.403.
  23. Giovanna Menardi and Nicola Torelli. Training and assessing classification rules with imbalanced data. Data Mining and Knowledge Discovery, 28:92-122, 2012. Google Scholar
  24. Wang Miao, Zhi Geng, and Eric J. Tchetgen Tchetgen. Identifying causal effects with proxy variables of an unmeasured confounder. Biometrika, 105 4:987-993, 2016. URL: https://api.semanticscholar.org/CorpusID:88521475.
  25. Sérgio Moro, Paulo Cortez, and Paulo Rita. A data-driven approach to predict the success of bank telemarketing. Decision Support Systems, 62:22-31, 2014. URL: https://doi.org/10.1016/j.dss.2014.03.001.
  26. Hien M. Nguyen, Eric W. Cooper, and Katsuari Kamei. Borderline over-sampling for imbalanced data classification. Int. J. Knowl. Eng. Soft Data Paradigms, 3:4-21, 2009. Google Scholar
  27. Valerio Perrone, Michele Donini, Muhammad Bilal Zafar, Robin Schmucker, Krishnaram Kenthapadi, and Cédric Archambeau. Fair bayesian optimization. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, AIES '21, pages 854-863, New York, NY, USA, 2021. Association for Computing Machinery. URL: https://doi.org/10.1145/3461702.3462629.
  28. Flavien Prost, Hai Qian, Qiuwen Chen, Ed H. Chi, Jilin Chen, and Alex Beutel. Toward a better trade-off between performance and fairness with kernel-based distribution matching. ArXiv, abs/1910.11779, 2019. URL: https://api.semanticscholar.org/CorpusID:204900934.
  29. Hongxiang Qiu, Xu Shi, Wang Miao, Edgar Dobriban, and Eric Tchetgen Tchetgen. Doubly robust proximal synthetic controls, 2023. URL: https://arxiv.org/abs/2210.02014.
  30. M. A. Redmond and A. Baveja. A data-driven software tool for enabling cooperative information sharing among police departments, 2002. Google Scholar
  31. Xu Shi, Kendrick Li, Wang Miao, Mengtong Hu, and Eric Tchetgen Tchetgen. Theory for identification and inference with synthetic controls: A proximal causal inference framework, 2023. URL: https://arxiv.org/abs/2108.13935.
  32. Eric J Tchetgen Tchetgen, Andrew Ying, Yifan Cui, Xu Shi, and Wang Miao. An introduction to proximal causal learning, 2020. URL: https://arxiv.org/abs/2009.10982.
  33. I. Tomek. An experiment with the edited nearest-neighbor rule. IEEE Transactions on Systems, Man, and Cybernetics, SMC-6(6):448-452, 1976. URL: https://doi.org/10.1109/TSMC.1976.4309523.
  34. I. Tomek. Two modifications of cnn. IEEE Transactions on Systems, Man, and Cybernetics, 6:769-772, 1976. Google Scholar
  35. Bureau of the Census U. S. Department of Commerce. Census of population and housing 1990 united states: Summary tape file 1a & 3a (computer files). Google Scholar
  36. U.S. Students for fair admissions, inc. v. president and fellows of harvard college, 2023. Google Scholar
  37. Bureau Of The Census Producer U.S. Department Of Commerce, 1992. Google Scholar
  38. Bureau Of The Census Producer U.S. Department Of Commerce. U.s. department of justice, bureau of justice statistics, law enforcement management and administrative statistics (computer file), 1992. Google Scholar
  39. Federal Bureau of Investigation U.S. Department of Justice. Crime in the united states (computer file), 1995. Google Scholar
  40. Ioan Voicu. Using first name information to improve race and ethnicity classification. Statistics and Public Policy, 5:1-13, 2016. Google Scholar
  41. Michael R. Wickens. A note on the use of proxy variables. Econometrica, 40(4):759-761, 1972. URL: http://www.jstor.org/stable/1912971.
  42. Dennis L. Wilson. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern., 2:408-421, 1972. Google Scholar
  43. Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization, 2018. URL: https://arxiv.org/abs/1710.09412.
  44. Yan Zhang. Assessing fair lending risks using race/ethnicity proxies. Comparative Political Economy: Regulation eJournal, 2016. Google Scholar
  45. Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the Twentieth International Conference on Machine Learning. Washington, DC, 2003. Google Scholar