Distributionally Robust Data Join

Authors Pranjal Awasthi, Christopher Jung, Jamie Morgenstern

Thumbnail PDF


  • Filesize: 0.67 MB
  • 15 pages

Document Identifiers

Author Details

Pranjal Awasthi
  • Google Research, NY, USA
Christopher Jung
  • Stanford University, CA, USA
Jamie Morgenstern
  • University of Washington, Seattle, WA, USA

Cite AsGet BibTex

Pranjal Awasthi, Christopher Jung, and Jamie Morgenstern. Distributionally Robust Data Join. In 4th Symposium on Foundations of Responsible Computing (FORC 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 256, pp. 10:1-10:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)


Suppose we are given two datasets: a labeled dataset and unlabeled dataset which also has additional auxiliary features not present in the first dataset. What is the most principled way to use these datasets together to construct a predictor? The answer should depend upon whether these datasets are generated by the same or different distributions over their mutual feature sets, and how similar the test distribution will be to either of those distributions. In many applications, the two datasets will likely follow different distributions, but both may be close to the test distribution. We introduce the problem of building a predictor which minimizes the maximum loss over all probability distributions over the original features, auxiliary features, and binary labels, whose Wasserstein distance is r₁ away from the empirical distribution over the labeled dataset and r₂ away from that of the unlabeled dataset. This can be thought of as a generalization of distributionally robust optimization (DRO), which allows for two data sources, one of which is unlabeled and may contain auxiliary features.

Subject Classification

ACM Subject Classification
  • Theory of computation → Machine learning theory
  • Distributionally Robust Optimization
  • Semi-Supervised Learning
  • Learning Theory


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Pranjal Awasthi, Alex Beutel, Matthäus Kleindessner, Jamie Morgenstern, and Xuezhi Wang. Evaluating fairness of machine learning models under uncertain and incomplete information. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 206-214, 2021. Google Scholar
  2. Jose Blanchet, Yang Kang, and Karthyek Murthy. Robust wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56(3):830-857, 2019. Google Scholar
  3. Jose Blanchet and Karthyek Murthy. Quantifying distributional model risk via optimal transport. Mathematics of Operations Research, 44(2):565-600, 2019. Google Scholar
  4. Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717-772, 2009. Google Scholar
  5. L Elisa Celis, Lingxiao Huang, Vijay Keswani, and Nisheeth K Vishnoi. Fair classification with noisy protected attributes: A framework with provable guarantees. In International Conference on Machine Learning, pages 1349-1361. PMLR, 2021. Google Scholar
  6. L Elisa Celis, Anay Mehrotra, and Nisheeth K Vishnoi. Fair classification with adversarial perturbations. arXiv preprint, 2021. URL: https://arxiv.org/abs/2106.05964.
  7. Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks, 20(3):542-542, 2009. Google Scholar
  8. Erick Delage and Yinyu Ye. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations research, 58(3):595-612, 2010. Google Scholar
  9. Emily Diana, Wesley Gill, Michael Kearns, Krishnaram Kenthapadi, and Aaron Roth. Minimax group fairness: Algorithms and experiments. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 66-76, 2021. Google Scholar
  10. Emily Diana, Wesley Gill, Michael Kearns, Krishnaram Kenthapadi, Aaron Roth, and Saeed Sharifi-Malvajerdi. Multiaccurate proxies for downstream fairness. arXiv preprint, 2021. URL: https://arxiv.org/abs/2107.04423.
  11. A Rogier T Donders, Geert JMG Van Der Heijden, Theo Stijnen, and Karel GM Moons. A gentle introduction to imputation of missing values. Journal of clinical epidemiology, 59(10):1087-1091, 2006. Google Scholar
  12. John C Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49(3):1378-1406, 2021. Google Scholar
  13. Emre Erdoğan and Garud Iyengar. Ambiguous chance constrained problems and robust optimization. Mathematical Programming, 107(1):37-61, 2006. Google Scholar
  14. Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171(1):115-166, 2018. Google Scholar
  15. Allen Fremont, Joel S Weissman, Emily Hoch, and Marc N Elliott. When race/ethnicity data are lacking. RAND Health Q, 6:1-6, 2016. Google Scholar
  16. Joel Goh and Melvyn Sim. Distributionally robust optimization and its tractable approximations. Operations research, 58(4-part-1):902-917, 2010. Google Scholar
  17. Tatsunori Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. Fairness without demographics in repeated loss minimization. In International Conference on Machine Learning, pages 1929-1938. PMLR, 2018. Google Scholar
  18. Zhaolin Hu and L Jeff Hong. Kullback-leibler divergence constrained distributionally robust optimization. Available at Optimization Online, 2013. Google Scholar
  19. Haewon Jeong, Hao Wang, and Flavio P Calmon. Fairness without imputation: A decision tree approach for fair prediction with missing values. arXiv preprint, 2021. URL: https://arxiv.org/abs/2109.10431.
  20. Nathan Kallus, Xiaojie Mao, and Angela Zhou. Assessing algorithmic fairness with unobserved protected class using data combination. Management Science, 2021. Google Scholar
  21. Michael P Kim, Amirata Ghorbani, and James Zou. Multiaccuracy: Black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 247-254, 2019. Google Scholar
  22. Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30-37, 2009. Google Scholar
  23. Daniel Kuhn, Peyman Mohajerin Esfahani, Viet Anh Nguyen, and Soroosh Shafieezadeh-Abadeh. Wasserstein distributionally robust optimization: Theory and applications in machine learning. In Operations research & management science in the age of analytics, pages 130-166. Informs, 2019. Google Scholar
  24. Jaeho Lee and Maxim Raginsky. Minimax statistical learning with wasserstein distances. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 2692-2701, 2018. URL: https://proceedings.neurips.cc/paper/2018/hash/ea8fcd92d59581717e06eb187f10666d-Abstract.html.
  25. Andriy Mnih and Russ R Salakhutdinov. Probabilistic matrix factorization. In Advances in neural information processing systems, pages 1257-1264, 2008. Google Scholar
  26. Hamed Rahimian and Sanjay Mehrotra. Distributionally robust optimization: A review. arXiv preprint, 2019. URL: https://arxiv.org/abs/1908.05659.
  27. Patrick Royston. Multiple imputation of missing values. The Stata Journal, 4(3):227-241, 2004. Google Scholar
  28. Soroosh Shafieezadeh-Abadeh, Peyman Mohajerin Esfahani, and Daniel Kuhn. Distributionally robust logistic regression. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, editors, Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 1576-1584, 2015. URL: https://proceedings.neurips.cc/paper/2015/hash/cc1aa436277138f61cda703991069eaf-Abstract.html.
  29. Alexander Shapiro. On duality theory of conic linear problems. In Semi-infinite programming, pages 135-165. Springer, 2001. Google Scholar
  30. Bahar Taskesen, Viet Anh Nguyen, Daniel Kuhn, and Jose Blanchet. A distributionally robust approach to fair classification. arXiv preprint, 2020. URL: https://arxiv.org/abs/2007.09530.
  31. Cédric Villani. Topics in optimal transportation. American Mathematical Soc., 2003. Google Scholar
  32. Joel S Weissman and Romana Hasnain-Wynia. Advancing health care equity through improved data collection. The New England journal of medicine, 364(24):2276-2277, 2011. Google Scholar
  33. Wolfram Wiesemann, Daniel Kuhn, and Melvyn Sim. Distributionally robust convex optimization. Operations Research, 62(6):1358-1376, 2014. Google Scholar
  34. Yan Zhang. Assessing fair lending risks using race/ethnicity proxies. Management Science, 64(1):178-197, 2018. Google Scholar
  35. Xiaojin Zhu and Andrew B Goldberg. Introduction to semi-supervised learning. Synthesis lectures on artificial intelligence and machine learning, 3(1):1-130, 2009. Google Scholar
  36. Xiaojin Jerry Zhu. Semi-supervised learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2005. Google Scholar