Fair Grading Algorithms for Randomized Exams

Authors Jiale Chen, Jason Hartline, Onno Zoeter

Thumbnail PDF


  • Filesize: 0.8 MB
  • 22 pages

Document Identifiers

Author Details

Jiale Chen
  • Department of Management Science and Engineering, Stanford University, CA, USA
Jason Hartline
  • Department of Computer Science, Northwestern University, Evanston, IL, USA
Onno Zoeter
  • Booking.com, Amsterdam, The Netherlands

Cite AsGet BibTex

Jiale Chen, Jason Hartline, and Onno Zoeter. Fair Grading Algorithms for Randomized Exams. In 4th Symposium on Foundations of Responsible Computing (FORC 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 256, pp. 7:1-7:22, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)


This paper studies grading algorithms for randomized exams. In a randomized exam, each student is asked a small number of random questions from a large question bank. The predominant grading rule is simple averaging, i.e., calculating grades by averaging scores on the questions each student is asked, which is fair ex-ante, over the randomized questions, but not fair ex-post, on the realized questions. The fair grading problem is to estimate the average grade of each student on the full question bank. The maximum-likelihood estimator for the Bradley-Terry-Luce model on the bipartite student-question graph is shown to be consistent with high probability when the number of questions asked to each student is at least the cubed-logarithm of the number of students. In an empirical study on exam data and in simulations, our algorithm based on the maximum-likelihood estimator significantly outperforms simple averaging in prediction accuracy and ex-post fairness even with a small class and exam size.

Subject Classification

ACM Subject Classification
  • Social and professional topics → Student assessment
  • Ex-ante and Ex-post Fairness
  • Item Response Theory
  • Algorithmic Fairness in Education


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Xinming An and Yiu-Fai Yung. Item response theory: What it is and how you can use the irt procedure to apply it. SAS Institute Inc, 10(4):364-2014, 2014. Google Scholar
  2. Haris Aziz. Simultaneously Achieving Ex-ante and Ex-post Fairness, June 2020. URL: https://doi.org/10.48550/arXiv.2004.02554.
  3. Moshe Babaioff, Tomer Ezra, and Uriel Feige. Best-of-Both-Worlds Fair-Share Allocations, March 2022. URL: https://arxiv.org/abs/2102.04909.
  4. Gordon G. Bechtel. Generalizing the Rasch Model for Consumer Rating Scales. Marketing Science, 4(1):62-73, February 1985. URL: https://doi.org/10.1287/mksc.4.1.62.
  5. Nikolaus Bezruczko. Rasch Measurement in Health Sciences. JAM Press, Maple Grove, Minn, 2005. Google Scholar
  6. Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons, 1952. URL: https://doi.org/10.2307/2334029.
  7. Luca De Alfaro and Michael Shavlovsky. Crowdgrader: A tool for crowdsourcing the evaluation of homework assignments. In Proceedings of the 45th ACM technical symposium on Computer science education, pages 415-420, 2014. Google Scholar
  8. Craig K. Enders. Applied Missing Data Analysis. Methodology in the Social Sciences. Guilford Press, New York, 2010. Google Scholar
  9. L. R. Ford. Solution of a Ranking Problem from Binary Comparisons. The American Mathematical Monthly, 64(8):28-33, 1957. URL: https://doi.org/10.2307/2308513.
  10. Max Fowler, David H. Smith, Chinedu Emeka, Matthew West, and Craig Zilles. Are We Fair? Quantifying Score Impacts of Computer Science Exams with Randomized Question Pools. In Proceedings of the 53rd ACM Technical Symposium on Computer Science Education V. 1, SIGCSE 2022, pages 647-653, New York, NY, USA, February 2022. Association for Computing Machinery. URL: https://doi.org/10.1145/3478431.3499388.
  11. Rupert Freeman, Nisarg Shah, and Rohit Vaish. Best of Both Worlds: Ex-Ante and Ex-Post Fairness in Resource Allocation. In Proceedings of the 21st ACM Conference on Economics and Computation, EC '20, pages 21-22, New York, NY, USA, July 2020. Association for Computing Machinery. URL: https://doi.org/10.1145/3391403.3399537.
  12. Shelby J. Haberman. Maximum Likelihood Estimates in Exponential Response Models. The Annals of Statistics, 5(5):815-841, September 1977. URL: https://doi.org/10.1214/aos/1176343941.
  13. Shelby J. Haberman. Joint and Conditional Maximum Likelihood Estimation for the Rasch Model for Binary Responses. ETS Research Report Series, 2004(1):i-63, June 2004. URL: https://doi.org/10.1002/j.2333-8504.2004.tb01947.x.
  14. John Hamer, Kenneth T. K. Ma, Hugh H. F. Kwong, Kenneth T. K, Ma Hugh, and H. F. Kwong. A Method of Automatic Grade Calibration in Peer Assessment. In Of Conferences in Research and Practice in Information Technology, Australian Computer Society, pages 67-72, 2005. Google Scholar
  15. Ruijian Han, Rougang Ye, Chunxi Tan, and Kani Chen. Asymptotic theory of sparse BradleyendashTerry model. The Annals of Applied Probability, 30(5):2491-2515, October 2020. URL: https://doi.org/10.1214/20-AAP1564.
  16. David R. Hunter. MM algorithms for generalized Bradley-Terry models. The Annals of Statistics, 32(1), February 2004. URL: https://doi.org/10.1214/aos/1079120141.
  17. Won-Chan Lee and Guemin Lee. IRT Linking and Equating. In The Wiley Handbook of Psychometric Testing, chapter 21, pages 639-673. John Wiley & Sons, Ltd, 2018. URL: https://doi.org/10.1002/9781118489772.ch21.
  18. Francisco J. Moral and Francisco J. Rebollo. Characterization of soil fertility using the Rasch model. Journal of soil science and plant nutrition, 17(2):486-498, June 2017. URL: https://doi.org/10.4067/S0718-95162017005000035.
  19. Georg Rasch. Probabilistic Models for Some Intelligence and Attainment Tests. MESA Press, 5835 S, 1993. Google Scholar
  20. Syed A. Raza, Wasim Qazi, Komal Akram Khan, and Javeria Salam. Social Isolation and Acceptance of the Learning Management System (LMS) in the time of COVID-19 Pandemic: An Expansion of the UTAUT Model. Journal of Educational Computing Research, 59(2):183-208, April 2021. URL: https://doi.org/10.1177/0735633120960421.
  21. Ken Reily, Pam Finnerty, and Loren Terveen. Two peers are better than one: Aggregating peer reviews for computing assignments is surprisingly accurate. In GROUP'09 - Proceedings of the 2009 ACM SIGCHI International Conference on Supporting Group Work, pages 115-124, January 2009. URL: https://doi.org/10.1145/1531674.1531692.
  22. Alexander Robitzsch. A Comprehensive Simulation Study of Estimation Methods for the Rasch Model. Stats, 4(4):814-836, December 2021. URL: https://doi.org/10.3390/stats4040048.
  23. Mehdi S. M. Sajjadi, Morteza Alamgir, and Ulrike von Luxburg. Peer Grading in a Course on Algorithms and Data Structures: Machine Learning Algorithms do not Improve over Simple Baselines, February 2016. URL: https://doi.org/10.48550/arXiv.1506.00852.
  24. Gordon Simons and Yi-Ching Yao. Asymptotics when the number of parameters tends to infinity in the Bradley-Terry model for paired comparisons. The Annals of Statistics, 27(3):1041-1060, June 1999. URL: https://doi.org/10.1214/aos/1018031267.
  25. Glenn Waterbury. Missing Data and the Rasch Model: The Effects of Missing Data Mechanisms on Item Parameter Estimation. Journal of applied measurement, 20:1-12, May 2019. Google Scholar
  26. Ting Yan, Yaning Yang, and Jinfeng Xu. Sparse Paired Comparisons in the Bradley-Terry Model. Statistica Sinica, 22(3):1305-1318, 2012. Google Scholar
  27. E. Zermelo. Die Berechnung der Turnier-Ergebnisse als ein Maximumproblem der Wahrscheinlichkeitsrechnung. Mathematische Zeitschrift, 29(1):436-460, December 1929. URL: https://doi.org/10.1007/BF01180541.
Questions / Remarks / Feedback

Feedback for Dagstuhl Publishing

Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail