Mechanizing Soundness of Off-Policy Evaluation

Authors Jared Yeager , J. Eliot B. Moss , Michael Norrish , Philip S. Thomas

Thumbnail PDF


  • Filesize: 0.75 MB
  • 20 pages

Document Identifiers

Author Details

Jared Yeager
  • Manning College of Information and Computer Sciences, University of Massachusetts Amherst, MA, USA
J. Eliot B. Moss
  • Manning College of Information and Computer Sciences, University of Massachusetts Amherst, MA, USA
Michael Norrish
  • School of Computing, Australian National University, Canberra, Australia
Philip S. Thomas
  • Manning College of Information and Computer Sciences, University of Massachusetts Amherst, MA, USA

Cite AsGet BibTex

Jared Yeager, J. Eliot B. Moss, Michael Norrish, and Philip S. Thomas. Mechanizing Soundness of Off-Policy Evaluation. In 13th International Conference on Interactive Theorem Proving (ITP 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 237, pp. 32:1-32:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)


There are reinforcement learning scenarios - e.g., in medicine - where we are compelled to be as confident as possible that a policy change will result in an improvement before implementing it. In such scenarios, we can employ off-policy evaluation (OPE). The basic idea of OPE is to record histories of behaviors under the current policy, and then develop an estimate of the quality of a proposed new policy, seeing what the behavior would have been under the new policy. As we are evaluating the policy without actually using it, we have the "off-policy" of OPE. Applying a concentration inequality to the estimate, we derive a confidence interval for the expected quality of the new policy. If the confidence interval lies above that of the current policy, we can change policies with high confidence that we will do no harm. We focus here on the mathematics of this method, by mechanizing the soundness of off-policy evaluation. A natural side effect of the mechanization is both to clarify all the result’s mathematical assumptions and preconditions, and to further develop HOL4’s library of verified statistical mathematics, including concentration inequalities. Of more significance, the OPE method relies on importance sampling, whose soundness we prove using a measure-theoretic approach. In fact, we generalize the standard result, showing it for contexts comprising both discrete and continuous probability distributions.

Subject Classification

ACM Subject Classification
  • Theory of computation → Interactive proof systems
  • Theory of computation → Logic and verification
  • Theory of computation → Sequential decision making
  • Mathematics of computing → Hypothesis testing and confidence interval computation
  • Formal Methods
  • HOL4
  • Reinforcement Learning
  • Off-Policy Evaluation
  • Concentration Inequality
  • Hoeffding


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Carlo Acerbi and Dirk Tasche. On the coherence of expected shortfall. Journal of Banking & Finance, 26(7):1487-1503, 2002. Google Scholar
  2. B. D. Anderson and J. B. Moore. Optimal control: linear quadratic methods. Courier Corporation, 2007. Google Scholar
  3. Alexander Bagnall and Gordon Stewart. Certifying the true error: Machine learning in Coq with verified generalization guarantees. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):2662-2669, July 2019. URL:
  4. M. Bastani. Model-free intelligent diabetes management using machine learning. Master’s thesis, Department of Computing Science, University of Alberta, 2014. Google Scholar
  5. Sylvie Boldo, François Clément, Florian Faissole, Vincent Martin, and Micaela Mayero. A Coq formalization of Lebesgue integration of nonnegative functions. Journal of Automated Reasoning, November 2021. URL:
  6. Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S. Thomas. Universal off-policy evaluation. In Advances in Neural Information Processing Systems, 2021. Google Scholar
  7. Yinlam Chow and Mohammad Ghavamzadeh. Algorithms for CVaR optimization in MDPs. In Advances in Neural Information Processing Systems, pages 3509-3517, 2014. Google Scholar
  8. Aaron R. Coble. Anonymity, information, and machine-assisted proof. PhD thesis, Computer Lab., University of Cambridge, July 2010. URL:
  9. A. Dvoretzky, J. Kiefer, and J. Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. Annals of Mathematical Statistics, 27(3):642-669, 1956. URL:
  10. Ahmad El Sallab, Mohammed Abdou, Etienne Perot, and Senthil Yogamani. Deep reinforcement learning framework for autonomous driving. Electronic Imaging, 2017(19):70-76, 2017. Google Scholar
  11. Norbert Gaffke. Nonparametric one-sided testing for the mean and related extremum problems. Mathematical Methods of Statistics, 13(4):369-391, 2004. Google Scholar
  12. Arthur Guez, Robert D. Vincent, Massimo Avoli, and Joelle Pineau. Adaptive treatment of epilepsy via batch-mode reinforcement learning. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence, pages 1671-1678, 2008. Google Scholar
  13. Paul R. Halmos. Measure Theory. Springer-Verlag, New York, NY, 1950. Google Scholar
  14. W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:13-30, 1963. Google Scholar
  15. Johannes Hölzl and Armin Heller. Three chapters of measure theory in Isabelle/HOL. In Marko van Eekelen, Herman Geuvers, Julien Schmaltz, and Freek Wiedijk, editors, Proceedings of Interactive Theorem Proving, pages 135-151. Springer, 2011. Google Scholar
  16. Joe Hurd. Formal verification of probabilistic algorithms. Technical Report UCAM-CL-TR-566, University of Cambridge, Computer Laboratory, May 2003. URL:
  17. Joe Hurd. Verification of the Miller–Rabin probabilistic primality test. The Journal of Logic and Algebraic Programming, 56(1):3-21, 2003. URL:
  18. Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pages 652-661. PMLR, 2016. Google Scholar
  19. Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237-285, 1996. Google Scholar
  20. T. Kloek and H. K. van Dijk. Bayesian estimates of equation system parameters: An application of integration by Monte Carlo. Econometrica, 46(1):1-19, 1978. URL:
  21. Matthieu Komorowski, Leo A. Celi, Omar Badawi, Anthony C. Gordon, and A. Aldo Faisal. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine, 24(11):1716-1720, 2018. Google Scholar
  22. Erik Learned-Miller and Philip S. Thomas. A new confidence interval for the mean of a bounded random variable, 2020. URL:
  23. P. Massart. The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. Annals of Probability, 18(3):1269-1283, July 1990. URL:
  24. A. Maurer and M. Pontil. Empirical Bernstein bounds and sample variance penalization. In Proceedings of the Twenty-Second Annual Conference on Learning Theory, pages 115-124, 2009. Google Scholar
  25. Rozalina G McCoy, Holly K Van Houten, Jeanette Y Ziegenfuss, Nilay D Shah, Robert A Wermers, and Steven A Smith. Increased mortality of patients with diabetes reporting severe hypoglycemia. Diabetes Care, 35(9):1897-1901, 2012. Google Scholar
  26. Tarek Mhamdi, Osman Hasan, and Sofiène Tahar. On the formalization of the Lebesgue integration theory in HOL. In Matt Kaufmann and Lawrence C. Paulson, editors, Proceedings of Interactive Theorem Proving, volume 6172, pages 387-402. Springer, July 2010. URL:
  27. Tarek Mhamdi, Osman Hasan, and Sofiène Tahar. Formalization of entropy measures in HOL. In Marko van Eekelen, Herman Geuvers, Julien Schmaltz, and Freek Wiedijk, editors, Proceedings of Interactive Theorem Proving, volume 6898, pages 233-248. Springer, August 2011. URL:
  28. Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and Toshiyuki Tanaka. Nonparametric return distribution approximation for reinforcement learning. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 799-806. Citeseer, 2010. Google Scholar
  29. A. Papoulis. The Fourier Integral and its Implications. McGraw-Hill Book Company, Inc., New York, NY, 1962. Google Scholar
  30. Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning. arXiv preprint, 2017. URL:
  31. LA Prashanth and Mohammad Ghavamzadeh. Actor-critic algorithms for risk-sensitive MDPs. In Advances in Neural Information Processing Systems, pages 252-260, 2013. Google Scholar
  32. D. Precup, R. S. Sutton, and S. Singh. Eligibility traces for off-policy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning, pages 759-766, 2000. Google Scholar
  33. A. Shamilov, A. F. Yuzer, E. Agaoglu, and Y. Mert. A method of obtaining distributions of transformed random variables by using the Heaviside and the Dirac generalized functions. Journal of Statistical Research, 40(1):23-34, 2006. Google Scholar
  34. Student. The probable error of a mean. Biometrika, pages 1-25, 1908. Google Scholar
  35. R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998. Google Scholar
  36. Aviv Tamar, Yonatan Glassner, and Shie Mannor. Optimizing the CVaR via sampling. In AAAI, pages 2993-2999, 2015. Google Scholar
  37. P. S. Thomas, G. Theocharous, and M. Ghavamzadeh. High confidence off-policy evaluation. In Proceedings of the Twenty-Ninth Conference on Artificial Intelligence, 2015. Google Scholar
  38. Philip S. Thomas. Safe Reinforcement Learning. PhD thesis, University of Massachusetts Libraries, 2015. Google Scholar
  39. Philip S. Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139-2148. PMLR, 2016. Google Scholar
  40. Philip S. Thomas, Bruno Castro da Silva, Andrew G. Barto, Stephen Giguere, Yuriy Brun, and Emma Brunskill. Preventing undesirable behavior of intelligent machines. Science, 366(6468):999-1004, 2019. Google Scholar
  41. Chun Tian. The new HOL-Probability based on [0,+Inf]-measure theory, September 2019. URL: