Document

# Mechanizing Soundness of Off-Policy Evaluation

## File

LIPIcs.ITP.2022.32.pdf
• Filesize: 0.75 MB
• 20 pages

## Cite As

Jared Yeager, J. Eliot B. Moss, Michael Norrish, and Philip S. Thomas. Mechanizing Soundness of Off-Policy Evaluation. In 13th International Conference on Interactive Theorem Proving (ITP 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 237, pp. 32:1-32:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.ITP.2022.32

## Abstract

There are reinforcement learning scenarios - e.g., in medicine - where we are compelled to be as confident as possible that a policy change will result in an improvement before implementing it. In such scenarios, we can employ off-policy evaluation (OPE). The basic idea of OPE is to record histories of behaviors under the current policy, and then develop an estimate of the quality of a proposed new policy, seeing what the behavior would have been under the new policy. As we are evaluating the policy without actually using it, we have the "off-policy" of OPE. Applying a concentration inequality to the estimate, we derive a confidence interval for the expected quality of the new policy. If the confidence interval lies above that of the current policy, we can change policies with high confidence that we will do no harm. We focus here on the mathematics of this method, by mechanizing the soundness of off-policy evaluation. A natural side effect of the mechanization is both to clarify all the result’s mathematical assumptions and preconditions, and to further develop HOL4’s library of verified statistical mathematics, including concentration inequalities. Of more significance, the OPE method relies on importance sampling, whose soundness we prove using a measure-theoretic approach. In fact, we generalize the standard result, showing it for contexts comprising both discrete and continuous probability distributions.

## Subject Classification

##### ACM Subject Classification
• Theory of computation → Interactive proof systems
• Theory of computation → Logic and verification
• Theory of computation → Sequential decision making
• Mathematics of computing → Hypothesis testing and confidence interval computation
##### Keywords
• Formal Methods
• HOL4
• Reinforcement Learning
• Off-Policy Evaluation
• Concentration Inequality
• Hoeffding

## Metrics

• Access Statistics
• Total Accesses (updated on a weekly basis)
0

## References

1. Carlo Acerbi and Dirk Tasche. On the coherence of expected shortfall. Journal of Banking & Finance, 26(7):1487-1503, 2002.
2. B. D. Anderson and J. B. Moore. Optimal control: linear quadratic methods. Courier Corporation, 2007.
3. Alexander Bagnall and Gordon Stewart. Certifying the true error: Machine learning in Coq with verified generalization guarantees. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):2662-2669, July 2019. URL: https://doi.org/10.1609/aaai.v33i01.33012662.
4. M. Bastani. Model-free intelligent diabetes management using machine learning. Master’s thesis, Department of Computing Science, University of Alberta, 2014.
5. Sylvie Boldo, François Clément, Florian Faissole, Vincent Martin, and Micaela Mayero. A Coq formalization of Lebesgue integration of nonnegative functions. Journal of Automated Reasoning, November 2021. URL: https://doi.org/10.1007/s10817-021-09612-0.
6. Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S. Thomas. Universal off-policy evaluation. In Advances in Neural Information Processing Systems, 2021.
7. Yinlam Chow and Mohammad Ghavamzadeh. Algorithms for CVaR optimization in MDPs. In Advances in Neural Information Processing Systems, pages 3509-3517, 2014.
8. Aaron R. Coble. Anonymity, information, and machine-assisted proof. PhD thesis, Computer Lab., University of Cambridge, July 2010. URL: https://doi.org/10.48456/tr-785.
9. A. Dvoretzky, J. Kiefer, and J. Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. Annals of Mathematical Statistics, 27(3):642-669, 1956. URL: https://doi.org/10.1214/aoms/1177728174.
10. Ahmad El Sallab, Mohammed Abdou, Etienne Perot, and Senthil Yogamani. Deep reinforcement learning framework for autonomous driving. Electronic Imaging, 2017(19):70-76, 2017.
11. Norbert Gaffke. Nonparametric one-sided testing for the mean and related extremum problems. Mathematical Methods of Statistics, 13(4):369-391, 2004.
12. Arthur Guez, Robert D. Vincent, Massimo Avoli, and Joelle Pineau. Adaptive treatment of epilepsy via batch-mode reinforcement learning. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence, pages 1671-1678, 2008.
13. Paul R. Halmos. Measure Theory. Springer-Verlag, New York, NY, 1950.
14. W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:13-30, 1963.
15. Johannes Hölzl and Armin Heller. Three chapters of measure theory in Isabelle/HOL. In Marko van Eekelen, Herman Geuvers, Julien Schmaltz, and Freek Wiedijk, editors, Proceedings of Interactive Theorem Proving, pages 135-151. Springer, 2011.
16. Joe Hurd. Formal verification of probabilistic algorithms. Technical Report UCAM-CL-TR-566, University of Cambridge, Computer Laboratory, May 2003. URL: https://doi.org/10.48456/tr-566.
17. Joe Hurd. Verification of the Miller–Rabin probabilistic primality test. The Journal of Logic and Algebraic Programming, 56(1):3-21, 2003. URL: https://doi.org/10.1016/S1567-8326(02)00065-6.
18. Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pages 652-661. PMLR, 2016.
19. Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237-285, 1996.
20. T. Kloek and H. K. van Dijk. Bayesian estimates of equation system parameters: An application of integration by Monte Carlo. Econometrica, 46(1):1-19, 1978. URL: https://doi.org/10.2307/1913641.
21. Matthieu Komorowski, Leo A. Celi, Omar Badawi, Anthony C. Gordon, and A. Aldo Faisal. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine, 24(11):1716-1720, 2018.
22. Erik Learned-Miller and Philip S. Thomas. A new confidence interval for the mean of a bounded random variable, 2020. URL: http://arxiv.org/abs/1905.06208.
23. P. Massart. The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. Annals of Probability, 18(3):1269-1283, July 1990. URL: https://doi.org/10.1214/aop/1176990746.
24. A. Maurer and M. Pontil. Empirical Bernstein bounds and sample variance penalization. In Proceedings of the Twenty-Second Annual Conference on Learning Theory, pages 115-124, 2009.
25. Rozalina G McCoy, Holly K Van Houten, Jeanette Y Ziegenfuss, Nilay D Shah, Robert A Wermers, and Steven A Smith. Increased mortality of patients with diabetes reporting severe hypoglycemia. Diabetes Care, 35(9):1897-1901, 2012.
26. Tarek Mhamdi, Osman Hasan, and Sofiène Tahar. On the formalization of the Lebesgue integration theory in HOL. In Matt Kaufmann and Lawrence C. Paulson, editors, Proceedings of Interactive Theorem Proving, volume 6172, pages 387-402. Springer, July 2010. URL: https://doi.org/10.1007/978-3-642-14052-5_27.
27. Tarek Mhamdi, Osman Hasan, and Sofiène Tahar. Formalization of entropy measures in HOL. In Marko van Eekelen, Herman Geuvers, Julien Schmaltz, and Freek Wiedijk, editors, Proceedings of Interactive Theorem Proving, volume 6898, pages 233-248. Springer, August 2011. URL: https://doi.org/10.1007/978-3-642-22863-6_18.
28. Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and Toshiyuki Tanaka. Nonparametric return distribution approximation for reinforcement learning. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 799-806. Citeseer, 2010.
29. A. Papoulis. The Fourier Integral and its Implications. McGraw-Hill Book Company, Inc., New York, NY, 1962.
30. Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning. arXiv preprint, 2017. URL: http://arxiv.org/abs/1703.02702.
31. LA Prashanth and Mohammad Ghavamzadeh. Actor-critic algorithms for risk-sensitive MDPs. In Advances in Neural Information Processing Systems, pages 252-260, 2013.
32. D. Precup, R. S. Sutton, and S. Singh. Eligibility traces for off-policy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning, pages 759-766, 2000.
33. A. Shamilov, A. F. Yuzer, E. Agaoglu, and Y. Mert. A method of obtaining distributions of transformed random variables by using the Heaviside and the Dirac generalized functions. Journal of Statistical Research, 40(1):23-34, 2006.
34. Student. The probable error of a mean. Biometrika, pages 1-25, 1908.
35. R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998.
36. Aviv Tamar, Yonatan Glassner, and Shie Mannor. Optimizing the CVaR via sampling. In AAAI, pages 2993-2999, 2015.
37. P. S. Thomas, G. Theocharous, and M. Ghavamzadeh. High confidence off-policy evaluation. In Proceedings of the Twenty-Ninth Conference on Artificial Intelligence, 2015.
38. Philip S. Thomas. Safe Reinforcement Learning. PhD thesis, University of Massachusetts Libraries, 2015.
39. Philip S. Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139-2148. PMLR, 2016.
40. Philip S. Thomas, Bruno Castro da Silva, Andrew G. Barto, Stephen Giguere, Yuriy Brun, and Emma Brunskill. Preventing undesirable behavior of intelligent machines. Science, 366(6468):999-1004, 2019.
41. Chun Tian. The new HOL-Probability based on [0,+Inf]-measure theory, September 2019. URL: https://github.com/HOL-Theorem-Prover/HOL/commit/c4db120ba392910141cc83672cc9bd6435e17c9a.