Safe Reinforcement Learning Using Probabilistic Shields (Invited Paper)

Authors Nils Jansen, Bettina Könighofer, Sebastian Junges, Alex Serban, Roderick Bloem



PDF
Thumbnail PDF

File

LIPIcs.CONCUR.2020.3.pdf
  • Filesize: 3.28 MB
  • 16 pages

Document Identifiers

Author Details

Nils Jansen
  • Radboud University, Nijmegen, The Netherlands
Bettina Könighofer
  • Graz University of Technology, Austria
Sebastian Junges
  • University of California, Berkeley, CA, USA
Alex Serban
  • Radboud University, Nijmegen, The Netherlands
Roderick Bloem
  • Graz University of Technology, Austria

Cite As Get BibTex

Nils Jansen, Bettina Könighofer, Sebastian Junges, Alex Serban, and Roderick Bloem. Safe Reinforcement Learning Using Probabilistic Shields (Invited Paper). In 31st International Conference on Concurrency Theory (CONCUR 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 171, pp. 3:1-3:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020) https://doi.org/10.4230/LIPIcs.CONCUR.2020.3

Abstract

This paper concerns the efficient construction of a safety shield for reinforcement learning. We specifically target scenarios that incorporate uncertainty and use Markov decision processes (MDPs) as the underlying model to capture such problems. Reinforcement learning (RL) is a machine learning technique that can determine near-optimal policies in MDPs that may be unknown before exploring the model. However, during exploration, RL is prone to induce behavior that is undesirable or not allowed in safety- or mission-critical contexts. We introduce the concept of a probabilistic shield that enables RL decision-making to adhere to safety constraints with high probability. We employ formal verification to efficiently compute the probabilities of critical decisions within a safety-relevant fragment of the MDP. These results help to realize a shield that, when applied to an RL algorithm, restricts the agent from taking unsafe actions, while optimizing the performance objective. We discuss tradeoffs between sufficient progress in the exploration of the environment and ensuring safety. In our experiments, we demonstrate on the arcade game PAC-MAN and on a case study involving service robots that the learning efficiency increases as the learning needs orders of magnitude fewer episodes.

Subject Classification

ACM Subject Classification
  • Computing methodologies → Reinforcement learning
  • Theory of computation → Verification by model checking
  • Theory of computation → Reinforcement learning
  • Computing methodologies → Markov decision processes
Keywords
  • Safe Reinforcement Learning
  • Formal Verification
  • Safe Exploration
  • Model Checking
  • Markov Decision Process

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Pieter Abbeel and Andrew Y. Ng. Exploration and apprenticeship learning in reinforcement learning. In ICML, volume 119 of ACM International Conference Proceeding Series, pages 1-8. ACM, 2005. Google Scholar
  2. Anayo K. Akametalu, Shahab Kaynama, Jaime F. Fisac, Melanie Nicole Zeilinger, Jeremy H. Gillula, and Claire J. Tomlin. Reachability-based safe learning with gaussian processes. In CDC, pages 1424-1431. IEEE, 2014. Google Scholar
  3. Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, and Ufuk Topcu. Safe reinforcement learning via shielding. In AAAI, pages 2669-2678. AAAI Press, 2018. Google Scholar
  4. Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety. CoRR, abs/1606.06565, 2016. URL: http://arxiv.org/abs/1606.06565.
  5. Pranav Ashok, Mathias Jackermeier, Pushpak Jagtap, Jan Kretínský, Maximilian Weininger, and Majid Zamani. dtcontrol: decision tree learning algorithms for controller representation. In HSCC, pages 30:1-30:2. ACM, 2020. Google Scholar
  6. Pranav Ashok, Jan Kretínský, Kim Guldstrand Larsen, Adrien Le Coënt, Jakob Haahr Taankvist, and Maximilian Weininger. SOS: safe, optimal and small strategies for hybrid markov decision processes. In QEST, volume 11785 of Lecture Notes in Computer Science, pages 147-164. Springer, 2019. Google Scholar
  7. Christel Baier and J.P. Katoen. Principles of Model Checking. MIT Press, 2008. Google Scholar
  8. Osbert Bastani, Yewen Pu, and Armando Solar-Lezama. Verifiable reinforcement learning via policy extraction. In NeurIPS, pages 2499-2509, 2018. Google Scholar
  9. UC Berkeley. Intro to AI - Reinforcement Learning , 2018. URL: http://ai.berkeley.edu/reinforcement.html.
  10. Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe model-based reinforcement learning with stability guarantees. In NIPS, pages 908-919, 2017. Google Scholar
  11. Suda Bharadwaj, Roderick Bloem, Rayna Dimitrova, Bettina Könighofer, and Ufuk Topcu. Synthesis of minimum-cost shields for multi-agent systems. In ACC, pages 1048-1055. IEEE, 2019. Google Scholar
  12. Arthur Bit-Monnot, Francesco Leofante, Luca Pulina, Erika Ábrahám, and Armando Tacchella. Smartplan: a task planner for smart factories. CoRR, abs/1806.07135, 2018. URL: http://arxiv.org/abs/1806.07135.
  13. Roderick Bloem, Krishnendu Chatterjee, and Barbara Jobstmann. Graph games and reactive synthesis. In Handbook of Model Checking, pages 921-962. Springer, 2018. Google Scholar
  14. Roderick Bloem, Peter Gjøl Jensen, Bettina Könighofer, Kim Guldstrand Larsen, Florian Lorber, and Alexander Palmisano. It’s time to play safe: Shield synthesis for timed systems. CoRR, abs/2006.16688, 2020. URL: http://arxiv.org/abs/2006.16688.
  15. Roderick Bloem, Bettina Könighofer, Robert Könighofer, and Chao Wang. Shield synthesis: - runtime enforcement for reactive systems. In TACAS, volume 9035 of LNCS, pages 533-548. Springer, 2015. Google Scholar
  16. Maxime Bouton, Jesper Karlsson, Alireza Nakhaei, Kikuo Fujimura, Mykel J. Kochenderfer, and Jana Tumova. Reinforcement learning with probabilistic guarantees for autonomous driving. CoRR, abs/1904.07189, 2019. URL: http://arxiv.org/abs/1904.07189.
  17. Tomáš Brázdil, Krishnendu Chatterjee, Martin Chmelík, Vojtěch Forejt, Jan Křetínský, Marta Z. Kwiatkowska, David Parker, and Mateusz Ujma. Verification of Markov decision processes using learning algorithms. In ATVA, 2014. Google Scholar
  18. Steven Carr, Nils Jansen, and Ufuk Topcu. Verifiable rnn-based policies for pomdps under temporal logic constraints. CoRR, abs/2002.05615, 2020. URL: http://arxiv.org/abs/2002.05615.
  19. Steven Carr, Nils Jansen, Ralf Wimmer, Alexandru Constantin Serban, Bernd Becker, and Ufuk Topcu. Counterexample-guided strategy improvement for pomdps using recurrent neural networks. In IJCAI, pages 5532-5539. ijcai.org, 2019. Google Scholar
  20. Jia Chen, Jiayi Wei, Yu Feng, Osbert Bastani, and Isil Dillig. Relational verification using reinforcement learning. Proc. ACM Program. Lang., 3(OOPSLA):141:1-141:30, 2019. Google Scholar
  21. Richard Cheng, Gábor Orosz, Richard M Murray, and Joel W Burdick. End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. AAAI, 2019. Google Scholar
  22. Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh. A Lyapunov-based approach to safe reinforcement learning. In NIPS, pages 8103-8112, 2018. Google Scholar
  23. Edmund M. Clarke, Orna Grumberg, and Doron A. Peled. Model checking. MIT Press, 2001. Google Scholar
  24. Jeffery A. Clouse and Paul E. Utgoff. A teaching method for reinforcement learning. In ML, pages 92-110. Morgan Kaufmann, 1992. Google Scholar
  25. Alexandre David, Peter Gjøl Jensen, Kim Guldstrand Larsen, Marius Mikucionis, and Jakob Haahr Taankvist. Uppaal stratego. In TACAS, volume 9035 of LNCS, pages 206-211. Springer, 2015. Google Scholar
  26. Peter Dayan and Yael Niv. Reinforcement learning: the good, the bad and the ugly. Current opinion in neurobiology, 18(2):185-196, 2008. Google Scholar
  27. Christian Dehnert, Sebastian Junges, Joost-Pieter Katoen, and Matthias Volk. A storm is coming: A modern probabilistic model checker. In CAV (2), volume 10427 of LNCS, pages 592-600. Springer, 2017. Google Scholar
  28. Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of Artificial Intelligence Research, 13:227-303, 2000. Google Scholar
  29. Klaus Dräger, Vojtech Forejt, Marta Z. Kwiatkowska, David Parker, and Mateusz Ujma. Permissive controller synthesis for probabilistic systems. Log. Methods Comput. Sci., 11(2), 2015. Google Scholar
  30. Richard G Freedman and Shlomo Zilberstein. Safety in AI-HRI: Challenges complementing user experience quality. In AAAI Fall Symposium Series, 2016. Google Scholar
  31. Jie Fu and Ufuk Topcu. Probably approximately correct mdp learning and control with temporal logic constraints. In RSS, 2014. Google Scholar
  32. Nathan Fulton and André Platzer. Verifiably Safe Off-Model Reinforcement Learning, volume 11427 of Lecture Notes in Computer Science, pages 413-430. Springer, 2019. Google Scholar
  33. Javier Garcıa and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437-1480, 2015. Google Scholar
  34. Javier García and Fernando Fernández. Probabilistic policy reuse for safe reinforcement learning. ACM Transactions on Autonomous and Adaptive Systems (TAAS), 13(3):14, 2019. Google Scholar
  35. OpenAI Gym. Taxi-v2, 2018. URL: https://gym.openai.com/envs/Taxi-v2/.
  36. Ernst Moritz Hahn, Mateo Perez, Sven Schewe, Fabio Somenzi, Ashutosh Trivedi, and Dominik Wojtczak. Omega-regular objectives in model-free reinforcement learning. In TACAS (1), volume 11427 of LNCS, pages 395-412. Springer, 2019. Google Scholar
  37. Mohammadhosein Hasanbeig, Alessandro Abate, and Daniel Kroening. Logically-correct reinforcement learning. CoRR, abs/1801.08099, 2018. URL: http://arxiv.org/abs/1801.08099.
  38. Mohammadhosein Hasanbeig, Yiannis Kantaros, Alessandro Abate, Daniel Kroening, George J. Pappas, and Insup Lee. Reinforcement learning for temporal logic control synthesis with probabilistic satisfaction guarantees. In CDC, pages 5338-5343. IEEE, 2019. Google Scholar
  39. Matthew Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable mdps. CoRR, abs/1507.06527, 2015. URL: http://arxiv.org/abs/1507.06527.
  40. Manfred Jaeger, Peter Gjøl Jensen, Kim Guldstrand Larsen, Axel Legay, Sean Sedwards, and Jakob Haahr Taankvist. Teaching stratego to play ball: Optimal synthesis for continuous space mdps. In ATVA, volume 11781 of Lecture Notes in Computer Science, pages 81-97. Springer, 2019. Google Scholar
  41. Sebastian Junges, Nils Jansen, and Sanjit A. Seshia. Enforcing almost-sure reachability in pomdps. CoRR, abs/2007.00085, 2020. URL: http://arxiv.org/abs/2007.00085.
  42. Sebastian Junges, Nils Jansen, Christian Dehnert, Ufuk Topcu, and Joost-Pieter Katoen. Safety-constrained reinforcement learning for MDPs. In TACAS, volume 9636 of LNCS, pages 130-146. Springer, 2016. Google Scholar
  43. Joost-Pieter Katoen. The probabilistic model checking landscape. In LICS, pages 31-45. ACM, 2016. Google Scholar
  44. Bettina Könighofer, Mohammed Alshiekh, Roderick Bloem, Laura R. Humphrey, Robert Könighofer, Ufuk Topcu, and Chao Wang. Shield synthesis. Formal Methods Syst. Des., 51(2):332-361, 2017. Google Scholar
  45. Jan Kretínský, Guillermo A. Pérez, and Jean-François Raskin. Learning-based mean-payoff optimization in an unknown MDP under omega-regular constraints. In CONCUR, volume 118 of LIPIcs, pages 8:1-8:18. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2018. Google Scholar
  46. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097-1105, 2012. Google Scholar
  47. Marta Z. Kwiatkowska. Model checking for probability and time: from theory to practice. In LICS, page 351. IEEE CS, 2003. Google Scholar
  48. Marta Z. Kwiatkowska, Gethin Norman, and David Parker. PRISM 4.0: Verification of probabilistic real-time systems. In CAV, volume 6806 of LNCS, pages 585-591. Springer, 2011. Google Scholar
  49. Gil Lederman, Markus N. Rabe, Sanjit Seshia, and Edward A. Lee. Learning heuristics for quantified boolean formulas through reinforcement learning. In ICLR. OpenReview.net, 2020. Google Scholar
  50. Sergey Levine and Vladlen Koltun. Guided policy search. In ICML (3), volume 28 of JMLR Workshop and Conference Proceedings, pages 1-9. JMLR.org, 2013. Google Scholar
  51. George Mason, Radu Calinescu, Daniel Kudenko, and Alec Banks. Assured reinforcement learning with formally verified abstract policies. In ICAART (2), pages 105-117. SciTePress, 2017. Google Scholar
  52. Teodor Mihai Moldovan and Pieter Abbeel. Safe exploration in markov decision processes. In ICML. icml.cc / Omnipress, 2012. Google Scholar
  53. M. Ohnishi, L. Wang, G. Notomista, and M. Egerstedt. Barrier-certified adaptive reinforcement learning with applications to brushbot navigation. IEEE Transactions on Robotics, pages 1-20, 2019. Google Scholar
  54. Martin Pecka and Tomas Svoboda. Safe exploration techniques for reinforcement learning-an overview. In MESAS, pages 357-375. Springer, 2014. Google Scholar
  55. Amir Pnueli. The temporal logic of programs. In Foundations of Computer Science, pages 46-57. IEEE, 1977. Google Scholar
  56. Diederik M. Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective sequential decision-making. J. Artif. Intell. Res., 48:67-113, 2013. Google Scholar
  57. Dorsa Sadigh, Eric S Kim, Samuel Coogan, S Shankar Sastry, and Sanjit A Seshia. A learning based approach to control synthesis of markov decision processes for linear temporal logic specifications. In CDC, pages 1091-1096. IEEE, 2014. Google Scholar
  58. Dorsa Sadigh, Nick Landolfi, Shankar S Sastry, Sanjit A Seshia, and Anca D Dragan. Planning for cars that coordinate with people: leveraging effects on human actions for planning and active information gathering over human internal state. Autonomous Robots, 42(7):1405-1426, 2018. Google Scholar
  59. Dorsa Sadigh, Shankar Sastry, Sanjit A Seshia, and Anca D Dragan. Planning for autonomous cars that leverage effects on human actions. In Robotics: Science and Systems, 2016. Google Scholar
  60. David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nat., 529(7587):484-489, 2016. Google Scholar
  61. Ion Stoica, Dawn Song, Raluca Ada Popa, David Patterson, Michael W Mahoney, Randy Katz, Anthony D Joseph, Michael Jordan, Joseph M Hellerstein, Joseph E Gonzalez, et al. A berkeley view of systems challenges for AI. CoRR, abs/1712.05855, 2017. URL: http://arxiv.org/abs/1712.05855.
  62. Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. Google Scholar
  63. Florent Teichteil-Königsbuch. Stochastic safest and shortest path problems. In AAAI. AAAI Press, 2012. Google Scholar
  64. Matteo Turchetta, Felix Berkenkamp, and Andreas Krause. Safe exploration for interactive machine learning. In NeurIPS, pages 2887-2897, 2019. Google Scholar
  65. Abhinav Verma, Hoang Minh Le, Yisong Yue, and Swarat Chaudhuri. Imitation-projected programmatic reinforcement learning. In NeurIPS, pages 15726-15737, 2019. Google Scholar
  66. Abhinav Verma, Vijayaraghavan Murali, Rishabh Singh, Pushmeet Kohli, and Swarat Chaudhuri. Programmatically interpretable reinforcement learning. In ICML, volume 80 of Proceedings of Machine Learning Research, pages 5052-5061. PMLR, 2018. Google Scholar
  67. Angelina Wang, Thanard Kurutach, Kara Liu, Pieter Abbeel, and Aviv Tamar. Learning robotic manipulation through visual planning and acting. arXiv preprint, 2019. URL: http://arxiv.org/abs/1905.04411.
  68. Min Wen, Rüdiger Ehlers, and Ufuk Topcu. Correct-by-synthesis reinforcement learning with temporal logic constraints. In IROS, 2015. Google Scholar
  69. Douglas J White. Real applications of Markov decision processes. Interfaces, 15(6):73-83, 1985. Google Scholar
  70. Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2016. Google Scholar
  71. Meng Wu, Jingbo Wang, Jyotirmoy Deshmukh, and Chao Wang. Shield synthesis for real: Enforcing safety in cyber-physical systems. In FMCAD, pages 129-137. IEEE, 2019. Google Scholar
  72. Weichao Zhou and Wenchao Li. Safety-aware apprenticeship learning. In CAV (1), volume 10981 of Lecture Notes in Computer Science, pages 662-680. Springer, 2018. Google Scholar
  73. He Zhu, Zikang Xiong, Stephen Magill, and Suresh Jagannathan. An inductive synthesis framework for verifiable reinforcement learning. In PLDI, pages 686-701. ACM, 2019. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail