Safe Reinforcement Learning Using Probabilistic Shields (Invited Paper)

Jansen, Nils; Könighofer, Bettina; Junges, Sebastian; Serban, Alex; Bloem, Roderick

doi:10.4230/LIPIcs.CONCUR.2020.3

Abstract

This paper concerns the efficient construction of a safety shield for reinforcement learning. We specifically target scenarios that incorporate uncertainty and use Markov decision processes (MDPs) as the underlying model to capture such problems. Reinforcement learning (RL) is a machine learning technique that can determine near-optimal policies in MDPs that may be unknown before exploring the model. However, during exploration, RL is prone to induce behavior that is undesirable or not allowed in safety- or mission-critical contexts. We introduce the concept of a probabilistic shield that enables RL decision-making to adhere to safety constraints with high probability. We employ formal verification to efficiently compute the probabilities of critical decisions within a safety-relevant fragment of the MDP. These results help to realize a shield that, when applied to an RL algorithm, restricts the agent from taking unsafe actions, while optimizing the performance objective. We discuss tradeoffs between sufficient progress in the exploration of the environment and ensuring safety. In our experiments, we demonstrate on the arcade game PAC-MAN and on a case study involving service robots that the learning efficiency increases as the learning needs orders of magnitude fewer episodes.

Pieter Abbeel and Andrew Y. Ng. Exploration and apprenticeship learning in reinforcement learning. In ICML, volume 119 of ACM International Conference Proceeding Series, pages 1-8. ACM, 2005.
Anayo K. Akametalu, Shahab Kaynama, Jaime F. Fisac, Melanie Nicole Zeilinger, Jeremy H. Gillula, and Claire J. Tomlin. Reachability-based safe learning with gaussian processes. In CDC, pages 1424-1431. IEEE, 2014.
Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, and Ufuk Topcu. Safe reinforcement learning via shielding. In AAAI, pages 2669-2678. AAAI Press, 2018.
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety. CoRR, abs/1606.06565, 2016. URL: http://arxiv.org/abs/1606.06565.
Pranav Ashok, Mathias Jackermeier, Pushpak Jagtap, Jan Kretínský, Maximilian Weininger, and Majid Zamani. dtcontrol: decision tree learning algorithms for controller representation. In HSCC, pages 30:1-30:2. ACM, 2020.
Pranav Ashok, Jan Kretínský, Kim Guldstrand Larsen, Adrien Le Coënt, Jakob Haahr Taankvist, and Maximilian Weininger. SOS: safe, optimal and small strategies for hybrid markov decision processes. In QEST, volume 11785 of Lecture Notes in Computer Science, pages 147-164. Springer, 2019.
Christel Baier and J.P. Katoen. Principles of Model Checking. MIT Press, 2008.
Osbert Bastani, Yewen Pu, and Armando Solar-Lezama. Verifiable reinforcement learning via policy extraction. In NeurIPS, pages 2499-2509, 2018.
UC Berkeley. Intro to AI - Reinforcement Learning , 2018. URL: http://ai.berkeley.edu/reinforcement.html.
Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe model-based reinforcement learning with stability guarantees. In NIPS, pages 908-919, 2017.
Suda Bharadwaj, Roderick Bloem, Rayna Dimitrova, Bettina Könighofer, and Ufuk Topcu. Synthesis of minimum-cost shields for multi-agent systems. In ACC, pages 1048-1055. IEEE, 2019.
Arthur Bit-Monnot, Francesco Leofante, Luca Pulina, Erika Ábrahám, and Armando Tacchella. Smartplan: a task planner for smart factories. CoRR, abs/1806.07135, 2018. URL: http://arxiv.org/abs/1806.07135.
Roderick Bloem, Krishnendu Chatterjee, and Barbara Jobstmann. Graph games and reactive synthesis. In Handbook of Model Checking, pages 921-962. Springer, 2018.
Roderick Bloem, Peter Gjøl Jensen, Bettina Könighofer, Kim Guldstrand Larsen, Florian Lorber, and Alexander Palmisano. It’s time to play safe: Shield synthesis for timed systems. CoRR, abs/2006.16688, 2020. URL: http://arxiv.org/abs/2006.16688.
Roderick Bloem, Bettina Könighofer, Robert Könighofer, and Chao Wang. Shield synthesis: - runtime enforcement for reactive systems. In TACAS, volume 9035 of LNCS, pages 533-548. Springer, 2015.
Maxime Bouton, Jesper Karlsson, Alireza Nakhaei, Kikuo Fujimura, Mykel J. Kochenderfer, and Jana Tumova. Reinforcement learning with probabilistic guarantees for autonomous driving. CoRR, abs/1904.07189, 2019. URL: http://arxiv.org/abs/1904.07189.
Tomáš Brázdil, Krishnendu Chatterjee, Martin Chmelík, Vojtěch Forejt, Jan Křetínský, Marta Z. Kwiatkowska, David Parker, and Mateusz Ujma. Verification of Markov decision processes using learning algorithms. In ATVA, 2014.
Steven Carr, Nils Jansen, and Ufuk Topcu. Verifiable rnn-based policies for pomdps under temporal logic constraints. CoRR, abs/2002.05615, 2020. URL: http://arxiv.org/abs/2002.05615.
Steven Carr, Nils Jansen, Ralf Wimmer, Alexandru Constantin Serban, Bernd Becker, and Ufuk Topcu. Counterexample-guided strategy improvement for pomdps using recurrent neural networks. In IJCAI, pages 5532-5539. ijcai.org, 2019.
Jia Chen, Jiayi Wei, Yu Feng, Osbert Bastani, and Isil Dillig. Relational verification using reinforcement learning. Proc. ACM Program. Lang., 3(OOPSLA):141:1-141:30, 2019.
Richard Cheng, Gábor Orosz, Richard M Murray, and Joel W Burdick. End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. AAAI, 2019.
Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh. A Lyapunov-based approach to safe reinforcement learning. In NIPS, pages 8103-8112, 2018.
Edmund M. Clarke, Orna Grumberg, and Doron A. Peled. Model checking. MIT Press, 2001.
Jeffery A. Clouse and Paul E. Utgoff. A teaching method for reinforcement learning. In ML, pages 92-110. Morgan Kaufmann, 1992.
Alexandre David, Peter Gjøl Jensen, Kim Guldstrand Larsen, Marius Mikucionis, and Jakob Haahr Taankvist. Uppaal stratego. In TACAS, volume 9035 of LNCS, pages 206-211. Springer, 2015.
Peter Dayan and Yael Niv. Reinforcement learning: the good, the bad and the ugly. Current opinion in neurobiology, 18(2):185-196, 2008.
Christian Dehnert, Sebastian Junges, Joost-Pieter Katoen, and Matthias Volk. A storm is coming: A modern probabilistic model checker. In CAV (2), volume 10427 of LNCS, pages 592-600. Springer, 2017.
Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of Artificial Intelligence Research, 13:227-303, 2000.
Klaus Dräger, Vojtech Forejt, Marta Z. Kwiatkowska, David Parker, and Mateusz Ujma. Permissive controller synthesis for probabilistic systems. Log. Methods Comput. Sci., 11(2), 2015.
Richard G Freedman and Shlomo Zilberstein. Safety in AI-HRI: Challenges complementing user experience quality. In AAAI Fall Symposium Series, 2016.
Jie Fu and Ufuk Topcu. Probably approximately correct mdp learning and control with temporal logic constraints. In RSS, 2014.
Nathan Fulton and André Platzer. Verifiably Safe Off-Model Reinforcement Learning, volume 11427 of Lecture Notes in Computer Science, pages 413-430. Springer, 2019.
Javier Garcıa and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437-1480, 2015.
Javier García and Fernando Fernández. Probabilistic policy reuse for safe reinforcement learning. ACM Transactions on Autonomous and Adaptive Systems (TAAS), 13(3):14, 2019.
OpenAI Gym. Taxi-v2, 2018. URL: https://gym.openai.com/envs/Taxi-v2/.
Ernst Moritz Hahn, Mateo Perez, Sven Schewe, Fabio Somenzi, Ashutosh Trivedi, and Dominik Wojtczak. Omega-regular objectives in model-free reinforcement learning. In TACAS (1), volume 11427 of LNCS, pages 395-412. Springer, 2019.
Mohammadhosein Hasanbeig, Alessandro Abate, and Daniel Kroening. Logically-correct reinforcement learning. CoRR, abs/1801.08099, 2018. URL: http://arxiv.org/abs/1801.08099.
Mohammadhosein Hasanbeig, Yiannis Kantaros, Alessandro Abate, Daniel Kroening, George J. Pappas, and Insup Lee. Reinforcement learning for temporal logic control synthesis with probabilistic satisfaction guarantees. In CDC, pages 5338-5343. IEEE, 2019.
Matthew Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable mdps. CoRR, abs/1507.06527, 2015. URL: http://arxiv.org/abs/1507.06527.
Manfred Jaeger, Peter Gjøl Jensen, Kim Guldstrand Larsen, Axel Legay, Sean Sedwards, and Jakob Haahr Taankvist. Teaching stratego to play ball: Optimal synthesis for continuous space mdps. In ATVA, volume 11781 of Lecture Notes in Computer Science, pages 81-97. Springer, 2019.
Sebastian Junges, Nils Jansen, and Sanjit A. Seshia. Enforcing almost-sure reachability in pomdps. CoRR, abs/2007.00085, 2020. URL: http://arxiv.org/abs/2007.00085.
Sebastian Junges, Nils Jansen, Christian Dehnert, Ufuk Topcu, and Joost-Pieter Katoen. Safety-constrained reinforcement learning for MDPs. In TACAS, volume 9636 of LNCS, pages 130-146. Springer, 2016.
Joost-Pieter Katoen. The probabilistic model checking landscape. In LICS, pages 31-45. ACM, 2016.
Bettina Könighofer, Mohammed Alshiekh, Roderick Bloem, Laura R. Humphrey, Robert Könighofer, Ufuk Topcu, and Chao Wang. Shield synthesis. Formal Methods Syst. Des., 51(2):332-361, 2017.
Jan Kretínský, Guillermo A. Pérez, and Jean-François Raskin. Learning-based mean-payoff optimization in an unknown MDP under omega-regular constraints. In CONCUR, volume 118 of LIPIcs, pages 8:1-8:18. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2018.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097-1105, 2012.
Marta Z. Kwiatkowska. Model checking for probability and time: from theory to practice. In LICS, page 351. IEEE CS, 2003.
Marta Z. Kwiatkowska, Gethin Norman, and David Parker. PRISM 4.0: Verification of probabilistic real-time systems. In CAV, volume 6806 of LNCS, pages 585-591. Springer, 2011.
Gil Lederman, Markus N. Rabe, Sanjit Seshia, and Edward A. Lee. Learning heuristics for quantified boolean formulas through reinforcement learning. In ICLR. OpenReview.net, 2020.
Sergey Levine and Vladlen Koltun. Guided policy search. In ICML (3), volume 28 of JMLR Workshop and Conference Proceedings, pages 1-9. JMLR.org, 2013.
George Mason, Radu Calinescu, Daniel Kudenko, and Alec Banks. Assured reinforcement learning with formally verified abstract policies. In ICAART (2), pages 105-117. SciTePress, 2017.
Teodor Mihai Moldovan and Pieter Abbeel. Safe exploration in markov decision processes. In ICML. icml.cc / Omnipress, 2012.
M. Ohnishi, L. Wang, G. Notomista, and M. Egerstedt. Barrier-certified adaptive reinforcement learning with applications to brushbot navigation. IEEE Transactions on Robotics, pages 1-20, 2019.
Martin Pecka and Tomas Svoboda. Safe exploration techniques for reinforcement learning-an overview. In MESAS, pages 357-375. Springer, 2014.
Amir Pnueli. The temporal logic of programs. In Foundations of Computer Science, pages 46-57. IEEE, 1977.
Diederik M. Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective sequential decision-making. J. Artif. Intell. Res., 48:67-113, 2013.
Dorsa Sadigh, Eric S Kim, Samuel Coogan, S Shankar Sastry, and Sanjit A Seshia. A learning based approach to control synthesis of markov decision processes for linear temporal logic specifications. In CDC, pages 1091-1096. IEEE, 2014.
Dorsa Sadigh, Nick Landolfi, Shankar S Sastry, Sanjit A Seshia, and Anca D Dragan. Planning for cars that coordinate with people: leveraging effects on human actions for planning and active information gathering over human internal state. Autonomous Robots, 42(7):1405-1426, 2018.
Dorsa Sadigh, Shankar Sastry, Sanjit A Seshia, and Anca D Dragan. Planning for autonomous cars that leverage effects on human actions. In Robotics: Science and Systems, 2016.
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nat., 529(7587):484-489, 2016.
Ion Stoica, Dawn Song, Raluca Ada Popa, David Patterson, Michael W Mahoney, Randy Katz, Anthony D Joseph, Michael Jordan, Joseph M Hellerstein, Joseph E Gonzalez, et al. A berkeley view of systems challenges for AI. CoRR, abs/1712.05855, 2017. URL: http://arxiv.org/abs/1712.05855.
Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.
Florent Teichteil-Königsbuch. Stochastic safest and shortest path problems. In AAAI. AAAI Press, 2012.
Matteo Turchetta, Felix Berkenkamp, and Andreas Krause. Safe exploration for interactive machine learning. In NeurIPS, pages 2887-2897, 2019.
Abhinav Verma, Hoang Minh Le, Yisong Yue, and Swarat Chaudhuri. Imitation-projected programmatic reinforcement learning. In NeurIPS, pages 15726-15737, 2019.
Abhinav Verma, Vijayaraghavan Murali, Rishabh Singh, Pushmeet Kohli, and Swarat Chaudhuri. Programmatically interpretable reinforcement learning. In ICML, volume 80 of Proceedings of Machine Learning Research, pages 5052-5061. PMLR, 2018.
Angelina Wang, Thanard Kurutach, Kara Liu, Pieter Abbeel, and Aviv Tamar. Learning robotic manipulation through visual planning and acting. arXiv preprint, 2019. URL: http://arxiv.org/abs/1905.04411.
Min Wen, Rüdiger Ehlers, and Ufuk Topcu. Correct-by-synthesis reinforcement learning with temporal logic constraints. In IROS, 2015.
Douglas J White. Real applications of Markov decision processes. Interfaces, 15(6):73-83, 1985.
Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2016.
Meng Wu, Jingbo Wang, Jyotirmoy Deshmukh, and Chao Wang. Shield synthesis for real: Enforcing safety in cyber-physical systems. In FMCAD, pages 129-137. IEEE, 2019.
Weichao Zhou and Wenchao Li. Safety-aware apprenticeship learning. In CAV (1), volume 10981 of Lecture Notes in Computer Science, pages 662-680. Springer, 2018.
He Zhu, Zikang Xiong, Stephen Magill, and Suresh Jagannathan. An inductive synthesis framework for verifiable reinforcement learning. In PLDI, pages 686-701. ACM, 2019.

Safe Reinforcement Learning Using Probabilistic Shields (Invited Paper)

Authors Nils Jansen, Bettina Könighofer, Sebastian Junges, Alex Serban, Roderick Bloem

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Safe Reinforcement Learning Using Probabilistic Shields (Invited Paper)

Authors Nils Jansen, Bettina Könighofer, Sebastian Junges, Alex Serban, Roderick Bloem

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Supplementary Materials

References

Thanks for your feedback!

Could not send message