Monte Carlo Tree Search Guided by Symbolic Advice for MDPs

Busatto-Gaston, Damien; Chakraborty, Debraj; Raskin, Jean-Francois

doi:10.4230/LIPIcs.CONCUR.2020.40

Abstract

In this paper, we consider the online computation of a strategy that aims at optimizing the expected average reward in a Markov decision process. The strategy is computed with a receding horizon and using Monte Carlo tree search (MCTS). We augment the MCTS algorithm with the notion of symbolic advice, and show that its classical theoretical guarantees are maintained. Symbolic advice are used to bias the selection and simulation strategies of MCTS. We describe how to use QBF and SAT solvers to implement symbolic advice in an efficient way. We illustrate our new algorithm using the popular game Pac-Man and show that the performances of our algorithm exceed those of plain MCTS as well as the performances of human players.

Cite As Get BibTex

Damien Busatto-Gaston, Debraj Chakraborty, and Jean-Francois Raskin. Monte Carlo Tree Search Guided by Symbolic Advice for MDPs. In 31st International Conference on Concurrency Theory (CONCUR 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 171, pp. 40:1-40:24, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020) https://doi.org/10.4230/LIPIcs.CONCUR.2020.40

Author Details

Damien Busatto-Gaston

Université Libre de Bruxelles, Brussels, Belgium

Debraj Chakraborty

Université Libre de Bruxelles, Brussels, Belgium

Jean-Francois Raskin

Université Libre de Bruxelles, Brussels, Belgium

Funding

This work is partially supported by the ARC project Non-Zero Sum Game Graphs: Applications to Reactive Synthesis and Beyond (Fédération Wallonie-Bruxelles), the EOS project Verifying Learning Artificial Intelligence Systems (F.R.S.-FNRS & FWO), and the COST Action 16228 GAMENET (European Cooperation in Science and Technology).

Acknowledgements

Computational resources have been provided by the Consortium des Équipements de Calcul Intensif (CÉCI), funded by the Fonds de la Recherche Scientifique de Belgique (F.R.S.-FNRS) under Grant No. 2.5020.11.

Supplementary Materials

http://di.ulb.ac.be/verif/debraj/pacman/

References

Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, and Ufuk Topcu. Safe reinforcement learning via shielding. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, (AAAI 2018), pages 2669-2678. AAAI Press, 2018.
Pranav Ashok, Tomás Brázdil, Jan Kretínský, and Ondrej Slámecka. Monte Carlo tree search for verifying reachability in Markov decision processes. In Proceedings of the 8th International Symposium on Leveraging Applications of Formal Methods, Verification and Validation (ISoLA 2018), volume 11245 of Lecture Notes in Computer Science, pages 322-335. Springer, 2018. URL: https://doi.org/10.1007/978-3-030-03421-4_21.
Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235-256, 2002. URL: https://doi.org/10.1023/A:1013689704352.
Hendrik Baier and Mark H. M. Winands. MCTS-minimax hybrids. IEEE Transactions on Computational Intelligence and AI in Games, 7(2):167-179, 2015. URL: https://doi.org/10.1109/TCIAIG.2014.2366555.
Armin Biere, Keijo Heljanko, Tommi Junttila, Timo Latvala, and Viktor Schuppan. Linear encodings of bounded LTL model checking. Logical Methods in Computer Science, Volume 2, Issue 5, November 2006. URL: https://doi.org/10.2168/LMCS-2(5:5)2006.
Tomás Brázdil, Krishnendu Chatterjee, Martin Chmelik, Vojtech Forejt, Jan Kretínský, Marta Z. Kwiatkowska, David Parker, and Mateusz Ujma. Verification of Markov decision processes using learning algorithms. In Proceedings of the 12th International Symposium on Automated Technology for Verification and Analysis (ATVA 2014), volume 8837 of Lecture Notes in Computer Science, pages 98-114. Springer, 2014. URL: https://doi.org/10.1007/978-3-319-11936-6_8.
Cameron Browne, Edward Jack Powley, Daniel Whitehouse, Simon M. Lucas, Peter I. Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez Liebana, Spyridon Samothrakis, and Simon Colton. A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4(1):1-43, 2012. URL: https://doi.org/10.1109/TCIAIG.2012.2186810.
Supratik Chakraborty, Daniel J. Fremont, Kuldeep S. Meel, Sanjit A. Seshia, and Moshe Y. Vardi. Distribution-aware sampling and weighted model counting for SAT. In Proceedings of the 28th AAAI Conference on Artificial Intelligence, 2014 (AAAI 2014), pages 1722-1730. AAAI Press, 2014. URL: http://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/view/8364.
Supratik Chakraborty, Daniel J. Fremont, Kuldeep S. Meel, Sanjit A. Seshia, and Moshe Y. Vardi. On parallel scalable uniform SAT witness generation. In Proceedings of the 21st International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS 2015), Held as Part of the European Joint Conferences on Theory and Practice of Software (ETAPS 2015), volume 9035 of Lecture Notes in Computer Science, pages 304-319. Springer, 2015. URL: https://doi.org/10.1007/978-3-662-46681-0_25.
Guillaume M. J. B. Chaslot, Mark H. M. Winands, and H. Jaap van den Herik. Parallel Monte-Carlo tree search. In H. Jaap van den Herik, Xinhe Xu, Zongmin Ma, and Mark H. M. Winands, editors, Proceedings of the 6th International Conference on Computers and Games (CG 2008), volume 5131 of Lecture Notes in Computer Science, pages 60-71. Springer, 2008. URL: https://doi.org/10.1007/978-3-540-87608-3_6.
Krishnendu Chatterjee, Petr Novotný, Guillermo A. Pérez, Jean-François Raskin, and Dorde Zikelic. Optimizing expectation with guarantees in POMDPs. In Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI 2017), pages 3725-3732. AAAI Press, 2017.
Przemyslaw Daca, Thomas A. Henzinger, Jan Kretínský, and Tatjana Petrov. Faster statistical model checking for unbounded temporal properties. In Proceedings of the 22nd International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS 2016), Held as Part of the European Joint Conferences on Theory and Practice of Software (ETAPS 2016), volume 9636 of Lecture Notes in Computer Science, pages 112-129. Springer, 2016. URL: https://doi.org/10.1007/978-3-662-49674-9_7.
John DeNero and Dan Klein. CS 188 : Introduction to artificial intelligence. URL: https://inst.eecs.berkeley.edu/~cs188.
Mohammadhosein Hasanbeig, Alessandro Abate, and Daniel Kroening. Cautious reinforcement learning with logical constraints. In Amal El Fallah Seghrouchni, Gita Sukthankar, Bo An, and Neil Yorke-Smith, editors, Proceedings of the 19th International Conference on Autonomous Agents and Multiagent Systems, (AAMAS 2020), pages 483-491. International Foundation for Autonomous Agents and Multiagent Systems, 2020. URL: https://dl.acm.org/doi/abs/10.5555/3398761.3398821.
Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13-30, 1963. URL: https://doi.org/10.1080/01621459.1963.10500830.
Jing Huang, Zhiqing Liu, Benjie Lu, and Feng Xiao. Pruning in UCT algorithm. In Proceedings of the International Conference on Technologies and Applications of Artificial Intelligence (TAAI 2010), pages 177-181, 2010. URL: https://doi.org/10.1109/TAAI.2010.38.
Michael J. Kearns, Yishay Mansour, and Andrew Y. Ng. A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine Learning, 49(2-3):193-208, 2002. URL: https://doi.org/10.1023/A:1017932429737.
Levente Kocsis and Csaba Szepesvári. Bandit based Monte-Carlo planning. In Proceedings of the 17th European Conference on Machine Learning (ECML 2006), volume 4212 of Lecture Notes in Computer Science, pages 282-293. Springer, 2006. URL: https://doi.org/10.1007/11871842_29.
Jan Kretínský, Guillermo A. Pérez, and Jean-François Raskin. Learning-based mean-payoff optimization in an unknown MDP under omega-regular constraints. In Proceedings of the 29th International Conference on Concurrency Theory (CONCUR 2018), volume 118 of Leibniz International Proceedings in Informatics, pages 8:1-8:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2018.
Christof Löding. Infinite games and automata theory. In Krzysztof R. Apt and Erich Grädel, editors, Lectures in Game Theory for Computer Scientists, pages 38-73. Cambridge University Press, 2011.
João Marques-Silva and Sharad Malik. Propositional SAT solving. In Edmund M. Clarke, Thomas A. Henzinger, Helmut Veith, and Roderick Bloem, editors, Handbook of Model Checking, pages 247-275. Springer, 2018.
Nina Narodytska, Alexander Legg, Fahiem Bacchus, Leonid Ryzhyk, and Adam Walker. Solving games without controllable predecessor. In Proceedings of the 26th International Conference on Computer Aided Verification (CAV 2014), volume 8559 of Lecture Notes in Computer Science, pages 533-540. Springer, 2014. URL: https://doi.org/10.1007/978-3-319-08867-9_35.
Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Probability and Statistics. Wiley, 1994. URL: https://doi.org/10.1002/9780470316887.
Ankit Shukla, Armin Biere, Luca Pulina, and Martina Seidl. A survey on applications of quantified Boolean formulas. In Proceedings of the 31st IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2019), pages 78-84. IEEE, 2019. URL: https://doi.org/10.1109/ICTAI.2019.00020.
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484-489, 2016. URL: https://doi.org/10.1038/nature16961.
David Silver and Gerald Tesauro. Monte-Carlo simulation balancing. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 945-952, 2009.

Monte Carlo Tree Search Guided by Symbolic Advice for MDPs

Authors Damien Busatto-Gaston , Debraj Chakraborty , Jean-Francois Raskin

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message