Learning-Based Mean-Payoff Optimization in an Unknown MDP under Omega-Regular Constraints

Kretínský, Jan; Pérez, Guillermo A.; Raskin, Jean-François

doi:10.4230/LIPIcs.CONCUR.2018.8

Abstract

We formalize the problem of maximizing the mean-payoff value with high probability while satisfying a parity objective in a Markov decision process (MDP) with unknown probabilistic transition function and unknown reward function. Assuming the support of the unknown transition function and a lower bound on the minimal transition probability are known in advance, we show that in MDPs consisting of a single end component, two combinations of guarantees on the parity and mean-payoff objectives can be achieved depending on how much memory one is willing to use. (i) For all epsilon and gamma we can construct an online-learning finite-memory strategy that almost-surely satisfies the parity objective and which achieves an epsilon-optimal mean payoff with probability at least 1 - gamma. (ii) Alternatively, for all epsilon and gamma there exists an online-learning infinite-memory strategy that satisfies the parity objective surely and which achieves an epsilon-optimal mean payoff with probability at least 1 - gamma. We extend the above results to MDPs consisting of more than one end component in a natural way. Finally, we show that the aforementioned guarantees are tight, i.e. there are MDPs for which stronger combinations of the guarantees cannot be ensured.

Shaull Almagor, Orna Kupferman, and Yaron Velner. Minimizing expected cost under hard boolean constraints, with applications to quantitative synthesis. In Josée Desharnais and Radha Jagadeesan, editors, 27th International Conference on Concurrency Theory, CONCUR 2016, August 23-26, 2016, Québec City, Canada, volume 59 of LIPIcs, pages 9:1-9:15. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2016. URL: http://dx.doi.org/10.4230/LIPIcs.CONCUR.2016.9.
Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, and Ufuk Topcu. Safe reinforcement learning via shielding. CoRR, abs/1708.08611, 2017. URL: http://arxiv.org/abs/1708.08611.
Benjamin Aminof and Sasha Rubin. First-cycle games. Information and Compution, 254:195-216, 2017. URL: http://dx.doi.org/10.1016/j.ic.2016.10.008.
Krzysztof R. Apt and Erich Grädel. Lectures in game theory for computer scientists. Cambridge University Press, 2011.
Christel Baier and Joost-Pieter Katoen. Principles of model checking. MIT Press, 2008.
Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 449-458. PMLR, 2017. URL: http://proceedings.mlr.press/v70/bellemare17a.html.
Julien Bernet, David Janin, and Igor Walukiewicz. Permissive strategies: from parity games to safety games. ITA, 36(3):261-275, 2002. URL: http://dx.doi.org/10.1051/ita:2002013.
Raphaël Berthon, Mickael Randour, and Jean-François Raskin. Threshold constraints with guarantees for parity objectives in Markov decision processes. In Ioannis Chatzigiannakis, Piotr Indyk, Fabian Kuhn, and Anca Muscholl, editors, 44th International Colloquium on Automata, Languages, and Programming, ICALP 2017, July 10-14, 2017, Warsaw, Poland, volume 80 of LIPIcs, pages 121:1-121:15. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2017. URL: http://dx.doi.org/10.4230/LIPIcs.ICALP.2017.121.
Tomás Brázdil, Krishnendu Chatterjee, Martin Chmelik, Vojtech Forejt, Jan Kretínský, Marta Z. Kwiatkowska, David Parker, and Mateusz Ujma. Verification of markov decision processes using learning algorithms. In Automated Technology for Verification and Analysis - 12th International Symposium, ATVA 2014, Sydney, NSW, Australia, November 3-7, 2014, Proceedings, volume 8837 of Lecture Notes in Computer Science, pages 98-114. Springer, 2014.
Tomás Brázdil, Antonín Kucera, and Petr Novotný. Optimizing the expected mean payoff in energy Markov decision processes. In Automated Technology for Verification and Analysis - 14th International Symposium, ATVA 2016, Chiba, Japan, October 17-20, 2016, Proceedings, volume 9938 of Lecture Notes in Computer Science, pages 32-49, 2016. URL: http://dx.doi.org/10.1007/978-3-319-46520-3.
Véronique Bruyère, Emmanuel Filiot, Mickael Randour, and Jean-François Raskin. Meet your expectations with guarantees: Beyond worst-case synthesis in quantitative games. In Ernst W. Mayr and Natacha Portier, editors, 31st International Symposium on Theoretical Aspects of Computer Science (STACS 2014), STACS 2014, March 5-8, 2014, Lyon, France, volume 25 of LIPIcs, pages 199-213. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2014. URL: http://dx.doi.org/10.4230/LIPIcs.STACS.2014.199.
Cristian S. Calude, Sanjay Jain, Bakhadyr Khoussainov, Wei Li, and Frank Stephan. Deciding parity games in quasipolynomial time. In Hamed Hatami, Pierre McKenzie, and Valerie King, editors, Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, Montreal, QC, Canada, June 19-23, 2017, pages 252-263. ACM, 2017. URL: http://dx.doi.org/10.1145/3055399.3055409.
Krishnendu Chatterjee. Concurrent games with tail objectives. Theoretical Computer Science, 388(1-3):181-198, 2007. URL: http://dx.doi.org/10.1016/j.tcs.2007.07.047.
Krishnendu Chatterjee. Robustness of structurally equivalent concurrent parity games. In Lars Birkedal, editor, Foundations of Software Science and Computational Structures - 15th International Conference, FOSSACS 2012, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2012, Tallinn, Estonia, March 24 - April 1, 2012. Proceedings, volume 7213 of Lecture Notes in Computer Science, pages 270-285. Springer, 2012. URL: http://dx.doi.org/10.1007/978-3-642-28729-9_18.
Krishnendu Chatterjee and Laurent Doyen. Energy and mean-payoff parity markov decision processes. In Mathematical Foundations of Computer Science 2011 - 36th International Symposium, MFCS 2011, Warsaw, Poland, August 22-26, 2011. Proceedings, volume 6907 of Lecture Notes in Computer Science, pages 206-218. Springer, 2011. URL: http://dx.doi.org/10.1007/978-3-642-22993-0.
Krishnendu Chatterjee, Petr Novotný, Guillermo A. Pérez, Jean-François Raskin, and Dorde Zikelic. Optimizing expectation with guarantees in POMDPs. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., pages 3725-3732. AAAI Press, 2017. URL: http://www.aaai.org/Library/AAAI/aaai17contents.php.
Lorenzo Clemente and Jean-François Raskin. Multidimensional beyond worst-case and almost-sure problems for mean-payoff objectives. In 30th Annual ACM/IEEE Symposium on Logic in Computer Science, LICS 2015, Kyoto, Japan, July 6-10, 2015, pages 257-268. IEEE Computer Society, 2015. URL: http://dx.doi.org/10.1109/LICS.2015.33.
Przemyslaw Daca, Thomas A. Henzinger, Jan Kretínský, and Tatjana Petrov. Faster statistical model checking for unbounded temporal properties. In Tools and Algorithms for the Construction and Analysis of Systems - 22nd International Conference, TACAS 2016, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2016, Eindhoven, The Netherlands, April 2-8, 2016, Proceedings, volume 9636 of Lecture Notes in Computer Science, pages 112-129. Springer, 2016. URL: http://dx.doi.org/10.1007/978-3-662-49674-9.
Alexandre David, Peter Gjøl Jensen, Kim Guldstrand Larsen, Axel Legay, Didier Lime, Mathias Grund Sørensen, and Jakob Haahr Taankvist. On time with minimal expected cost! In Automated Technology for Verification and Analysis - 12th International Symposium, ATVA 2014, Sydney, NSW, Australia, November 3-7, 2014, Proceedings, volume 8837 of Lecture Notes in Computer Science, pages 129-145. Springer, 2014.
Hugo Gimbert. Pure stationary optimal strategies in Markov decision processes. In Wolfgang Thomas and Pascal Weil, editors, STACS 2007, 24th Annual Symposium on Theoretical Aspects of Computer Science, Aachen, Germany, February 22-24, 2007, Proceedings, volume 4393 of Lecture Notes in Computer Science, pages 200-211. Springer, 2007. URL: http://dx.doi.org/10.1007/978-3-540-70918-3_18.
Sebastian Junges, Nils Jansen, Christian Dehnert, Ufuk Topcu, and Joost-Pieter Katoen. Safety-constrained reinforcement learning for MDPs. In Marsha Chechik and Jean-François Raskin, editors, Tools and Algorithms for the Construction and Analysis of Systems - 22nd International Conference, TACAS 2016, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2016, Eindhoven, The Netherlands, April 2-8, 2016, Proceedings, volume 9636 of Lecture Notes in Computer Science, pages 130-146. Springer, 2016. URL: http://dx.doi.org/10.1007/978-3-662-49674-9_8.
Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore. Reinforcement learning: A survey. J. Artif. Intell. Res., 4:237-285, 1996. URL: http://dx.doi.org/10.1613/jair.301.
James R. Norris. Markov chains. Cambridge series in statistical and probabilistic mathematics. Cambridge University Press, 1998.
Christos H. Papadimitriou and Mihalis Yannakakis. Shortest paths without a map. Theoretical Computer Science, 84(1):127-150, 1991.
Martin L. Puterman. Markov Decision Processes. Wiley-Interscience, 2005.
Stuart J. Russell and Peter Norvig. Artificial Intelligence - A Modern Approach (3. internat. ed.). Pearson Education, 2010. URL: http://vig.pearsoned.com/store/product/1,1207,store-12521_isbn-0136042597,00.html.
Eilon Solan. Continuity of the value of competitive Markov decision processes. Journal of Theoretical Probability, 16(4):831-845, 2003.
Richard S. Sutton and Andrew G. Barto. Reinforcement learning: an introduction. Adaptive computation and machine learning. MIT Press, 2018. URL: http://www.incompleteideas.net/book/the-book-2nd.html.
Wolfgang Thomas. On the synthesis of strategies in infinite games. In STACS, pages 1-13, 1995. URL: http://dx.doi.org/10.1007/3-540-59042-0_57.
Leslie G. Valiant. A theory of the learnable. In Richard A. DeMillo, editor, Proceedings of the 16th Annual ACM Symposium on Theory of Computing, April 30 - May 2, 1984, Washington, DC, USA, pages 436-445. ACM, 1984. URL: http://dx.doi.org/10.1145/800057.808710.
Moshe Y. Vardi. Automatic verification of probabilistic concurrent finite-state programs. In 26th Annual Symposium on Foundations of Computer Science, Portland, Oregon, USA, 21-23 October 1985, pages 327-338. IEEE Computer Society, 1985. URL: http://dx.doi.org/10.1109/SFCS.1985.12.
Christopher J. C. H. Watkins and Peter Dayan. Technical note q-learning. Machine Learning, 8:279-292, 1992. URL: http://dx.doi.org/10.1007/BF00992698.
Min Wen, Rüdiger Ehlers, and Ufuk Topcu. Correct-by-synthesis reinforcement learning with temporal logic constraints. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2015, Hamburg, Germany, September 28 - October 2, 2015, pages 4983-4990. IEEE, 2015. URL: http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=7347169.
Min Wen and Ufuk Topcu. Probably approximately correct learning in stochastic games with temporal logic specifications. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, pages 3630-3636. IJCAI/AAAI Press, 2016. URL: http://www.ijcai.org/Proceedings/2016.

Learning-Based Mean-Payoff Optimization in an Unknown MDP under Omega-Regular Constraints

Authors Jan Kretínský , Guillermo A. Pérez , Jean-François Raskin

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Learning-Based Mean-Payoff Optimization in an Unknown MDP under Omega-Regular Constraints

Authors Jan Kretínský , Guillermo A. Pérez , Jean-François Raskin

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message