Reliable State Machines: A Framework for Programming Reliable Cloud Services

Authors Suvam Mukherjee , Nitin John Raj, Krishnan Govindraj, Pantazis Deligiannis , Chandramouleswaran Ravichandran, Akash Lal, Aseem Rastogi, Raja Krishnaswamy

Thumbnail PDF


  • Filesize: 0.96 MB
  • 29 pages

Document Identifiers

Author Details

Suvam Mukherjee
  • Microsoft Research, Bangalore, India
Nitin John Raj
  • International Institute of Information Technology, Hyderabad, India
Krishnan Govindraj
  • Microsoft Research, Bangalore, India
Pantazis Deligiannis
  • Microsoft Research, Redmond, USA
Chandramouleswaran Ravichandran
  • Microsoft Azure, Redmond, USA
Akash Lal
  • Microsoft Research, Bangalore, India
Aseem Rastogi
  • Microsoft Research, Bangalore, India
Raja Krishnaswamy
  • Microsoft Azure, Redmond, USA


We thank the anonymous reviewers for suggesting several ways to improve our work. Nitin John Raj’s work was done, in part, during an internship at Microsoft Research, India.

Cite AsGet BibTex

Suvam Mukherjee, Nitin John Raj, Krishnan Govindraj, Pantazis Deligiannis, Chandramouleswaran Ravichandran, Akash Lal, Aseem Rastogi, and Raja Krishnaswamy. Reliable State Machines: A Framework for Programming Reliable Cloud Services. In 33rd European Conference on Object-Oriented Programming (ECOOP 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 134, pp. 18:1-18:29, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Building reliable applications for the cloud is challenging because of unpredictable failures during a program’s execution. This paper presents a programming framework, called Reliable State Machines (RSMs), that offers fault-tolerance by construction. In our framework, an application comprises several (possibly distributed) RSMs that communicate with each other via messages, much in the style of actor-based programming. Each RSM is fault-tolerant by design, thereby offering the illusion of being "always-alive". An RSM is guaranteed to process each input request exactly once, as one would expect in a failure-free environment. The RSM runtime automatically takes care of persisting state and rehydrating it on a failover. We present the core syntax and semantics of RSMs, along with a formal proof of failure-transparency. We provide a .NET implementation of the RSM framework for deploying services to Microsoft Azure. We carry out an extensive performance evaluation on micro-benchmarks to show that one can build high-throughput applications with RSMs. We also present a case study where we rewrite a significant part of a production cloud service using RSMs. The resulting service has simpler code and exhibits production-grade performance.

Subject Classification

ACM Subject Classification
  • Software and its engineering → Software reliability
  • Software and its engineering → Cloud computing
  • Software and its engineering → Software fault tolerance
  • Fault tolerance
  • Cloud computing
  • Actor framework


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Akka. [Online; accessed 10-January-2019].
  2. Apache Kafka. [Online; accessed 1-January-2019].
  3. Asynchronous programming with async and await in C#. URL:
  4. Azure Service Fabric. URL:
  5. Azure Service Fabric Communication. URL:
  6. Azure Service Fabric Partitioning. URL:
  7. Azure Service Fabric Reliable Collections. URL:
  8. Azure Service Fabric Reliable Services. URL:
  9. Azure Service Fabric Reliable State Manager. URL:
  10. Engineer Bainomugisha, Andoni Lombide Carreton, Tom Van Cutsem, Stijn Mostinckx, and Wolfgang De Meuter. A survey on reactive programming. ACM Comput. Surv., 45(4):52:1-52:34, 2013. URL:
  11. Henri E. Bal, M. Frans Kaashoek, and Andrew S. Tanenbaum. Orca: A Language For Parallel Programming of Distributed Systems. IEEE Trans. Software Eng., 18(3):190-205, 1992. URL:
  12. Philip A Bernstein, Sergey Bykov, Alan Geller, Gabriel Kliot, and Jorgen Thelin. Orleans: Distributed virtual actors for programmability and scalability. MSR-TR-2014-41, 2014. Google Scholar
  13. Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou. Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing. In Jason Flinn and Hank Levy, editors, 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI '14, Broomfield, CO, USA, October 6-8, 2014., pages 285-300. USENIX Association, 2014. URL:
  14. Philippe Charles, Christian Grothoff, Vijay A. Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. X10: an object-oriented approach to non-uniform cluster computing. In Ralph E. Johnson and Richard P. Gabriel, editors, Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2005, October 16-20, 2005, San Diego, CA, USA, pages 519-538. ACM, 2005. URL:
  15. Pantazis Deligiannis, Alastair F. Donaldson, Jeroen Ketema, Akash Lal, and Paul Thomson. Asynchronous programming, analysis and testing with state machines. In David Grove and Steve Blackburn, editors, Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, Portland, OR, USA, June 15-17, 2015, pages 154-164. ACM, 2015. URL:
  16. Pantazis Deligiannis, Matt McCutchen, Paul Thomson, Shuo Chen, Alastair F. Donaldson, John Erickson, Cheng Huang, Akash Lal, Rashmi Mudduluru, Shaz Qadeer, and Wolfram Schulte. Uncovering Bugs in Distributed Storage Systems during Testing (Not in Production!). In Angela Demke Brown and Florentina I. Popovici, editors, 14th USENIX Conference on File and Storage Technologies, FAST 2016, Santa Clara, CA, USA, February 22-25, 2016., pages 249-262. USENIX Association, 2016. URL:
  17. Ankush Desai, Vivek Gupta, Ethan K. Jackson, Shaz Qadeer, Sriram K. Rajamani, and Damien Zufferey. P: safe asynchronous event-driven programming. In Hans-Juergen Boehm and Cormac Flanagan, editors, ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '13, Seattle, WA, USA, June 16-19, 2013, pages 321-332. ACM, 2013. URL:
  18. Ankush Desai, Shaz Qadeer, and Sanjit A. Seshia. Systematic testing of asynchronous reactive systems. In Elisabetta Di Nitto, Mark Harman, and Patrick Heymans, editors, Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, Bergamo, Italy, August 30 - September 4, 2015, pages 73-83. ACM, 2015. URL:
  19. Enterprise workloads in the cloud. URL:
  20. Erlang. [Online; accessed 10-January-2019].
  21. Jim Gray. The Transaction Concept: Virtues and Limitations (Invited Paper). In Very Large Data Bases, 7th International Conference, September 9-11, 1981, Cannes, France, Proceedings, pages 144-154. IEEE Computer Society, 1981. Google Scholar
  22. Maurice Herlihy and Jeannette M. Wing. Linearizability: A Correctness Condition for Concurrent Objects. ACM Trans. Program. Lang. Syst., 12(3):463-492, 1990. URL:
  23. Kafka Powered By. [Online; accessed 1-January-2019].
  24. Rajesh K. Karmani and Gul Agha. Actors. In David A. Padua, editor, Encyclopedia of Parallel Computing, pages 1-11. Springer, 2011. URL:
  25. Jay Kreps, Neha Narkhede, and Jun Rao. Kafka: a Distributed Messaging System for Log Processing. In 6th International Workshop on Networking Meets Databases (NetDB), 2011. Google Scholar
  26. Leslie Lamport. Time, Clocks, and the Ordering of Events in a Distributed System. Commun. ACM, 21(7):558-565, 1978. URL:
  27. Ragnar Mogk, Lars Baumgärtner, Guido Salvaneschi, Bernd Freisleben, and Mira Mezini. Fault-tolerant Distributed Reactive Programming. In Todd D. Millstein, editor, 32nd European Conference on Object-Oriented Programming, ECOOP 2018, July 16-21, 2018, Amsterdam, The Netherlands, volume 109 of LIPIcs, pages 1:1-1:26. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2018. URL:
  28. Rashmi Mudduluru, Pantazis Deligiannis, Ankush Desai, Akash Lal, and Shaz Qadeer. Lasso detection using partial-state caching. In Daryl Stewart and Georg Weissenbacher, editors, 2017 Formal Methods in Computer Aided Design, FMCAD 2017, Vienna, Austria, October 2-6, 2017, pages 84-91. IEEE, 2017. URL:
  29. Suvam Mukherjee, Nitin John Raj, Krishnan Govindraj, Pantazis Deligiannis, Chandramouleswaran Ravichandran, Akash Lal, Aseem Rastogi, and Raja Krishnaswamy. Reliable State Machines: A Framework for Programming Reliable Cloud Services. CoRR, abs/1902.09502, 2019. URL:
  30. Andrew Newell, Gabriel Kliot, Ishai Menache, Aditya Gopalan, Soramichi Akiyama, and Mark Silberstein. Optimizing distributed actor systems for dynamic interactive services. In Proceedings of the Eleventh European Conference on Computer Systems, EuroSys 2016, London, United Kingdom, April 18-21, 2016, pages 38:1-38:15, 2016. URL:
  31. P#. P#: Safe Asynchronous Event-Driven Programming. [Online; accessed 1-January-2019].
  32. Ganesan Ramalingam and Kapil Vaswani. Fault tolerance via idempotence. In Roberto Giacobazzi and Radhia Cousot, editors, The 40th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '13, Rome, Italy - January 23 - 25, 2013, pages 249-262. ACM, 2013. URL:
  33. Amr Sabry and Matthias Felleisen. Reasoning about Programs in Continuation-Passing Style. Lisp and Symbolic Computation, 6(3-4):289-360, 1993. Google Scholar
  34. Guido Salvaneschi, Gerold Hintz, and Mira Mezini. REScala: bridging between object-oriented and functional style in reactive applications. In Walter Binder, Erik Ernst, Achille Peternier, and Robert Hirschfeld, editors, 13th International Conference on Modularity, MODULARITY '14, Lugano, Switzerland, April 22-26, 2014, pages 25-36. ACM, 2014. URL:
  35. Service Fabric Reliable Actors. URL:
  36. Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. A comprehensive study of Convergent and Commutative Replicated Data Types. JHAL-Inria, page 50, 2011. URL:
  37. The TailSpin Scenario. Accessed: 2019-1-10.
  38. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Steven D. Gribble and Dina Katabi, editors, Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, April 25-27, 2012, pages 15-28. USENIX Association, 2012. URL: