The Need for Language Support for Fault-Tolerant Distributed Systems

Authors Cezara Dragoi, Thomas A. Henzinger, Damien Zufferey

Thumbnail PDF


  • Filesize: 477 kB
  • 13 pages

Document Identifiers

Author Details

Cezara Dragoi
Thomas A. Henzinger
Damien Zufferey

Cite AsGet BibTex

Cezara Dragoi, Thomas A. Henzinger, and Damien Zufferey. The Need for Language Support for Fault-Tolerant Distributed Systems. In 1st Summit on Advances in Programming Languages (SNAPL 2015). Leibniz International Proceedings in Informatics (LIPIcs), Volume 32, pp. 90-102, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2015)


Fault-tolerant distributed algorithms play an important role in many critical/high-availability applications. These algorithms are notoriously difficult to implement correctly, due to asynchronous communication and the occurrence of faults, such as the network dropping messages or computers crashing. Nonetheless there is surprisingly little language and verification support to build distributed systems based on fault-tolerant algorithms. In this paper, we present some of the challenges that a designer has to overcome to implement a fault-tolerant distributed system. Then we review different models that have been proposed to reason about distributed algorithms and sketch how such a model can form the basis for a domain-specific programming language. Adopting a high-level programming model can simplify the programmer's life and make the code amenable to automated verification, while still compiling to efficiently executable code. We conclude by summarizing the current status of an ongoing language design and implementation project that is based on this idea.
  • Programming language
  • Fault-tolerant distributed algorithms
  • Automated verification


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Parosh Aziz Abdulla, Karlis Cerans, Bengt Jonsson, and Yih-Kuen Tsay. General Decidability Theorems for Infinite-State Systems. In LICS, pages 313-321. IEEE, 1996. Google Scholar
  2. Parosh Aziz Abdulla, Aurore Collomb-Annichini, Ahmed Bouajjani, and Bengt Jonsson. Using Forward Reachability Analysis for Verification of Lossy Channel Systems. FMSD, 25(1):39-65, 2004. Google Scholar
  3. Yehuda Afek and Eli Gafni. Asynchrony from synchrony. In Davide Frey, Michel Raynal, Saswati Sarkar, Rudrapatna K. Shyamasundar, and Prasun Sinha, editors, Distributed Computing and Networking, 14th International Conference, ICDCN 2013, Mumbai, India, January 3-6, 2013. Proceedings, pages 225-239, 2013. Google Scholar
  4. Krzysztof R. Apt and Dexter Kozen. Limits for automatic verification of finite-state concurrent systems. Inf. Process. Lett., 22(6):307-309, 1986. Google Scholar
  5. Joe L. Armstrong. The development of erlang. In Simon L. Peyton Jones, Mads Tofte, and A. Michael Berman, editors, Proceedings of the 1997 ACM SIGPLAN International Conference on Functional Programming (ICFP '97), Amsterdam, The Netherlands, June 9-11, 1997., pages 196-203. ACM, 1997. Google Scholar
  6. Martin Biely, Bernadette Charron-Bost, Antoine Gaillard, Martin Hutle, André Schiper, and Josef Widder. Tolerating corrupted communication. In PODC, pages 244-253, 2007. Google Scholar
  7. Fatemeh Borran, Martin Hutle, and André Schiper. Timing analysis of leader-based and decentralized Byzantine consensus algorithms. J. Braz. Comp. Soc., 18(1):29-42, 2012. Google Scholar
  8. Daniel Brand and Pitro Zafiropulo. On Communicating Finite-State Machines. J. ACM, 30(2):323-342, 1983. Google Scholar
  9. Francisco Brasileiro, Fabíola Greve, Achour Mostefaoui, and Michel Raynal. Consensus in one communication step. In Victor Malyshkin, editor, Parallel Computing Technologies, volume 2127 of Lecture Notes in Computer Science. Springer, 2001. Google Scholar
  10. Eugene Burmako. Scala macros: Let our powers combine!: On how rich syntax and static types work with metaprogramming. In Proceedings of the 4th Workshop on Scala, SCALA '13, pages 3:1-3:10, New York, NY, USA, 2013. ACM. Google Scholar
  11. Mike Burrows. The chubby lock service for loosely-coupled distributed systems. In OSDI, Berkeley, CA, USA, 2006. USENIX Association. Google Scholar
  12. Tushar D. Chandra, Robert Griesemer, and Joshua Redstone. Paxos made live: An engineering perspective. In Proceedings of the Twenty-sixth Annual ACM Symposium on Principles of Distributed Computing, PODC '07, pages 398-407, New York, NY, USA, 2007. ACM. Google Scholar
  13. Tushar Deepak Chandra and Sam Toueg. Unreliable failure detectors for reliable distributed systems. J. ACM, 43(2):225-267, 1996. Google Scholar
  14. Mouna Chaouch-Saad, Bernadette Charron-Bost, and Stephan Merz. A reduction theorem for the verification of round-based distributed algorithms. In RP, volume 5797 of LNCS, pages 93-106, 2009. Google Scholar
  15. Bernadette Charron-Bost and Stephan Merz. Formal verification of a consensus algorithm in the heard-of model. Int. J. Software and Informatics, 3(2-3):273-303, 2009. Google Scholar
  16. Bernadette Charron-Bost and André Schiper. The heard-of model: computing in distributed systems with benign faults. Distributed Computing, 22(1):49-71, 2009. Google Scholar
  17. Kaustuv Chaudhuri, Damien Doligez, Leslie Lamport, and Stephan Merz. A TLA+ proof system. In LPAR Workshops, 2008. Google Scholar
  18. S. Chaudhuri. More choices allow more faults: Set consensus problems in totally asynchronous systems. Information and Computation, 105(1):132 - 158, 1993. Google Scholar
  19. Henri Debrat and Stephan Merz. Verifying fault-tolerant distributed algorithms in the heard-of model. Archive of Formal Proofs, 2012, 2012. Google Scholar
  20. Ankush Desai, Vivek Gupta, Ethan K. Jackson, Shaz Qadeer, Sriram K. Rajamani, and Damien Zufferey. P: safe asynchronous event-driven programming. In Hans-Juergen Boehm and Cormac Flanagan, editors, ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '13, Seattle, WA, USA, June 16-19, 2013, pages 321-332. ACM, 2013. Google Scholar
  21. Danny Dolev, Cynthia Dwork, and Larry Stockmeyer. On the minimal synchronism needed for distributed consensus. J. ACM, 34:77-97, 1987. URL:
  22. Danny Dolev and Ezra N. Hoch. On self-stabilizing synchronous actions despite byzantine attacks. In Andrzej Pelc, editor, Distributed Computing, 21st International Symposium, DISC 2007, Lemesos, Cyprus, September 24-26, 2007, Proceedings, volume 4731 of Lecture Notes in Computer Science, pages 193-207. Springer, 2007. Google Scholar
  23. Cezara Dragoi, Thomas A. Henzinger, Helmut Veith, Josef Widder, and Damien Zufferey. A logic-based framework for verifying consensus algorithms. In Kenneth L. McMillan and Xavier Rival, editors, VMCAI, pages 161-181. Springer, 2014. Google Scholar
  24. Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. Consensus in the presence of partial synchrony. JACM, 35(2):288-323, April 1988. Google Scholar
  25. Tzilla Elrad and Nissim Francez. Decomposition of distributed programs into communication-closed layers. Sci. Comput. Program., 2(3):155-173, 1982. Google Scholar
  26. Jose M. Falerio, Sriram K. Rajamani, Kaushik Rajan, G. Ramalingam, and Kapil Vaswani. Generalized lattice agreement. In ACM Symposium on Principles of Distributed Computing, PODC '12, Funchal, Madeira, Portugal, July 16-18, 2012, pages 125-134, 2012. Google Scholar
  27. Alain Finkel. A generalization of the procedure of karp and miller to well structured transition systems. In Thomas Ottmann, editor, Automata, Languages and Programming, 14th International Colloquium, ICALP87, Karlsruhe, Germany, July 13-17, 1987, Proceedings, volume 267 of Lecture Notes in Computer Science, pages 499-508. Springer, 1987. Google Scholar
  28. Alain Finkel and Ph. Schnoebelen. Well-structured transition systems everywhere! Theor. Comput. Sci., 256(1-2):63-92, 2001. Google Scholar
  29. Michael J. Fischer, Nancy A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty process. J. ACM, 32(2):374-382, April 1985. Google Scholar
  30. Eli Gafni. Round-by-round fault detectors: Unifying synchrony and asynchrony (extended abstract). In Brian A. Coan and Yehuda Afek, editors, Proceedings of the Seventeenth Annual ACM Symposium on Principles of Distributed Computing, PODC '98, Puerto Vallarta, Mexico, June 28 - July 2, 1998, pages 143-152, 1998. Google Scholar
  31. Carl Hewitt, Peter Bishop, and Richard Steiger. A Universal Modular ACTOR Formalism for Artificial Intelligence. In IJCAI, pages 235-245, 1973. Google Scholar
  32. C. A. R. Hoare. Communicating sequential processes. Commun. ACM, 21(8):666-677, 1978. Google Scholar
  33. Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. Zookeeper: wait-free coordination for internet-scale systems. In USENIXATC. USENIX Association, 2010. Google Scholar
  34. Michael Isard. Autopilot: Automatic data center management. Operating Systems Review, 41(2):60-67, April 2007. Google Scholar
  35. Flavio Paiva Junqueira, Benjamin C. Reed, and Marco Serafini. Zab: High-performance broadcast for primary-backup systems. In Proceedings of the 2011 IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2011, Hong Kong, China, June 27-30 2011, pages 245-256. IEEE, 2011. Google Scholar
  36. Idit Keidar and Alexander Shraer. Timeliness, failure-detectors, and consensus performance. In PODC, pages 169-178, 2006. Google Scholar
  37. Leslie Lamport. The part-time parliament. ACM Trans. Comput. Syst., 1998. Google Scholar
  38. Leslie Lamport. Distributed algorithms in TLA (abstract). In PODC, 2000. Google Scholar
  39. Leslie Lamport, Robert Shostak, and Marshall Pease. The Byzantine generals problem. ACM Trans. Program. Lang. Syst., 4(3):382-401, 1982. Google Scholar
  40. Nancy Lynch. Distributed Algorithms. Morgan Kaufman, 1996. Google Scholar
  41. Nancy A. Lynch and Mark R. Tuttle. Hierarchical correctness proofs for distributed algorithms. In Fred B. Schneider, editor, Proceedings of the Sixth Annual ACM Symposium on Principles of Distributed Computing, Vancouver, British Columbia, Canada, August 10-12, 1987, pages 137-151. ACM, 1987. Google Scholar
  42. Yanhua Mao, Flavio Paiva Junqueira, and Keith Marzullo. Mencius: Building efficient replicated state machine for wans. In Richard Draves and Robbert van Renesse, editors, 8th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2008, December 8-10, 2008, San Diego, California, USA, Proceedings, pages 369-384. USENIX Association, 2008. Google Scholar
  43. Heather Miller, Philipp Haller, Eugene Burmako, and Martin Odersky. Instant pickles: generating object-oriented pickler combinators for fast and extensible serialization. In Antony L. Hosking, Patrick Th. Eugster, and Cristina V. Lopes, editors, OOPSLA, pages 183-202, 2013. Google Scholar
  44. Robin Milner, Joachim Parrow, and David Walker. A calculus of mobile processes, I. Inf. Comput., 100(1):1-40, 1992. Google Scholar
  45. Robin Milner, Joachim Parrow, and David Walker. A calculus of mobile processes, II. Inf. Comput., 100(1):41-77, 1992. Google Scholar
  46. Iulian Moraru, David G. Andersen, and Michael Kaminsky. There is more consensus in egalitarian parliaments. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP '13, pages 358-372, New York, NY, USA, 2013. ACM. Google Scholar
  47. Tobias Nipkow, Lawrence C. Paulson, and Markus Wenzel. Isabelle/HOL - A Proof Assistant for Higher-Order Logic, volume 2283 of Lecture Notes in Computer Science. Springer, 2002. Google Scholar
  48. Diego Ongaro and John K. Ousterhout. In search of an understandable consensus algorithm. In Garth Gibson and Nickolai Zeldovich, editors, 2014 USENIX Annual Technical Conference, USENIX ATC '14, Philadelphia, PA, USA, June 19-20, 2014., pages 305-319, 2014. Google Scholar
  49. Amir Pnueli. The temporal logic of programs. 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, 0:46-57, 1977. Google Scholar
  50. Sergio Rajsbaum. Acm sigact news distributed computing column 5. SIGACT News, 32(4):34-58, December 2001. Google Scholar
  51. Nicolas Schiper, Vincent Rahli, Robbert van Renesse, Mark Bickford, and Robert L. Constable. Shadowdb: A replicated database on a synthesized consensus core. In Michael J. Freedman and Neeraj Suri, editors, Proceedings of the Eighth Workshop on Hot Topics in System Dependability, HotDep 2012, Hollywood, CA, USA, October 7, 2012. USENIX Association, 2012. Google Scholar
  52. Philippe Schnoebelen. Revisiting Ackermann-Hardness for Lossy Counter Machines and Reset Petri Nets. In MFCS, pages 616-628, 2010. Google Scholar
  53. Seved Torstendahl. Open telecom platform. Ericsson Review, 1, 1997. Google Scholar
  54. Tatsuhiro Tsuchiya and André Schiper. Model checking of consensus algorithms. In 26th IEEE Symposium on Reliable Distributed Systems (SRDS 2007), Beijing, China, October 10-12, 2007, pages 137-148, 2007. Google Scholar
  55. Tatsuhiro Tsuchiya and André Schiper. Using bounded model checking to verify consensus algorithms. In DISC, pages 466-480, 2008. Google Scholar
  56. Tatsuhiro Tsuchiya and André Schiper. Verification of consensus algorithms using satisfiability solving. Distributed Computing, 23(5-6):341-358, 2011. Google Scholar
  57. Josef Widder, Martin Biely, Günther Gridling, Bettina Weiss, and Jean-Paul Blanquart. Consensus in the presence of mortal Byzantine faulty processes. Distributed Computing, 2012. Google Scholar
  58. Lantian Zheng and Andrew C. Myers. A language-based approach to secure quorum replication. In Proceedings of the Ninth Workshop on Programming Languages and Analysis for Security, PLAS@ECOOP 2014, Uppsala, Sweden, July 29, 2014, page 27, 2014. Google Scholar