You Only Live Multiple Times: A Blackbox Solution for Reusing Crash-Stop Algorithms In Realistic Crash-Recovery Settings

Authors David Kozhaya, Ognjen Maric, Yvonne-Anne Pignolet



PDF
Thumbnail PDF

File

LIPIcs.OPODIS.2018.19.pdf
  • Filesize: 0.62 MB
  • 17 pages

Document Identifiers

Author Details

David Kozhaya
  • ABB Corporate Research, Switzerland
Ognjen Maric
  • Digital Asset, Switzerland
Yvonne-Anne Pignolet
  • ABB Corporate Research, Switzerland

Cite As Get BibTex

David Kozhaya, Ognjen Maric, and Yvonne-Anne Pignolet. You Only Live Multiple Times: A Blackbox Solution for Reusing Crash-Stop Algorithms In Realistic Crash-Recovery Settings. In 22nd International Conference on Principles of Distributed Systems (OPODIS 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 125, pp. 19:1-19:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019) https://doi.org/10.4230/LIPIcs.OPODIS.2018.19

Abstract

Distributed agreement-based algorithms are often specified in a crash-stop asynchronous model augmented by Chandra and Toueg's unreliable failure detectors. In such models, correct nodes stay up forever, incorrect nodes eventually crash and remain down forever, and failure detectors behave correctly forever eventually, However, in reality, nodes as well as communication links both crash and recover without deterministic guarantees to remain in some state forever.
In this paper, we capture this realistic temporary and probabilitic behaviour in a simple new system model. Moreover, we identify a large algorithm class for which we devise a property-preserving transformation. Using this transformation, many algorithms written for the asynchronous crash-stop model run correctly and unchanged in real systems.

Subject Classification

ACM Subject Classification
  • Theory of computation → Distributed algorithms
Keywords
  • Crash recovery
  • consensus
  • asynchrony

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Marcos Kawazoe Aguilera, Wei Chen, and Sam Toueg. Failure detection and consensus in the crash-recovery model. Distributed computing, 13(2):99-125, 2000. Google Scholar
  2. Dan Alistarh, James Aspnes, Valerie King, and Jared Saia. Communication-Efficient Randomized Consensus. In Distributed Computing, pages 61-75, 2014. Google Scholar
  3. Dan Alistarh, Seth Gilbert, Rachid Guerraoui, and Corentin Travers. How to solve consensus in the smallest window of synchrony. In DISC. Springer, 2008. Google Scholar
  4. Bowen Alpern and Fred Schneider. Defining Liveness. Information Processing Letters, 21:181-185, June 1985. Google Scholar
  5. James Aspnes, Hagit Attiya, and Keren Censor. Combining Shared-coin Algorithms. J. Parallel Distrib. Comput., 70(3):317-322, 2010. Google Scholar
  6. Christel Baier and Joost-Pieter Katoen. Principles of model checking. MIT Press, 2008. Google Scholar
  7. Gabriel Bracha and Sam Toueg. Asynchronous Consensus and Broadcast Protocols. J. ACM, 32(4), 1985. Google Scholar
  8. Tushar Deepak Chandra and Sam Toueg. Unreliable Failure Detectors for Reliable Distributed Systems. J. ACM, 43(2):225-267, 1996. Google Scholar
  9. Bernadette Charron-Bost, Martin Hutle, and Josef Widder. In search of lost time. Information Processing Letters, 110(21), 2010. Google Scholar
  10. Flavin Cristian. Understanding Fault-tolerant Distributed Systems. Commun. ACM, 34(2):56-78, 1991. Google Scholar
  11. Danny Dolev, Roy Friedman, Idit Keidar, and Dahlia Malkhi. Failure Detectors in Omission Failure Environments. In PODC, pages 286-, 1997. Google Scholar
  12. Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. Consensus in the Presence of Partial Synchrony. J. ACM, 35(2):288-323, 1988. Google Scholar
  13. Dacfey Dzung, Rachid Guerraoui, David Kozhaya, and Yvonne Anne Pignolet. Never Say Never - Probabilistic and Temporal Failure Detectors. In IEEE International Parallel and Distributed Processing Symposium, IPDPS, pages 679-688, 2016. Google Scholar
  14. Tzilla Elrad and Nissim Francez. Decomposition of distributed programs into communication-closed layers. Science of Computer Programming, 2:155-173, 1982. Google Scholar
  15. Christof Fetzer, Ulrich Schmid, and Martin Susskraut. On the Possibility of Consensus in Asynchronous Systems with Finite Average Response Times. In 25th IEEE International Conference on Distributed Computing Systems, pages 271-280, 2005. Google Scholar
  16. Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM (JACM), 32(2):374-382, 1985. Google Scholar
  17. Pierre Fraigniaud, Mika Göös, Amos Korman, Merav Parter, and David Peleg. Randomized distributed decision. LNCS, 27, 2014. Google Scholar
  18. Felix C. Freiling, Christian Lambertz, and Mila Majster-Cederbaum. Modular Consensus Algorithms for the Crash-Recovery Model. In 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies, pages 287-292, December 2009. Google Scholar
  19. Eli Gafni. Round-by-round fault detectors: Unifying synchrony and asynchrony. In PODC, pages 143-152, 1998. Google Scholar
  20. Rachid Guerraoui and Michel Raynal. A Generic Framework for Indulgent Consensus. In ICDCS, pages 88-, 2003. Google Scholar
  21. Michel Hurfin, Achour Mostéfaoui, and Michel Raynal. Consensus in Asynchronous Systems Where Processes Can Crash and Recover. In Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems, pages 280-, 1998. Google Scholar
  22. Michel Hurfin, Achour Mostéfaoui, and Michel Raynal. A versatile family of consensus protocols based on Chandra-Toueg’s unreliable failure detectors. IEEE Transactions on Computers, 51(4):395-408, 2002. Google Scholar
  23. Leslie Lamport. The Part-time Parliament. ACM Trans. Comput. Syst., 16(2):133-169, 1998. Google Scholar
  24. Mikel Larrea, Cristian Martín, and Iratxe Soraluze. Communication-efficient leader election in crash–recovery systems. Journal of Systems and Software, 84(12):2186-2195, 2011. Google Scholar
  25. Neeraj Mittal, Kuppahalli L. Phaneesh, and Felix C. Freiling. Safe Termination Detection in an Asynchronous Distributed System when Processes May Crash and Recover. Theor. Comput. Sci., 410(6-7):614-628, February 2009. Google Scholar
  26. H. Moniz, N.F. Neves, and M. Correia. Turquois: Byzantine consensus in wireless ad hoc networks. In DSN, 2010. Google Scholar
  27. Henrique Moniz, NunoFerreira Neves, Miguel Correia, and Paulo Veríssimo. Randomization Can Be a Healer: Consensus with Dynamic Omission Failures. In DISC, volume 5805 of LNCS, 2009. Google Scholar
  28. Rui Oliveira, Rachid Guerraoui, and André Schiper. Consensus in the Crash-Recover Model, 1997. Google Scholar
  29. Marshall Pease, Robert Shostak, and Leslie Lamport. Reaching agreement in the presence of faults. Journal of the ACM (JACM), 27(2):228-234, 1980. Google Scholar
  30. André Schiper. Early Consensus in an Asynchronous System with a Weak Failure Detector. Distrib. Comput., 10(3):149-157, 1997. Google Scholar
  31. Ulrich Schmid, Bettina Weiss, and Idit Keidar. Impossibility results and lower bounds for consensus under link failures. SIAM Journal on Computing, 38(5):1912-1951, 2009. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail