Recovering Shared Objects Without Stable Storage

Authors Ellis Michael, Dan R. K. Ports, Naveen Kr. Sharma, Adriana Szekeres



PDF
Thumbnail PDF

File

LIPIcs.DISC.2017.36.pdf
  • Filesize: 0.51 MB
  • 16 pages

Document Identifiers

Author Details

Ellis Michael
Dan R. K. Ports
Naveen Kr. Sharma
Adriana Szekeres

Cite As Get BibTex

Ellis Michael, Dan R. K. Ports, Naveen Kr. Sharma, and Adriana Szekeres. Recovering Shared Objects Without Stable Storage. In 31st International Symposium on Distributed Computing (DISC 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 91, pp. 36:1-36:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017) https://doi.org/10.4230/LIPIcs.DISC.2017.36

Abstract

This paper considers the problem of building fault-tolerant shared objects when processes can crash and recover but lose their persistent state on recovery. This Diskless Crash-Recovery (DCR) model matches the way many long-lived systems are built. We show that it presents new challenges, as operations that are recorded at a quorum may not persist after some of the processes in that quorum crash and then recover.

To address this problem, we introduce the notion of crash-consistent quorums, where no recoveries happen during the quorum responses. We show that relying on crash-consistent quorums enables a recovery procedure that can recover all operations that successfully finished. Crash-consistent quorums can be easily identified using a mechanism we term the crash vector, which tracks the causal relationship between crashes, recoveries, and other operations.

We apply crash-consistent quorums and crash vectors to build two storage primitives. We give a new algorithm for multi-writer, multi-reader atomic registers in the DCR model that guarantees safety under all conditions and termination under a natural condition. It improves on the best prior protocol for this problem by requiring fewer rounds, fewer nodes to participate in the quorum, and a less restrictive liveness condition. We also present a more efficient single-writer, single-reader atomic set - a virtual stable storage abstraction. It can be used to lift any existing algorithm from the traditional Crash-Recovery model to the DCR model. We examine a specific application, state machine replication, and show that existing diskless protocols can violate their correctness guarantees, while ours offers a general and correct solution.

Subject Classification

Keywords
  • asynchronous system
  • fault-tolerance
  • crash-recovery
  • R/W register
  • state machine replication

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Marcos K. Aguilera, Idit Keidar, Dahlia Malkhi, and Alexander Shraer. Dynamic atomic storage without consensus. J. ACM, 58(2):7:1-7:32, April 2011. Google Scholar
  2. Marcos Kawazoe Aguilera, Wei Chen, and Sam Toueg. Failure detection and consensus in the crash-recovery model. In Proc. of DISC, 1998. Google Scholar
  3. Hagit Attiya, Amotz Bar-Noy, and Danny Dolev. Sharing memory robustly in message-passing systems. J. of the ACM, 42(1):124-142, January 1995. Google Scholar
  4. Hagit Attiya, Hyun Chul Chung, Faith Ellen, Saptaparni Kumar, and Jennifer L. Welch. Simulating a shared register in an asynchronous system that never stops changing. In Proc. of DISC, 2015. Google Scholar
  5. Roberto Baldoni, Silvia Bonomi, Anne-Marie Kermarrec, and Michel Raynal. Implementing a register in a dynamic distributed system. In Proc. of ICDCS, 2009. Google Scholar
  6. Roberto Baldoni, Silvia Bonomi, and Michel Raynal. Implementing a regular register in an eventually synchronous distributed system prone to continuous churn. IEEE Trans. Parallel Distrib. Syst., 23(1):102-109, January 2012. Google Scholar
  7. Tushar D Chandra, Robert Griesemer, and Joshua Redstone. Paxos made live: An engineering perspective. In Proc. of PODC, 2007. Google Scholar
  8. Danny Dolev, Roy Friedman, Idit Keidar, and Dahlia Malkhi. Failure detectors in omission failure environments. In Proc. of PODC, 1997. Google Scholar
  9. Colin J. Fidge. Timestamps in message-passing systems that preserve the partial ordering. In Proc. of ACSC, 1988. Google Scholar
  10. Eli Gafni and Dahlia Malkhi. Elastic configuration maintenance via a parsimonious speculating snapshot solution. In Proc. of DISC, 2015. Google Scholar
  11. Rachid Guerraoui, Ron R. Levy, Bastian Pochon, and Jim Pugh. The collective memory of amnesic processes. ACM Trans. Algorithms, 4(1):12:1-12:31, March 2008. Google Scholar
  12. Michel Hurfin, Achour Mostéfaoui, and Michel Raynal. Consensus in asynchronous systems where processes can crash and recover. In Proc. of SRDS, 1998. Google Scholar
  13. Leander Jehl, Tormod Erevik Lea, and Hein Meling. Replacement: Decentralized failure handling for replicated state machines. In Proc. of SRDS, 2015. Google Scholar
  14. Leander Jehl, Roman Vitenberg, and Hein Meling. SmartMerge: A new approach to reconfiguration for atomic storage. In Proc. of DISC, 2015. Google Scholar
  15. Andreas Klappenecker, Hyunyoung Lee, and Jennifer L. Welch. Dynamic regular registers in systems with churn. Theor. Comput. Sci., 512:84-97, November 2013. Google Scholar
  16. Jan Kończak, Nuno Santos, Tomasz Żurkowski, Paweł T. Wojciechowski, and André Schiper. JPaxos: State machine replication based on the Paxos protocol. Technical Report EPFL-REPORT-167765, 2011. Google Scholar
  17. Kishori M. Konwar, N. Prakash, Nancy A. Lynch, and Muriel Médard. RADON: Repairable atomic data object in networks. In Proc. of OPODIS, 2016. Google Scholar
  18. Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Comm. of the ACM, 1978. Google Scholar
  19. Leslie Lamport. On interprocess communication. Distributed Computing. Parts I and II, 1986. Google Scholar
  20. Leslie Lamport. Paxos made simple. ACM SIGACT News 32, 2001. Google Scholar
  21. Leslie Lamport, Dahlia Malkhi, and Lidong Zhou. Reconfiguring a state machine. SIGACT News, 41(1):63-73, March 2010. Google Scholar
  22. Barbara Liskov and James Cowling. Viewstamped Replication revisited. Technical report, MIT, July 2012. Google Scholar
  23. Jacob R. Lorch, Atul Adya, William J. Bolosky, Ronnie Chaiken, John R. Douceur, and Jon Howell. The SMART way to migrate replicated stateful services. In Proc. of EuroSys, 2006. Google Scholar
  24. Nancy A. Lynch and Alexander A. Shvartsman. RAMBO: A reconfigurable atomic memory service for dynamic networks. In Proc. of DISC, 2002. Google Scholar
  25. Nancy A. Lynch and Mark R. Tuttle. An introduction to input/output automata. CWI Quarterly, 2:219-246, 1989. Google Scholar
  26. Michael Merritt and Gadi Taubenfeld. Computing with infinitely many processes under assumptions on concurrency and participation. In Proc. of DISC, 2000. Google Scholar
  27. Ellis Michael, Dan R. K. Ports, Naveen Kr. Sharma, and Adriana Szekeres. Recovering shared objects without stable storage [extended version]. Technical Report UW-CSE-17-08-01, University of Washington CSE, August 2017. Google Scholar
  28. Peter Musial, Nicolas Nicolaou, and Alexander A. Shvartsman. Implementing distributed shared memory for dynamic networks. Comm. of the ACM, 57(6), June 2014. Google Scholar
  29. Brian M. Oki and Barbara H. Liskov. Viewstamped Replication: A new primary copy method to support highly-available distributed systems. In Proc. of PODC, 1988. Google Scholar
  30. R.C. Oliveira, R. Guerraoui, and A. Schiper. Consensus in the crash recover model. Technical Report TR-97/239, EPFL, Lausanne, Switzerland, 1997. Google Scholar
  31. D.S. Parker, G.J. Popek, G. Rudisin, A. Stoughton, B.J. Walker, E. Walton, J.M. Chow, D. Edwards, S. Kiser, and C. Kline. Detection of mutual inconsistency in distributed systems. In IEEE Trans. on Software Engineering, 1983. Google Scholar
  32. Fred B. Schneider. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Computing Surveys, 1990. Google Scholar
  33. Alexander Shraer, Benjamin Reed, Dahlia Malkhi, and Flavio P. Junqueira. Dynamic reconfiguration of primary/backup clusters. In Proc. of USENIX ATC, 2012. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail