Fault-Tolerant Computing with Unreliable Channels

Naser-Pastoriza, Alejandro; Chockler, Gregory; Gotsman, Alexey

doi:10.4230/LIPIcs.OPODIS.2023.21

Abstract

We study implementations of basic fault-tolerant primitives, such as consensus and registers, in message-passing systems subject to process crashes and a broad range of communication failures. Our results characterize the necessary and sufficient conditions for implementing these primitives as a function of the connectivity constraints and synchrony assumptions. Our main contribution is a new algorithm for partially synchronous consensus that is resilient to process crashes and channel failures and is optimal in its connectivity requirements. In contrast to prior work, our algorithm assumes the most general model of message loss where faulty channels are flaky, i.e., can lose messages without any guarantee of fairness. This failure model is particularly challenging for consensus algorithms, as it rules out standard solutions based on leader oracles and failure detectors. To circumvent this limitation, we construct our solution using a new variant of the recently proposed view synchronizer abstraction, which we adapt to the crash-prone setting with flaky channels.

Yehuda Afek, Hagit Attiya, Alan D. Fekete, Michael J. Fischer, Nancy A. Lynch, Yishay Mansour, Da-Wei Wang, and Lenore D. Zuck. Reliable communication over unreliable channels. J. ACM, 41(6):1267-1297, 1994. URL: https://doi.org/10.1145/195613.195651.
Marcos K. Aguilera, Carole Delporte-Gallet, Hugues Fauconnier, and Sam Toueg. Communication-efficient leader election and consensus with limited link synchrony. In Symposium on Principles of Distributed Computing (PODC), 2004. URL: https://doi.org/10.1145/1011767.1011816.
Marcos K. Aguilera, Carole Delporte-Gallet, Hugues Fauconnier, and Sam Toueg. On implementing Omega in systems with weak reliability and synchrony assumptions. Distributed Comput., 21(4):285-314, 2008. URL: https://doi.org/10.1007/S00446-008-0068-Y.
Marcos Kawazoe Aguilera, Wei Chen, and Sam Toueg. Using the heartbeat failure detector for quiescent reliable communication and consensus in partitionable networks. Theor. Comput. Sci., 220(1):3-30, 1999. URL: https://doi.org/10.1016/S0304-3975(98)00235-7.
Marcos Kawazoe Aguilera, Wei Chen, and Sam Toueg. Failure detection and consensus in the crash-recovery model. Distributed Comput., 13(2):99-125, 2000. URL: https://doi.org/10.1007/S004460050070.
Mohammed Alfatafta, Basil Alkhatib, Ahmed Alquraan, and Samer Al-Kiswany. Toward a generic fault tolerance technique for partial network partitioning. In Symposium on Operating Systems Design and Implementation (OSDI), 2020. URL: https://www.usenix.org/conference/osdi20/presentation/alfatafta.
Ahmed Alquraan, Hatem Takruri, Mohammed Alfatafta, and Samer Al-Kiswany. An analysis of network-partitioning failures in cloud systems. In Symposium on Operating Systems Design and Implementation (OSDI), 2018. URL: https://www.usenix.org/conference/osdi18/presentation/alquraan.
Hagit Attiya, Amotz Bar-Noy, and Danny Dolev. Sharing memory robustly in message-passing systems. J. ACM, 42(1):124-142, 1995. URL: https://doi.org/10.1145/200836.200869.
Hagit Attiya, Ohad Ben-Baruch, and Danny Hendler. Lower bound on the step complexity of anonymous binary consensus. In Symposium on Distributed Computing (DISC), 2016. URL: https://doi.org/10.1007/978-3-662-53426-7_19.
Peter Bailis and Kyle Kingsbury. The network is reliable. Commun. ACM, 57(9):48-55, 2014. URL: https://doi.org/10.1145/2643130.
Anindya Basu, Bernadette Charron-Bost, and Sam Toueg. Simulating reliable links with unreliable links in the presence of process crashes. In Workshop on Distributed Algorithms (WDAG), 1996. URL: https://doi.org/10.1007/3-540-61769-8_8.
Manuel Bravo, Gregory Chockler, and Alexey Gotsman. Liveness and latency of Byzantine state-machine replication. In Symposium on Distributed Computing (DISC), 2022. URL: https://doi.org/10.4230/LIPICS.DISC.2022.12.
Manuel Bravo, Gregory Chockler, and Alexey Gotsman. Making Byzantine consensus live. Distributed Comput., 35(6):503-532, 2022. URL: https://doi.org/10.1007/S00446-022-00432-Y.
Eric A. Brewer. Towards robust distributed systems (abstract). In Symposium on Principles of Distributed Computing (PODC), 2000. URL: https://doi.org/10.1145/343477.343502.
Marc Brooker, Tao Chen, and Fan Ping. Millions of tiny databases. In Symposium on Networked Systems Design and Implementation (NSDI), 2020. URL: https://www.usenix.org/conference/nsdi20/presentation/brooker.
Tushar Deepak Chandra, Vassos Hadzilacos, and Sam Toueg. The weakest failure detector for solving consensus. J. ACM, 43(4):685-722, 1996. URL: https://doi.org/10.1145/234533.234549.
Tushar Deepak Chandra and Sam Toueg. Unreliable failure detectors for reliable distributed systems. J. ACM, 43(2):225-267, 1996. URL: https://doi.org/10.1145/226643.226647.
Gregory Chockler, Idit Keidar, and Roman Vitenberg. Group communication specifications: A comprehensive study. ACM Comput. Surv., 33(4):427-469, 2001. URL: https://doi.org/10.1145/503112.503113.
Brian Coan. A compiler that increases the fault tolerance of asynchronous protocols. IEEE Trans. Comput., 37(12):1541-1553, 1988. URL: https://doi.org/10.1109/12.9732.
Danny Dolev. The Byzantine generals strike again. J. Algorithms, 3(1):14-30, 1982. URL: https://doi.org/10.1016/0196-6774(82)90004-9.
Danny Dolev, Roy Friedman, Idit Keidar, and Dahlia Malkhi. Failure detectors in omission failure environments. Technical Report TR96-1608, Department of Computer Science, Cornell University, 1996.
Danny Dolev, Roy Friedman, Idit Keidar, and Dahlia Malkhi. Failure detectors in omission failure environments (brief announcement). In Symposium on Principles of Distributed Computing (PODC), 1997.
Cynthia Dwork, Nancy A. Lynch, and Larry J. Stockmeyer. Consensus in the presence of partial synchrony. J. ACM, 35(2):288-323, 1988. URL: https://doi.org/10.1145/42282.42283.
Antonio Fernández Anta and Michel Raynal. From an intermittent rotating star to a leader. In Conference on Principles of Distributed Systems (OPODIS), 2007. URL: https://doi.org/10.1007/978-3-540-77096-1_14.
Faith Fich, Maurice Herlihy, and Nir Shavit. On the space complexity of randomized synchronization. J. ACM, 45(5):843-862, 1998. URL: https://doi.org/10.1145/290179.290183.
Roy Friedman, Idit Keidar, Dahlia Malkhi, Ken Birman, and Danny Dolev. Deciding in partitionable networks. Technical Report TR95-1554, Department of Computer Science, Cornell University, 1995.
Seth Gilbert and Nancy Lynch. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News, 33(2):51-59, 2002. URL: https://doi.org/10.1145/564585.564601.
Vassos Hadzilacos. Byzantine agreement under restricted type of failures (not telling the truth is different from telling lies). Technical Report TR-18-63, Department of Computer Science, Harvard University, 1983.
Vassos Hadzilacos and Sam Toueg. A modular approach to fault-tolerant broadcast and related problems. Technical Report TR94-1425, Department of Computer Science, Cornell University, 1994.
Maurice Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages and Systems, 13(1):124-149, 1991. URL: https://doi.org/10.1145/114005.102808.
Maurice Herlihy, Victor Luchangco, and Mark Moir. Obstruction-free synchronization: Double-ended queues as an example. In International Conference on Distributed Computing Systems (ICDCS), 2003. URL: https://doi.org/10.1109/ICDCS.2003.1203503.
Martin Hutle, Dahlia Malkhi, Ulrich Schmid, and Lidong Zhou. Chasing the weakest system model for implementing ω and consensus. IEEE Trans. Dependable Secur. Comput., 6(4):269-281, 2009. URL: https://doi.org/10.1109/TDSC.2008.24.
Chris Jensen, Heidi Howard, and Richard Mortier. Examining Raft’s behaviour during partial network failures. In Workshop on High Availability and Observability of Cloud Systems (HAOC), 2021. URL: https://doi.org/10.1145/3447851.3458739.
Leslie Lamport. On interprocess communication - Part I: Basic formalism, Part II: Algorithms. Distributed Comput., 1(2):77-101, 1986. URL: https://doi.org/10.1007/BF01786227.
Leslie Lamport. The part-time parliament. ACM Trans. Comput. Syst., 16(2):133-169, 1998. URL: https://doi.org/10.1145/279227.279229.
Wai-Kau Lo and Vassos Hadzilacos. Using failure detectors to solve consensus in asynchronous shared-memory systems (extended abstract). In Workshop on Distributed Algorithms (WDAG), 1994. URL: https://doi.org/10.1007/BFb0020440.
Nancy Lynch. Distributed Algorithms, chapter 17. Morgan Kaufmann, 1996.
Dahlia Malkhi, Florin Oprea, and Lidong Zhou. ω meets paxos: Leader election and stability without eventual timely links. In Symposium on Distributed Computing (DISC), 2005. URL: https://doi.org/10.1007/11561927_16.
Dahlia Malkhi and Michael K. Reiter. Byzantine quorum systems. Distributed Comput., 11(4):203-213, 1998. URL: https://doi.org/10.1007/S004460050050.
Oded Naor, Mathieu Baudet, Dahlia Malkhi, and Alexander Spiegelman. Cogsworth: Byzantine view synchronization. In Cryptoeconomics Systems Conference (CES), 2020. URL: https://doi.org/10.21428/58320208.08912a03.
Oded Naor and Idit Keidar. Expected linear round synchronization: The missing link for linear Byzantine SMR. In Symposium on Distributed Computing (DISC), 2020. URL: https://doi.org/10.4230/LIPICS.DISC.2020.26.
Alejandro Naser-Pastoriza, Gregory Chockler, and Alexey Gotsman. Fault-tolerant computing with unreliable channels (extended version), 2023. URL: https://arxiv.org/abs/2305.15150.
Gil Neiger and Sam Toueg. Automatically increasing the fault-tolerance of distributed algorithms. J. Algorithms, 11(3):374-419, 1990. URL: https://doi.org/10.1016/0196-6774(90)90019-B.
Harald Ng, Seif Haridi, and Paris Carbone. Omni-Paxos: Breaking the barriers of partial connectivity. In European Conference on Computer Systems (EuroSys), 2023. URL: https://doi.org/10.1145/3552326.3587441.
Diego Ongaro. Consensus: bridging theory and practice. PhD thesis, Stanford University, USA, 2014. URL: https://searchworks.stanford.edu/view/10608105.
Diego Ongaro and John K. Ousterhout. In search of an understandable consensus algorithm. In USENIX Annual Technical Conference, 2014. URL: https://www.usenix.org/conference/atc14/technical-sessions/presentation/ongaro.
Kenneth J. Perry and Sam Toueg. Distributed agreement in the presence of processor and communication faults. IEEE Trans. Software Eng., 12(3):477-482, 1986. URL: https://doi.org/10.1109/TSE.1986.6312888.
Roberto De Prisco, Butler W. Lampson, and Nancy A. Lynch. Revisiting the PAXOS algorithm. Theor. Comput. Sci., 243(1-2):35-91, 2000. URL: https://doi.org/10.1016/S0304-3975(00)00042-6.
Dimitris Sakavalas and Lewis Tseng. Network Topology and Fault-Tolerant Consensus. Synthesis Lectures on Distributed Computing Theory. Morgan & Claypool Publishers, 2019. URL: https://doi.org/10.2200/S00918ED1V01Y201904DCT016.
Nicola Santoro and Peter Widmayer. Time is not a healer. In Symposium on Theoretical Aspects of Computer Science (STACS), 1989. URL: https://doi.org/10.1007/BFB0028994.
Nicola Santoro and Peter Widmayer. Distributed function evaluation in the presence of transmission faults. In Symposium on Algorithms (SIGAL), 1990. URL: https://doi.org/10.1007/3-540-52921-7_85.

Fault-Tolerant Computing with Unreliable Channels

Authors Alejandro Naser-Pastoriza, Gregory Chockler, Alexey Gotsman

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Fault-Tolerant Computing with Unreliable Channels

Authors Alejandro Naser-Pastoriza, Gregory Chockler, Alexey Gotsman

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

References