Brief Announcement: Communication Patterns for Optimal Resilience
Abstract
In response to the impossibility of solving the consensus problem in asynchronous systems subject to failures, various relaxations of the consensus problem have been proposed, including approximate agreement, crusader agreement, gather, and reliable broadcast. Some are interesting in their own right while others are useful building blocks for solving other problems. We focus on message-passing systems of processes, up to of which can experience malicious (Byzantine) failures. These problems all require that and they frequently have fairly simple and efficient algorithms when . Challenges arise when considering resilience between and . For instance, nearly twenty years elapsed between the discovery of an approximate agreement algorithm for [10] and one for [1]. A stumbling block could be too much focus on looking for algorithms in a certain natural and intuitive form, which we call canonical (asynchronous) rounds. In such an algorithm, each process repeatedly sends a message containing its entire state tagged with a round number, then waits to receive messages with that same round number, does some local computation and proceeds to the next round number. The approximate agreement algorithm is in canonical round form but the one is not.
For algorithms in canonical round form, an obvious way of measuring time is the number of canonical rounds until the algorithm completes. However, this approach does not apply to other algorithms, such as those in which processes wait to receive a certain number of messages that have other properties besides simply having a certain round number. Attempts to rewrite these latter algorithms in canonical round form can result in drastically increased round complexity. This blow-up in the round complexity is inherent, as we show in this paper that for a wide set of problems, there is no algorithm in canonical round form that has a finite upper bound on the number of rounds if . In contrast, the standard way of measuring time results in constant time complexity.
We first show the impossibility of a bounded number of canonical rounds for a generic problem that captures the key properties needed in the proof. The result then follows immediately for, most notably, crusader agreement and flavors of approximate agreement. We then show via reductions that the same result holds for reliable broadcast and gather, since there are crusader agreement algorithms that use reliable broadcast and gather with no round overhead.
Keywords and phrases:
canonical rounds, reliable broadcast, gather, crusader agreement, approximate agreement, time complexityFunding:
Hagit Attiya: Supported by the Israel Science Foundation (22/1425 and 25/1849).Copyright and License:
2012 ACM Subject Classification:
Theory of computation Distributed algorithmsRelated Version:
These results appear as part of https://arxiv.org/abs/2510.04310.Editor:
Dariusz R. KowalskiSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Modern geo-distributed systems are often modeled as a set of asynchronous processes that communicate by sending and receiving messages; the possibility of arbitrary failures is captured by assuming that of the processes might be malicious (Byzantine) and deviate from their algorithms or even collude. While traditional consensus cannot be solved in this model, even when [14], relaxations of it are solvable, e.g., crusader agreement [9], approximate agreement [10], reliable broadcast [5], and gather [7, 1]. Solutions for these problems play a key role in Byzantine fault tolerance in asynchronous systems.
The resilience needed to solve a task in such a system is the relation between the total number of processes, , and the maximal number of Byzantine processes, . It is known that a resilience of is necessary and sufficient for solving many consensus-related problems. However, finding algorithms that have the optimal resilience can be challenging. For example, the seminal paper defining the approximate agreement problem [10] proved that is a lower bound on the resilience but the algorithms in that paper only have resilience . Finding an algorithm with the optimal resilience, that is, an algorithm that only assumes , has taken about twenty years [1].
We believe that a stumbling block to obtaining these algorithms is the inclination to look for algorithms that work in canonical rounds, a notion first mentioned by Fekete [11]. These are full-information algorithms, in which a process always sends all the information it has received so far, that proceed in asynchronous rounds. In each round , a process sends an -message with the information it has received so far. Then, process waits to receive -messages from any distinct processes, changes state and proceeds to round . After some number of rounds , process decides on a value that is some function of the information it has received, and stops sending or receiving messages.
Assuming this communication pattern seems innocuous; indeed, what else can a process do? Fekete even claims that “this form of protocol is completely general” [12]. However, several widely-used algorithms employ variations on this communication pattern, by waiting to receive a certain number of messages that satisfy a certain property, not just any set of messages. For instance, in Bracha’s reliable broadcast algorithm [5], the sender sends an initial message containing its input to all processes. Each process waits to receive an initial message, or echo messages for a common value, or ready messages for a common value, and then sends an echo message for the value. Each process then waits to receive echo messages for a common value or ready messages for a common value, and then sends a ready message for the value. Once a process has received ready messages for the same value, it accepts that value.
Bracha’s algorithm can be restructured to use canonical rounds. Since processes must move to the next round after receiving messages for the current round, null messages are sent as placeholders while waiting to receive enough messages of a required type. The resulting algorithm reveals an interesting property: The number of canonical asynchronous rounds required is unbounded. The reason is that, even with a correct sender, the sender’s initial message can be delayed arbitrarily long, while canonical rounds consisting largely of null messages continue to advance. In contrast, the canonical round version of Bracha’s algorithm, like the original, has bounded (in fact, constant) time complexity, measured as the maximum real time elapsed until termination assuming every message sent between correct processes is delivered within one time unit [3, 4]. The reason is that while Byzantine processes are “rushing” the correct processes through many canonical rounds, messages between correct processes are pending, and until they arrive less than one time unit can elapse.
This undesirable behavior is not caused simply by a poor conversion into canonical rounds. On the contrary, we prove in this paper that this behavior is inherent for a wide collection of fundamental problems. First, we show that any algorithm for a generic problem among processes has an execution with an unbounded number of canonical rounds. We then show that crusader agreement, approximate agreement on the real numbers, and approximate agreement on graphs are all special cases of the generic problem, which implies the same impossibility result for these problems. In particular, this explains why the code for the approximate agreement in [1] cannot be structured in canonical rounds. We observe that the approximate agreement algorithm for in [10] and the crusader agreement algorithm for in [4] both employ a bounded number of asynchronous canonical rounds, showing that requiring to be at most is necessary for the impossibility result.
To derive the same impossibility result for reliable broadcast and gather, we rely on the fact that they can be used to solve crusader agreement with little to no overhead.
To summarize, our main contribution is proving that canonical round algorithms for certain problems require an unbounded number of rounds, for Byzantine fault-tolerance with resilience . The impossibility result holds for several important building blocks, including crusader agreement, reliable broadcast and gather, as well as for approximate agreement on the real line and on graphs. This points to the limitations of round-based asynchronous algorithms, both as a design principle and as a way to measure time, at least for resilience in the range .
2 Canonical Round Algorithms have Unbounded Round Complexity
We assume the standard asynchronous model for processes, up to of which can be faulty, in which processes communicate via reliable point-to-point messages (cf., e.g., [4]). We consider malicious (or Byzantine) failures, where a faulty process can change state arbitrarily and send messages with arbitrary content.
Informally, an execution is a sequence of alternating configurations (states of the processes and in-transit messages) and events (WakeUp’s and message deliveries). If and are executions and is a set of processes, we say the executions are indistinguishable to , denoted , if, for each process in , has the same initial state and experiences the same sequence of events in as in .
An algorithm for processes, of which may be faulty, is in canonical round format and decides in rounds if it operates as follows. Every message sent is labeled with the current round number of the sender, starting with 1. When a process wakes up, it sends a message containing its initial state to all the processes. Once a process receives messages for the current round, it increments its round counter and sends a message containing its initial state and the sequence of messages it has received to all the processes. Once a process reaches round , it decides based on all the messages it has received.
Our result is proved for a nontrivial convergence problem in which there are at least two possible input values and and at least two decision values and , such that if all correct processes have input , then all correct processes must decide , for (validity), and if a correct process decides in an execution, then no correct process can decide in the same execution (agreement).
Theorem 1.
For any canonical round algorithm that solves a nontrivial convergence problem with and for any integer , there exists an execution and a correct process that does not decide by round .
Proof.
Assume towards contradiction that there exists a nontrivial convergence canonical round algorithm with and some such that in every execution, all correct processes decide by the end of round .
For simplicity, we assume , and divide the processes into five disjoint sets of processes each: , , , , . We consider the following initial configurations:
-
Denote by the initial configuration such that processes in groups are correct and processes in group are Byzantine. All correct processes begin with input 0.
-
Denote by the initial configuration such that processes in groups are correct and processes in group are Byzantine. All correct processes begin with input 1.
-
Denote by the initial configuration such that processes in groups are correct and processes in group are Byzantine. Processes in groups begin with input 0, and processes in groups begin with input 1.
We construct three executions starting at the initial configurations respectively, such that . Each execution is constructed as follows:
-
: The execution begins with WakeUp events for all processes in , , , ; call this part of the execution . Next appear receive events in which each of the processes in , , , receives the round 1 messages sent by the processes in , , , . Since , the processes complete round 1 and send their round 2 messages. Call this part of the execution . Similarly, define through , so that processes receive round messages and send round messages in with the caveat that in , processes decide instead of sending round messages. The processes in , , , which are correct, send messages whose content is determined by the algorithm; the contents of the messages sent by the processes in , which are Byzantine, are specified below. Note that processes in take no steps in even though they are correct; consider them as starting late, after the other processes have completed rounds.
-
: This execution and its partitioning into through is defined analogously to , but with processes in , , , exchanging messages, those in being Byzantine, and those in starting late.
-
: This execution and its partitioning into through are similar to the previous executions but with some key differences. consists of WakeUp events for all the processes. consists of receive events in which each of the correct processes receives a carefully selected set of round 1 messages and each faulty process takes a step in order to send a round 2 message. In particular, (correct) processes in , receive round 1 messages from processes in , , , , while (correct) processes in , receive round 1 messages from processes in , , , . Similarly, define through . The contents of the messages sent by the (Byzantine) processes in are defined below; unlike in and , the round messages sent to processes in , by a faulty process are not the same as those sent to processes in , by that process.
We now specify the messages sent by the faulty processes in each round , , of the three executions. In , a faulty process sends the round message sent by the corresponding correct process in . In , a faulty process sends the round message sent by the corresponding correct process in . Finally, in , a faulty process sends to the correct processes in , the round message sent by the corresponding correct process in , and to the correct processes in , the round message sent by the corresponding correct process in .
Recall that for . Denote by for and . Claim 2, implying that and are indistinguishable to processes in through round , and Claim 3, implying that and are indistinguishable to processes in through round , can be proved by induction.
Claim 2.
For each , (a) and (b) the same set of messages are in transit from to in the last configurations of and .
Claim 3.
For each , (a) and (b) the same set of messages are in transit from to in the last configurations of and .
From the validity property of the nontrivial convergence problem, by the end of , correct processes in groups must decide 1. Since , the corresponding correct processes in these groups must decide 1 by the end of . Similarly from validity, by the end of the correct processes in groups must decide 0. Since , processes in groups must decide 0 by the end of as well. This is in contradiction to the agreement property of the nontrivial convergence problem for execution .
3 Applications of Theorem 1
Crusader agreement [9] with input set ensures that if all correct processes start with the same value , they must decide on this value, and otherwise, they may pick an undecided value, denoted . If all correct processes start with they must decide , and if a correct process decides , the other correct processes decide either or . Therefore, crusader agreement is a nontrivial convergence problem with 0 and 1 being the two distinguished inputs and the two distinguished decisions, which implies:
Corollary 4.
Any canonical round algorithm that solves crusader agreement with has an execution in which a correct process does not decide by round , for any integer .
Crusader agreement is a special case of connected consensus [4], with parameter . It is easy to see that connected consensus for any is a nontrivial convergence problem, and as a special case, so is gradecast [13].
Approximate agreement on the real numbers with parameter [10] is defined as follows. Processes start with arbitrary real numbers and correct processes must decide on real numbers that are at most apart from each other. Decisions must also be valid, i.e., contained in the interval of the inputs of correct processes. To show approximate agreement is a nontrivial convergence problem, choose any two real numbers whose difference is greater than as the two distinguished inputs and two distinguished decisions. This implies:
Corollary 5.
Consider a canonical round algorithm that solves -approximate agreement with . If the input range includes and such that , then there is an execution in which some correct process does not decide by round , for any integer .
A similar result can be shown for approximate agreement on graphs [8], provided that the graph has two vertices that are at distance 2 apart.
Reliable broadcast [5] and its weaker variant consistent broadcast [6] are defined with one designated sender process . The sender has an input value that it wants to broadcast. Processes may terminate by accepting a message from the sender. Consistent broadcast ensures that if the sender is correct then eventually all correct processes accept ’s input (validity), and all correct processes that accept a value from sender , accept the same value (agreement). Reliable broadcast further ensures that if some correct process accepts a value from then eventually all correct processes accept a value from (relay).
Consider the following algorithm for crusader agreement, assuming , which uses concurrent instantiations of consistent broadcast. Each process consistently broadcasts its input and then waits to accept values from consistent broadcast instances. If it accepts copies of some value , then it decides , otherwise it decides . Validity for crusader agreement is immediate since when all correct processes start with , they all accept at least copies of and thus decide . To argue agreement, assume for contradiction that one correct process accepts copies of and another correct process accepts copies of . By the agreement property of consistent broadcast, these messages are broadcast by different processes, and so , which implies , a contradiction.
Note that the algorithm simply waits for the termination of out of concurrent invocations of consistent broadcast. Thus, a consistent broadcast algorithm that terminates within canonical rounds would yield a crusader agreement algorithm that terminates within canonical rounds, contradicting Corollary 4. This implies:
Corollary 6.
Any canonical round algorithm for consistent or reliable broadcast, with , has an execution in which a correct process does not terminate by round , for any integer .
Gather [7, 13, 1, 2] is an extension of consistent broadcast in which all processes broadcast their value, and accept values from a large set of processes. Beyond properties inherited from consistent broadcast, most notably, that if two correct processes accept a value from another process, it is the same value, gather also ensures that there is a common core of values that are accepted by all correct processes.
A very simple transformation shows that a gather algorithm can be used to solve crusader agreement, with no extra cost: Process gathers the input values in a set , and if some value appears at least times in , then it decides on ; otherwise, it decides on . A counting argument shows that a value appearing times in a correct process ’s variable must appear at least times in the common core . Since at most one value can appear times in , all correct processes that decide on a non- value decide on the same value. Note that the transformation does not add any communication on top of the gather submodule. Thus, a gather algorithm that terminates within canonical rounds would yield a crusader agreement algorithm that terminates within canonical rounds, contradicting Corollary 4. This implies:
Corollary 7.
Any canonical round algorithm for gather with has an execution in which some correct process does not terminate by round , for any integer .
4 Discussion
We have shown that for many important building blocks for Byzantine fault-tolerance with maximal resilience, canonical round algorithms require an unbounded number of rounds. Given that every round requires all-to-all communication, this also means an unbounded number of messages, although many of them are expected to be empty. For crash failures, there is an approximate agreement algorithm [10] that works in a logarithmic number of canonical rounds when (which is the optimal resilience). Thus the anomaly of requiring an unbounded number of canonical rounds when resilience is optimal does not (always) occur with crash failures.
References
- [1] Ittai Abraham, Yonatan Amit, and Danny Dolev. Optimal resilience asynchronous approximate agreement. In OPODIS, pages 229–239. Springer, 2004. doi:10.1007/11516798_17.
- [2] Ittai Abraham, Philipp Jovanovic, Mary Maller, Sarah Meiklejohn, Gilad Stern, and Alin Tomescu. Reaching consensus for asynchronous distributed key generation. In Proceedings of the 2021 ACM Symposium on Principles of Distributed Computing, pages 363–373. Association for Computing Machinery, 2021. doi:10.1145/3465084.3467914.
- [3] Hagit Attiya and Jennifer L. Welch. Distributed computing: fundamentals, simulations, and advanced topics. John Wiley & Sons, 2004.
- [4] Hagit Attiya and Jennifer L. Welch. Multi-valued connected consensus: a new perspective on crusader agreement and adopt-commit. In 27th International Conference on Principles of Distributed Systems (OPODIS), pages 6:1–6:23, 2023. doi:10.4230/LIPICS.OPODIS.2023.6.
- [5] Gabriel Bracha. Asynchronous byzantine agreement protocols. Information and Computation, 75(2):130–143, 1987. doi:10.1016/0890-5401(87)90054-X.
- [6] Christian Cachin, Klaus Kursawe, Frank Petzold, and Victor Shoup. Secure and efficient asynchronous broadcast protocols. In Joe Kilian, editor, Advances in Cryptology — CRYPTO 2001, pages 524–541. Springer, 2001. doi:10.1007/3-540-44647-8_31.
- [7] Ran Canetti and Tal Rabin. Fast asynchronous Byzantine agreement with optimal resilience. In Proceedings of the Twenty-Fifth Annual ACM Symposium on Theory of Computing (STOC), pages 42–51, 1993. doi:10.1145/167088.167105.
- [8] Armando Castañeda, Sergio Rajsbaum, and Matthieu Roy. Convergence and covering on graphs for wait-free robots. Journal of the Brazilian Computer Society, 24:1–15, 2018. doi:10.1186/S13173-017-0065-8.
- [9] Danny Dolev. The Byzantine generals strike again. Journal of Algorithms, 3(1):14–30, 1982. doi:10.1016/0196-6774(82)90004-9.
- [10] Danny Dolev, Nancy A. Lynch, Shlomit S. Pinter, Eugene W. Stark, and William E. Weihl. Reaching approximate agreement in the presence of faults. J. ACM, 33(3):499–516, 1986. doi:10.1145/5925.5931.
- [11] Alan D. Fekete. Asynchronous approximate agreement. In Proceedings of the Sixth Annual ACM Symposium on Principles of Distributed Computing, pages 64–76, 1987.
- [12] Alan D. Fekete. Asynchronous approximate agreement. Information and Computation, 115(1):95–124, 1994. doi:10.1006/INCO.1994.1094.
- [13] Pesech Feldman and Silvio Micali. An optimal probabilistic protocol for synchronous Byzantine agreement. SIAM J. Comput., 26(4):873–933, 1997. doi:10.1137/S0097539790187084.
- [14] Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. Impossibility of distributed consensus with one faulty process. J. ACM, 32(2):374–382, 1985. doi:10.1145/3149.214121.
