Abstract 1 Introduction 2 Preliminaries 3 Layered Circuits as ๐’„-๐—™๐—ฎ๐˜‚๐—น๐˜๐˜†โข๐—–๐—น๐—ถ๐—พ๐˜‚๐—ฒ Algorithms 4 Application: Semi-Ring Matrix Multiplication References

Computing in a Faulty Congested Clique

Keren Censor-Hillel ORCID Technion, Haifa, Israel Pedro Soto ORCID Virginia Tech, Blacksburg, VA, USA
Abstract

We study a Faulty Congested Clique model, in which an adversary may fail nodes in the network throughout the computation. We show that any task of Oโข(nโขlogโกn)-bit input per node can be solved in roughly n rounds, where n is the size of the network. This nearly matches the linear upper bound on the complexity of the non-faulty Congested Clique model for such problems, by learning the entire input, and it holds in the faulty model even with a linear number of faults.

Our main contribution is that we establish that one can do much better by looking more closely at the computation. Given a deterministic algorithm ๐’œ for the non-faulty Congested Clique model, we show how to transform it into an algorithm ๐’œโ€ฒ for the faulty model, with an overhead that could be as small as some logarithmic-in-n factor, by considering refined complexity measures of ๐’œ.

As an exemplifying application of our approach, we show that the Oโข(n1/3)-round complexity of semi-ring matrix multiplication [Censor-Hillel, Kaski, Korhonen, Lenzen, Paz, Suomela, PODC 2015] remains the same up to polylog factors in the faulty model, even if the adversary can fail 99% of the nodes (or any other constant fraction).

Keywords and phrases:
distributed computing, graph algorithms, computing with faults
Funding:
Keren Censor-Hillel: is supported in part by the Israel Science Foundation, grant 529/23.
Copyright and License:
[Uncaptioned image]โ€‚ยฉ Keren Censor-Hillel and Pedro Soto; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation โ†’ Distributed algorithms
Related Version:
Full Version: https://arxiv.org/abs/2505.11430 [13]
Acknowledgements:
We thank Orr Fischer, Ran Gelles, and Merav Parter for useful discussions.
Editors:
Andrei Arusoaie, Emanuel Onica, Michael Spear, and Sara Tucci-Piergiovanni

1 Introduction

Distributed systems are prone to failures by their nature, and thus coping with faults in distributed computing has been extensively studied since the dawn of this research area [1, 39, 47]. In this work, we address the ๐–ข๐—ˆ๐—‡๐—€๐–พ๐—Œ๐—๐–พ๐–ฝโข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model (or ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ for short) [38], in which n nodes of a network communicate in synchronous rounds by exchanging Oโข(logโกn)-bit messages between every pair of nodes in each round. This model is heavily studied, through the lens of various computing tasks, such as algebraic computations [9, 21, 10], graph problems such as computing an MST [38, 27, 26, 35, 34, 41], computing distances and spanners [30, 40, 17, 3, 18, 2, 45], computing local tasks [29, 22, 23, 12, 15, 44, 43, 14], optimization and approximation algorithms [29, 25, 24], subgraph finding [19, 16, 42, 4, 32, 33, 20, 6, 7] and many more [46, 36, 28, 11] (see the full version [13] for additional related work).

We consider deterministic algorithms, and study a faulty version of the ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model, which we refer to as the ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model. In this model an adversary may fail nodes in the network throughout the computation, such that a failed node cannot continue to participate in the computation. We capture the budget of failures that the adversary has as a parameter c, such that at least n/c nodes must remain non-faulty. Thus, the adversary may fail (cโˆ’1c)โขn nodes throughout the execution of an algorithm (see Section 2 for a formal definition of the model and the additional concepts used throughout the paper). We ask:

How efficiently can one compute in the ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model?

Our contribution is a scheme that converts any ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithm into a ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithm. The goal of such a scheme is to incur the smallest possible overhead to the round complexity of the algorithm. We exemplify the strength of our scheme by showing that it allows multiplying two matrices over a semi-ring in O~โข(n1/3) rounds, thus retaining its non-faulty complexity from [9] up to polylog factors. This is more than quadratically faster compared to the complexity of n2/3+oโข(1) that can be obtained for this problem from the follow-up work of [5].

Background.

Without further restrictions on the adversary, it may fail a node at the very first round of the algorithm, preventing the others from ever completing the computation since they can never access the failed nodeโ€™s input. Note that loss of information is not allowed since the output may depend on the original inputs, as opposed to some other fault models that allow the output to depend only on the inputs of non-faulty nodes. Thus, the ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model must allow at least one initial quiet round, in which no faults can occur. We therefore consider the number of quiet rounds that are needed for a ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithm as an additional complexity measure which we aim to minimize. We emphasize that our method will require a very small constant number of quiet rounds. This does not render the model trivial for any task which can be solved in Oโข(1) rounds in the non-faulty ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model, because if its exact complexity is larger than our constant then it cannot be entirely executed during the quiet rounds (also note that we do not modify the bandwidth by any constant, but rather we stick to a certain given bandwidth of bโขlogโกn bits, as we cannot change the communication model based on the task that we wish to solve).111[5] work with already-coded inputs and thus avoid the notion of quiet rounds. Here, we care about how the input is coded and hence we need these constant number of preprocessing rounds.

Similarly, the adversary can fail a node at the end of the computation, preventing its output from being accessible. Thus, in the ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model, the output requirement is modified. Yet, some care should be taken when doing so. We denote by ๐’œw the output that node w should hold in a non-faulty execution of ๐’œ, and we would like to require that for every node w there is a node u which holds ๐’œw. However, this is still insufficient because the adversary may now fail u. Instead, we demand that for every w, the output ๐’œw is encoded in the network such that any node u can obtain it within some number of R rounds of communication, even if additional nodes fail, as long as u itself does not fail. We call this the decodability complexity of the algorithm. Thus, a ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithm has three complexity measures: the number of quiet rounds, the number of additional (not necessarily quiet) rounds, and the number of rounds required for obtaining an output.

Our contribution is a method for computing in the ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model that yields the following. First, a simplified usage of it gives that any task that has Oโข(nโขlogโกn) bits of input per node can be solved in O~โข(n) rounds in the ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model, for constant c (our results hold for larger values of c as well, with the actual dependence on c being polynomial in c). This nearly matches the upper bound of linear round complexity of computing such tasks in the non-faulty ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model, and is similarly obtained by learning all the inputs.

Second, our main result improves upon this greatly for computing certain functions f: To this end, we identify two particular parameters which, informally, capture the locality of communication and the locality of computation that a certain computation requires, and show that our result obtains a super-fast computation of functions for which these parameters are not large (see Section 2 for formal definitions). For such functions, the complexity overhead of our approach can be as low as O~โข(1) for a constant c. This insight is what allows us to compute the semi-ring product of two matrices, whose complexity in the ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model is Oโข(n1/3) due to [9], in O~โข(n1/3)-rounds in the ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model.

1.1 Our Contribution

Our treatment of ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ and ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithms goes through circuits, as was also done in the follow-up work of [5], and is aligned with prior work of [19] which show how to compute circuits with ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithms. For completeness and for setting the ground with the terminology that we need, we define in Section 2 the concept of layered circuits. Roughly speaking, these are circuits whose gates can be nicely split into layers, with wires only between two consecutive layers. When we compute a circuit by a ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ or ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithm, we assign each node with the task of computing some of the gates in a layer, which we refer to as this nodeโ€™s part. Note that a node always refers to a computing component in the ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ or ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model, while we use gates and wires to refer to circuits.

Layered circuits can be parametrized by what we call their parallel partition parameter. Informally, this parameter relates to the number of gates that have wires into and from the gates of any nodeโ€™s part. This captures the amount of โ€œcommunicationโ€ between layers when we later interpret them as ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithms. Intuitively, we say that a layered circuit has a parallel partition of nฯ† when its gates have wires into/from Oโข(n1+ฯ†) gates in other parts. Thus, when we compute a layer of the circuit by a ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithm, the multiplicative overhead beyond the n elements that each node can send/receive in a round is Oโข(nฯ†). This means that the ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithm will incur this number of rounds per layer that it computes. It is not hard to show that any ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithm that runs in T rounds has a circuit of depth Oโข(T) with a 1-parallel partition (ฯ†=0). Notably, this does not depend on any non-adaptivity assumption on the ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithm (that is, this applies also to algorithms in which the communication pattern may depend on the inputs). We show this in the full version [13].

Our key insight, is that for handling faults in the ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model, we can exploit additional parameters of a circuit that we wish to compute. To this end, we define two additional parameters, which we call communication locality and computation locality. These help us capture the amount of โ€œlocalityโ€ between layers when we later interpret them as ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithms. Informally, the computation locality of a layered circuit is nฮถ when in each layer, each of the n parts can be split to Oโข(nฮถ) pieces of n gates (for a total of at most Oโข(n1+ฮถ) gates). The communication locality is nฮพ, when for each of the n parts, the number of pieces of other parts that have incoming wires into it is at most Oโข(nฮพ).

Intuitively, the computation locality measures the amount of information that a node needs to store in order to proceed with computing a layer of a circuit. The reason that this plays a role is because in the ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model we will need a non-faulty node u to simulate the parts of the computation of a faulty node w. To do that, the non-faulty node u will need to recover information that corresponds to the state of the faulty node w. However, it may be that w does not need its entire state, and that it can instead be compressed or partially โ€œforgottenโ€. This will allow a better complexity for the ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithm. The communication locality captures the number of ฮ˜โข(n)-sized pieces of other nodes, which a non-faulty node u has to get in order to simulate the computation of a faulty node w in a layer. That is, our algorithm would benefit from a circuit in which if a node u needs information from a node w then it needs entire pieces of n bits rather than smaller chunks of information. One can think of the information of w as checkpointed โ€œin the cloudโ€, and a smaller communication locality will also reduce the complexity of the ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithm.

The reason that communication locality only addresses incoming wires is that the parallel partition parameter nฯ† already takes care of communication regardless of faults. However, in the ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model we cannot trust the ability of a node to send its outgoing messages because it or other nodes may be faulty. Thus, it still remains to capture the number of coded pieces that a node has to collect in order to compute its gates in the upcoming layer.

We are now ready to state our main result (proven in Section 3). For simplicity, we assume throughout the paper that c is a constant (our results apply to all values of c with an additional dependence on c). We prove that every layered circuit can be computed by a ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithm, with Oโข(1) quiet rounds and decodability, and with a round complexity that depends on the above parameters, as follows.

Theorem 1 (Computing a layered circuit by a ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithm).

Let ๐’ž be a circuit of depth nฮด with alphabet size |ฮฃ|=bโขlogโก(n) that has an nฯ†-parallel partition with computation locality nฮถ and communication locality nฮพ. Let ฮผ be such that Oโข(n1+ฮผ) bounds the max size of the output per part. For every constant c, there is an algorithm ๐’œ in the c-๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ that computes the function f๐’ž with Oโข(1) quiet rounds, a (non-quiet) round complexity of Oโข(c2โข(nฮด+ฯ†+ฮท+nฮผ)โขlogโกn), where ฮท=maxโก{ฮถโˆ’ฯ†,ฮพโˆ’ฯ†}, and decodability complexity of R=Oโข(nฮผ).

The layered circuit we obtain in the full version [13] for a general ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithm with T rounds has a depth of Oโข(T), which is nฮด for ฮด=lognโกT, has an nฯ†-parallel partition with ฯ†=0, communication locality of nฮถ for ฮถ=lognโกT, and computation locality of nฮพ for ฮพ=1. Thus, Theorem 1 gives us a c-๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithm with a round complexity of Oโข(Tโขnโขlogโกn), which is a roughly linear-in-n overhead for a constant c. This is expensive for a general function f, as one can use a trivial circuit that has depth 1 to get a O~โข(n) round complexity (in which one can verify that ฮด=0, ฮผ=0, ฯ†=1, ฮพ=0, ฮถ=1, and hence ฮท=0), by essentially having each node learn all inputs. Yet, if there is a function f for which one can construct a better circuit than the general construction we give in the full version [13], then one can get a smaller overhead compared with the ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithm. Indeed, this is exactly what we exemplify for semi-ring matrix multiplication in Section 4.

Theorem 2 (A ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithm for semi-ring matrix multiplication).

Suppose that the inputs on the nodes are matrices A,Bโˆˆฮฃnร—n, where each node begins with n coefficients of A and B and ฮฃ is a semi-ring, then the value C=Aโ‹…B can be computed in the c-๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ with Oโข(1) quiet rounds, a round complexity of Oโข(c2โขn1/3โขlogโกn), and Oโข(1)-decodability.

In particular, Theorem 2 means that for any constant c, we incur only a logarithmic overhead in the round complexity over the non-faulty ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model, in which this task can be solved in Oโข(n1/3) rounds due to [9]. Following up on our work, [5] present a different scheme, which yields a complexity of T2โ‹…noโข(1) rounds in the ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model for any T-round ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithm. While for some ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithms their approach improves upon ours, for others our approach is significantly faster: their scheme would give a complexity of n2/3+oโข(1) for semi-ring matrix multiplication, which is quadratically slower compared to ours. We emphasize that matrix multiplication is just an example, and more cornerstone tasks can enjoy our approach by constructing circuits that are efficient for the use of Theorem 1.

The way that the above is obtained is by compressing the communication rounds into a constant number of layers, while still having small computation and communication locality parameters, that are equal to the parallel partition parameter nฯ† (that is, ฮถ=ฮพ=ฯ†). Thus our algorithm incurs no additional overhead. We say that such a circuit has an nฯ†-local parallel partition. That is, since our Theorem 1 shows that we can trade off ฮด and ฯ† when constructing the circuit, we leverage this while exploiting the concurrency depending on the refined locality parameters.

This exemplifies an immediate corollary of Theorem 1, which is that circuits with small local parallel partitions are the optimal choice for invoking Theorem 1, in the sense that our transformation incurs the smallest overhead for them when implementing them in the ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model. This should be viewed through the lens of what we can say about the complexity of their ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ implementation, which is Oโข(nฮด+ฯ†) round, as we prove in the full version [13]. Note that this implies that layered circuits are equivalent to ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithms.

A remark on quiet rounds.

Matrix multiplication is an excellent example for why we cannot simply replace quiet rounds by an assumption of coded inputs. If we do not carefully shuffle the input matrices before encoding them, then the nodes will be forced to download (decode) too much information compared to what they need, resulting in communication and computation localities that are extremely higher compared to those of the circuit that we construct for proving Theorem 2. More explicitly, if we want a matrix multiplication algorithm to re-distribute the entries of the second input matrix among the nodes such that each node holds a column of the matrix rather than a row, or if we want each node to hold any other pattern of n entries as we do in Section 4, then we need the information to already be encoded in this manner, as otherwise a node has to decode too much information in order to obtain its relevant entries. This could result in a huge number of rounds for a task that takes Oโข(1) rounds in the non-faulty ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model.

Regimes of lower number of faults.

Finally, we show that if the bound on the number of failures is sublinear, then our method allows circuits with relaxed properties to be implemented by fast ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithms. The high level intuition is that if we are promised no more than (cโˆ’1)โขnฯ‡/c faults, we can replace the usage of codewords of length n with n1โˆ’ฯ‡ disjoint codewords of length nฯ‡. Since these are only technical modifications, we defer them to the full version [13]. This allows us to further extend our results for fast (ring) matrix multiplication, for which we show the following in the full version [13].

Theorem 3 (A ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithm for Fast (Ring) Matrix Multiplication ).

Suppose that the inputs on the nodes are matrices A,Bโˆˆฮฃnร—n, where each node begins with n coefficients of A and B and ฮฃ is a ring, then the value C=Aโ‹…B can be computed in the (ฯ‡,c)-๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ with Oโข(1) quiet rounds, a round complexity of Oโข(maxโก{c2โขnฯ‡โขlogโกn,c2โขn1โˆ’2/ฯ‰โขlogโกn}), and Oโข(1)-decodability.

Here ฯ‰ is the exponent of matrix multiplication and Oโข(n1โˆ’2/ฯ‰) is the complexity of ring matrix multiplication in the non-faulty setting [8].

1.2 Technical Overview

To prove Theorem 1, our approach is as follows. First, we partition the gates of each layer of the circuit into its respective parts in the parallel partition, and assign one node to each part. In the first Oโข(1) quiet rounds, we shuffle the inputs of the nodes such that each node holds the input wires to its gates, using Lenzenโ€™s routing scheme [36]. Then, each node w encodes its state to prepare for the possibility of a failure, and sends one piece of the codeword to every other node. The code that is used has to be able to tolerate (cโˆ’1c)โขn erasures, as this is the number of nodes that may fail (a node that has already failed simply does not receive its piece from w). We use [n,n/c,(cโˆ’1c)โขn+1]q codes, where q is a field whose elements are of ฮ˜โข(cโขbโขlogโกn) bits, and thus we need c=Oโข(1) additional quiet rounds for this step.

We then split the layers of the circuit into epochs, where each epoch consists of computation layers and ends with a communication layer. On a high level, we say that a layer is a computation layer if for each part in the layer (a part is a set of gates assigned to a node for computing them), all of its output wires go in the next layer to the part that is assigned to the same node. This means that a node w computing its gates in the next layer already has all the information it needs for the computation without needing to communicate with other nodes. A communication layer is any other layer. By the way we split into epochs, we get that once a node w receives the information it needs to compute the gates in its part in the first layer of the epoch, it can compute the gates in its parts in the rest of the layers of the epoch. Eventually, we would want to have a circuit with as few epochs as possible. We emphasize that an algorithm may be represented by many different circuits and, indeed, given an algorithm, one needs to design a circuit that gives the parameters for which our result yields the best complexity measures.

In a non-faulty ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithm, we would simply precede the above with communication between the nodes in order for w to obtain the information it needs for computing the gates in its part in the first layer of the epoch (which is indeed what we do in the full version when we compute a circuit in the non-faulty setting). However, in a ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithm, it is possible for the node w to fail, in which case some node u takes its role in computing the gates of w in each layer in the epoch. Thus, each node w encodes its state at the end of an epoch and sends one piece of the codeword to every other node. In the same spirit, at the beginning of the epoch, a node u that simulates w needs to first collect the pieces of the codeword. Communicating in order to obtain the required information for the epoch takes Oโข(cโขnฮพ) rounds, since there are nฮพ codes that need to be recovered, due to nฮพ being the communication locality of the circuit. Encoding the information at the end of the epoch takes Oโข(cโขnฮถ) rounds, since there are nฮถ codes that need to be used, due to nฮถ being the computation locality of the circuit. Thus, this approach takes Oโข(cโข(nฮถ+nฮพ)) rounds per epoch. If every part of the last layer gets successfully encoded, this also implies decodability within a number of rounds that equals its output divided by n, as n data elements can be routed to/from each node in a single round (again due to [36]).

However, the above is insufficient, because after a node w fails, it may be that another node u which simulates w also fails. If we proceed with simulating w by yet another node uโ€ฒ, we may incur a round complexity of the number of failures โ€“ an unacceptable (cโˆ’1c)โขn rounds.

Thus, we simulate nodes in a more careful manner. For each epoch, we progress by attempts for simulating the failed nodes of the epoch. In a certain attempt, if the number of simulation tasks that still need to occur for an epoch is greater or equal to the number of currently non-faulty nodes, then we let each non-faulty node simulate one of them. This promises that the attempt successfully simulates at least n/c failed nodes, as this number of nodes is guaranteed to remain non-faulty. This type of attempt can only occur Oโข(c) times.

The other case is when the number of remaining failed nodes that need to be simulated drops below the number of currently non-faulty nodes. In this case we simulate every such node by a multiplicity of non-faulty nodes, which implies that although we cannot argue that n/c such tasks succeed, we have that it is still hard for the adversary to prevent such a task from succeeding because it would have to fail all the nodes that are assigned to it. We show that we make progress by succeeding in at least a constant fraction of the remaining tasks, resulting in at most a logarithmic number of such attempts. To make our analysis go through, we actually need to batch these remaining tasks into batches of size 3โขc, each of which is handled by some carefully chosen multiplicity of non-faulty nodes.

Thus, one of the c factors in our complexity is due to coding, and the other is due to either batching into Oโข(c)-sized batches or requiring c attempts per epoch (these latter two are disjoint events). Finally, we incur an additive Oโข(c2โขnฮผโขlogโกn) for checkpointing the outputs. Note that if ฮผ is larger than the computation locality of the last epoch, then we can replace the computation of the last epoch and instead consider it as part of the decodability complexity: simply decode the checkpointed values from the penultimate epoch and compute the last epoch (that is, simulate the last computation epoch as part of the decodability).

We mention that, informally speaking, coding is essential for ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithms. We show in the full version [13] that if the messages of the quiet rounds depend only on the input of each node and not on messages it receives, then the messages of c quiet rounds must form codewords of an [n,n/c,(cโˆ’1c)โขn+1]q error-correcting code.

2 Preliminaries

2.1 The Model

Definition 4 (The ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model).

The ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model is a synchronous communication model on n nodes, where in each round every pair of nodes can exchange bโขlogโกn-bit messages, for some constant b. In an algorithm ๐’œ, each node has an input in ฮฃn for an alphabet ฮฃ of size 2bโขlogโกn, and after executing ๐’œ each node holds an output in ฮฃM for some value of M.

Definition 5 (The ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model).

The c-๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model is similar to the ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model, except that an adversary may fail up to a fraction of (cโˆ’1)/c of the nodes throughout the execution of an algorithm (that is, n/c nodes must be non-faulty). A node that is failed in a certain round, does not send or receive any message from that round on. The ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model allows an initial number of quiet rounds, in which no faults can occur.

Definition 6 (The complexity of a ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithm).

An algorithm ๐’œ for the ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model has three complexity parameters.

The first complexity measure is the number of quiet rounds it requires. The other two complexity measures are the round complexity and decodability complexity, defined as follows.

Let ๐’œ be an algorithm for the ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model, and consider a non-faulty execution โ„ฐnon-faulty of ๐’œ. Denote by ฯƒw,Tnon-faulty the state that node w has at the end of round Tnon-faulty in โ„ฐnon-faulty.

For a round Tnon-faulty of โ„ฐnon-faulty, we say that a possibly faulty execution โ„ฐ of ๐’œ satisfies at round T the R-decodability condition if the following holds. Suppose every node u is associated with some other node wu, then it is possible in R rounds for each node u that is non-faulty at the end of these R rounds to obtain ฯƒwu,Tnon-faulty. (Informally, the only way the adversary can prevent u from receiving knowledge of any other nodeโ€™s state after the next R rounds is by failing u itself.)

We stress that during these additional rounds the adversary may continue to fail nodes up to its given budget. We say that the execution โ„ฐ is finished executing ๐’œ in T rounds with R-decodability, if at the end of T rounds, it satisfies the R-decodability condition for the last round Tnon-faulty of โ„ฐnon-faulty (note that it is possible that for other rounds of โ„ฐnon-faulty, the decodability condition in โ„ฐ holds with larger values than R).

The maximum values of T and R over all executions are the round complexity and the decodability complexity of ๐’œ, respectively.

2.2 Layered Circuits

We define layer circuits and pinpoint which of their parameters captures the complexities of ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ and ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithms that compute them.

Definition 7 (fan-in, fan-out).

For a vertex v in a directed graph G, let inโข(v):={wโˆˆVโˆฃ(w,v)โˆˆE} and outโข(v):={wโˆˆVโˆฃ(v,w)โˆˆE}. The values |inโข(v)| and |outโข(v)| are called the fan-in and fan-out of v, respectively.

Definition 8 (layered circuit).

A layered circuit ๐’žn=๐’ž=(V,E) of depth nฮด over an alphabet ฮฃ is a connected directed graph with vertex set V=V0โˆชV1โˆชโ€ฆโˆชVnฮด and edge set EโŠ‚โˆชiโˆˆ[nฮด]Viร—Vi+1, where each vertex (gate) w with fan-in F is labeled by a function fw:ฮฃFโ†’ฮฃ.

We say that a layered circuit ๐’ž with V0={vi(0)}iโˆˆ[kiโขn] and Vnฮด={vi(nฮด)}iโˆˆ[koโขuโขt] computes a function f:ฮฃkiโขnโ†’ฮฃkoโขuโขt if the following recursively defined function f๐’ž is equal to f. The function f๐’ž is defined as the function whose output is (fv1(nฮด)๐’ž,โ€ฆ,fvkoโขuโขt(nฮด)๐’ž) where

  1. 1.

    For every 1โ‰คiโ‰คkiโขn, fvi(0):ฮฃโ†’ฮฃ is the identity function and fvi(0)๐’ž=fvi(0)โข(xi)=xi, where xi is the i-th input of f.

  2. 2.

    For every 1โ‰คโ„“โ‰คnฮด and every i such that vi(โ„“)โˆˆVโ„“ with fan-in Fโ„“,i, there are gates vi,jโ„“โˆ’1 for jโˆˆ[Fโ„“,i], such that fvi(โ„“)๐’ž=fvi(โ„“)โข(fvi,1(โ„“โˆ’1)๐’ž,โ€ฆ,fvi,Fโ„“,i(โ„“โˆ’1)๐’ž).

We make use of the following parameter of layered circuits, which we call the parallel partition parameter, which captures the amount of โ€œcommunicationโ€ between nodes for computing a layer of the circuit in a ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithm.

Definition 9 (parallel partition).

Let Gn=G=(V,E) be a bipartite directed graph with V=LโˆชR and EโŠ‚Lร—R. For a pair of partitions L=L0โˆชL1โˆชโ€ฆโˆชLnโˆ’1 and R=R0โˆชR1โˆชโ€ฆโˆชRnโˆ’1, let

EiL:={(โ„“,r)โˆฃโ„“โˆˆLiโขย andย โขโˆƒjโ‰ i,rโˆˆRj},EjR:={(โ„“,r)โˆฃrโˆˆRjโขย andย โขโˆƒiโ‰ j,โ„“โˆˆLi}.

We say that G has a parallel partition of size n with block fan size n1+ฯ† if there exists such a pair of partitions in which for every i,jโˆˆ[n], |EiL|,|EjR|=Oโข(n1+ฯ†). We say that a layered circuit ๐’ž=(V,E) of depth nฮด has an nฯ†-parallel partition, if there is a refinement ๐’ซ of the partition of V=V0โˆชV1โˆชโ€ฆโˆชVnฮด such that for every i, the restriction of ๐’ซ to each of the subgraphs ViโˆชVi+1 is a parallel partition of size n with block fan size n1+ฯ† with L=Vi, R=Vi+1.

The following parameters of layered circuits, which we call communication locality and computation locality, help us capture the amount of โ€œlocalityโ€ between layers when we later interpret them as ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithms.

Definition 10 (Computation locality and Communication locality).

Let ๐’ž=(V,E)=(โˆชiVi,E) be a layered circuit of depth nฮด and an nฯ†-parallel partition with respect to a refinement ๐’ซ of V. Let Pi,j denote the j-th part of Vi in ๐’ซ.

We say that ๐’ž has computation locality nฮถ and communication locality nฮพ if there is a constant h such that for all Pi,w there exists a further (not necessarily disjoint) subdivision Pi,w=:โ‹ƒjโˆˆ[hโขnฮถ]Pi,w(j) such that |Pi,w(j)|=n. For each u, the number of pairs j,w such that wโ‰ u for which Pi,w(j) has a wire into Pi+1,u is at most hโ€ฒโขnฮพ for some constant hโ€ฒ. We denote by bโขiโขnโข(Pi+1,u) the parts corresponding to those pairs, that is, the parts Pi,w(j) (where wโ‰ u) in Vi that consist of at least one gate that has a wire into a gate in the part Pi+1,u.

Definition 11 (Local parallel partition).

We say that a layered circuit ๐’ž=(V,E)=(โˆชiVi,E) of depth nฮด has an nฯ†-local parallel partition, if it has an nฯ†-parallel partition for a refinement V=โˆชPโˆˆ๐’ซP, with a computation locality of Oโข(nฯ†) and a communication locality of Oโข(nฯ†).

3 Layered Circuits as ๐’„-๐—™๐—ฎ๐˜‚๐—น๐˜๐˜†โข๐—–๐—น๐—ถ๐—พ๐˜‚๐—ฒ Algorithms

We now prove our main result about how to compute a layered circuit in the c-๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model.

Theorem 1 (Computing a layered circuit by a ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithm). [Restated, see original statement.]

Let ๐’ž be a circuit of depth nฮด with alphabet size |ฮฃ|=bโขlogโก(n) that has an nฯ†-parallel partition with computation locality nฮถ and communication locality nฮพ. Let ฮผ be such that Oโข(n1+ฮผ) bounds the max size of the output per part. For every constant c, there is an algorithm ๐’œ in the c-๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ that computes the function f๐’ž with Oโข(1) quiet rounds, a (non-quiet) round complexity of Oโข(c2โข(nฮด+ฯ†+ฮท+nฮผ)โขlogโกn), where ฮท=maxโก{ฮถโˆ’ฯ†,ฮพโˆ’ฯ†}, and decodability complexity of R=Oโข(nฮผ).

An immediate corollary of Theorem 1 is that circuits with small local parallel partitions are optimal if the output size is not huge, in the sense that our transformation incurs the smallest overhead for them when implementing them in the ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model, compared to what we can say about their ๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ implementation, which is Oโข(nฮด+ฯ†) rounds (see the full version for details).

Corollary 12.

Let ๐’ž be a circuit of depth nฮด with alphabet size |ฮฃ|=bโขlogโก(n) that has an nฯ†-local parallel partition (i.e., with computation locality nฯ† and communication locality nฯ†). Let ฮผ be such that Oโข(n1+ฮผ) bounds the max size of the output per part. For every constant c, there is an algorithm ๐’œ in the c-๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ that computes the function f๐’ž with Oโข(1) quiet rounds, a (non-quiet) round complexity of Oโข(c2โข(nฮด+ฯ†+nฮผ)โขlogโกn), and decodability complexity of Oโข(nฮผ).

To prove the theorem, we make use of the following codes.

Definition 13 (Error-correcting codes).

An [N,K,d]q code is a mapping eโขnโขc:(๐”ฝq)Kโ†’(๐”ฝq)N, such that the Hamming distance between any two codewords s1,s2โˆˆIโขmโขaโขgโขeโข(eโขnโขc)โІ(๐”ฝq)N is at least d. By its definition, an [N,K,d]q code can correct up to dโˆ’1 erasures, that is, given a string sโ€ฒโˆˆ(๐”ฝq)N, there is at most one codeword sโˆˆIโขmโขaโขgโขeโข(eโขnโขc)โІ(๐”ฝq)N which equals sโ€ฒ up to at most dโˆ’1 erasures of symbols in (๐”ฝq)N.

Lemma 14.

Let n and c be some integers. Let p be a prime number and let k be an integer such that the prime power q=pk satisfies logโกq=logโก(pk)=ฮ˜โข(cโขbโขlogโก(n))=ฮ˜โข(cโขlogโก|ฮฃ|), then there exists an [N,K,d]q:=[n,n/c,cโˆ’1cโขn+1]q code.

Proof.

The existence of a code with these parameters (d=Nโˆ’K+1) is a classical result in coding theory; in particular, since qโ‰ฅn (i.e., since the field size is bigger than the length of the code) we have that the classical construction of Reed-Solomon codes suffices. See Section 6.8 of [37] or Section 5.2 of [31]. โ—€

โ–ถย Remark 15.

Note that if we encode nโขbโขlogโกn bits of information then the length of a codeword is cโขnโขbโขlogโกn. However, we consider the codeword as having n symbols, each one held by a different node in the ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model. Since faults are in terms of nodes, we lose exactly cโขbโขlogโกn bits of information per fault, and so when we consider the codeword as having n symbols, the size of each symbol in the alphabet becomes cโขbโขlogโกn.

Proof of Theorem 1.

Construction of the algorithm.

We construct ๐’œ as follows:

  1. 1.

    We associate node w with the gates in part P0,w. In the first Oโข(1) quiet rounds, the nodes shuffle their data so that each node w holds the inputs to its gates. This is possible by Lenzenโ€™s routing scheme [36].

  2. 2.

    In the subsequent c=Oโข(1) quiet rounds, each node w encodes all of its input and sends a coded piece to each of the other nodes, as follows: denote by Encโข(_) multiplication on the right by the Kร—n generator matrix corresponding to the code given by Lemma 14, and define the codeword P~0,w=Encโข(P0,w)โˆˆ๐”ฝqn. Note that the total number of bits in a codeword is at most cโขnโข|ฮฃ|, since the size of the input of each node is nโข|ฮฃ| and the first layer is the identity function by Definition 8. We refer to the action of encoding the output wires of a part and splitting the pieces of the codeword to all nodes as checkpointing this information. Thus, at this point, all nodes have checkpointed the 0-th layer of the circuit.

  3. 3.

    We now recursively describe the method by which the nodes use the checkpointed values P~โ„“,w(k) of a layer โ„“ to compute the values in Pโ„“+1,w and then checkpoint them (but notice that from this point on, we are not guaranteed to have quiet rounds). The method of collecting coded checkpoints can be thought of as filling out โ€œbingo cardsโ€, which represent the tasks of possibly faulty virtual nodes; in particular, we now have that the data of node w lives โ€œup in the cloudโ€ and obtaining data from w is replaced with obtaining coded pieces from the entire network.

    1. i.

      Communication. Assume that the nodes have checkpointed all of the data from the previous round so that every node has a coefficient of the codeword P~โ„“,v(k):=Encโข(Pโ„“,v(k))โˆˆ๐”ฝqn, where kโˆˆ[hโขnฮถ] (so there are hโขnฮถ codewords of length n). The reason that there are hโขnฮถ such codewords is because nฮถ is the computation locality (see Definition 10). In particular, node u has the value (P~โ„“,v(k))uโˆˆ๐”ฝqn.

      Each node w collects the pieces

      bin~โข(Pโ„“+1,w):={Encโข(Pโ„“,v(kโ€ฒ))โˆฃPโ„“,v(kโ€ฒ)โˆˆbinโข(Pโ„“+1,w)} (1)

      that they need for computing the gates contained in their part Pโ„“+1,w, where kโ€ฒโˆˆ[hโ€ฒโขnฮพ] because nฮพ is the communication locality (see Definition 10). In particular, w collects Pโ„“,v1(k1) in a first set of c rounds, Pโ„“,v2(k2) in a second set of c rounds, โ€ฆ, and Pโ„“,vhโ€ฒโขnฮพ(khโ€ฒโขnฮพ) in a hโ€ฒโขnฮพ-th set of c rounds, where hโ€ฒ is the constant in the definition of communication locality. Therefore, this takes Oโข(cโขnฮพ) rounds.

    2. ii.

      Computation. If for every part Pโ„“,w in this layer, all output wires go to the part Pโ„“+1,w in the next layer (and not to any Pโ„“+1,v for some vโ‰ w), we call this layer a computation layer. Otherwise, we call the layer a communication layer.

      If โ„“ is a computation layer, the node w performs the computation corresponding to the gates in Pโ„“+1,w locally. The node w continues to do these computations locally until there is a layer ฯ„:=โ„“+t for some t, in which either there is an output wire that goes from a part Pฯ„,w for some w into a part Pฯ„+1,v for some vโ‰ w (that is, ฯ„ is a communication layer), or ฯ„ is the output layer. The round complexity of this step is 0, because these are all local computations.

    3. iii.

      Checkpointing. Upon arriving at some communication layer ฯ„, each node w checkpoints and sends each node u the value (P~ฯ„,w(k))u:=(Encโข(Pฯ„,w(k)))uโˆˆ๐”ฝqn for kโˆˆ[hโขnฮถ] (since nฮถ is the computation locality). Checkpointing a code with these parameters requires Oโข(c) rounds, since we have n symbols of size logโกq=Oโข(cโขbโขlogโกn) bits. Thus, this completes in Oโข(cโขnฮถ) rounds.

      We refer to a non-faulty execution of steps (i)-(iii) as an epoch. We have that the complexity of these steps is Oโข(cโข(nฮถ+nฮพ)).

    4. iv.

      Repeat/Fill out Missing Bingo Cards. If all of the parts in a layer have been checkpointed, then the nodes move on to the next epoch; otherwise, they divide the work of the failed nodes and simulate the computations of the gates in their parts until all are checkpointed. In particular, if F nodes failed so far, and of those F, there are Fโˆ— nodes whose parts are yet to have been checkpointed, the nโˆ’F non-faulty nodes evenly divide themselves and simulate those Fโˆ— parts by rerunning steps (i)-(iii). This is done as follows.

      1. a.

        If Fโˆ—>nโˆ’F, then each of the nโˆ’F non-faulty nodes is assigned to one failed node whose part has not been checkpointed yet, and repeats steps (i)-(iii) for that part.

      2. b.

        Otherwise, there are Fโˆ—โ‰คnโˆ’F such parts Pฯ„,v1,Pฯ„,v2,โ€ฆ,PvFโˆ—. We batch these into sets of size at most 6โขc, and let each batch be simulated by multiple nodes, as follows. We denote Fbโขaโขtโขcโขhโˆ—=maxโก{โŒŠFโˆ—/3โขcโŒ‹,1}, and mโขuโขlโขt=โŒŠnโˆ’FFbโขaโขtโขcโขhโˆ—โŒ‹. We denote by x the remainder when dividing nโˆ’F by Fbโขaโขtโขcโขhโˆ— (thus, mโขuโขlโขt=nโˆ’Fโˆ’xFbโขaโขtโขcโขhโˆ—). The non-faulty nodes, w0,โ€ฆ,wnโˆ’F, are split into Fbโขaโขtโขcโขhโˆ— batches:

        R{v0,v1,โ€ฆ,v3โขcโˆ’1}:={w0,โ€ฆ,wmโขuโขlโขtโˆ’1},
        R{v3โขc,v3โขc+1,โ€ฆ,v6โขcโˆ’1}:={wmโขuโขlโขt,โ€ฆ,w2โขmโขuโขlโขtโˆ’1},โ€ฆ,
        R{v(Fbโขaโขtโขcโขhโˆ—โˆ’1)โข3โขc,v(Fbโขaโขtโขcโขhโˆ—โˆ’1)โข3โขc+1,โ€ฆ,vFโˆ—โˆ’1}:={w(Fโˆ—โˆ’1)โขmโขuโขlโขt,โ€ฆ,wnโˆ’Fโˆ’x}

        Notice that since Fbโขaโขtโขcโขhโˆ— is defined using the floor of the respective ratio, the size of the last batch may be larger than 3โขc, but it is at most 6โขc. Further, in case Fโˆ— is less than 3โขc then there is only one batch, which may also be small, but it is assigned to all non-faulty nodes. For each batch, every node in that batch is assigned to each vj in the batch, and repeats steps (i)-(iii) for the corresponding part Pฯ„,vj.

      We call each iteration of these steps an attempt. If all of the parts have been checkpointed and ฯ„ is the output layer, then the algorithm halts. Note that additional failures can occur throughout repeating. Also, we stress that although the algorithm describes exchanging messages between any two nodes, whenever it attempts to send (receive) a message to (from) a failed node, such a message is not delivered.

      The complexity of this step is equal to the complexity of Oโข(cโข(nฮถ+nฮพ)) rounds for steps (i)-(iii), plus an additive overhead of Oโข(cโขnฮถ) rounds which corresponds to the rounds needed to decode the information corresponding to the node that is being simulated. In case (b), since the size of a batch is Oโข(c), this introduces an overhead of Oโข(c) over that round complexity, for a total of at most Oโข(c2โข(nฮถ+nฮพ)) rounds per attempt. The number of attempts is bounded in what follows.

Correctness and complexity of the algorithm despite faults.

It is immediate from the construction that the algorithm produces the output of the circuit. Further, it is straightforward that the number of quiet rounds is Oโข(1).

We prove that R-decodability holds by induction, with R=O(cnฮถ)=O(nฮถ for all epochs except the last, and R=Oโข(cโขnฮผ)=Oโข(nฮผ) for the last. The base case is straightforward because the computation takes place through quiet rounds: at the end of step (1), each node holds the inputs to its gates. In step (2), each node encodes its input using the code in Lemma 14, and thus R-decodability holds with R=Oโข(c)=Oโข(1). For the the inductive hypothesis, assume that epoch โ„“ has been correctly checkpointed as at most Oโข(nฮถ) codewords of an [n,nc,cโˆ’1cโขn+1]q-code for each node. Then, by the inductive hypothesis, we have that a node can, in Oโข(nฮถ) rounds, obtain all of the checkpointed data it needs and then perform its local computations, and simulate any failed nodes. Thus, at the end of step 3(iv), which we show below that it indeed checkpoints the computation of all nodes (also faulty nodes), we have that epoch โ„“+1 is correctly checkpointed as at most Oโข(nฮถ) codewords of an [n,nc,cโˆ’1cโขn+1]q-code for each node. Thus, R-decodability holds with R=Oโข(cโขnฮถ)=Oโข(nฮถ). For the last epoch, we only need to encode the output and therefore we get R-decodability with R=Oโข(cโขnฮผ)=Oโข(nฮผ).

It remains to bound the number of attempts needed for step 3(iv), which will also prove that indeed it checkpoints the computation of all nodes (also faulty nodes). For an attempt, recall that we denote by F the number of nodes that failed so far (and thus the number of non-faulty nodes is nโˆ’F), and by Fโˆ— the number of parts that are yet to be checkpointed. Let Fโ€ฒ be the additional number of nodes that fail throughout this attempt. We consider the two possible cases, depending on how Fโˆ— relates to nโˆ’F.

In the first case, recall that if Fโˆ—>nโˆ’F then each of the non-faulty nodes is assigned to one failed node whose part has not been checkpointed and repeats its computation. Since there are at least n/c non-faulty nodes even despite the additional Fโ€ฒ newly failed ones, we have that at least n/c additional tasks get checkpointed in this attempt. Thus, such an attempt can happen at most c times before all n parts are checkpointed for this epoch.

In the second case, we have that Fโˆ—โ‰คnโˆ’F. Recall that each part that is not yet checkpointed is now attempted to be checkpointed by a multiplicity of mโขuโขlโขt or mโขuโขlโขtโˆ’1 nodes, where mโขuโขlโขt=โŒˆnโˆ’FFโˆ—โŒ‰โ‰ฅ1. There are two possible sub-cases based on how the value of Fโ€ฒ relates to the remaining number of allowed failures Frโขeโขmโขaโขiโขn=(cโˆ’1c)โ‹…nโˆ’F. It holds that either Fโ€ฒโ‰ฅFrโขeโขmโขaโขiโขnโ‹…12 or Fโ€ฒ<Frโขeโขmโขaโขiโขnโ‹…12. The former case can happen at most logโก((cโˆ’1c)โ‹…n)=Oโข(logโกn) times before Frโขeโขmโขaโขiโขn drops to 0.

Thus, suppose that the latter case happens in an attempt. If there is a single batch because Fโˆ—<3โขc, then all of the batch is performed by all of these nodes, and hence must succeed because at least one of them (at least n/c of them) is non-faulty.

Otherwise, since x is the remainder of dividing nโˆ’F by Fbโขaโขtโขcโขhโˆ—, we have xโ‰คFbโขaโขtโขcโขhโˆ—โˆ’1. Thus, the actual number of nodes that are successful in checkpointing the tasks they are now in charge of is at least nโˆ’Fโˆ’Fโ€ฒโˆ’x. Possibly, every mโขuโขlโขt=โŒŠ(nโˆ’F)/Fbโขaโขtโขcโขhโˆ—โŒ‹ of them are performing the same batch, but even in this worst case, the number of distinct newly checkpointed batches is at least (nโˆ’Fโˆ’Fโ€ฒโˆ’x)/mโขuโขlโขt. Note that with the notation of x, we have that mโขuโขlโขt=(nโˆ’Fโˆ’x)/Fbโขaโขtโขcโขhโˆ—. Further, it holds that Fbโขaโขtโขcโขhโˆ—=maxโก{โŒŠFโˆ—/(3โขc)โŒ‹,1}โ‰คn3โขcโ‰ค(nโˆ’F)/3, where the first inequality is since Fโˆ—<n, and the second is since n/cโ‰คnโˆ’F because the number of non-faulty nodes can never go below n/c.

We bound this number of newly checkpointed batches (nโˆ’Fโˆ’Fโ€ฒโˆ’x)/mโขuโขlโขt in terms of the total number of remaining batches Fbโขaโขtโขcโขhโˆ—, as follows. We have that:

nโˆ’Fโˆ’Fโ€ฒโˆ’xmโขuโขlโขt =(nโˆ’Fโˆ’Fโ€ฒโˆ’x)/nโˆ’Fโˆ’xFbโขaโขtโขcโขhโˆ—
=Fbโขaโขtโขcโขhโˆ—โ‹…nโˆ’Fโˆ’Fโ€ฒโˆ’xnโˆ’Fโˆ’x=Fbโขaโขtโขcโขhโˆ—โ‹…(1โˆ’Fโ€ฒnโˆ’Fโˆ’x).

Since Fโ€ฒ<Frโขeโขmโขaโขiโขn/2, and by plugging Frโขeโขmโขaโขiโขn=(cโˆ’1c)โขnโˆ’F, we get:

nโˆ’Fโˆ’Fโ€ฒโˆ’xmโขuโขlโขt โ‰ฅFbโขaโขtโขcโขhโˆ—โ‹…(1โˆ’Frโขeโขmโขaโขiโขnโ‹…12nโˆ’Fโˆ’x)=Fbโขaโขtโขcโขhโˆ—โ‹…(1โˆ’((cโˆ’1c)โ‹…nโˆ’F)โ‹…12nโˆ’Fโˆ’x).

Further simple algebraic manipulations give:

nโˆ’Fโˆ’Fโ€ฒโˆ’xmโขuโขlโขt โ‰ฅFbโขaโขtโขcโขhโˆ—โ‹…(1โˆ’nโˆ’Fโˆ’n/c2โข(nโˆ’Fโˆ’x))
=Fbโขaโขtโขcโขhโˆ—โ‹…(1โˆ’(nโˆ’F)โˆ’n/c2โข(nโˆ’F)โˆ’2โขx)=Fbโขaโขtโขcโขhโˆ—โ‹…(1โˆ’1โˆ’n/cnโˆ’F2โˆ’2โขxnโˆ’F).

Removing the n/cnโˆ’F term from the nominator only decreases the value of the expression, as well as using xโ‰คFbโขaโขtโขcโขhโˆ—โˆ’1. We thus have:

nโˆ’Fโˆ’Fโ€ฒโˆ’xmโขuโขlโขt โ‰ฅFbโขaโขtโขcโขhโˆ—โ‹…(1โˆ’12โˆ’2โข(Fbโขaโขtโขcโขhโˆ—โˆ’1)nโˆ’F).

Since nโˆ’FFbโขaโขtโขcโขhโˆ—โ‰ฅ3, we finally have:

nโˆ’Fโˆ’Fโ€ฒโˆ’xmโขuโขlโขt โ‰ฅFbโขaโขtโขcโขhโˆ—โ‹…(1โˆ’12โˆ’2โข(Fbโขaโขtโขcโขhโˆ—โˆ’1)3โขFbโขaโขtโขcโขhโˆ—)
โ‰ฅFbโขaโขtโขcโขhโˆ—โ‹…(1โˆ’12โˆ’23)=Fbโขaโขtโขcโขhโˆ—โ‹…(1/4).

This yields that after this subcase, the number of batches to be checkpointed drops from Fbโขaโขtโขcโขhโˆ— to at most (1/4)โ‹…Fbโขaโขtโขcโขhโˆ—. This implies that this can happen at most Oโข(logโกn) times before all n parts are checkpointed.

Thus the algorithm has all the parts checkpointed in Oโข(c+logโกn) attempts. Each attempt runs in either Oโข(cโ‹…(nฮถ+nฮพ)) or Oโข(c2โ‹…(nฮถ+nฮพ)) rounds, where the latter number of rounds occurs only in the case of batching, which can happen at most Oโข(logโกn) attempts. Hence, we obtain a total of Oโข(c2โ‹…(nฮถ+nฮพ)โขlogโกn) rounds per epoch. In the final epoch, the nodes checkpoint the outputs in Oโข(c2โขnฮผโขlogโกn) rounds. This is done only once, and thus incurs only an additive overhead. In total, since the depth of the circuit is nฮด, we get a running time of Oโข(c2โ‹…(nฮด+ฯ†+ฮท+nฮผ)โขlogโกn), where ฮท=maxโก{ฮถโˆ’ฯ†,ฮพโˆ’ฯ†}, as claimed. โ—€

4 Application: Semi-Ring Matrix Multiplication

Theorem 2 (A ๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ algorithm for semi-ring matrix multiplication). [Restated, see original statement.]

Suppose that the inputs on the nodes are matrices A,Bโˆˆฮฃnร—n, where each node begins with n coefficients of A and B and ฮฃ is a semi-ring, then the value C=Aโ‹…B can be computed in the c-๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ with Oโข(1) quiet rounds, a round complexity of Oโข(c2โขn1/3โขlogโกn), and Oโข(1)-decodability.

Proof.

We prove the theorem by showing that there is circuit computing this function that satisfies the hypothesis of Theorem 1. To describe our circuit, we define some notation.

Outer Partition.

The matrices are split as follows: A is split into n1/3ร—n1/3 block matrices of equal dimension n2/3ร—n2/3, where the Aji are n2/3ร—n2/3 matrices, and similarly B is split into n1/3ร—n1/3 block matrices of equal dimension n2/3ร—n2/3 where the Bji are n2/3ร—n2/3 matrices; i.e.,

A=:[A11โ€ฆAn1/31โ‹ฎโ€ฆโ‹ฎA1n1/3โ€ฆAn1/3n1/3]=:[A1โ€ฆAn1/3],B=:[B11โ€ฆBn1/31โ‹ฎโ€ฆโ‹ฎB1n1/3โ€ฆBn1/3n1/3]=:[B1โ‹ฎBn1/3]

Elementary linear algebra gives us that the product C=Aโ‹…B satisfies the equation Cji=โˆ‘kโˆˆ[n1/3](AkโขBk)ji, which can be computed by a total of (# ofย kย inย โขAk,Bk)โ‹…(# ofย iย inย โขAki)โ‹…(# ofย jย inย โขBjk)=n1/3โ‹…n1/3โ‹…n1/3=n block (outer) matrix multiplications; indeed, by the definition of an outerproduct we have that (AkโขBk)ji=AkiโขBjk, which we use to more efficiently distribute the tasks, and so we have:

Cji=โˆ‘kโˆˆ[n1/3](AkโขBk)ji=โˆ‘kโˆˆ[n1/3]AkiโขBjk (2)

Inner Partition.

We define sub-blocks Ajiโข[m] of the blocks Aji, each of size n2/3ร—n1/3, as follows: (Ajiโข[m])โ„“k:=(Aji)โ„“+mโขn1/3k, i.e., Ajiโข[m] is implicitly defined by the equation Aji=[Ajiโข[1]โ€ฆAjiโข[n1/3]]. Similarly, we define subblocks Bjiโข[m] of the blocks Bji, each of size n1/3ร—n2/3, analogously to the previous construction: (Bjiโข[m])โ„“k:=(Bji)โ„“k+mโขn1/3, i.e., Bjiโข[m] is implicitly defined by the following equation

Bji=[Bjiโข[1]โ‹ฎBjiโข[n1/3]].

In particular, we have that

AkiโขBjk=โˆ‘mโˆˆ[n1/3]Akiโข[m]โขBjkโข[m], (3)

which by Equation 2 further implies that Cji=โˆ‘kโˆˆ[n1/3]AkiโขBjk=โˆ‘k,mโˆˆ[n1/3]Akiโข[m]โขBjkโข[m]. For the last layer of our circuit, we will also use the fact that

(Cji)mโ„“=โˆ‘kโˆˆ[n1/3](AkiโขBjk)mโ„“. (4)
Figure 1: Illustration for the inner and outer partitions, demonstrated on matrix A.

Construction of the circuit.

We construct every layer of the circuit by considering it as having n parts (sets of gates). We associate each part w with a 3-tuple of indices (w1,w2,w3)โˆˆ[n1/3]3.

First Layer (Input/Shuffle).

We make a part responsible for the entries of its 3-tuple over all of the blocks in the very first shuffling layer; in particular, we assume that in the initial shuffle step, the data has been rearranged so that part w now has incoming wires corresponding to (Aw2w1โข[w3])ji and (Bw2w1โข[w3])kj for all i,j,k. We can formalize this as the following: the wires of the gates in layer 0 shuffle data to the parts corresponding to w0,w1,w2 to part w in layer 1 as given above; i.e., there are 2โขn gates in part w, which are:

P0,w:={(Aw2w1โข[w3])ji,(Bw2w1โข[w3])kjโˆฃi,kโˆˆ[n2/3],jโˆˆ[n1/3]}.
Second Layer (Communication).

We define the next communication round by gates

P1,w:={(Aw2w1โข[v])ji,(Bw3w2โข[v])kjโˆฃvโˆˆ[n1/3],i,kโˆˆ[n2/3],jโˆˆ[n1/3]}; (5)

Each gate has a single input wire from the prior layer. There are at most n1/3 parts whose indices differ from w=(w1,w2,w3) only on v (i.e., for parts wโ€ฒ=(w1,w2,v)) and they are each connected by exactly n wires for a total of n4/3 wires; therefore, the partition satisfies Definition 11, with:

  • โ– 

    (Communication Locality) Fix a part w=(w1,w2,w3) and consider its gates (Aw2w1โข[v])ji for all vโˆˆ[n1/3],iโˆˆ[n2/3],jโˆˆ[n1/3]. There are exactly n1/3 parts wโ€ฒ=(w1,w2,v), one for each such v, which have wires into part w in this layer (and each has precisely n wires to part w in this layer, to its gates (Aw2w1โข[v])ji for the respective v). For (Bw3w2โข[v])kj the count is similar. This gives a communication locality of Oโข(n1/3).

  • โ– 

    (Computation Locality) We have that every part P0,(w1,w2,v) can be partitioned into exactly 2 sets P0,(w1,w2,v)(1):={(Aw2w1โข[v])jiโˆฃiโˆˆ[n2/3],jโˆˆ[n1/3]} and P0,(w1,w2,v)(2):={(Bw3w2โข[v])kjโˆฃkโˆˆ[n2/3],jโˆˆ[n1/3]}. It is straightforward to see that this satisfies computation locality with nฮถ=1 (and h=2).

Third Layer (Computation).

In the third layer, part w is constructed so that its gates locally compute Aw2w1โขBw3w2. By saying that a gate in part w locally computes a function, we mean that all of its incoming wires are from part w of the previous layer. That is, we define the n4/3 gates:

P2,w:=Aw2w1โขBw3w2=โˆ‘vโˆˆ[n1/3]Aw2w1โข[v]โขBw3w2โข[v],

one for each element of the n2/3ร—n2/3 matrix. By Equation 3 and the definition of the gates in Equation 5, we get that indeed these are the values computed by the gates in this layer.

Fourth Layer (Communication).

We define the next communication round by

P3,w:={(Avw1โขBw3v)jw2โขn1/3+iโˆฃv,iโˆˆ[n1/3],jโˆˆ[n2/3]}; (6)
  • โ– 

    (Communication Locality) For (Avw1โขBw3v)jw2โขn1/3+i there are at most n1/3 parts wโ€ฒ=(w1,v,w3) that have the information corresponding to Avw1โขBw3v and furthermore there are precisely n input wires for each one, since i,j have in total a range of n.

  • โ– 

    (Computation Locality) The parts in the third layer that have outputs wires are all of size n4/3 gates, which can each be partitioned perfectly into n1/3 sets

    P2,(w1,v,w3)(w1,w2,w3):={(Avw1โขBw3v)jw2โขn1/3+iโˆฃv,iโˆˆ[n1/3],jโˆˆ[n2/3]}

    (one for each of the n1/3 choices for w2). Each set is of size n and these sets, P2,(w1,v,w3)(w1,w2,w3), are all of the subparts corresponding to (w1,v,w3). Thus, we have that nฮถ=n1/3.

Fifth Layer (Computation/Output).

Now we can have gates of part w in the fifth layer that can locally (with incoming wires from the fourth layer only from part w) compute

P4,w :={(Cw2w1)jw3โขn1/3+i,โˆฃiโˆˆ[n1/3],jโˆˆ[n2/3]}
={โˆ‘vโˆˆ[n1/3](Avw1โขBw3v)jw3โขn1/3+iโˆฃiโˆˆ[n1/3],jโˆˆ[n2/3]}.

By Equation 4 and the definition of the gates in Equation 6, we get that indeed these are the values that are computed by the gates in this layer.

This completes the proof that the circuit correctly computes the product of the matrices. The circuit has a constant depth, and an Oโข(n1/3)-parallel partition. By Theorem 1, it can be computed in the c-๐–ฅ๐–บ๐—Ž๐—…๐—๐—’โข๐–ข๐—…๐—‚๐—Š๐—Ž๐–พ model within Oโข(c2โขn1/3โขlogโกn) rounds, with Oโข(1) quiet rounds and Oโข(1)-decodability.

โ—€

References

  • [1] Hagit Attiya and Jennifer L. Welch. Distributed computing - fundamentals, simulations, and advanced topics (2. ed.). Wiley series on parallel and distributed computing. Wiley, 2004.
  • [2] Hong Duc Bui, Shashwat Chandra, Yi-Jun Chang, Michal Dory, and Dean Leitersdorf. Improved all-pairs approximate shortest paths in congested clique. In Ran Gelles, Dennis Olivetti, and Petr Kuznetsov, editors, Proceedings of the 43rd ACM Symposium on Principles of Distributed Computing, PODC 2024, Nantes, France, June 17-21, 2024, pages 391โ€“400. ACM, 2024. doi:10.1145/3662158.3662804.
  • [3] Keren Censor-Hillel, Michal Dory, Janne H. Korhonen, and Dean Leitersdorf. Fast approximate shortest paths in the congested clique. Distributed Comput., 34(6):463โ€“487, 2021. doi:10.1007/S00446-020-00380-5.
  • [4] Keren Censor-Hillel, Orr Fischer, Franรงois Le Gall, Dean Leitersdorf, and Rotem Oshman. Quantum distributed algorithms for detection of cliques. In Mark Braverman, editor, 13th Innovations in Theoretical Computer Science Conference, ITCS 2022, January 31 - February 3, 2022, Berkeley, CA, USA, volume 215 of LIPIcs, pages 35:1โ€“35:25. Schloss Dagstuhl โ€“ Leibniz-Zentrum fรผr Informatik, 2022. doi:10.4230/LIPICS.ITCS.2022.35.
  • [5] Keren Censor-Hillel, Orr Fischer, Ran Gelles, and Pedro Soto. Two for one, one for all: Deterministic LDC-based robust computation in congested clique. CoRR, abs/2508.08740, 2025. doi:10.48550/arXiv.2508.08740.
  • [6] Keren Censor-Hillel, Orr Fischer, Tzlil Gonen, Franรงois Le Gall, Dean Leitersdorf, and Rotem Oshman. Fast distributed algorithms for girth, cycles and small subgraphs. In Hagit Attiya, editor, 34th International Symposium on Distributed Computing, DISC 2020, October 12-16, 2020, Virtual Conference, volume 179 of LIPIcs, pages 33:1โ€“33:17. Schloss Dagstuhl โ€“ Leibniz-Zentrum fรผr Informatik, 2020. doi:10.4230/LIPICS.DISC.2020.33.
  • [7] Keren Censor-Hillel, Franรงois Le Gall, and Dean Leitersdorf. On distributed listing of cliques. In Yuval Emek and Christian Cachin, editors, PODC โ€™20: ACM Symposium on Principles of Distributed Computing, Virtual Event, Italy, August 3-7, 2020, pages 474โ€“482. ACM, 2020. doi:10.1145/3382734.3405742.
  • [8] Keren Censor-Hillel, Ran Gelles, and Bernhard Haeupler. Making asynchronous distributed computations robust to noise. Distributed Comput., 32(5):405โ€“421, 2019. doi:10.1007/S00446-018-0343-5.
  • [9] Keren Censor-Hillel, Petteri Kaski, Janne H. Korhonen, Christoph Lenzen, Ami Paz, and Jukka Suomela. Algebraic methods in the congested clique. Distributed Comput., 32(6):461โ€“478, 2019. doi:10.1007/S00446-016-0270-2.
  • [10] Keren Censor-Hillel, Dean Leitersdorf, and Elia Turner. Sparse matrix multiplication and triangle listing in the congested clique model. Theor. Comput. Sci., 809:45โ€“60, 2020. doi:10.1016/J.TCS.2019.11.006.
  • [11] Keren Censor-Hillel, Yannic Maus, and Volodymyr Polosukhin. Near-optimal scheduling in the congested clique. In Tomasz Jurdzinski and Stefan Schmid, editors, Structural Information and Communication Complexity - 28th International Colloquium, SIROCCO 2021, Wrocล‚aw, Poland, June 28 - July 1, 2021, Proceedings, volume 12810 of Lecture Notes in Computer Science, pages 50โ€“67. Springer, 2021. doi:10.1007/978-3-030-79527-6_4.
  • [12] Keren Censor-Hillel, Merav Parter, and Gregory Schwartzman. Derandomizing local distributed algorithms under bandwidth restrictions. Distributed Comput., 33(3-4):349โ€“366, 2020. doi:10.1007/S00446-020-00376-1.
  • [13] Keren Censor-Hillel and Pedro Soto. Computing in a faulty congested clique. CoRR, abs/2505.11430, 2025. doi:10.48550/arXiv.2505.11430.
  • [14] Sam Coy, Artur Czumaj, Peter Davies, and Gopinath Mishra. Optimal (degree+1)-coloring in congested clique. In Kousha Etessami, Uriel Feige, and Gabriele Puppis, editors, 50th International Colloquium on Automata, Languages, and Programming, ICALP 2023, July 10-14, 2023, Paderborn, Germany, volume 261 of LIPIcs, pages 46:1โ€“46:20. Schloss Dagstuhl โ€“ Leibniz-Zentrum fรผr Informatik, 2023. doi:10.4230/LIPICS.ICALP.2023.46.
  • [15] Artur Czumaj, Peter Davies, and Merav Parter. Simple, deterministic, constant-round coloring in congested clique and MPC. SIAM J. Comput., 50(5):1603โ€“1626, 2021. doi:10.1137/20M1366502.
  • [16] Danny Dolev, Christoph Lenzen, and Shir Peled. "tri, tri again": Finding triangles and small subgraphs in a distributed setting - (extended abstract). In Marcos K. Aguilera, editor, Distributed Computing - 26th International Symposium, DISC 2012, Salvador, Brazil, October 16-18, 2012. Proceedings, volume 7611 of Lecture Notes in Computer Science, pages 195โ€“209. Springer, 2012. doi:10.1007/978-3-642-33651-5_14.
  • [17] Michal Dory, Orr Fischer, Seri Khoury, and Dean Leitersdorf. Constant-round spanners and shortest paths in congested clique and MPC. In Avery Miller, Keren Censor-Hillel, and Janne H. Korhonen, editors, PODC โ€™21: ACM Symposium on Principles of Distributed Computing, Virtual Event, Italy, July 26-30, 2021, pages 223โ€“233. ACM, 2021. doi:10.1145/3465084.3467928.
  • [18] Michal Dory and Merav Parter. Exponentially faster shortest paths in the congested clique. J. ACM, 69(4):29:1โ€“29:42, 2022. doi:10.1145/3527213.
  • [19] Andrew Drucker, Fabian Kuhn, and Rotem Oshman. On the power of the congested clique model. In Magnรบs M. Halldรณrsson and Shlomi Dolev, editors, ACM Symposium on Principles of Distributed Computing, PODC โ€™14, Paris, France, July 15-18, 2014, pages 367โ€“376. ACM, 2014. doi:10.1145/2611462.2611493.
  • [20] Orr Fischer, Tzlil Gonen, Fabian Kuhn, and Rotem Oshman. Possibilities and impossibilities for distributed subgraph detection. In Christian Scheideler and Jeremy T. Fineman, editors, Proceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures, SPAA 2018, Vienna, Austria, July 16-18, 2018, pages 153โ€“162. ACM, 2018. doi:10.1145/3210377.3210401.
  • [21] Franรงois Le Gall. Further algebraic algorithms in the congested clique model and applications to graph-theoretic problems. In Cyril Gavoille and David Ilcinkas, editors, Distributed Computing - 30th International Symposium, DISC 2016, Paris, France, September 27-29, 2016. Proceedings, volume 9888 of Lecture Notes in Computer Science, pages 57โ€“70. Springer, 2016. doi:10.1007/978-3-662-53426-7_5.
  • [22] Mohsen Ghaffari. Distributed MIS via all-to-all communication. In Elad Michael Schiller and Alexander A. Schwarzmann, editors, Proceedings of the ACM Symposium on Principles of Distributed Computing, PODC 2017, Washington, DC, USA, July 25-27, 2017, pages 141โ€“149. ACM, 2017. doi:10.1145/3087801.3087830.
  • [23] Mohsen Ghaffari, Themis Gouleakis, Christian Konrad, Slobodan Mitrovic, and Ronitt Rubinfeld. Improved massively parallel computation algorithms for mis, matching, and vertex cover. In Calvin Newport and Idit Keidar, editors, Proceedings of the 2018 ACM Symposium on Principles of Distributed Computing, PODC 2018, Egham, United Kingdom, July 23-27, 2018, pages 129โ€“138. ACM, 2018. doi:10.1145/3212734.3212743.
  • [24] Mohsen Ghaffari, Ce Jin, and Daan Nilis. A massively parallel algorithm for minimum weight vertex cover. In Christian Scheideler and Michael Spear, editors, SPAA โ€™20: 32nd ACM Symposium on Parallelism in Algorithms and Architectures, Virtual Event, USA, July 15-17, 2020, pages 259โ€“268. ACM, 2020. doi:10.1145/3350755.3400260.
  • [25] Mohsen Ghaffari and Krzysztof Nowicki. Congested clique algorithms for the minimum cut problem. In Calvin Newport and Idit Keidar, editors, Proceedings of the 2018 ACM Symposium on Principles of Distributed Computing, PODC 2018, Egham, United Kingdom, July 23-27, 2018, pages 357โ€“366. ACM, 2018. URL: https://dl.acm.org/citation.cfm?id=3212750.
  • [26] Mohsen Ghaffari and Merav Parter. MST in log-star rounds of congested clique. In George Giakkoupis, editor, Proceedings of the 2016 ACM Symposium on Principles of Distributed Computing, PODC 2016, Chicago, IL, USA, July 25-28, 2016, pages 19โ€“28. ACM, 2016. doi:10.1145/2933057.2933103.
  • [27] James W. Hegeman, Gopal Pandurangan, Sriram V. Pemmaraju, Vivek B. Sardeshmukh, and Michele Scquizzato. Toward optimal bounds in the congested clique: Graph connectivity and MST. In Chryssis Georgiou and Paul G. Spirakis, editors, Proceedings of the 2015 ACM Symposium on Principles of Distributed Computing, PODC 2015, Donostia-San Sebastiรกn, Spain, July 21 - 23, 2015, pages 91โ€“100. ACM, 2015. doi:10.1145/2767386.2767434.
  • [28] James W. Hegeman and Sriram V. Pemmaraju. Lessons from the congested clique applied to mapreduce. Theor. Comput. Sci., 608:268โ€“281, 2015. doi:10.1016/J.TCS.2015.09.029.
  • [29] James W. Hegeman, Sriram V. Pemmaraju, and Vivek Sardeshmukh. Near-constant-time distributed algorithms on a congested clique. In Fabian Kuhn, editor, Distributed Computing - 28th International Symposium, DISC 2014, Austin, TX, USA, October 12-15, 2014. Proceedings, volume 8784 of Lecture Notes in Computer Science, pages 514โ€“530. Springer, 2014. doi:10.1007/978-3-662-45174-8_35.
  • [30] Monika Henzinger, Sebastian Krinninger, and Danupon Nanongkai. A deterministic almost-tight distributed algorithm for approximating single-source shortest paths. In Daniel Wichs and Yishay Mansour, editors, Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 18-21, 2016, pages 489โ€“498. ACM, 2016. doi:10.1145/2897518.2897638.
  • [31] W. Cary Huffman and Vera Pless. Fundamentals of Error-Correcting Codes. Cambridge University Press, 2003. doi:10.1017/CBO9780511807077.
  • [32] Taisuke Izumi and Franรงois Le Gall. Triangle finding and listing in CONGEST networks. In Elad Michael Schiller and Alexander A. Schwarzmann, editors, Proceedings of the ACM Symposium on Principles of Distributed Computing, PODC 2017, Washington, DC, USA, July 25-27, 2017, pages 381โ€“389. ACM, 2017. doi:10.1145/3087801.3087811.
  • [33] Taisuke Izumi and Franรงois Le Gall. Quantum distributed algorithm for the all-pairs shortest path problem in the CONGEST-CLIQUE model. In Peter Robinson and Faith Ellen, editors, Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing, PODC 2019, Toronto, ON, Canada, July 29 - August 2, 2019, pages 84โ€“93. ACM, 2019. doi:10.1145/3293611.3331628.
  • [34] Tomasz Jurdzinski and Krzysztof Nowicki. MST in O(1) rounds of congested clique. In Artur Czumaj, editor, Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018, pages 2620โ€“2632. SIAM, 2018. doi:10.1137/1.9781611975031.167.
  • [35] Janne H. Korhonen. Deterministic MST sparsification in the congested clique. CoRR, abs/1605.02022, 2016. arXiv:1605.02022.
  • [36] Christoph Lenzen. Optimal deterministic routing and sorting on the congested clique. In Panagiota Fatourou and Gadi Taubenfeld, editors, ACM Symposium on Principles of Distributed Computing, PODC โ€™13, Montreal, QC, Canada, July 22-24, 2013, pages 42โ€“50. ACM, 2013. doi:10.1145/2484239.2501983.
  • [37] J. H. Van Lint. Introduction to Coding Theory. Springer-Verlag, Berlin, Heidelberg, 3rd edition, 1998.
  • [38] Zvi Lotker, Boaz Patt-Shamir, Elan Pavlov, and David Peleg. Minimum-weight spanning tree construction in O(log log n) communication rounds. SIAM J. Comput., 35(1):120โ€“131, 2005. doi:10.1137/S0097539704441848.
  • [39] Nancy A. Lynch. Distributed Algorithms. Morgan Kaufmann, 1996.
  • [40] Danupon Nanongkai. Distributed approximation algorithms for weighted shortest paths. In David B. Shmoys, editor, Symposium on Theory of Computing, STOC 2014, New York, NY, USA, May 31 - June 03, 2014, pages 565โ€“573. ACM, 2014. doi:10.1145/2591796.2591850.
  • [41] Krzysztof Nowicki. A deterministic algorithm for the MST problem in constant rounds of congested clique. In Samir Khuller and Virginia Vassilevska Williams, editors, STOC โ€™21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, Virtual Event, Italy, June 21-25, 2021, pages 1154โ€“1165. ACM, 2021. doi:10.1145/3406325.3451136.
  • [42] Gopal Pandurangan, Peter Robinson, and Michele Scquizzato. On the distributed complexity of large-scale graph computations. In Christian Scheideler and Jeremy T. Fineman, editors, Proceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures, SPAA 2018, Vienna, Austria, July 16-18, 2018, pages 405โ€“414. ACM, 2018. doi:10.1145/3210377.3210409.
  • [43] Merav Parter. (delta+1) coloring in the congested clique model. In Ioannis Chatzigiannakis, Christos Kaklamanis, Dรกniel Marx, and Donald Sannella, editors, 45th International Colloquium on Automata, Languages, and Programming, ICALP 2018, July 9-13, 2018, Prague, Czech Republic, volume 107 of LIPIcs, pages 160:1โ€“160:14. Schloss Dagstuhl โ€“ Leibniz-Zentrum fรผr Informatik, 2018. doi:10.4230/LIPICS.ICALP.2018.160.
  • [44] Merav Parter and Hsin-Hao Su. Randomized (delta+1)-coloring in o(log* delta) congested clique rounds. In Ulrich Schmid and Josef Widder, editors, 32nd International Symposium on Distributed Computing, DISC 2018, New Orleans, LA, USA, October 15-19, 2018, volume 121 of LIPIcs, pages 39:1โ€“39:18. Schloss Dagstuhl โ€“ Leibniz-Zentrum fรผr Informatik, 2018. doi:10.4230/LIPICS.DISC.2018.39.
  • [45] Merav Parter and Eylon Yogev. Congested clique algorithms for graph spanners. In Ulrich Schmid and Josef Widder, editors, 32nd International Symposium on Distributed Computing, DISC 2018, New Orleans, LA, USA, October 15-19, 2018, volume 121 of LIPIcs, pages 40:1โ€“40:18. Schloss Dagstuhl โ€“ Leibniz-Zentrum fรผr Informatik, 2018. doi:10.4230/LIPICS.DISC.2018.40.
  • [46] Boaz Patt-Shamir and Marat Teplitsky. The round complexity of distributed sorting: extended abstract. In Cyril Gavoille and Pierre Fraigniaud, editors, Proceedings of the 30th Annual ACM Symposium on Principles of Distributed Computing, PODC 2011, San Jose, CA, USA, June 6-8, 2011, pages 249โ€“256. ACM, 2011. doi:10.1145/1993806.1993851.
  • [47] David Peleg. Distributed computing: a locality-sensitive approach. SIAM, 2000.