Sparse Matrix Multiplication and Triangle Listing in the Congested Clique Model

We multiply two $n \times n$ matrices $S,T$ over semirings in the Congested Clique model, where $n$ fully connected nodes communicate synchronously using $O(\log n)$-bit messages, within $O(nz(S)^{1/3} nz(T)^{1/3}/n + 1)$ rounds of communication, where $nz(A)$ denotes the number of non-zero elements in a matrix $A$. By leveraging the sparsity of the input matrices, our algorithm greatly reduces communication compared with general algorithms [Censor-Hillel et al., PODC 2015], improving upon the state-of-the-art for matrices with $o(n^2)$ non-zero elements. Our algorithm exhibits the additional strength of surpassing previous solutions also when only one matrix is sparse. This allows efficiently raising a sparse matrix to a power greater than 2. As applications, we speed up 4-cycle counting and APSP in sparse graphs. Our algorithmic contribution is a new \emph{deterministic} method of restructuring the input matrices in a sparsity-aware manner, which assigns each node with element-wise multiplication tasks that are not necessarily consecutive but are balanced, yielding communication-efficient multiplication. Moreover, this new deterministic method for restructuring matrices may be used to restructure the adjacency matrix of input graphs, enabling faster solutions for graph related problems. As an example, we present a new deterministic algorithm which solves the triangle listing problem in $O(m/n^{5/3} + 1)$ rounds, a complexity that was previously obtained by a \emph{randomized} algorithm [Pandurangan et al., SPAA 2018] and matches the lower bound of $\tilde{\Omega}(n^{1/3})$ when $m=n^2$ of [Izumi and Le Gall, PODC 2017, Pandurangan et al., SPAA 2018]. Our triangle listing algorithm implies triangle counting with the same complexity of $O(m/n^{5/3} + 1)$ rounds, which is a \emph{cubic} improvement over the previous $O(m^2/n^3)$-round algorithm [Dolev et al., DISC 2012].


Introduction
Matrix multiplication is a fundamental algebraic task, with abundant applications to various computations. The value of the exponent ω of matrix multiplication, that is, the value ω for which Θ(n ω ) is the complexity of matrix multiplication, is a central question in algebraic algorithms [10,29,31], and is currently known to be bounded by 2.3728639 [15].
The work of Censor-Hillel et al. [9] recently showed that known matrix multiplication algorithms for the parallel setting can be adapted to the distributed Congested Clique model, which consists of n nodes in a fully connected synchronous network, limited by a bandwidth of O(log n) bits per message. Subsequently, this significantly improved the state-of-the-art for a variety of tasks, including triangle and 4-cycle counting, girth computations, and (un)weighted/(un)directed allpairs-shortest-paths (APSP). This was followed by the beautiful work of Le Gall [16], who showed how to efficiently multiply rectangular matrices, as well as multiple independent multiplication instances. These led to even faster algorithms for some of the tasks, such as weighted or directed APSP, as well as fast algorithms for new tasks, such as computing the size of the maximum matching.
In this paper we focus our attention on the task of multiplying sparse matrices in the Congested Clique model, providing a novel deterministic algorithm with a round complexity which depends on the sparsity of the input matrices.
An immediate application of our algorithm is faster counting of 4-cycles. Moreover, a prime feature of our algorithm is that it speeds up matrix multiplication even if only one of the input matrices is sparse. The significance of this ability stems from the fact that the product of sparse matrices may be non-sparse, which in general may stand in the way of fast multiplication of more than two sparse matrices, such as raising a sparse matrix to a power that is larger than 2. Therefore, this property of our algorithm enables, for instance, a fast algorithm for computing APSP in the Congested Clique model. We emphasize that, unlike the matrix multiplication algorithms of [9], we are not aware of a similar sparse matrix multiplication algorithm existing in the literature of parallel settings.
Furthermore, we leverage our techniques to obtain a deterministic algorithm for sparsity-aware triangle listing in the Congested Clique model, in which each triangle needs to be known to some node. This problem has been tackled (implicitly) in the Congested Clique model for the first time by Dolev et al. [12], providing two deterministic algorithms. Later, [19,26] showed ã Ω(n 1/3 ) lower bound in general graphs. Pandurangan et al. [26] showed a randomized triangle listing algorithm, with the same round complexity as we obtain.

Our contribution
For a matrix A, let nz(A) be its number of nonzero elements. Our main contribution is an algorithm called SMM (Sparse Matrix Multiplication), for which we prove the following. Theorem 1. Given two n × n matrices S and T , Algorithm SMM deterministically computes the product P = S · T over a semiring in the Congested Clique model, completing in O(nz(S) 1/3 nz(T ) 1/3 /n + 1) rounds. 1 An important case of Theorem 1, especially when squaring the adjacency matrix of a graph in order to solve graph problems, is when the sparsities of the input matrices are roughly the same. In such a case, Theorem 1 gives the following. Corollary 1. Given two n × n matrices S and T , where O(nz(S)) = O(nz(T )) = m, Algorithm SMM deterministically computes the product P = S · T over a semiring in the Congested Clique model, within O(m 2/3 /n + 1) rounds.
Notice that for m = O(n 2 ), Corollary 1 gives the same complexity of O(n 1/3 ) rounds as given by the semiring multiplication of [9].
We apply Algorithm SMM to 4-cycle counting, obtaining the following. Notice that for m = O(n 3/2 ) this establishes 4-cycle counting in a constant number of rounds. As described earlier, our algorithm is fast also in the case where only one of the input matrices is sparse, as stated in the following corollary of Theorem 1.
This allows us to compute powers that are larger than 2 of a sparse input matrix. Although we cannot enjoy the guarantees of our algorithm when repeatedly squaring a matrix, because this may require multiplying dense matrices, we can still repeatedly increase its power by 1. This gives the following for computing APSP, whose comparison to the state-of-the-art depends on trade-off between the number of edges in the graph and its diameter.
There is a deterministic algorithm that computes unweighted undirected APSP in an n-node graph G in O(D((m/n) 1/3 + 1)) rounds in the Congested Clique model, where m is the number of edges of G and D is its diameter.
For comparison, the previously known best complexity of unweighted undirected APSP is O(n 1−2/ω ), given by [9,16], which is currently known to be bounded by O(n 0.158 ). For a graph with a number of edges that is m = o(n 4−6/ω /D 3 ), which is currently o(n 1.474 /D 3 ), our algorithm improves upon the latter.
Lastly, we leverage the routing techniques developed in our sparse matrix multiplication algorithm in order to introduce an algorithm for the triangle listing problem in the Congested Clique model.

Theorem 4.
There is a deterministic algorithm for triangle listing in an n-node, m-edge graph G in O(m/n 5/3 + 1) rounds in the Congested Clique model.
For comparison, two deterministic algorithms by Dolev et al. [12] takeÕ(n 1/3 ) and O( ∆ 2 /n ) rounds, while the sparsity-aware randomized algorithm of Pandurangan et al. [26] completes iñ O(m/n 5/3 ), w.h.p. Notice that for general graphs, our algorithm matches the lower bound of Ω(n 1/3 ) by [19,26]. Additionally, our algorithm for triangle listing implies a triangle counting algorithm. A triangle counting algorithm whose complexity depends on the arboricity A of the graph is given in [12]. Their algorithm completes in O(A 2 /n + log 2+n/A 2 n) rounds. Since A ≥ m/n, this gives a complexity of Ω(m 2 /n 3 ), upon which our algorithm provides more than a cubic improvement. The previously known best complexity of triangle and 4-cycle counting in general graphs is O(n 1−2/ω ), given by [9], which is currently known to be bounded by O(n 0.158 ). For a graph with a number of edges that is m = o(n 8/3−2/ω ), which is currently o(n 1.824 ), our algorithm improves upon the latter.
Roadmap: The remainder of this section contains an intuitive discussion of the challenges and how we overcome them, followed by a survey of related work and the required preliminaries. Section 2 gives our sparse matrix multiplication algorithm, and Section 3 shows its immediate applications in APSP and 4-cycle counting. Section 4 provides our triangle listing algorithm. We conclude with a discussion in Section 5.

Challenges and Our Techniques
Given two n × n matrices S and T , denote their product by P = S · T , for which P . A common way of illustrating the multiplication is by a 3-dimensional cube of size n × n × n, in which the entry (i, j, k) corresponds to the element-wise product In other words, two dimensions of the cube correspond to the matrices S and T , and the third dimension corresponds to element-wise products. Each index of the third dimension is a page, and P corresponds to the element-wise summation of all n pages. In essence, the task of distributed matrix multiplication is to assign each of the n 3 element-wise multiplications to the nodes of the network, in a way which minimizes the amount of communication that is required. 2 This motivates the goal of assigning the element-wise products to the nodes in a way that balances the number of non-zero elements in S and T that need to be communicated among the nodes, as this is the key ingredient towards minimizing the number of communication rounds. The main obstacle is that a sparse input matrix may be unbalanced, leading to the existence of nodes whose element-wise multiplication operation assignment requires them to obtain many nonzero elements of the input matrices that originally reside in other nodes, and thus necessitating much communication.
As we elaborate upon in Section 1.3, algorithms for the parallel settings, which encounter the same hurdle, typically first permute the rows and columns of the input matrices in an attempt to balance the structure of the non-zero entries. Ballard et al. [5] write: "While a priori knowledge of sparsity structure can certainly reduce communication for many important classes of inputs, we are not aware of any algorithms that dynamically determine and efficiently exploit the structure of general input matrices. In fact, a common technique of current library implementations is to randomly permute rows and columns of the input matrices in an attempt to destroy their structure and improve computational load balance." Our high-level approach, which is deterministic, is threefold. The first ingredient is splitting the a b S' T' P' Figure 1: An illustration of the multiplication cube for P = S T . Each sub-matrix is assigned to n/ab nodes, with a not necessarily consecutive page assignment that is computed on-the-fly to minimize communication.
n × n × n cube into n equally sized sub-cubes whose dimensions are determined dynamically, based on the sparsity of the input matrices. The second is indeed permuting the input matrices S and T into two matrices S and T , respectively. We do so in a subtle manner, for which the resulting matrices exhibit some nice balancing property. 3 The third ingredient is the innovative part of our algorithm, which assigns the computation of pages of different sub-matrices across the nodes in a non-consecutive manner. We elaborate below about these key ingredients, with the aid of Figure 1.
Permuting the input matrices: We employ standard parallelization of the task of computing the product matrix P , by partitioning P into ab equal sized n/a × n/b sub-matrices denoted by P i,j for i ∈ [a], j ∈ [b], and assigning n/ab nodes for computing each sub-matrix.
To this end, we leverage the simple observation that the multiplication of permutations of the rows of S and the columns of T results in a permutation of the product of S and T , which can be easily inverted. This observation underlies the first part of our algorithm, in which the nodes permute the input matrices, such that the number of non-zero entries from S and T that are required for computing each n/a×n/b sub-matrix are roughly the same across the a·b sub-matrices. We call the two matrices, S and T , that result from the permutations, sparsity-balanced matrices with respect to (a, b). The rest of our algorithm deals with computing the product of two such matrices. This part inherently includes a computation of the best choice for a and b for minimizing the communication.
Assigning pages to nodes: To obtain each sub-matrix P i,j , there are n sub-pages P i,j, which need to be computed and summed. For each P i,j , this task is assigned to distinct n/ab nodes, each of which computes some of the n sub-pages P i,j, and sums them locally. The local sums are then aggregated and summed, for obtaining P i,j . We utilize the commutativity and associativity properties of summation over the semiring in order to assign sub-pages to nodes in a non-consecutive manner, such that the nodes require receiving a roughly equal number of non-zero entries in order to compute their assigned sub-pages.
Assigning non-zero matrix entries to nodes: For fast communication in the Congested Clique model using Lenzen's routing scheme (see Section 1.4), it is moreover paramount that the nodes also send a roughly equal amount of non-zero matrix entries. However, it may be the case that a certain row, held by a node v, contains a significantly larger number of non-zero entries as compared with other rows. Therefore, we rearrange the entries held by each node such that every node holds a roughly equal amount of non-zero entries that need to be sent to other nodes for computing the n 3 products. Notice that in this step we do not rearrange the rows or columns of S or T , rather, we redistribute the entries of S and T . Thus, a node may hold values which originate from different rows.
Routing non-zero elements: Crucially, the assignments made above, for addressing the need to balance sending and receiving, are not global knowledge. That is, for every P i,j , the corresponding n/ab nodes decide which matrix entries are received by which node, but this is unknown to the other nodes, who need to send this information. Likewise, the redistribution of entries of S and T across the nodes is not known to all nodes. Nonetheless, clearly, a node must know the destination of each message it needs to send. As a consequence, we ultimately face the challenge of communicating some of this local knowledge. In our solution, a node that needs to receive information from a certain column of S (or row of T ) sends a request to the nodes holding subsequences of that column (or row) without knowing the exact partition into subsequences. The nodes then deliver the non-zero entries of this column (or row), which allow computing the required element-wise multiplications. Our solutions to the three challenges described above, for sending and receiving as small as possible amounts of information and for resolving a corresponding routing, and their combination, are the main innovation of our algorithm.

Related work
Matrix multiplication in the Congested Clique model: A randomized Boolean matrix multiplication algorithm was given by Drucker et al. [13], completing in O(n ω−2 ) rounds, where ω is the exponent of sequential matrix multiplication. The best currently known upper bound is ω < 2.3728639 [15], implying O(n 0.372 ) rounds for the above. Later, Censor-Hillel et al. [9] gave a deterministic algorithm for (general) matrix multiplication over semirings, completing in O(n 1/3 ) rounds, and a deterministic algorithm for (general) matrix multiplication over rings, completing in O(n 1−2/ω ) rounds, which by the current known upper bound on ω is O(n 0.158 ). The latter is a Strassen-like algorithm, exploiting known schemes for computing the product of two matrices over a ring without directly computing all n 3 element-wise multiplications. Then, Le Gall [16] provided fast algorithms for multiplying rectangular matrices and algorithms for computing multiple instances of products of independent matrices.
Related graph computations in the Congested Clique model: Triangle counting in the Congested Clique model was addressed by Dolev et al. [12], who provided a deterministic O(n d−2/d )-round algorithm for counting the number of appearances of any d-node subgraph, giving triangle counting inÕ(n 1/3 ) rounds. To speed up the computation for sparse instances, [12] show that every node in a graph with a maximum degree of ∆ can learn its 2-hop neighborhood within O(∆ 2 /n) rounds, implying the same round complexity for triangle counting. They also showed a deterministic triangle counting algorithm completing inÕ(A 2 /n + log 2+n/A 2 n) rounds, where A is the arboricity of the input graph, i.e., the minimal number of forests into which the set of edges can be decomposed. Note that a graph with arboricity A has at most An edges, but there are graphs with arboricity A and a significantly smaller number of edges. Since it holds that A ≥ m/n, this implies a complexity of Ω(m 2 /n 3 ) for their triangle counting algorithm, upon which our O(m 2/3 /n + 1)-round algorithm provides a cubic improvement. The deterministic matrix multiplication algorithm over rings of [9] directly gives a triangle counting algorithm with O(n 1−2/ω ) rounds.
For APSP, the matrix multiplication algorithms of [9] give O(n 1−2/ω ) for the unweighted undirected case. For weighted directed APSP,Õ(n 1/3 ) rounds are given in [9], and improved algorithms for weighted (directed and undirected) APSP are given in [16]. We mention that our technique could allow for computing weighted APSP, but the cost would be too large due to our iterative multiplication (as opposed to the previous algorithms that can afford iterative squaring). Algorithms for approximations of APSP are given in [9,16,25].
Note that for all graph problems, Lenzen's routing scheme [24] (see Section 1.4) implies that every node can learn the entire structure of G within O(m/n) rounds, where m is the number of edges (this can also be obtained by a simpler scheme).
Sequential matrix multiplication: The works of Gustavson [18] and of Yuster and Zwick [32] give matrix multiplication algorithms that are faster than O(n ω ), for sparse matrices. In the latter, the exact complexity depends also on the exponents of certain rectangular matrix multiplications. In a nutshell, the latter algorithm cleverly splits the input matrices into two sets of rows and columns, one dense and one very sparse, by balancing the complexities of multiplying each part. This algorithm is designed to reduce the number of multiplication operations, which is not a direct concern for distributed algorithms, in which the main cost is due to communication. Le Gall [14] improved this result for some range of sparsity by improving general rectangular matrix multiplication, for which a further improvement was recently given by Le Gall and Urrutia [17]. Kaplan et al. [20] give an algorithm for multiplying sparse rectangular matrices, and Amossen and Pagh [3] give a fast algorithm for the case of sparse square matrices for which the product is also sparse.
Parallel matrix multiplication: There are many known matrix multiplication algorithms in parallel models of computing, of which we give a non-exhaustive overview here. These algorithms are typically categorized as 1D, 2D or 3D algorithms, according to the manner in which the elementwise products are split across the nodes (by pages, rectangular prisms, or sub-cubes, respectively). Algorithms are also distinguished according to whether they are sparsity-dependent, that is, whether they leverage the structure of the non-zero elements rather than their number only.
For example, the work of Ballard et al. [6] looks for a good assignment to nodes by modeling the problem as a hypergraph. Randomly permuting the rows or columns of the input matrices can be expected to result in a balanced structure of nonzero elements. Examples for algorithms relying on random permutations of the input matrices can be found in [7,8].
Additional study appears in Solomonik et al. [27] and in Azad et al. [4]. The latter proposes algorithms of various types of dimensionality and also employ permutations on the input matrices.
In a similar spirit to random permutations, fast algorithms can be devised for random matrices, as shown by the work of Ballard et al. [5], which provides fast matrix multiplication algorithms for matrices representing sparse random Erdos-Renyi graphs. The random positioning of the nonzero elements gives rise to the analysis of the complexity of their algorithm.
Solomon and Demmel [28] give a 2.5-dimensional matrix multiplication algorithm, in the sense that the cube is split to (n/c) 1/2 × (n/c) 1/2 × c sub-cubes. A recent work by Lazzaro et al. [23] provides a 2.5D algorithm that is also suitable for sparse matrices, which also employs random permutations of rows and columns.
We note that the sparse-dense parallel multiplication algorithm of Tiskin [30] shuffles a single matrix in a sparsity-aware manner, but it does so to only one of the two multiplied matrices. This work also contains a path-doubling technique for computing APSP, but it is not clear what improvement this would constitute in the complexity measures addressed in our paper. An additional algorithm for sparse-dense multiplication is given in Koanantakool et al. [22].
Finally, we note that experimental studies appear in some of the above papers, and in additional works, such as by Ahmed et al. [1] and by Deveci et al. [11].
In comparison to all of the above, our algorithm is a sparsity-dependent 3D algorithm. Yet, our algorithm presents the additional complication of determining the dimensions of the sub-cube assigned to each node dynamically, depending on the number of non-zero elements in each input matrix. Moreover, the sub-cubes may be of non-consecutive pages. Further, our transformations on the rows and columns are deterministic and are applied to both matrices. Being able to permute the two matrices and assign pages to nodes in ways which leverage the sparsity of both matrices is the crux of the novelty of our algorithm. In an instance of multiplication of two matrices S and T , the input to each node v is row v of each matrix and its output should be row v of P = S · T . For a graph problem over a graph G of n nodes, we identify the nodes of the Congested Clique model with the nodes of G, and the input to node v in the Congested Clique model is its input in G.

Preliminaries
As defined earlier, for a matrix A we denote by nz(A) the number of non-zero elements of A. Throughout the paper, we also need to refer to the number of non-zero elements in certain sub-matrices or sequences. We will therefore overload this notation, and use nz(X) to denote the number of non-zero elements in any object X.
A pair of integers (a, b) is n-split if a, b ∈ [n], both a and b divide n, and n/ab ≥ 1. The requirement that a and b divide n is for simplification only and could be omitted. Eventually, the n-split pair that will be chosen is a = n · nz(S) 1/3 /nz(T ) 2/3 and b = n · nz(T ) 1/3 /nz(S) 2/3 . For a given n-split pair (a, b), it will be helpful to associate each node v ∈ [n] with three indices, two indicating the P i,j sub-matrix to which the node is assigned, and one distinguishing it from the other nodes assigned to P i,j . Hence, we denote each node v also as , and k ∈ [n/ab]. The assignment of indices to the nodes can be any arbitrary one-to-one function Throughout our algorithms, we implicitly comply with the following: (I) no information is sent for matrix entries whose value is zero, and (II) when the value of a non-zero entry is sent, it is sent alongside its location in the matrix. Since sending the location within an n × n matrix requires O(log n) bits, the overhead in the complexity is constant.
Lenzen's routing scheme: A useful tool in designing algorithms for the Congested Clique model is Lenzen's routing scheme [24]. In this scheme, each of the n nodes can send and receive n − 1 messages (of O(log n) bits each) in a constant number of rounds. While this is simple to see for the simplest case where each node sends a single message to every other node, the power of Lenzen's scheme is that it applies to any (multi)set of source-destination pairs, as long as each node is source of at most n − 1 messages and destination of at most n − 1 messages. Moreover, the multiset of pairs does not need to be known to all nodes in advance, rather each sender only needs to know the recipient of its messages. Employing this scheme is what underlies our incentive for balancing the number of messages that need to be sent and received by all the nodes.
Useful combinatorial claims: The following are simple combinatorial claims that we use for routing messages in a load-balanced manner. Claim 1. Let A = (a 1 , . . . , a t ) be a finite set and let 1 ≤ c ≤ t be an integer. There exists a partition of A into t/c subsets of size at most c + 1 each.

Proof of Claim 1: Let
where if j > t then we ignore the notation a j . It is easy to verify that each set is of size at most c + 1, and that There exists a partition of each A i into t i /avg subsets of size at most avg + 1 each, such that the total number of subsets is at most 2n.
Proof of Claim 2: By Claim 1 with c = avg, for every 1 ≤ i ≤ n, there is a partition of A i into t i /avg subsets of size at most avg + 1. The total number of subsets is 1≤i≤n t i /avg ≤ 1≤i≤n t i /avg + 1 ≤ n + ( 1≤i≤n t i )/avg ≤ 2n, as required.
Claim 3. Given a sorted finite multiset A = (a 1 , . . . , a n ) of natural numbers, an integer x ∈ N such that for all i ∈ [n] it holds that a i ≤ x, and an integer k that divides n, there exists a partition Proof of Claim 3: We show that A j = {a j+ k | 0 ≤ < n/k} gives the claimed partition. Since A is sorted, we have that sum( . In addition, removing the last element from A k gives that sum(A k ) − a n = n/k−2 =0 . Since a n ≤ x, we conclude that sum(A j ) ≤ sum(A k ) = sum(A k ) − a n + a n = k(sum(A k )−an) k + a n ≤ sum(A)/k + x, for every j ∈ [k], which completes the proof.

Fast Sparse Matrix Multiplication
Our main result is Theorem 1, stating the guarantees of our principal algorithm SMM (Sparse Matrix Multiplication) for fast multiplication of sparse matrices. Algorithm SMM first manipulates the structure of its input matrices and then calls algorithm SBMM (Sparse Balanced Matrix Multiplication), which solves the problem of fast sparse matrix multiplication under additional assumptions on the distributions of non-zero elements in the input matrices, which are defined next. In Section 2.1, we show how SMM computes general matrix multiplication P = ST , given Algorithm SBMM and Theorem 5. Algorithm SBMM, and Theorem 5 which states its guarantees, are deferred to Section 2.2.
Theorem 1 (repeated) Given two n × n matrices S and T , Algorithm SMM deterministically computes the product P = S · T over a semiring in the Congested Clique model, completing in O(nz(S) 1/3 nz(T ) 1/3 /n + 1) rounds.
We proceed to presenting Theorem 5 which discusses SBMM. SBMM multiplies matrices S and T in which the non-zero elements are roughly balanced between portions of the rows of S and columns of T . In what follows, for a matrix A, the notation A[x : y][ * ] refers to rows x through y of A and the notation A[ * ][x : y] refers to columns x through y of A. In the following definition we capture the needed properties of well-balanced matrices. • T -condition: For every j ∈ [b], nz(T j ) ≤ nz(T )/b + n.
These conditions ensure that bands of adjacent rows of S and columns of T contain roughly the same number of non-zero elements. We can now state our theorem for multiplying sparsitybalanced matrices, which summarizes our algorithm SBMM.
Theorem 5. Given two n × n matrices S and T and an n-split pair (a, b), if S and T are a sparsity-balanced pair with respect to (a, b), then Algorithm SBMM deterministically computes the product P = S · T over a semiring in the Congested Clique model, completing in O(nz(S) · b/n 2 + nz(T ) · a/n 2 + n/ab + 1) rounds.
We show that O(1) rounds are sufficient in the Congested Clique for transforming any two general matrices S and T to sparsity-balanced matrices S and T by invoking standard matrix permutation operations. Therefore, in essence, Algorithm SMM performs permutation operations on S and T , generating the matrices S and T , respectively, invokes SBMM on S and T to compute P = S T , and finally recovers P from P .

Fast General Sparse Matrix Multiplication -Algorithm SMM
Algorithm Description: First, each node distributes the entries in its row of T to other nodes in order for each node to obtain its column in T . Then, the nodes broadcast the number of non-zero elements in their respective row of S and column of T , in order for all nodes to compute nz(S) and nz(T ). Having this information, the nodes locally compute the n-split pair (a, b) that minimizes the expression nz(S) · b/n 2 + nz(T ) · a/n 2 + n/ab, which describes the round complexities of each of the three parts of Algorithm SBMM. It can be shown that the pair (n · nz(S) 1/3 /nz(T ) 2/3 , n · nz(T ) 1/3 /nz(S) 2/3 ) minimizes this expression. Then, the nodes permute the rows of S and columns of T so as to produce matrices S and T which have the required balance. Subsequently, Algorithm SBMM is executed on the permuted matrices S and T , followed by invoking the inverse permutations on the product P = S T in order to obtain the product P = S · T of the original matrices. A pseudocode of SMM is given in Algorithm 1.
Proof of Theorem 1: To prove correctness, we need to show that the matrices S and T computed in Line 10 are a sparsity-balanced pair of matrices with respect to the n-split pair (a, b) that 10 Let σ be a permutation for which its n × n permutation matrix A σ is such that the rows of the matrix S = A σ S that correspond to any single A S u are adjacent, and let τ be a permutation for which its n × n permutation matrix A τ is such that the columns of the matrix T = T A τ that correspond to any single A T u are adjacent.
is determined in Line 8. Once this is proven, the correctness of the algorithm is as follows. In Lines 1-12 the matrices S and T are computed and are distributed among the nodes such that each node v ∈ [n] holds row v of S and column v of T . The loop of Line 13 is only for consistency, having the input to SBMM be the respective rows of both S and T . Assuming the correctness of algorithm SBMM given in Theorem 5, the matrix P computed in Line 15 is the product P = S T . Finally, in the last loop, node v receives row v of P = A −1 σ P A −1 τ , completing the correctness of the Algorithm SMM.
We now show that S and T are indeed a sparsity-balanced pair of matrices with respect to (a, b). To this end, we first need to show that for all i ∈ [a], the number of non-zero elements in S i is at most nz(S )/a+n. By construction, the number of non-zero elements in S i = S [(i−1)(n/a)+1 : i(n/a)][ * ] is exactly sum(A S i ) of the partition computed in Line 9. By Claim 3 this is bounded by sum(A)/k + x, which in our case is nz(S)/a + n = nz(S )/a + n. Thus, S satisfies the S-condition of Definition 1. A similar argument shows that T satisfies the T -condition of Definition 1.
For the complexity, we sum the number of rounds as follows. The first loop allows every node v to obtain column v of T , while in the second loop the nodes exchange the sums of non-zero elements in rows and columns of S and T , respectively. Even without the need to resort to Lenzen's routing scheme, both of these loops can be completed within O(1) rounds. A similar argument shows that O(1) rounds suffice for permuting S and T into S and T , and for permuting P back into P . Thus, all lines of the pseudocode excluding Line 15 complete in O(1) rounds. This implies that the complexity of Algorithm SMM equals that of Algorithm SBMM when given S , T , a, and b as input. By Theorem 5 and due to the choice of a and b in Line 8, this complexity is O(min n-split pairs (a,b) {nz(S) · b/n 2 + nz(T ) · a/n 2 + n/ab + 1}). Choosing a = n · nz(S) 1/3 /nz(T ) 2/3 and b = n · nz(T ) 1/3 /nz(S) 2/3 gives a complexity of O(nz(S) 1/3 nz(T ) 1/3 /n + 1) rounds, which can be shown to be optimal.

Fast Sparse Balanced Matrix Multiplication -Algorithm SBMM
Here we present SBMM and prove Theorem 5. We begin with a short overview of the algebraic computations and node allocation in SBMM. We then proceed to presenting a communication scheme detailing how to perform the computations of SBMM in the Congested Clique model in O(M S · b/n 2 + M T · a/n 2 + n/ab + 1) rounds of communication.
Algorithm Description: Consider the partition of P into ab rectangles, such that ∀(i, j) ∈ [a] × [b], sub-matrix P i,j = P [(i − 1)(n/a) + 1 : i(n/a)][(j − 1)(n/b) + 1 : j(n/b)]. Each sub-matrix P i,j is an n/a × n/b matrix, i.e., has n 2 /ab entries. Notice that P i,j = S i · T j . We assign the computation of P i,j to a unique set of n/ab nodes N i,j = {v i,j,k |k ∈ [n/ab]}.
In the initial phase of algorithm SBMM, for every (i, j) ∈ [a] × [b], each non-zero element of S i and T j is sent to some node in N i,j . Due to the sparsity-balanced property of S and T , all S i 's have roughly the same amount of non-zero elements, and likewise all T j 's. Therefore, each set of nodes N i,j receives roughly the same amount of non-zero elements from S and T .
Within each N i,j , the computation of P i,j is carried out according to the following framework.
. The computation of the n different P i,j, sub-matrices is split among the nodes in N i,j as follows: The set [n] is partitioned into A i,j,1 , . . . , A i,j,n/ab such that for each k ∈ [n/ab], node v i,j,k ∈ N i,j is required to compute the entries of the matrices in the set {P i,j, | ∈ A i,j,k }. Then, node v i,j,k ∈ N i,j locally sums its computed submatrices to produce P k i,j = ∈A i,j,k P i,j, . Clearly, due to the associativity and commutativity of the addition operation in the semiring, it holds that P i,j = ∈[n] P i,j, = k∈[n/ab] P k i,j . Therefore, once every node v i,j,k has P k i,j , the nodes can collectively compute P , and redistribute its entries in a straightforward manner such that each node obtains a distinct row of P .
Implementing SBMM: A pseudocode for Algorithm SBMM is given in Algorithm 2, which consists of three components: exchanging information between the nodes such that every node v i,j,k has the required information for computing P i,j, for every ∈ A i,j,k , local computation of P k i,j for each (i, j, k) ∈ [a] × [b] × [n/ab] and, finally, the communication of the P k i,j matrices and assembling of the rows of P .
The technical challenge is in Line 1, upon which we elaborate below. In Lines 2-3, only local computations are performed, resulting in each node v i,j,k holding P k i,j . In Line 4, each node sends each row of its sub-matrix P k i,j to the appropriate node, so that in Line 6 each node can sum this information to produce its row in P . Formally, we prove the following.   . (a, b). Code for node v ∈ [n], which is also denoted v i,j,k .
1 ExchangeInfo (S, T, a, b) 2 Locally compute P i,j, for every ∈ A i,j,k 3 Locally compute P k i,j = ∈A i,j,k P i,j, [ * ] to node of respective row 6 foreach ∈ [n] do In the loop of Line 4, each node sends each of the entries of its sub-matrix P k i,j to a single receiving node, implying that each node sends n 2 /ab messages. To verify that this is also the number of messages received by each node, recall that each entry of the matrix P is computed in Line 6 as a summation of n/ab entries, each is an entry of P k i,j for two appropriate values of i, j and for all k ∈ [n/ab]. Since such n/ab messages need to be received for every entry of the row, this results in receiving n 2 /ab messages.
For the above, we use Lenzen's routing scheme, which completes in n/ab rounds, completing the proof.
The remainder of this section is dedicated to presenting and analyzing Line 1. During this part of the algorithm, for every (i, j) ∈ [a] × [b], each entry in S i and T j needs to be sent to a node in N i,j . As per our motivation throughout the entire algorithm, we strive to achieve this goal in a way which ensures that all nodes send and receive roughly the same number of messages. This leads to the following three challenges which we need to overcome.
Sending Challenge: Initially, node v holds row v of S and row v of T . Every column v of S needs to be sent to b nodes -one node in each N i,j for an appropriate i ∈ Receiving Challenge: Since S and T are sparsity-balanced w.r. t. (a, b), for every (i, j) ∈ [a] × [b] it holds that the number of messages to be received by each set of nodes N i,j is at most nz(S)/a + nz(T )/b + 2n. This ensures that each node set N i,j receives roughly the same amount of messages as every other node set. The challenge remains to ensure that within any given node set N i,j , every node receives roughly the same number of messages.
Routing Challenge: When overcoming the above mentioned challenges in a non-trivial manner, all nodes locally determine that they are senders and recipients of certain messages with the guarantee that each node sends and receives roughly the same number of messages. However, these partitions of sending and receiving messages are obtained independently and thus are not global knowledge; a sender of a message does not necessarily know who the recipient is. The routing challenge is thus to ensure that each node associates the correct recipient with every message that it sends.

ExchangeInfo (S, T, a, b)
We next present our implementation of ExchangeInfo (S, T, a, b) which solves the above challenges in an on-the-fly manner. To simplify the presentation, we split ExchangeInfo (S, T, a, b) into its three components, as given in the pseudocode of Algorithm 3.
Algorithm 3: ExchangeInfo (S,T,a,b): Sending each entry of S i , T j to a node in N i,j , for Compute-Sending: In Compute-Sending, whose pseudocode is given in Algorithm 4, we overcome the sending challenge. The nodes communicate the distribution of non-zero elements across the columns of S and the rows of T and reorganize the entries held by each node such that all nodes hold roughly the same amount of non-zero elements of S and T .
Notably, in order to enable fast communication in Resolve-Routing, Algorithm 4 must guarantee no node holds entries of more than two columns of S and two rows of T .
Algorithm 4: Compute-Sending: Code for node v ∈ [n], which is also denoted v i,j,k .  1) rounds, after which the entries of S and T are evenly redistributed across the nodes such that every node holds elements from at most 2 columns of S and 2 rows of T and such that every node v knows for every node u the indices of the two columns of S and two rows of T from which the elements which u holds are taken.
Proof. In Lines 2, 4, 5 the nodes exchange entries of S such that each node holds a distinct column of S, and knows the number of non-zero entries in each column of S and in each row of T . This allows local computation of the average number of non-zeros in the following two lines, as well as locally computing the (same) partition into subsequences.
By Claim 2, in total across all n columns there are at most 2n subsequences of entries from S, and similarly there are at most 2n subsequences from T . Since ∀u ∈ [n], all nodes know nz(S[ * ] [u]) and nz(T [u][ * ]), then all nodes know how many subsequences are created for each u. Thus, all nodes can agree in Line 9 on the assignment of the subsequences, with each node assigned at most 2 subsequences of entries of S and 2 of entries of T . Crucially for what follows, all the nodes know the column in S or the row in T to which the subsequence B belongs. We denote this index (B). The entries of each subsequence B are then sent to its node v(B) in the following loop.
For the round complexity, note that a node v sends a single message to every other node in each of Lines 2, 4, and 5. The rest of the computation until Line 9 is done locally. Therefore, these lines complete within 3 rounds.
In the last loop of Algorithm 4, node v potentially sends all subsequences with entries from column v of S and row v of T . Due to the facts that each subsequence is sent only once, no subsequences overlap, and all the subsequences which v send are parts of a single column of S and a single row of T , node v sends at most 2n messages during this loop. Additionally, since every node receives at most 4 subsequences and each subsequence consists of most n entries, each node receives at most 4n messages. Thus, by using Lenzen's routing scheme, this completes in O(1) rounds as well.
Compute-Receiving: The pseudocode for Compute-Receiving is given in Algorithm 5. This algorithm assigns the P i,j, matrices to different nodes in N i,j . Specifically, each node v i,j,k in N i,j is assigned ab such matrices, while verifying that all nodes in N i,j require roughly the same amount of non-zero entries from S and T in order to compute all their assigned P i,j, matrices. Since each sub-matrix P i,j, is defined as ). By this definition, in order to obtain that each node in N i,j requires roughly the same amount of messages in order to compute all of its assigned P i,j, , we assign the P i,j, matrices to the nodes of N i,j such that the total communication cost, as measured by w, of all matrices assigned to a given node is roughly the same for all nodes.
Algorithm 5: Compute-Receiving: Code for node v ∈ [n], which is also denoted v i,j,k . 9 Let A i,j,1 , . . . , A i,j,n/ab be a partition of the sorted multiset {w(P i,j, )| ∈ [n]} into n/ab multisets with a bound x = 2n on its elements, proven to exist in Claim 3. 10 Let A i,j,1 , . . . , A i,j,n/ab be a partition of [n] such that for every k ∈ [n/ab], Proof. The loop of Line 1 provides each node of N i ,j with the number of non-zero elements in each column of S i and each row of T j . This allows the nodes to compute the required communication costs in Line 7. Claim 3 implies that after executing Line 10, each node v i,j,k is assigned a subset A i,j,k ⊆ [n], such that for every k ∈ [n/ab] it holds that ∈A i,j,k w(P i,j, ) ≤ n ab ∈[n] w(P i,j, ) + 2n. Every node sends every other node exactly 4 messages throughout the loop in Line 1, while the remaining lines are executed locally for each node, without communication. As such, this completes in O(1) rounds in total.
Resolve-Routing: Roughly speaking, we solve this challenge by having the recipient of each possibly non-zero entry deduce which node is the sender of this entry, and inform the sender that it is its recipient, as follows. At the end of the execution of Compute-Sending in Algorithm 4, every node v has at most two subsequences in B S (v) and at most two subsequences in B T (v). Moreover, the subsequence assignment is known to all nodes due to performing the same local computation in Line 9. On the other hand, upon completion of Algorithm 5, node v i,j,k is assigned the task of computing P i,j, for every ∈ A i,j,k . For this, it suffices for v i,j,k to know the non-zero entries of column of S i and of row of T j .
Hence, in Resolve-Routing, given in Algorithm 6, node v i,j,k sends every index ∈ A i,j,k to the nodes that hold subsequences of column in S and row in T . Notice that v i,j,k does not know which indices inside these columns and rows are non-zero. However, the nodes which hold these subsequences have this information, and respond with the non-zero entries of the respective columns and rows that are part of S i or T j . For the round complexity, notice that by Lemma 2, for each node u, B S (u) contains entries from at most two distinct columns of S and B T (u) contains entries from at most two distinct rows of T . Therefore, every node sends at most 4 messages to every other node throughout Lines 1 -5. Thus, this part completes in 4 rounds.
We now show that Lines 6 -11 complete in O(nz(S) · b/n 2 + nz(T ) · a/n 2 + 1) rounds. Each node v sends the entries of every B ∈ B S (v) to a single node in each of b sets N i ,j . Since nz(B) ≤ nz(S)/n + 1 by Claim 2, this implies that each node sends at most O(nz(S) · b/n) entries in Lines 6 -8. A similar argument shows that each node sends at most O(nz(T ) · a/n) entries for each of the two B ∈ B T (v). In total, this sums up to sending at most O(nz(S) · b/n + nz(T ) · a/n) entries by each node. Likewise, we show that this is the number of entries that need to be received by each node. This is because the number of non-zero entries from S and T required for v i,j,k to compute the entries of the matrices P i,j, for ∈ A i,j,k is at most (nz(S)/a + n)/(n/ab) + 2n + (ns(T )/b + n)/(n/ab) + 2n, by Lemma 3. Due to the fact that n/ab ≥ 1, the previous expression is bounded above by (nz(S)/a)/(n/ab) + (nz(T )/b)/(n/ab) + 6n = nz(S) · b/n + nz(T ) · a/n + 6n.
We can now wrap-up the proof of Theorem 5.
Proof of Theorem 5: Lemma 4 implies that each node v i,j,k has the required entries of S and T in order to compute P i,j, for every ∈ A i,j,k . Lemma 1 then gives that Algorithm SBMM correctly produces a row of P = ST for each node. Lemmas 2 and 3 show that Compute-Sending and Compute-Receiving complete in O(1) rounds. Lemma 4 gives the claimed round complexity of O(nz(S) · b/n 2 + nz(T ) · a/n 2 + 1) for Resolve-Routing, giving the same total number of rounds for ExchangeInfo. By Lemma 1, the remainder of Algorithm SBMM completes in O(n/ab + 1) rounds, completing the proof.

APSP and counting 4-cycles
As applications of Algorithm SMM, we improve upon the state-of-the-art in the Congested Clique model, for several fundamental graph problems, when considering sparse graphs. In Section 3.1, we utilize SMM alongside an additional algorithm for calculating the trace of the product of two matrices in order to count the number of 4-cycles of a given graph G. In Section 3.2, we utilize SMM for computing APSP in a way which is faster for some range of parameters that depends on the sparsity and the diameter of G.
In what follows, given a graph G, we denote by m the number of its edges and by A G its adjacency matrix.

Counting 4-cycles
In order to compute the number of 4-cycles of G, it is sufficient to compute the trace 4 of A 4 G and the degrees of the nodes, as described in [9]. This allows us to utilize Algorithm SMM for squaring the adjacency matrix A G and deducing the trace of A 4 G . It is noteworthy that we can avoid raising A G to the power of 4, and compute the trace of this power by only squaring A G .
Theorem 2 (repeated) There is a deterministic algorithm that computes the number of 4-cycles in an n-node graph G in O(m 2/3 /n + 1) rounds in the Congested Clique model, where m is the number of edges of G.
Proof. First, observe that given two n × n matrices A, B, it is possible to compute the trace of the matrix A · B in O(1) communication rounds in the Congested Clique model. This is done by redistributing the matrix T across the nodes such that node v holds column v instead of row v of T , and having each node v ∈ [n] locally compute the diagonal entry P Then, each node v broadcasts P [v][v] to all other nodes in a single round, and thus all nodes are able to sum these values and deduce trace(AB).
For counting 4-cycles, the nodes first execute Algorithm SMM for obtaining A 2 G , and then compute trace(A 2 G · A 2 G ), as explained above. By a result of Alon et al. [2], the number of 4-cycles in G equals 1 8 Since all nodes can obtain d v for all nodes v in a single round of broadcasting the degrees, by Corollary 1, this procedure completes in O(m 2/3 /n + 1) rounds.
We remark that Alon et al. [2] has similar formulas for traces of larger values of k.

APSP
In order to compute APSP for a given graph G, it is sufficient to compute A D G over the min-plus semiring 5 , where D is the diameter of G. Notice that we cannot use the approach of [9] which repeatedly squares the adjacency matrix, thus paying only a logarithmic overhead beyond a single multiplication, because powers of a sparse matrix may be dense. However, if we know D, then by repeatedly applying Corollary 2 D − 1 times, we can compute APSP in O(D(m/n) 1/3 + 1) rounds. In fact, any constant approximation of D suffices, and hence we first run a simple BFS computation in order to obtain a 2-approximationD for D, which we then follow with raising A G to the power of 2D. This gives the following.

Theorem 3 (repeated)
There is a deterministic algorithm that computes unweighted undirected APSP in an n-node graph G in O(D((m/n) 1/3 + 1)) rounds in the Congested Clique model, where m is the number of edges of G and D is its diameter.

Triangle Listing
In [26], Pandurangan et al. show a randomized algorithm for listing all triangles, completing inÕ(m/n 5/3 ) rounds, w.h.p. Our contribution is a deterministic algorithm which completes in O(m/n 5/3 + 1) rounds always. For dense graphs, this is not worse than the tight bound of Θ(n 1/3 ), as given by the lower bounds of [19,26].

Theorem 4 (repeated)
There is a deterministic algorithm for triangle listing in an n-node, m-edge graph G in O(m/n 5/3 + 1) rounds in the Congested Clique model.
In fact, our algorithm will apply also to directed graphs, being able to distinguish directed triangles. Each edge in the graph is oriented, and for every v ∈ V , let d in (v) and d out (v) denote the in and out degrees of node v, respectively. The undirected case follows easily, for example by imagining that every edge represents two edges in opposite directions.
Before giving the algorithm, we begin with some definitions and notations. For two sets S 1 , S 2 ⊆ V we denote the set of edges from S 1 to S 2 by E(S 1 , S 2 ) = E ∩ (S 1 × S 2 ). When one of the sets is a singleton we abuse notation and write, e.g., E(v, S 1 ). Our algorithm uses several partitions of sets of elements that are computed by the nodes after learning the amounts of elements, similarly to what we do in our matrix multiplication algorithm.

Definition 2. Equally Sized Partition:
For a partition {A 1 , . . . , A t } of V , we sometimes need an internal numbering of the nodes within each set A i which is uniquely determined by the node identifiers. Thus, every node in V has a unique pair of indexes (i, j) such that a i,j is the j-th node in the internal numbering of A i .
Our algorithm will use, among other computations, three subroutines that we will describe separately. The first, broadcast(m), consists of the node v sending a message m to every node in the graph. The second and third, LearnEdges(X) and LearnPaths(X) result in node v gaining knowledge of all the elements in the set X. The latter are implemented in two different ways, for their two different uses. Notice that in LearnPaths there is execution of the Resolve-Routing algorithm; in order to match the notation used in Algorithm 7, the indexing in lines 8, 11 of Resolve-Routing should be replaced with the edges exiting and entering node sets V j δ , V i δ , respectively.
Algorithm Overview. Let E 3 be the set of all ordered triplets of edges in G, and let T ri ⊆ E 3 be the set of all directed triangles in G. Our goal is to distribute (perhaps multiple copies of) parts of E across the nodes of the graph such that each (e 1 , e 2 , e 3 ) ∈ T ri is known to at least one node. For intuition, observe the following simple algorithm for triangle listing, which also appears in [12]: (1) arbitrarily partition the nodes of the graph into an n 1/3 -equally sized partition {V 1 , . . . , V n 1/3 }, (2) assign every ordered triplet of sets (V i , V j , V k ) from the partition to a different node and have each node learn E(V i , V j ) ∪ E(V j , V k ) ∪ E(V k , V i ). Clearly, any triangle will be found by some node. This simple approach is costly due to the load imbalance which it suffers fromfor different triplets (V i , V j , V k ), the number of edges in E(V i , V j ) ∪ E(V j , V k ) ∪ E(V k , V i ) may vary drastically, with some triplets being very dense while others are sparse.
Our approach is to utilize load balancing routing strategies from Algorithm SMM, in order to create a layered partition of V that ensures that every node learns roughly the same amount of edges. In more detail, instead of having a single node in charge of every triplet (V i , V j , V k ), we take another (fixed) partition {D 1 , . . . , D n 2/3 } of V which is an n 2/3 -equally sized partition, and assign each pair (V i , V j ) to a set of n 1/3 nodes in some D k . Then, within each D k , the nodes again partition V into {P 1 , . . . , P n 1/3 } in order to take into account the distribution of edges between the graph and the pair (V i , V j ) that is assigned to D k . Here too, we get that all triangles in the graph are guaranteed to be listed. For the cost of communication to be very efficient, we choose the two non-fixed partitions carefully, according to information that the nodes exchange regarding amounts of edges between different sets in the graph.
Thus, Algorithm 7 works as follows. We begin by partitioning V into an n 1/3 -equally sized partition {V 1 , . . . , V n 1/3 } of V that has the property that for every i ∈ [n 1/3 ], u∈V i d in (u) + d out (u) ≤ 2α. Then, we take every pair of sets in the partition, (V i , V j ), and split the set E(V i , V j ) into smaller sets by creating N i,j, sets such that |E(N i,j, , V j )| ≤ β, and each such N i,j, is assigned to a predefined fixed set D k of n 1/3 unique nodes. Next, the nodes within D k compute a partition {P 1 , . . . , P n 1/3 } of V such that for every d ∈ [n 1/3 ], |E(V j , P d )| + |E(P d , V i )| ≤ 4β. Finally, every node within D k is assigned one P d and learns all the edges in E(N i,j, , V j ) ∪ E(V j , P d ) ∪ E(P d , V i ).
The pseudo-code of the algorithm is described in Algorithm 7.
Proof of Theorem 4.: We need to show that every (e 1 , e 2 , e 3 ) ∈ T ri is known to at least one node. Let (v 1 , v 2 , v 3 , v 1 ) be a directed triangle, with the edges e 1 = (v 1 , v 2 ), e 2 = (v 2 , v 3 ), e 3 = (v 3 , v 1 ). Observing the partition defined in Step 2, let i, j be the unique integers such that v 1 ∈ V i , v 2 ∈ V j . Notice that due to the loop in Steps 6-8, there also exists an integer such that v 1 ∈ N i,j, and that N i,j, ⊆ V i . Further notice that in Step 13, there is some unique integer k such that the set N i,j, is assigned to D k . Next, in Step 18, the nodes in D k agree on some partition of V into {P 1 , . . . , P n 1/3 }. Let d be the unique integer such that v 3 ∈ P d for the specific {P 1 , . . . , P n 1/3 } partition agreed to by D k . Observe the node d k,d : in Step 14, all the nodes in D k , including d k,d , learn all the edges E(N i,j, , V j ); in Step 19, d k,d learn all the edges E(V j , P d ) ∪ E(P d , V i ). Therefore, node d k,d is aware of the edges e 1 , e 2 , e 3 and so in Step 20 it outputs this triangle.
For the number of rounds, it follows by Lenzen's routing scheme that all the steps of Algorithm 7, with the exception of the LearnEdges and LearnPaths instructions in Steps 14 and 19, can be executed in O(1) rounds in the Congested Clique model. For LearnEdges and LearnPaths, a code inspection gives that no node sends or receives more than O(β) messages, guaranteeing a total round complexity of O(m/n 5/3 + 1).

Discussion
This work significantly improves upon the round complexity of multiplying two matrices in the distributed Congested Clique model, for input matrices which are sparse. As mentioned, we are unaware of a similar algorithmic technique being utilized in the literature of parallel computing, which suggests that our approach may be of interest in a more general setting. The central ensuing open question left for future reserach is whether the round complexity of sparse matrix multiplication in the Congested Clique can be further improved.
Finally, an intriguing question is the complexity of various problems in the more general kmachine model [21,26], where the size of the computation clique is k << n. The way of partitioning the data to the nodes is of importance. One may assume that the input to each node consists of n/k unique consecutive rows of S and T , and its output should be the corresponding n/k rows of the product P = S · T . Applying our algorithm in this setting gives a round complexity of O(min n-split pairs (a,b) n 2 /k 2 + nz(S) · b/k 2 + nz(T ) · a/k 2 + n 2 /kab + 1) rounds, which is O(n 2/3 · nz(S) 1/3 nz(T ) 1/3 /k 5/3 + 1) rounds with the assignment a = n 2/3 k 1/3 · nz(S) 1/3 /nz(T ) 2/3 and b = n 2/3 k 1/3 · nz(T ) 1/3 /nz(S) 2/3 . To see why, consider each node as simulating the behavior of n/k virtual nodes of the Congested Clique model that belong to the same N i,j set. The round complexity of all steps of the algorithm grows by a multiplicative factor of n 2 /k 2 , apart from the steps in Algorithm 2 which grow only by a multiplicative factor of n/k, since part of the simulated communication consists of messages sent between virtual nodes that are simulated by the same actual node, and as such do not require actual communication. We ask whether this complexity can be improved for k << n.