Abstract 1 Introduction 2 Preliminaries 3 Sorting Strings 4 Pattern Matching 5 Suffix Array Construction and the Corresponding LCP Array References Appendix A Sorting Large Objects Appendix B Sorting Objects of Size 𝑶(𝒏)

String Problems in the Congested Clique Model

Shay Golan ORCID Reichman University, Herzliya, Israel
University of Haifa, Israel
Matan Kraus ORCID Bar-Ilan University, Ramat Gan, Israel
Abstract

In this paper we present algorithms for several string problems in the Congested Clique model. In the Congested Clique model, n nodes (computers) are used to solve some problem. The input to the problem is distributed among the nodes, and the communication between the nodes is conducted in rounds. In each round, every node is allowed to send an O(logn)-bit message to every other node in the network.

We consider three fundamental string problems in the Congested Clique model. First, we present an O(1) rounds algorithm for string sorting that supports strings of arbitrary length. Second, we present an O(1) rounds combinatorial pattern matching algorithm. Finally, we present an O(loglogn) rounds algorithm for the computation of the suffix array and the corresponding Longest Common Prefix array of a given string.

Keywords and phrases:
String Sorting, Pattern Matching, Suffix Array, Congested Clique, Sorting
Funding:
Shay Golan: supported by Israel Science Foundation grant 810/21.
Matan Kraus: supported by the BSF grant 2018364, and by the ERC grant MPM under the EU’s Horizon 2020 Research and Innovation Programme (grant no. 683064).
Copyright and License:
[Uncaptioned image] © Shay Golan and Matan Kraus; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Distributed algorithms
Related Version:
Full Version: http://arxiv.org/abs/2504.08376 [19]
Editors:
Paola Bonizzoni and Veli Mäkinen

1 Introduction

In the Congested Clique model [32, 31, 36] n nodes (computers) are used to solve some problem. The input to the problem is spread among the nodes, and the communication among the nodes is done in rounds. In each round, every node is allowed to send one message to every other node in the network. Typically, the size of every message is O(logn) bits, and messages to different nodes can be different. Usually, the input of every node is assumed to be of size O(n) words, and so, can be sent to other nodes in one round, and one can hope for O(1)-round algorithms. In this model, the local computation time is ignored and the efficiency of an algorithm is measured by the number of communication rounds made by the algorithm.

One of the fundamental tasks in the Congested Clique model is sorting of elements. In one of the seminal results for this model, Lenzen [31] shows a sorting algorithm that run in O(1) rounds. Lenzen’s algorithm supports keys of size O(logn) bits. We show how to generalize Lenzen’s sorting algorithm to support keys of size O(n1ε) (for some constant ε). Using this sorting algorithm we introduce efficient Congested Clique algorithms for several string problems.

String sorting (Section 3).

The first algorithm is for the string sorting problem [7, 25, 8]. This is a special case of the large objects sorting problem, where the order defined on the objects is the lexicographical order. We introduce an O(1) rounds algorithm for this specific order, even if there are strings of length ω(n).

Pattern matching (Section 4).

The second algorithm we present is an O(1) rounds algorithm for pattern matching, which uses the string sorting algorithm. In the pattern matching problem the input is two strings, a pattern P and a text T, and the goal is to find all the occurrences of P in T. Algorithms for this problem were designed since the 1970’s [22, 29, 27, 40, 9]. In the very related model of Massively Parallel Computing (MPC) (see discussion below), Hajiaghayi et al. [21] introduce a pattern matching algorithm that is based on FFT convolutions. Their algorithm can be adjusted to an O(1) rounds algorithm in the Congested Clique model. However, our algorithm has the advantage of using only combinatorial operations.

Suffix Array construction and the corresponding LCP array (Section 5)

The last algorithm we present is an algorithm that constructs the suffix array [33, 37] (𝖲𝖠) of a given string, together with the corresponding longest common prefix (𝖫𝖢𝖯) array [28, 33]. The suffix array of a string S, denoted by 𝖲𝖠S, is a sorted array of all of the suffixes of S. The 𝖫𝖢𝖯S array stores for every two lexicographic successive suffixes the length of their longest common prefix. It was proved [33, 28] that the combination of 𝖲𝖠S with 𝖫𝖢𝖯S is a powerful tool, that can simulate a bottom-up suffix tree traversal and is useful for solving problems like finding the longest common substring of two strings. Our algorithm takes O(loglogn) rounds.

The input model.

Most of the problems considered so far in the Congested Clique model are graph problems. For such problems where the input is a graph, it is very natural to consider partitioning of the input among n nodes. Each node in the network receives all the information on the neighborhood of one vertex of the input graph. However, when the input is a set of objects or strings, like in our problems, it is less clear how the input is spread among the n nodes of the Congested Clique network. We tried to minimize our assumption on the input to get as general algorithms as possible. Inspired by the standard RAM model, we assume that the input of any problem can be considered as a long array that contains the input (just like the internal memory of the computer). In the Congested Clique model with n nodes, we assume that the same input is now distributed among the local memories of the nodes (see Figure 1 for the case where the input objects are strings). So, to get as an input a sequence of objects with a total size of O(n2) words, we consider their representation in a long array, one after the other, and then partition this array into n pieces, each of length O(n). The assumption that the input of every node is of length O(n), is consistent with previous problems considered in the Congested Clique model [3, 31]. A useful assumption we assume for the sake of simplicity is that when the size of every object is bounded by O(n) words, any object is stored only in one node. This assumption can be guaranteed within an overhead of O(1) rounds.

Figure 1: An example of a sequence of strings, the input is partitioned between nodes v1,,vn.

Relation between Congested Clique and MPC.

The Massively Parallel Computing (MPC) model [5, 2] is a very popular model that is useful for the analyzing of the more practical model of MapReduce [15], from a theoretical point of view. In this model, a problem with input of size O(N) words, is solved by M machines, each with memory of size S words such that MS=Θ(N). The MPC model is synchronous, and in every round each machine sends and receives information, such that the data received by each machine fits its memory size. We point out that, as described by Behnezhad et al. [6, Theorem 3.1], every Congested Clique algorithm with Θ(n2) size input, that uses O(n) space in every node, can be simulated in the MPC model, with N=n2 and S=Θ(M)=Θ(N). Moreover, it is straightforward that every MPC algorithm that works with S=Θ(M) in r rounds, can be simulated in an MPC instance with S=ω(M) (but still MS=Θ(N)) in r rounds since every machine can simulate several machines of the original algorithm. As a result, most of the algorithms we introduce in the Congested Clique model implies also algorithms with the same round complexity in the MPC model for S=Ω(M). The only exception is the sorting algorithm for the case of ε=0, which uses ω(n) memory in each machine (see Appendix B). We note that the regime of S=Ω(M) is the most common regime for algorithms in the MPC model, see for example [21, 18, 34].

1.1 Related Work

String Sorting.

The string sorting problem was studied in the PRAM model by Hagerup [20], where he introduces an optimal O(logn/loglogn) time algorithm on a CRCW PRAM. The problem was also studied in the external I/O emory model by Arge et al. [4]. Recently, a more practical research was done by Bingman [8].

Pattern Matching.

In parallel models, on the 1980’s, Galil [17] and Vishkin [39] introduced pattern matching algorithms in the CRCW and CREW PRAM models. Later, Breslauer and Galil [10] improved the complexity for CRCW from O(logn) to O(loglogn) rounds and show that this round complexity is optimal.

Suffix Array.

In the world of parallel and distributed computing, the problem of 𝖲𝖠 construction was studied in several models, both in theory [26] and in practice [30, 24]. The most related line of work is the results of Kärkkäinen et al. [26] and the improvement for the Bulk Synchronous Parallel (BSP) model by Pace and Tiskin [35]. Kärkkäinen et al. [26] introduce a linear time algorithm that works in several models of parallel computing, and requires O(log2n) synchronization steps in the BSP model (for the case of a polynomial number of processors). Their result uses a recursive process with a parameter that was used as a fixed value in all levels of recursion. Pace and Tiskin [35] show that one can enlarge the value of the parameter with the levels, what they called accelerated sampling, such that the total work does not change asymptotically, but the depth of the recursion, and hence the number of synchronization steps, becomes O(loglogn). Our 𝖲𝖠 construction algorithm follows the same idea, but uses some different implementation details which fit the Congested Clique model, and exploits our large-objects sorting algorithm.

1.2 Our Contribution

Our results are summarized in the following theorems:

Theorem 1 (String Sorting).

There is an algorithm that given a sequence of strings 𝒮=(S1,S2,,Sk), computes 𝗋𝖺𝗇𝗄𝒮(Sj) for every string Sj𝒮, and stores this rank in the node that receives Sj[1]. The running time of the algorithm is O(1) rounds of the Congested Clique model.

Theorem 2 (Pattern Matching).

There is an algorithm that given two strings P and T, computes for every i[0..|T||P|+1] whether T[i+1..i+|P|]=P in O(1) rounds of the Congested Clique model.

Theorem 3 (Suffix Array and 𝖫𝖢𝖯).

There is an algorithm that given a string S, computes 𝖲𝖠S and 𝖫𝖢𝖯S in O(loglogn) rounds of the Congested Clique model.

As described above, all our results are based on the algorithm for sorting large objects. The problem is defined as follows

Problem 3 (Large Object Sorting).

Assume that a Congested Clique network of n nodes gets a sequence =(B1,B2,,Bk) of objects, each of size O(n1ε) words for ε0, where the total size of ’s objects is O(n2) words. For every object Bj, the node that gets Bj needs to learn 𝗋𝖺𝗇𝗄(Bj).

The algorithms for Section 1.2 are presented in Theorems 4 and 5, which we prove in Appendices A and B, respectively.

Theorem 4.

There is an algorithm that solves Section 1.2 for any constant ε>0 in O(1) rounds of the Congested Clique model.

Theorem 5.

There is an algorithm that solves Section 1.2 for ε=0 in O(logn) rounds. Moreover, any comparison-based Congested Clique algorithm that solves Section 1.2 for ε=0 requires Ω(logn/loglogn) rounds.

2 Preliminaries

For i,j we denote [i..j]={i,i+1,i+2,,j}. For a set S and a scalar α we denote S+α={a+αaS}. For a set 𝒦 of elements with a total order, and an element b𝒦 we denote 𝗋𝖺𝗇𝗄𝒦(b)=|{a𝒦a<b}| (or simply 𝗋𝖺𝗇𝗄(b) when 𝒦 is clear from the context). We clarify that for a multi-set M, we consider the rank of an element bM to be the number of distinct elements smaller than b in M. For a set of objects 𝒦 we denote 𝒦=B𝒦|B| as the total size (in words of space) of the objects in 𝒦.

Strings.

A string S=S[1]S[2]S[n] over an alphabet Σ is a sequence of characters from Σ. In this paper we assume |Σ|=[1..𝗉𝗈𝗅𝗒(n)] and therefore each character takes O(logn) bits. The length of S is denoted by |S|=n. For 1i<jn the string S[i..j]=S[i]S[i+1]S[j] is called a substring of S. if i=1 then S[i..j] is called a prefix of S and if j=n then S[i..j] is a suffix of S and is also denoted as S[i..]. The following lemma from [11] is useful for our pattern matching algorithm in Section 4.

Lemma 6 ([11, Lemma 3.1]).

Let u and v be two strings such that v contains at least three occurrences of u. Let t1<t2<<th be the locations of all occurrences u in v and assume that ti+2ti|u|, for i=[1..h2] and h3. Then, this sequence forms an arithmetic progression with difference d=ti+1ti, for i=[1..h1] (that is equal to the period length of u).

Here is the definition of the longest common prefix of two strings. It is useful for the definition of the lexicographical order and also for the LCP array.

Definition 7.

For two strings S1,S2Σ, we denote 𝖫𝖢𝖯(S1,S2)= max({S1[1..]=S2[1..]}{0}) to be the length of the longest common prefix of S1 and S2.

We provide here the formal definition of lexicographic order between strings.

Definition 8 (Lexicographic order).

For two strings S1,S2Σ we have S1S2 if one of the following holds:

  1. 1.

    If =𝖫𝖢𝖯(S1,S2)<min{|S1|,|S2|} and S1[+1]<S2[+1]

  2. 2.

    S1 is a prefix of S2, i.e. |S1||S2| and S1=S2[1..|S1|].

We denote the case where S1S2 and S1S2 as S1S2.

Routing.

In the Congested Clique model, a routing problem involves delivering messages from a set of source nodes to a set of destination nodes, where each node may need to send and receive multiple messages. A well-known result by Lenzen [31] shows that if each node is the source and destination of at most O(n) messages, then all messages can be delivered in O(1) rounds. The following lemma is useful for routing in the Congested Clique model.

Lemma 9 ([13, Lemma 9]).

Any routing instance, in which every node v is the target of up to O(n) messages, and v locally computes the messages it desires to send from at most O(nlogn) bits, can be performed in O(1) rounds.

 Remark.

A particularly useful case in which v locally computes the messages it wishes to send from O(nlogn) bits is when v stores O(n) messages, each intended for all nodes in some consecutive range of nodes in the network.

We also provide a routing lemma that considers a symmetric case to that of Lemma 9, which we prove in the full version of the paper [19, Appendix B].

Lemma 10.

Assume that each node of the Congested Clique stores O(n) words of space, each of size O(logn). Each node has O(n) queries, such that each query is a pair of a resolving node, and the content of the query which is encoded in O(logn) bits. Moreover, the query can be resolved by the resolving node, and the size of the result is O(logn) bits. Then, it is possible in O(1) rounds of the Congested Clique that each node get all the results of its queries.

3 Sorting Strings

Although in Appendix B we show that it is impossible to sort general keys of size Θ(n) in O(1) rounds, in this section we show that with some structural assumption on the keys and the order one can do much better. In particular, we show that if our keys are strings and we consider the lexicographic order, we can always sort them in O(1) rounds, even if there are strings of length ω(n).

We introduce a string sorting algorithm that uses the algorithm of Theorem 4 as a black box (specifically, our algorithm uses the algorithm for the special case of ε=2/3, which is proved in Lemma 25), and apply a technique called renaming to reduce long strings into shorter strings. The reduction preserves the original lexicographic order of the strings. After applying the reduction several times, all the strings are reduced to strings of length O(n1/3) which are sorted by an additional call to the algorithm of Theorem 4.

Renaming.

The idea behind the renaming technique is that to compare two long strings, one can partition the strings into blocks of the same length, and then compare pairs of blocks in corresponding positions in the two strings. The lexicographic order of the two original strings is determined by the lexicographic order of the first pair of blocks that are not the same. Such a comparison can be done by replacing each block with the block’s rank among all blocks, transforming the original strings into new, shorter strings.

As a first step, the algorithm splits each string into blocks of length n1/3 characters111For the sake of simplicity, we assume from now on that n1/3 is an integer and simply write n1/3 instead of n1/3. as follows. A string S of length |S|= is partitioned into n1/3 blocks, each of length n1/3 characters except for the last block which is of length modn1/3 characters (unless is a multiple of n1/3, in which case the length of the last block is also n1/3 characters).

In the case that every string is stored in one node, such a partitioning can be done locally. In the more general case, each node v broadcasts the number of strings starting in v and the number of characters v stores from v’s last string. This broadcast is done in O(1) rounds by Lenzen’s routing scheme and using this information each node v computes the partitioning positions of the strings stored in v. Finally, a block that is spread among two (or more) nodes is transferred to be stored only in the node where the block begins. This transfer of block parts is executed in O(1) rounds since each node is the source and destination of at most n1/3 characters.

The next step of the algorithm is to sort all the blocks, using the algorithm of Theorem 4. The result of the sorting is the rank of every block. In particular, if the same block appears more than once, all the block’s occurrences will have the same rank, and for every two different blocks, the order of their rank will match their lexicographic order. Thus, the algorithm will define new strings by replacing each block with its rank in the sorting. As a result, every string of length will be reduced to length n1/3. Notice that the alphabet of the new strings is the set of ranks, which is a subset of [1..n2], and therefore each new character uses O(logn) bits.

In the following lemma we prove that the new strings preserve the lexicographic order of the original strings.

Lemma 11.

Let A and B be two strings, and let A,B be the resulting strings defined by replacing each block of A and B with the block’s rank among all blocks, respectively. Then, AB if and only if AB.

Proof.

We first prove that ABAB. Recall that by Definition 8 there are two options for AB to be hold:

  1. 1.

    If =𝖫𝖢𝖯(A,B)<min{|A|,|B|} and A[+1]<B[+1]

  2. 2.

    A is a prefix of B, i.e. |A||B| and A=B[1..|A|].

For any j let Aj and Bj be the jth blocks of A and B, respectively.

For the first case, let α=+1n1/3. Notice that A[+1] and B[+1] are contained in Aα and Bα, respectively. Thus, for all j<α we have Aj=Bj and for j=α we have AαBα. Moreover, by definition AαBα. Hence, A[1..α1]=B[1..α1] and A[α]<B[α] which means that AB.

For the second case, where A is a prefix of B there are three subcases: (1) If A=B then all the blocks in A will get exactly the same ranks as the corresponding blocks in B and therefore A=B. (2) Otherwise, |A|<|B|. (2a) If |A| is a multiple of n1/3 then all the blocks of A are exactly the same as the corresponding blocks of B, but B has at least one additional block. Therefore, A is a prefix of B and AB. (2b) If |A| is not a multiple of n1/3 then the last block of A is shorter than the corresponding block of B. Let α=|A|n1/3 be the index of the last block of A, in particular the block Aα is a prefix of the block Bα, and therefore by definition the rank of Aα is smaller than the rank of Bα. Thus, we have that A[1..α1]=B[1..α1] and A[α]<B[α] which means that AB.

By very similar arguments, one can prove that BABA. Thus, the claim ABAB follows.

Algorithm 1 String sorting.

We are now ready to prove Theorem 1 (see also Algorithm 1).

Proof of Theorem 1.

The algorithm repeats the renaming process 7 times. Since the original longest string is of length O(n2), after seven iterations, we get that the length of every string is O(n2(n1/3)7)=1. Thus, at this time, each string is one block of length 1=O(n1/3) characters. In particular, each string is stored in one node. The algorithm uses one more time the algorithm of Theorem 4 (one could also use the sorting algorithm of [31, Corollary 4.6]) to solve the problem. Notice that during the execution of the renaming process, each block is moved completely to the first node that holds characters from this block. In particular, at the final sorting, each string contains one block, that starts at the original node where the complete string starts. Therefore, the ranks found by the sorting will be stored in the required nodes.

4 Pattern Matching

In this section we prove Theorem 2 and introduce an O(1)-round Congested Clique algorithm for the pattern matching problem. Recall that the input for this problem is a pattern string P, and a text string T. Moreover, we assume that |P|+|T|=O(n2), since for larger input one cannot hope for an O(1) rounds algorithm, due to communication bottlenecks. The goal is to find all the offsets i such that P occurs in T at offset i. Formally, for every two strings X,Y let 𝖯𝖬(X,Y)={iY[i+1..i+|X|]=X} be the set of occurrences of X in Y. Our goal is to compute 𝖯𝖬(P,T) in a distributed manner. We introduce an algorithm that solves this problem in O(1) rounds. Our algorithm distinguishes between the case where |P|n and |P|>n. Therefore, as a first step, each node v broadcasts the number of characters from P in v and then computes |P|. In Section 4.1 we take care of the simple case where |P|n and in Section 4.2 we describe an algorithm for the case of n<|P|O(n2).

4.1 Short Pattern

If |P|n then the algorithm first broadcasts P to all the nodes using Lemma 9. In addition, we want that every substring of T of length |P| will be stored completely in (at least) one node. For this, each node announces the number of characters from T it holds. Then, each node that its last input character is the ith character of T, needs to get the values of T[i+1],T[i+2],,T[i+|P|1] from the following nodes. So, each node that gets T[j] sends it to all preceding nodes starting from the node that gets T[j|P|+1] (or the node that gets T[1] if j|P|+1<1). Notice that all these nodes form a range and that each node will get at most |P|n messages. Thus, this routing can be done in O(1) rounds using Lemma 9. Now, every substring of T of length |P| is stored completely in one of the nodes, and all the nodes have P. Thus, every node locally finds all the occurrences of P in its portion of T. To conclude we proved Theorem 2 for the special case where |P|n (see also Algorithm 2).

Algorithm 2 Pattern matching with short pattern.
Algorithm 3 Pattern matching with |P|>n.

4.2 Long Pattern

From now on we consider the case where |P|>n. To conclude that an offset i of the text is an occurrence of P, i.e. that T[i+1..i+|P|]=P, we have to get for any 1j|P| evidence that T[i+j]=P[j]. We will use two types of such evidence. The first type is finding all the occurrences of the n-length prefix and suffix of P in P and T. The second type of evidence will be based on the string sorting algorithm of Theorem 1, to sort all the blocks between occurrences that were found in the first step, both in P and T. At every occurrence of P in T, all the occurrences of the prefix and suffix will match, and also the blocks between these occurrences will also be the same in the pattern and text (see also Algorithm 3).

First step - searching for the prefix and suffix.

Let B=P[1..n] and E=P[|P|n+1..|P|] be the prefix and suffix of P of length n, respectively. We use Algorithm 2 four times, to find all occurrences of B and E in P and T (in every execution the algorithm ignores the parts of the original input which are not relevant for this execution). By the following lemma, all the locations of B and E found by one node can be stored in O(1) words of space (which are O(logn) bits). The lemma is derived from Lemma 6 by using the so-called standard trick.

Lemma 12.

Let X and Y be two strings such that |Y||X| and |Y|=O(|X|). Then 𝖯𝖬(X,Y) can be stored in O(1) words of space.

Proof.

We divide Y into parts of length 2|X|1 characters, with overlap, such that each substring of length |X| is contained in one part. For any i=0,1,|Y||X|1 let Yi=Y[i|X|+1..min{(i+2)|X|1,|Y|}] be the ith part of Y, and let Li=𝖯𝖬(X,Yi)+i|X| be the set of all the occurrences of X in Yi. Notice that every occurrence t of X in Y is an occurrence of X in one part of Y, specifically in Yt/|X|. Thus, to represent 𝖯𝖬(X,Y) it is enough to store all the occurrences of X in every part Yi. Since we consider just |Y||X|=O(1) parts of Y, it is enough to show that Li can be stored in O(1) words of space. If |Li|2 then of course it could be stored in O(1) words of space. Otherwise, |Li|3, and notice that for every two occurrences in Li their distance is at most |Yi||X|+1=|X|. Thus, by Lemma 6, Li forms an arithmetic progression and therefore can be represented in O(1) words of space by storing only the first and last element and the difference between elements.

Hence, every node broadcasts to the whole network all the locations of B and E in the node’s fragments of P or T. Combining all the broadcasted information, each node will have 𝖯𝖬(B,P), 𝖯𝖬(B,T), 𝖯𝖬(E,P) and 𝖯𝖬(E,T).

Second step - completing the gaps.

In the second step, we want to find evidence of equality for all the locations in P and T which do not belong to any occurrence of B or E. Notice that |B|=|E|=n and therefore every occurrence of B or E that starts at offset i covers all the range [i+1,i+n]. We will use all the occurrences of B and E that were found in the first step, to focus on the remaining regions which are not covered yet. Formally, we define the sets of uncovered locations in P and T as follows. For a string X let

RX=[1..|X|]i𝖯𝖬(B,X)𝖯𝖬(E,X)[i+1..i+n]

The algorithm uses RP and RT to define a (multi)set of strings which contains all the maximal regions of P and T which were not covered on the first step (see Figure 2):

𝒮P={P[i..j]|[i..j]RP and i1RP and j+1RP} (1)
𝒮T={T[i..j]|[i..j]RT and i1RT and j+1RT}. (2)
𝒮=𝒮P𝒮T (3)
Figure 2: The yellow regions and red regions represents occurrences of B and E, respectively. The blue regions of the pattern and the text represent RP and RT, respectively. Notice that 𝒮P={S1,S2,S3}, and 𝒮T={S4,S5,S6,S7}. Given that the pattern is aligned to the ith offset, then iPM(P,T) if and only if S1=S5, S2=S6 and S3=S7.

Notice that the total size of all the strings in 𝒮 is O(n2), since |P|+|T|=O(n2). The algorithm uses the string sorting algorithm of Theorem 1, to sort all the strings in 𝒮. As a result, every string is stored with its rank - which is the same for two identical strings. A useful property of 𝒮 is that it contains O(n) strings, as we prove in the next lemma.

Lemma 13.

|𝒮|=O(n).

Proof.

We start with bounding |𝒮T|. By definition of 𝒮T, for T[i..j]𝒮T we have i1RT. Moreover, by definition of RT, it must be the case that [in..i1]RT=. Thus, one can associate with every string in 𝒮T (except for the first string in 𝒮T if it is a prefix of T) a set of n unique locations from [1..|T|] that are not in RP. Therefore, n(|𝒮T|1)|[1..|T|]|=O(n2) and so |𝒮T|=O(n). Applying the same argument for 𝒮P gives us |𝒮P|=O(n) and |𝒮|=O(n)+O(n)=O(n).

After sorting the strings of 𝒮, each node v broadcasts the ranks of the strings starting at v. Since there are just O(n) strings in 𝒮, this can be done in O(1) rounds with Lemma 9. Therefore, at the end of the second phase, every node has all the ranks of strings in 𝒮. Recall that at the end of the first phase each node stores 𝖯𝖬(B,P), 𝖯𝖬(B,T), 𝖯𝖬(E,P) and 𝖯𝖬(E,T). Hence, for every offset i, to check whether T[i+1..i+|P|]=P, each node first verifies that all the occurrences of B and E in P match the corresponding occurrences of B and E in T[i+1..i+|P|]. If this is the case, then it must be that all the maximal regions in RP are in corresponding positions to the maximal regions in RT at [i+1..i+|P|]. Thus, the node compares the rank of every string in 𝒮P with the rank of the corresponding (due to shift i) string of 𝒮T and checks whether they are equal (see also Figure 2 and Algorithm 3).

The following two lemmas give us the mathematical justification for the last part of the algorithm. Lemma 14 states that the test made by the algorithm is sufficient to decide whether i𝖯𝖬(P,T). Lemma 15 proves that for any i where the first two conditions of Lemma 14 holds, the test of the third condition can be made by comparing the ranks of strings from 𝒮, just like the algorithm acts.

Lemma 14.

T[i+1..i+|P|]=P if and only if all the following holds:

  1. 1.

    𝖯𝖬(B,P)=(𝖯𝖬(B,T)i)[0..|P|n]

  2. 2.

    𝖯𝖬(E,P)=(𝖯𝖬(E,T)i)[0..|P|n]

  3. 3.

    For every maximal range [a..b]RP we have P[a..b]=T[i+a..i+b].

Proof.

The first direction is simple. Assume T[i+1..i+|P|]=P, then for every 0j<|P|n we have j𝖯𝖬(B,P) if and only if B=P[j+1..j+n]=T[i+j+1..i+j+n] if and only if i+j𝖯𝖬(B,T)j𝖯𝖬(B,T)i. A similar argument works for the second property. Lastly, P[a..b]=T[i+a..i+b] is a direct consequence of the fact that T[i+1..i+|P|]=P.

For the other direction, let j[1..|P|], our goal is to prove that T[i+j]=P[j]. If [jn..j1]𝖯𝖬(B,P) then there exists some t[jn..j1] such that t𝖯𝖬(B,P). By the first property we have t+i𝖯𝖬(B,T). Thus, P[t+1..t+n]=T[i+t+1..i+t+n] and in particular P[j]=P[t+(jt)]=T[t+i+(jt)]=T[i+j]. The case where [jn..j1]𝖯𝖬(E,P) is symmetric. The last case we have to consider is where [jn..j1]𝖯𝖬(B,P)= and [jn..j1]𝖯𝖬(E,P)=. In this case, let P[a..b] be the maximal region in RP that contains j (such a region must exists). By the third property we have P[a..b]=T[i+a..i+b] and in particular P[j]=P[a+ja]=T[i+a+ja]=T[i+j] as required.

Lemma 15.

Let i[0..|T||P|] such that 𝖯𝖬(B,P)=(𝖯𝖬(B,T)i)[0..|P|n] and 𝖯𝖬(E,P)=(𝖯𝖬(E,T)i)[0..|P|n]. Then, for every maximal range [a..b]RP we have P[a..b]𝒮P and T[i+a..i+b]𝒮T.

Proof.

First, by definition of SP we have P[a..b]𝒮P. Similarly, to prove that T[i+a..i+b]𝒮T it is sufficient to prove that [i+a..i+b] is a maximal range in RT i.e. that (1) [i+a..i+b]RT and (2) i+a1RT and (3) i+bRT.

For (1), let j[i+a..i+b] assume by a way of contradiction that jRT. Then by definition there exists some t[jn..j1] such that t𝖯𝖬(B,T),𝖯𝖬(E,T). But then, it must be the case that for t=ti we have t𝖯𝖬(B,P)𝖯𝖬(E,P). Hence, and therefore ji=ti+jt=t+(jt)RP but ji[a..b] and therefore [a..b]RP in contradiction. Therefore, [i+a..i+b]RT.

For (2), since [a..b] is a maximal range in RP we have [a1]RP. Moreover, since 0𝖯𝖬(B,P) it must be that a>n and therefore by definition of RP we have a1n𝖯𝖬(B,P). Hence, a1n+i𝖯𝖬(B,T) and therefore a1n+i+n=a1+iRT.

Similarly for (3), since [a..b] is a maximal range in RP we have [b+1]RP. Moreover, since |P|n𝖯𝖬(E,P) it must be that b<|P|n and therefore by definition of RP we have b𝖯𝖬(B,E). Hence, i+b𝖯𝖬(B,T) and therefore i+bRT. Thus, we proved that [i+a..i+b] is a maximal range in RT and therefore T[i+a..i+b]𝒮T.

5 Suffix Array Construction and the Corresponding LCP Array

In this section, we are proving Theorem 3 by introducing an algorithm that computes 𝖲𝖠S, the suffix array of a given string S of length O(n2) in the Congested Clique model in O(loglogn) rounds. Moreover, we show how to compute the complementary 𝖫𝖢𝖯S array in the same asymptotic running time. We first give the formal definition of 𝖲𝖠S and 𝖫𝖢𝖯S:

Definition 16.

For a string S, the suffix array, denoted by 𝖲𝖠S is the sorted array of S suffixes, i.e., S[𝖲𝖠S[i]..]S[𝖲𝖠S[i+1]..] for all 1i<|S|1. The corresponding 𝖫𝖢𝖯 array, 𝖫𝖢𝖯S, stores the 𝖫𝖢𝖯 of every two consecutive suffixes due to the lexicographical order. Formally 𝖫𝖢𝖯[i]=𝖫𝖢𝖯(S[𝖲𝖠S[i]..],S[𝖲𝖠S[i+1]..]).

Our algorithm follows the recursive process described by Pace and Tiskin [35], which is a speedup of Kärkkäinen et al. [26] for parallel models. The main idea of the recursion is that in every level, the algorithm creates a smaller string that represents a subset of the original string positions, solve recursively and use the results of the subset to compute the complete results. While the depth of the recursion increases, the ratio between the length of the current string and the length of the new string increases as well. At the beginning the ratio is constant and in O(loglogn) rounds it becomes polynomial in n. The main difference between our algorithm and [35, 26] is that our algorithm is simpler due to the powerful sorting algorithm provided in Theorem 4. Moreover, we also introduce how to compute 𝖫𝖢𝖯S, in addition to 𝖲𝖠S. We first ignore the 𝖫𝖢𝖯S computation and then in Section 5.1 we give the details needed for computing 𝖫𝖢𝖯S.

Our algorithm uses the notion of difference cover [14], and difference cover sample as defined by Kärkkäinen et al. [26]. For a parameter t, a difference cover DCt[0,t1] is a set of integers such that for any pair i,j the set DCt contains i,jDCt with jiji(modt). For every t there exists a difference cover DCt of size Θ(t), which can be computed in O(t) time (see Colbourn and Ling [14]). For a string S, [26] defined the difference cover sample DCt(S)={ii[1..|S|] and (imodt)DCt}, as the set of all indices in S which are in DCt modulo t. The following lemma was proved in [38].

Lemma 17 ([38, Lemma 2]).

For a string S and an integer t|S|, there exists a difference cover DCt, such that |DCt(S)|O(|S|/t) and for any pair of positions i,j[1..|S|] there is an integer k[0..t1] such that (i+k) and (j+k) are both in DCt(S).

At every level of the recursion let ε>0 be a number such that the length of the string S is |S|O(n2ε). Later we will describe how to choose ε exactly (see the time complexity analysis). At the first level ε=O(1/logn) satisfies this requirement. Let DCt[0..t1] be some fixed difference cover with t=min{nε,n1/3} of size |DCt|=O(t).

For every i[1..|S|] let Si=S[i..i+t1] be the substring of S of length t starting at position i (we assume that S[j] is some dummy character for every j|S|). As a first step the algorithm sorts all the strings in 𝒮={SiiDCt(S)}. Notice that the total length of all these strings is |DCt(S)|t=O(|S|t)t=O(|S|t)O(n2εnε/2)=O(n2ε/2)O(n2) and therefore the algorithm can sort all the strings in O(1) rounds using Theorem 4. Recall that as a result of running the sorting algorithm, the node that stores index i of S, has now the rank of Si, 𝗋𝖺𝗇𝗄𝒮(Si), among all the strings of 𝒮 (copies of the same string will have the same rank). The algorithm uses the ranks of the strings to create a new string of length |DCt(S)|. For every iDCt let S(i)=𝗋𝖺𝗇𝗄𝒮(Si)𝗋𝖺𝗇𝗄𝒮(Si+t)𝗋𝖺𝗇𝗄𝒮(Si+2t)𝗋𝖺𝗇𝗄𝒮(Si+3t) (if 0DCt then S(0) starts from 𝗋𝖺𝗇𝗄𝒮(St) since S0 does not exist). Moreover, let S be the concatenation of all S(i)s for iDCt (in some arbitrary order) with a special character $ as a delimiter between the strings S(i). The algorithm runs recursively on S. The result of the recursive call is the suffix array of S, 𝖲𝖠S (which stores the order of the suffixes in S). Note that every index iDCt(S) has a corresponding index in S which is where 𝗋𝖺𝗇𝗄𝒮(Si) appears, we denote this position as f(i). The algorithm sends for every index i the rank of f(i) (which is the index in 𝖲𝖠S where f(i) appears) to the node that stores index i of S.

Due to the following claim, which we prove formally later in Lemma 20, the order of suffixes of S from DCt(S) is the same as the order of the corresponding suffixes of S.

Claim 18.

For a,bDCt(S) we have S[a..]S[b..] if and only if S[f(a)..]S[f(b)..].

Due to Claim 18, 𝖲𝖠S represents the order of the subset of suffixes of S starting at DCt(S). To extend the result for the complete order of all the suffixes of S (hence, computing 𝖲𝖠S), the algorithm creates for every index of S a representative object of size O(t) words of space. These objects have the property that by comparing two objects one can determine the order of the corresponding suffixes.

The representative objects.

For every index i the object represents the suffix S[i..] is composed of two parts. The first part is Si - which is the substring of S of length t starting at position i. The second part is the ranks (due to the lexicographic order) of all the suffixes at position in DCt(S)[i..i+t1] among all suffixes of DCt(S), using 𝖲𝖠S1. This information is stored as an array Ai of length t as follows. For every j[0..t1] if i+jDCt(S) we set Ai[j]=𝖲𝖠S1[f(i+j)], which is the rank of position i+j (as computed by the recursive call) and Ai[j]=1 otherwise. The first part is used to determine the order of two suffixes which their 𝖫𝖢𝖯 is at most t and the second part is used to determine the order of two suffixes which their 𝖫𝖢𝖯 is larger than t.

The comparison of the objects represent positions a and b, is done as follows. The algorithm first compares Sa and Sb. If SaSb then the order of S[a..] and S[b..] is determined by the order of Sa and Sb. Otherwise, Sa=Sb and the algorithm uses the second part of the objects. By Lemma 17 there exists some k[0..t1] such that a+k and b+k are both in DCt(S). In particular Aa[k] and Ab[k] both hold actual ranks of suffixes of S (and not 1s). The algorithm uses Aa[k] and Ab[k] to determine the order of S[a..] and S[b..]. By the following lemma the order of Aa[x] and Ab[x] is exactly the same order of the corresponding suffixes starting at positions a and b.

Lemma 19.

Let a,b[|S|] such that S[a..]S[b..] and Sa=Sb. Then, for every k, if Aa[k]1 and Ab[k]1 then Aa[k]<Ab[k]. Moreover, there exists some k[0..t1] such that Aa[k]1 and Ab[k]1.

Proof.

Let =𝖫𝖢𝖯(S[a..],S[b..]). By definition, S[a..a+1]=S[b..b+1] and S[a+]<S[b+]. Notice that t. Thus, for every 0kt we have S[a+k..a+k+(k1)]=S[b+k..b+k+(k1)] and S[a+k+(k)]<S[b+k+(k)]. Therefore, S[a+k..]S[b+k..]. For k[0..t1] such that Aa[k]1 and Ab[k]1, it must be the case that a+k,b+kDCt(S). Thus, by Claim 18 we have S[f(a+k)..]S[f(b+k)..] and therefore Aa[k]<Ab[k].

By Lemma 17 there exists some k[0..t1] such that a+k and b+k are both in DCt(S). For this value of k it is guaranteed that Aa[k]1 and Ab[k]1.

For every index i[1..|S|] the algorithm creates an object of size O(t) words of space. Thus, the total size of all the objects is O(t|S|)O(nεn2ε)=O(n2). Moreover, by definition tn1/3. Therefore the algorithm sorts all the objects in O(1) rounds using the algorithm of Theorem 4. By Lemma 19 the result of the object sorting algorithm is a sorting of all the suffixes of S.

Time complexity.

Recall that for |S|=O(n2ε) we defined t=min{nε,n1/3} and that the length of S is |S|=|DCt(S)|=O(|S|/t). Let c be a constant such that |S|c|S|/t (for any n>n0 for some n0). Denote Sk, Sk, εk and tk as the values of S, S, ε and t in the kth level of the recursion, respectively. Our goal is to gurantee an exponantial growth in the value of ε from level to level. For the first level, in order to have a growth, we have |S1|c|S1|t1=c|S1|n0.5ε1=cn0.1ε1|S1|n0.4ε1=cn0.1ε1O(n21.4ε1). We are setting ε1=10logc/logn and so |S1|O(n21.4ε1). Then, as long as εi<1/3 we define εi+1=1.4εi and we get with similar analysis |Si|O(n21.4εi). By a straightforward induction we get |Si|O(n21.4iε1). From the time that εi1/3 (i.e. |Si|=O(n5/3)) we have ti+1=n1/3. Thus, |Si+1|=O(|Si|/t)=O(|Si|/n1/6)=O(n3/2) and in four rounds we get Si+4=O(n). At this time the algorithm solves the problem in O(1) rounds in one node. Therefore the algorithm performs O(loglogn)+4=O(loglogn) levels of recursion. Each level of recursion takes O(1) rounds, and therefore the total running time of the algorithm is O(loglogn).

 Remark.

As described in Section 1, our algorithm can be translated to O(loglogn) rounds algorithm in the MPC model. However, if one consider a variant of the MPC model where the product of machines and memory size is polynomially larger than n, i.e. if MS=n1+α for some constant α>0, then there is a faster algorithm. In particular one can use t=nα in all rounds and get an O(1/α)=O(1) rounds algorithm.

5.1 Computing the Corresponding LCP Array

The computation of 𝖫𝖢𝖯S is done during the computation of 𝖲𝖠S, by several additional operations at some steps of the computation. The recursive process has exactly the same structure, but now every level of the recursion can use both 𝖲𝖠S and 𝖫𝖢𝖯S and has to compute 𝖫𝖢𝖯S in addition to 𝖲𝖠S.

Recall that in the 𝖲𝖠 construction algorithm of the previous section, the order of two suffixes of S that starts in DCt(S) is exactly the same as the order of the corresponding suffixes of S that is represented in 𝖲𝖠S. The issue with computing 𝖫𝖢𝖯S is slightly more complicated. In the following lemma we prove that for two positions a,bDCt(S) we have 𝖫𝖢𝖯(S[f(a)..],S[f(b)..])=𝖫𝖢𝖯(S[a..],S[b..])/t, which means 𝖫𝖢𝖯(S[a..],S[b..]])t𝖫𝖢𝖯(S[f(a)..],S[f(b)..])+[0..t1].

Lemma 20.

For a,bDCt(S) we have. 𝖫𝖢𝖯(S[f(a)..],S[f(b)..])=𝖫𝖢𝖯(S[a..],S[b..])/t and if S[a..]S[b..] then S[f(a)..]S[f(b)..].

Proof.

Let =𝖫𝖢𝖯(S[a..],S[b..]). By definition S[a..a+1]=S[b..b+1] and S[a+]S[b+]. Our goal is to prove that for any 0i</t, S[f(a)+i]=S[f(b)+i] and that S[f(a)+/t]S[f(b)+/t].

Recall that by the definition of S, S[f(a)+i] is exactly 𝗋𝖺𝗇𝗄𝒮(Sa+ti) (as long as i is small enough). Similarly, S[f(b)+i]=𝗋𝖺𝗇𝗄𝒮(Sb+ti). Thus, for any 0i</t we have

S[f(a)+i] =𝗋𝖺𝗇𝗄𝒮(Sa+ti)=𝗋𝖺𝗇𝗄𝒮(S[a+ti..a+ti+(t1)]) (4)
=𝗋𝖺𝗇𝗄𝒮(S[b+ti..b+ti+(t1)])=𝗋𝖺𝗇𝗄𝒮(Sb+ti)=S[f(b)+i]

With similar analysis one can get S[f(a)+/t]S[f(b)+/t].

The assumption S[a..]S[b..] means that S[a+]<S[b+]. Let k=t/t, we will focus on Sa+k and Sb+k. Let r=k and notice that r<t (and r=modt). We will show that Sa+/tSb+/t. For any i<r since k+i<k+r= we have Sa+/t[i]=S[a+k+i]=S[b+k+i]=Sb+/t[i]. Since k+r= we have Sa+/t[r]=S[a+k+r]=S[a+]<S[b+]=S[b+k+r]=Sb+/t[r]. Therefore, Sa+/tSb+/t and S[f(a)+/t]=𝗋𝖺𝗇𝗄𝒮(Sa+/t)<𝗋𝖺𝗇𝗄𝒮(Sb+/t)=S[f(b)+/t]. Combining with Equation 4 we have S[f(a)..]S[f(b)..].

First step - compute exact LCP for 𝑫𝑪𝒕(𝑺).

Let i1,i2,,i|DCt(S)| be the indices of DCt(S) such that for any j we have S[ij..]S[ij+1..]. Notice that ij=f1(𝖲𝖠S[j]). The first step of the algorithm is to compute the exact value of 𝖫𝖢𝖯(S[ij..],S[ij+1..]) for any j. For this step the algorithm uses inverse suffix array 𝖲𝖠S1. Recall that for any i, 𝖲𝖠S1[i] is the index j such that SAS[j]=i. By a simple routing, in O(1) rounds, the algorithm distributes the 𝖲𝖠S1 information to the Congested Clique nodes, such that the node v that stores the ath character of S for aDCt(S) will get 𝖲𝖠S1[f(a)]. Moreover, v will get also the value b such that f(b)=𝖲𝖠S[𝖲𝖠S1[f(a)]+1] which is the index of the lexicographic successive suffix among DCT(S) suffixes. In addition v gets =𝖫𝖢𝖯S[𝖲𝖠S1[f(a)]] which is exactly =𝖫𝖢𝖯(S[f(a)..],S[f(b)..]). Now, v creates 2t=O(nε) queries - to get all the characters of S[a+t..a+t+t1] and S[b+t..b+t+t1] (each query is for one character). Notice that every node holds at most O(n1ε) indices of DCt(S), and therefore every node has at most O(n1εt)=O(n) queries. Thus, using Lemma 10 in O(1) rounds all the queries will be answered. Using the answers, every index ij computes the exact value of 𝖫𝖢𝖯(S[ij..],S[ij+1..]).

Let 𝖫𝖢𝖯¯S the array of all the revised 𝖫𝖢𝖯 values of 𝖫𝖢𝖯S i.e., the value of 𝖫𝖢𝖯¯S[i] is the exact 𝖫𝖢𝖯 of the suffixes f1(𝖲𝖠S[i]) and f1(𝖲𝖠S[i+1]) which are the lexicographically ith and i+1th suffixes among DCt(S). We store 𝖫𝖢𝖯¯S in a distributed manner in the Congested Clique network.

Second step - compute 𝗟𝗖𝗣𝑺

In the second step, the algorithm computes the 𝖫𝖢𝖯 of every suffix of S with its lexicographic successive suffix. This computation is done similarly to the second step of the 𝖲𝖠S construction algorithm. In every level, the computation of 𝖫𝖢𝖯S is done after the computation of 𝖲𝖠S. Thus, we can use the order of the suffixes of S as an input for this step. For every index i, let i^ be the index of the successive suffix to S[i..]. The node that stores S[i] gets i^ in O(1) rounds. The algorithm uses the same representative objects used to compute 𝖲𝖠S, to compute 𝖫𝖢𝖯(S[i..],S[i^..])=𝖫𝖢𝖯S[𝖲𝖠S1[i]]. Recall that the representative objects of i and i^ are composed of two parts. The first part contains Si and Si^ which are the substrings of S of length t starting at positions i and i^. The algorithm first compares Si and Si^ and if SiSi^ then 𝖫𝖢𝖯(S[i..],S[i^..])=𝖫𝖢𝖯(Si,Si^). If Si=Si^, the algorithm uses the second part of the representative objects. Recall that in this part the object of an index a stores an array of length t, with the ranks (among suffix starting at DCt(S)) of suffixes from DCt(S)[a..a+t1] and 1s. Moreover, by Lemma 17 it is guaranteed that there is some k[0..t1] such that Ai[k]1 and Ai^[k]1. Since Si=Si^ and k<|Si|, we have 𝖫𝖢𝖯(S[i..],S[i^..])=k+𝖫𝖢𝖯(S[i+k..],S[i^+k..]). Thus, we have the ranks of the suffixes i+k and i^+k among suffixes of S starting at positions in DCt(S) (which are 𝖲𝖠S1[f(i+k)] and 𝖲𝖠S1[f(i^+k)]). Due to the following fact, 𝖫𝖢𝖯(S[i+k..],S[i^+k..])=min(𝖫𝖢𝖯¯S[j]𝖲𝖠S1[f(i+k)]j𝖲𝖠S1[f(i^+k)]). (This is because 𝖫𝖢𝖯¯S is an array of the 𝖫𝖢𝖯 values of monotone sequence of strings, and S[i+k),S[i^+k) are both elements in the sequence).

Fact 20.

Let T1,T2,,Tk be a sequence of strings such that TiTi+1 and Ti is not a prefix of Ti+1 for all i[1..k1]. Then, for any a<b we have 𝖫𝖢𝖯(Ta,Tb)=min{𝖫𝖢𝖯(Ti,Ti+1)i[a..b1]}.

Proof.

Let =min{𝖫𝖢𝖯(Ti,Ti+1i[a..b1])} and let i[a..b1] be an index with 𝖫𝖢𝖯(Ti,Ti+1)=. For any 1j we have Ta[j]=Ta+1[j]=Ta+2[j]==Tb[j] since for every i[a..b1] we have 𝖫𝖢𝖯(Ti,Ti+1) and by a straightforward induction. On the other hand, Ta[]Ta+1[]Ta+2[]Tb[], because TiTi+1 for all i. Moreover, since Ti[+1]Ti+1[+1], it must be that Ti[+1]<Ti+1[+1] and therefore Ta[+1]<Tb[+1]. Thus, 𝖫𝖢𝖯(Ta,Tb)=. Thus, for every i, if 𝖫𝖢𝖯(S[i..],S[i^..]) is not determined by Si and Si^, the algorithm has to perform one range minimum query (RMQ) on 𝖫𝖢𝖯¯S. Now, we will describe how to compute all these range minimum queries in O(1) rounds. This lemma might be of independent interest.

Definition 21.

Given an array A and two indices i,j such that 1ij|A|, a Range Minimum Query 𝖱𝖬𝖰A(i,j) returns the minimum value x in the range A[i..j].

Lemma 22.

Let A be an array of O(n2) numbers (each number of size O(logn) bits), distributed among n nodes in the Congested Clique model, such that each node holds a subarray of length O(n). In addition, every node has O(n) 𝖱𝖬𝖰 queries. Then, there is an algorithm that computes for each node the results of all the 𝖱𝖬𝖰 queries in O(1) rounds.

Proof.

First, each node broadcasts its subarray length, i.e. how many numbers it contains. Second, each node broadcasts the minimum number within the node.

There are two types of 𝖱𝖬𝖰 queries. The first type is where the range of the 𝖱𝖬𝖰 is contained in one specific node, i.e. both i and j of the 𝖱𝖬𝖰 are on the same node. The second type is where i and j are not in the same node. For this case, we separate the 𝖱𝖬𝖰 into three ranges. The first range is i to i, where i is the index of the last number in the node that contains the i’th number. The third range is j to j, where j is the index of the first number in the node that contains the j’th number. The second range is i+1 to j1, i.e. all the indices in the intermediate nodes (this range might be empty). To calculate 𝖱𝖬𝖰(i,j), it is enough to calculate 𝖱𝖬𝖰 on the three ranges, since 𝖱𝖬𝖰(i,j)=min(𝖱𝖬𝖰(i,i),𝖱𝖬𝖰(i+1,j1),𝖱𝖬𝖰(j,j)). It is easy to calculate 𝖱𝖬𝖰(i+1,j1), since on the second step of the algorithm, each node broadcasts its minimum number. We are left with two 𝖱𝖬𝖰 queries to two specific resolving nodes.

To conclude, after the second step, the O(n) 𝖱𝖬𝖰 queries on each node can be calculated with O(n) 𝖱𝖬𝖰 queries to specific resolving nodes. Since both an 𝖱𝖬𝖰 to a specific resolving node and the result can be encoded with O(logn) bits. Hence, using Lemma 10, in O(1) rounds all the O(n) 𝖱𝖬𝖰 queries to specific resolving nodes are resolved.

Complexity.

The overhead of computing 𝖫𝖢𝖯S from 𝖫𝖢𝖯S is just a constant number of rounds per level of the recursion. So, in total the computation of 𝖲𝖠S and 𝖫𝖢𝖯S takes O(loglogn) rounds.

References

  • [1] Miklós Ajtai, János Komlós, and Endre Szemerédi. An o(n log n) sorting network. In Proceedings of the 15th Annual ACM Symposium on Theory of Computing, pages 1–9. ACM, 1983. doi:10.1145/800061.808726.
  • [2] Alexandr Andoni, Aleksandar Nikolov, Krzysztof Onak, and Grigory Yaroslavtsev. Parallel algorithms for geometric graph problems. In David B. Shmoys, editor, Symposium on Theory of Computing, STOC 2014, pages 574–583. ACM, 2014. doi:10.1145/2591796.2591805.
  • [3] Benny Applebaum, Dariusz R. Kowalski, Boaz Patt-Shamir, and Adi Rosén. Clique here: On the distributed complexity in fully-connected networks. Parallel Process. Lett., 26(1):1650004:1–1650004:12, 2016. doi:10.1142/S0129626416500043.
  • [4] Lars Arge, Paolo Ferragina, Roberto Grossi, and Jeffrey Scott Vitter. On sorting strings in external memory (extended abstract). In Frank Thomson Leighton and Peter W. Shor, editors, Proceedings of the Twenty-Ninth Annual ACM Symposium on the Theory of Computing, pages 540–548. ACM, 1997. doi:10.1145/258533.258647.
  • [5] Paul Beame, Paraschos Koutris, and Dan Suciu. Communication steps for parallel query processing. In Richard Hull and Wenfei Fan, editors, Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2013, pages 273–284. ACM, 2013. doi:10.1145/2463664.2465224.
  • [6] Soheil Behnezhad, Mahsa Derakhshan, and MohammadTaghi Hajiaghayi. Brief announcement: Semi-mapreduce meets congested clique. CoRR, abs/1802.10297, 2018. doi:10.48550/arXiv.1802.10297.
  • [7] Jon Louis Bentley and Robert Sedgewick. Fast algorithms for sorting and searching strings. In Michael E. Saks, editor, Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 360–369. ACM/SIAM, 1997. URL: http://dl.acm.org/citation.cfm?id=314161.314321.
  • [8] Timo Bingmann. Scalable String and Suffix Sorting: Algorithms, Techniques, and Tools. PhD thesis, Karlsruhe Institute of Technology, Germany, 2018. URL: https://publikationen.bibliothek.kit.edu/1000085031.
  • [9] Robert S. Boyer and J. Strother Moore. A fast string searching algorithm. Commun. ACM, 20(10):762–772, 1977. doi:10.1145/359842.359859.
  • [10] Dany Breslauer and Zvi Galil. An optimal o(\log\logn) time parallel string matching algorithm. SIAM Journal on Computing, 19(6):1051–1058, 1990. doi:10.1137/0219072.
  • [11] Dany Breslauer and Zvi Galil. Real-time streaming string-matching. ACM Transactions on Algorithms, 10(4):22:1–22:12, 2014. doi:10.1145/2635814.
  • [12] Keren Censor-Hillel, Michal Dory, Janne H. Korhonen, and Dean Leitersdorf. Fast approximate shortest paths in the congested clique. Distributed Comput., 34(6):463–487, 2021. doi:10.1007/s00446-020-00380-5.
  • [13] Keren Censor-Hillel, Orr Fischer, Tzlil Gonen, François Le Gall, Dean Leitersdorf, and Rotem Oshman. Fast distributed algorithms for girth, cycles and small subgraphs. In Hagit Attiya, editor, 34th International Symposium on Distributed Computing, DISC 2020, October 12-16, 2020, Virtual Conference, volume 179 of LIPIcs, pages 33:1–33:17. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPIcs.DISC.2020.33.
  • [14] Charles J Colbourn and Alan CH Ling. Quorums from difference covers. Information Processing Letters, 75(1-2):9–12, 2000. doi:10.1016/S0020-0190(00)00080-6.
  • [15] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. In Eric A. Brewer and Peter Chen, editors, 6th Symposium on Operating System Design and Implementation (OSDI 2004), pages 137–150. USENIX Association, 2004. URL: http://www.usenix.org/events/osdi04/tech/dean.html.
  • [16] Lester R Ford Jr and Selmer M Johnson. A tournament problem. The American Mathematical Monthly, 66(5):387–389, 1959.
  • [17] Zvi Galil. Optimal parallel algorithms for string matching. Information and Control, 67(1-3):144–157, 1985. doi:10.1016/S0019-9958(85)80031-0.
  • [18] Mohsen Ghaffari, Christoph Grunau, and Slobodan Mitrovic. Massively parallel algorithms for b-matching. In Kunal Agrawal and I-Ting Angelina Lee, editors, SPAA ’22: 34th ACM Symposium on Parallelism in Algorithms and Architectures, pages 35–44. ACM, 2022. doi:10.1145/3490148.3538589.
  • [19] Shay Golan and Matan Kraus. String problems in the congested clique model, 2025. doi:10.48550/arXiv.2504.08376.
  • [20] Torben Hagerup. Optimal parallel string algorithms: sorting, merging and computing the minimum. In Frank Thomson Leighton and Michael T. Goodrich, editors, Proceedings of the Twenty-Sixth Annual ACM Symposium on Theory of Computing, 23-25 May 1994, Montréal, Québec, Canada, pages 382–391. ACM, 1994. doi:10.1145/195058.195202.
  • [21] MohammadTaghi Hajiaghayi, Hamed Saleh, Saeed Seddighin, and Xiaorui Sun. String matching with wildcards in the massively parallel computation model. In Kunal Agrawal and Yossi Azar, editors, SPAA ’21: 33rd ACM Symposium on Parallelism in Algorithms and Architectures, pages 275–284. ACM, 2021. doi:10.1145/3409964.3461793.
  • [22] Vaughan R. Pratt James H. Morris Jr. A linear pattern-matching algorithm. techreport 40, University of California, Berkeley, 1970.
  • [23] Tomasz Jurdzinski and Krzysztof Nowicki. MST in O(1) rounds of congested clique. In Artur Czumaj, editor, Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 2620–2632. SIAM, 2018. doi:10.1137/1.9781611975031.167.
  • [24] Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Parallel external memory suffix sorting. In Ferdinando Cicalese, Ely Porat, and Ugo Vaccaro, editors, Proceedings of ‘Combinatorial Pattern Matching - 26th Annual Symposium, CPM 2015, volume 9133 of Lecture Notes in Computer Science, pages 329–342. Springer, 2015. doi:10.1007/978-3-319-19929-0_28.
  • [25] Juha Kärkkäinen and Tommi Rantala. Engineering radix sort for strings. In Amihood Amir, Andrew Turpin, and Alistair Moffat, editors, Proceedings of String Processing and Information Retrieval, 15th International Symposium, SPIRE 2008,, volume 5280 of Lecture Notes in Computer Science, pages 3–14. Springer, 2008. doi:10.1007/978-3-540-89097-3_3.
  • [26] Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. Linear work suffix array construction. Journal of the ACM, 53(6):918–936, 2006. doi:10.1145/1217856.1217858.
  • [27] Richard M. Karp and Michael O. Rabin. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2):249–260, 1987. doi:10.1147/rd.312.0249.
  • [28] Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa, and Kunsoo Park. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Amihood Amir and Gad M. Landau, editors, Combinatorial Pattern Matching, 12th Annual Symposium, CPM 2001, volume 2089 of Lecture Notes in Computer Science, pages 181–192. Springer, 2001. doi:10.1007/3-540-48194-X_17.
  • [29] Donald E. Knuth, James H. Morris, Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM Journal on Computing, 6(2):323–350, 1977. doi:10.1137/0206024.
  • [30] Fabian Kulla and Peter Sanders. Scalable parallel suffix array construction. Parallel Comput., 33(9):605–612, 2007. doi:10.1016/j.parco.2007.06.004.
  • [31] Christoph Lenzen. Optimal deterministic routing and sorting on the congested clique. In Panagiota Fatourou and Gadi Taubenfeld, editors, ACM Symposium on Principles of Distributed Computing, PODC ’13, Montreal, QC, Canada, July 22-24, 2013, pages 42–50. ACM, 2013. doi:10.1145/2484239.2501983.
  • [32] Zvi Lotker, Elan Pavlov, Boaz Patt-Shamir, and David Peleg. MST construction in o(log log n) communication rounds. In Arnold L. Rosenberg and Friedhelm Meyer auf der Heide, editors, SPAA 2003: Proceedings of the Fifteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 94–100. ACM, 2003. doi:10.1145/777412.777428.
  • [33] Udi Manber and Eugene W. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935–948, 1993. doi:10.1137/0222058.
  • [34] Krzysztof Nowicki. A deterministic algorithm for the MST problem in constant rounds of congested clique. In Samir Khuller and Virginia Vassilevska Williams, editors, STOC ’21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 1154–1165. ACM, 2021. doi:10.1145/3406325.3451136.
  • [35] Matthew Felice Pace and Alexander Tiskin. Parallel suffix array construction by accelerated sampling. In Jan Holub and Jan Zdárek, editors, Proceedings of the Prague Stringology Conference 2013, Prague, Czech Republic, September 2-4, 2013, pages 142–156. Department of Theoretical Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2013. URL: http://www.stringology.org/event/2013/p13.html.
  • [36] Boaz Patt-Shamir and Marat Teplitsky. The round complexity of distributed sorting: extended abstract. In Cyril Gavoille and Pierre Fraigniaud, editors, Proceedings of the 30th Annual ACM Symposium on Principles of Distributed Computing, PODC, pages 249–256. ACM, 2011. doi:10.1145/1993806.1993851.
  • [37] Simon J. Puglisi, William F. Smyth, and Andrew Turpin. A taxonomy of suffix array construction algorithms. ACM Comput. Surv., 39(2):4, 2007. doi:10.1145/1242471.1242472.
  • [38] Tatiana Starikovskaya and Hjalte Wedel Vildhøj. Time-space trade-offs for the longest common substring problem. In Johannes Fischer and Peter Sanders, editors, Combinatorial Pattern Matching, 24th Annual Symposium, CPM, volume 7922 of Lecture Notes in Computer Science, pages 223–234. Springer, 2013. doi:10.1007/978-3-642-38905-4_22.
  • [39] Uzi Vishkin. Optimal parallel pattern matching in strings. Inf. Control., 67(1-3):91–113, 1985. doi:10.1016/S0019-9958(85)80028-0.
  • [40] Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, SWAT 1973, pages 1–11. IEEE Computer Society, 1973. doi:10.1109/SWAT.1973.13.

Appendix A Sorting Large Objects

In this section, we solve Section 1.2, the large objects sorting problem, for the special case of ε=2/3, in the Congested Clique model, by presenting a deterministic sorting algorithm for objects of size O(n1/3) words, that takes O(1) rounds. In the full version of the paper [19], we generalize the algorithm for any ε>0, which proves Theorem 4.

Our algorithm makes use of the following two lemmas.

Lemma 23 ([12, Lemma 3]).

Let x1,x2,,xn be natural numbers, and let X, x and k be natural numbers such that i=1nxi=X, xix for all i. Then there is a partition of [n] into k sets I1,I2,,Ik such that for each j, the set Ij consists of consecutive elements, and iIjxiX/k+x.

Lemma 24 ([23, Lemma 1.2]).

Let A be a Congested Clique algorithm which, except of the nodes u1,,un corresponding to the input strings, uses O(n) auxiliary nodes v1,v2, such that the auxiliary nodes do not have initially any knowledge of the input strings on the nodes u1,,un. Then, each round of A might be simulated in O(1) rounds in the standard Congested Clique model, without auxiliary nodes.

Our algorithm is a generalization of Lenzen’s [31] sorting algorithm. In [31], each node is given n keys of size O(1) words of space (i.e. O(logn) bits) and the nodes need to learn the ranks of their keys in the total order of the union of all keys. Our algorithm uses similar methods but has another level of recursion to handle also objects of size ω(1) (yet O(n1/3)) words. Formally, we prove here the following lemma.

Lemma 25.

Consider a variant of Section 1.2 where each object is of size O(n1/3) words. Moreover, every object is stored in one node. Then, there exists an algorithm that solves this variant in O(1) rounds.

The main part of the algorithm is sorting the objects of the network by redistributing the objects among the nodes such that for any two objects B<B that the algorithm sends to nodes vi and vj, respectively, we have ij.

The algorithm stores with each object B the original node of B and the index of B in the original node. The algorithm uses the order of original nodes and indices to break ties.

To sort large objects of size O(n1/3) words, the algorithm uses two building blocks. First, we show how to sort the objects of a set of n1/3 nodes222We assume that n1/3 is an integer. Otherwise, we add O(n) auxiliary nodes such that the total number of nodes n holds n1/3. By Lemma 24 the round complexity is increased only by a constant factor.. This algorithm is the base of the second building block, which is a recursive algorithm that sorts the objects of a set of ω(n1/3) nodes.

A.1 Sorting at most 𝒏𝟏/𝟑 Nodes with Objects of Size 𝑶(𝒏𝟏/𝟑)

In this section we present Algorithm 4, that sorts all the objects in a set of nodes WV of at most n1/3 nodes, each object of size O(n1/3) words, and each node stores O(n) words. As in [31], each node marks some objects as candidates. Then, n1/3 of the candidates are chosen to be delimiters, and the objects are redistributed according to these delimiters. The main part of the analysis is to prove that the redistribution works well, i.e. the delimiters divide the nodes into sets of almost evenly sizes and therefore each set can be sent to one node in O(1) rounds.

Algorithm 4 Sorting objects of at most n1/3 nodes.

Correctness.

The correctness of Algorithm 4 derives from steps 4 to 6. As in [31, Lemma 4.2] due to the partitioning by delimiters, all the objects in Ki,j are larger than the objects in Ki,j for all vi,viW and j<j.

Complexity.

We now show that Algorithm 4 runs in O(1) rounds. Notice that communication only happens in steps 3 and 6. In both steps, each node sends O(n) words. We will show that each node also receives O(n) words, and therefore we can use Lenzen’s routing scheme.

For step 3, notice that |W|n1/3 nodes, there are O(n1/3) candidates per node, and the size of any candidate is O(n1/3) words. Therefore each node receives O(n1/3)O(n1/3)O(n1/3)=O(n) words of space.

It is left to prove that in step 6 each node receives O(n) words. A similar argument was also proved in [31, Lemma 4.3], and we show here (with the proof given in the full version of the paper [19]) that partitioning the objects using step 2 is an efficient interpretation of Lenzen’s sorting algorithm in the case of large objects.

Lemma 26.

When executing Algorithm 4, for each j[1..|W|], it holds that
i=1|W|Ki,j=O(n).

A.2 Sorting more than 𝒏𝟏/𝟑 Nodes with Objects of Size 𝑶(𝒏𝟏/𝟑)

The following algorithm sorts all the objects in UV for |U|>n1/3 nodes, where each object is of size O(n1/3) words, and each node stores O(n) words. In particular, this algorithm sorts all the objects in n nodes.

In this recursive algorithm, each node marks some objects as candidates, such that all the candidates are fit into O(|U|/n1/3) nodes. The candidates are sorted recursively, and n1/3 of the candidates are chosen to be delimiters. Then, the objects are redistributed according to these delimiters, and each set of size O(|U|/n1/3) nodes is sorted recursively.

Algorithm 5 Sorting objects of more than n1/3 nodes.

Correctness.

The correctness of Algorithm 5 stems from steps 5 to 8 and follows analogously to the correctness of Algorithm 4.

Complexity.

We will focus on the steps in Algorithm 5 where communication is made and show that each step takes O(1) rounds.

In step 3, we need to further explain some algorithmic details. Each node sends O(n1/3) objects of size O(n1/3) words, so at most O(n2/3) words per node. The candidates of node vi are sent to node vj for j=in1/3, therefore each node receives at most n1/3O(n2/3)=O(n) words. By Lenzen’s routing scheme this is done in O(1) rounds. In step 4, notice that since the number of nodes is at most n, the depth of the recursion is O(1), therefore the recursion does not increase the round complexity asymptotically.

In step 5, the delimiters should be recognized. Each node v in the first |U|/n1/3 nodes broadcasts the number of objects that v receives in step 4. Therefore, each node v computes for every object B whether the rank of B among the candidates is a multiple of |𝒞|/n1/3 and if so, selects B to be a delimiter. There are O(n1/3) delimiters of size O(n1/3) each, which is in total O(n2/3) words. Therefore, the delimiters are announced to all the nodes in O(1) rounds using Lemma 9.

In step 7 we first show that the total number of words that each set Wj receives, is O(|U|n2/3) words.

Lemma 27 (Proof is given in the full version [19]).

When executing Algorithm 5, for each j[1..n1/3], it holds that i=1|U|Ki,j=O(|U|n2/3).

Now, the algorithm selects for every set Wj a leader vWj. Each node vi sends Ki,j to vWj. The leader vWj computes and sends to each node viU a node wWj such that vi should send Ki,j to w. By Lemma 23, there is a computation such that for each wWj, the number of words that w receives is at most O(n) words, by setting in the lemma xiKi,j, X|U|O(n2/3), xO(n) and k|Wj|=|U|/n1/3. On the other hand, each node sends at most O(n) words. By Lenzen’s routing scheme this is done in O(1) rounds.

In step 8, similar to step 4, the depth of the recursion is O(1).

We are ready to prove Lemma 25.

Proof of Lemma 25.

First, we apply Algorithm 5 with U=V. Hence, all the objects are ordered in a non-decreasing lexicographical order among all the nodes of the network.

Next, we show how to compute for each object B, 𝗋𝖺𝗇𝗄(B). Each node vi for 1i<n sends to node vi+1 the largest object of vi (by the lexicographical order), denoted Bi. Then, each node vi computes and broadcasts the number of distinct objects vi holds that are different from Bi1 (for i=1, node vi just broadcasts the number of distinct objects vi holds), ignoring the tiebreakers of the original node and original index (notice that the number of distinct objects might be 0). Now, each node vi computes the rank of all the objects vi holds.

Lastly, for every object B, 𝗋𝖺𝗇𝗄(B) is sent to the original node of B, using the information of the original node that B stores. By Lenzen’s routing scheme this is done in O(1) rounds.

Appendix B Sorting Objects of Size 𝑶(𝒏)

In this section we prove Theorem 5. Notice that Section 1.2 with ε=0 means that every key is of size O(n) words of space.

B.1 Upper Bound

In this section we show an algorithm that sorts objects of size O(n) words in O(logn) rounds. Our algorithm simulates an execution of a sorting network. A sorting network with N wires (analog to cells of an input array) sorts comparable objects as follows. The network has a fixed number of parallel levels, each level is composed of at most N/2 comparators. A comparator compares two input objects and swap their positions if they are out of order. The number of parallel levels of a sorting network is called the depth of the sorting network.

Ajtai, Komlós, and Szemerédi (AKS) [1] described a sorting network of depth O(logN) for every N. Notice that in our setting NO(n2). We prove the following lemma.

Lemma 28.

There is an algorithm that solves Section 1.2 with ε=0 in O(logn) rounds.

Proof.

We show how to simulate an execution of each level of AKS sorting network for N input objects in the Congested Clique model in O(1) rounds.

First, each node broadcasts the number of objects it stores, and then each node calculates N, the number of objects in the network, and produces the AKS sorting network with N wires. In addition, each node attaches to every object within the node the global index of the object as a metadata.

On each level, there are O(n2) comparators. Each node vi is responsible for the [(i1)N/n+1..iN/n]=O(n) comparators (if exist). To do so, each node with inputs to a comparator under vi’s responsibility, sends for every such input the size and the index of the input to vi. This routing takes O(1) rounds. Denote the sum of the sizes of the objects (inputs) under vi responsibility as Svi. Then, vi creates ai=Svi/n auxiliary nodes. Notice that the total number of auxiliary nodes is at most

i=1nai=i=1nSvi/ni=1n1+Svi/nn+O(n2)/n=O(n).

By Lemma 24 the algorithm can simulate each round with the auxiliary nodes on the original network in O(1) rounds. Then, vi calculates a partition of the comparators between vi’s auxiliary nodes such that for each auxiliary node of vi, the total size of objects for vi’s comparators to this auxiliary node is O(n) (by Lemma 23, there is such a partition). Then, let u be a node that holds an object B that should be sent to one of vi’s comparators. vi sends to u, which auxiliary node is the target of B, and then u sends B to this auxiliary node. By Lenzen’s routing scheme, this is done in O(1) rounds.

Finally, all the comparators execute the comparisons, and if a swap is needed, the objects swap their metdata indices. Since a swap of wires in a comparator is equivalent to a swap of indices in our simulation, we have that after the last level, the metadata index of each object is its rank among the objects.

To conclude, it takes O(1) rounds to simulate each level of a sorting network. There are at most O(log(n2))=O(logn) levels to AKS sorting network, and therefore the running time for sorting objects of size O(n) is O(logn) rounds.

B.2 Lower Bound

In this section we prove the lower bound of Theorem 5.

Lemma 29.

Every comparison based algorithm that solves Section 1.2 with ε=0 must take Ω(logn/loglogn) rounds.

Proof.

Let A be an algorithm that solves Section 1.2 with n nodes and n keys, each of size Θ(n), in r rounds. We describe another algorithm A that also runs in r rounds, solves Section 1.2 and performs O(nrlogr) comparisons. Thus, for r=o(logn/loglogn) we get a sorting algorithm of n keys with O(nrlogr)=o(nlogn) comparisons, which contradicts the celebrated comparison based sorting lower bound by Ford and Johnson [16].

The algorithm A works as follows. Let us focus on one specific node v. v will simulate A, but will ignore all the comparisons performed in A. Let x1,x2, be the keys v receives during the running of the algorithm, due to their arrival order (breaking ties arbitrarily). In A, the node v maintains at any time the sorted order of all the keys that v received so far. Whenever v receives a new key xi, v performs a binary search on the keys {x1,,xi1} (which are maintained in a sorted order). This takes O(logi) comparisons, and then v updates the sorted order of all the keys x1,,xi with no additional comparisons.

Since v maintains at any time the sorted order of all the keys v has received until this time, v can simulate any operation that v has to perform due to A, even if the operation requires some comparison between keys.

We focus on the case where every key is of size Θ(n) words of space. In this case a node v must get Ω(n) words to receive a key. Since A runs in r rounds, v gets at most O(r) keys. The total number of comparisons v performs while running A is i=1O(r)logi=O(rlogr). There are n nodes in the Congested Clique , and therefore O(nrlogr) comparisons in total across all the nodes.