String Problems in the Congested Clique Model

Golan, Shay; Kraus, Matan

doi:10.4230/LIPIcs.CPM.2025.6

String Problems in the Congested Clique Model

Shay Golan

Reichman University, Herzliya, Israel
University of Haifa, Israel Matan Kraus

Bar-Ilan University, Ramat Gan, Israel

Abstract

In this paper we present algorithms for several string problems in the Congested Clique model. In the Congested Clique model, $n$ nodes (computers) are used to solve some problem. The input to the problem is distributed among the nodes, and the communication between the nodes is conducted in rounds. In each round, every node is allowed to send an $O(\log n)$ -bit message to every other node in the network.

We consider three fundamental string problems in the Congested Clique model. First, we present an $O(1)$ rounds algorithm for string sorting that supports strings of arbitrary length. Second, we present an $O(1)$ rounds combinatorial pattern matching algorithm. Finally, we present an $O(\log\log n)$ rounds algorithm for the computation of the suffix array and the corresponding Longest Common Prefix array of a given string.

Keywords and phrases:

String Sorting, Pattern Matching, Suffix Array, Congested Clique, Sorting

Funding:

Shay Golan: supported by Israel Science Foundation grant 810/21.

Matan Kraus: supported by the BSF grant 2018364, and by the ERC grant MPM under the EU’s Horizon 2020 Research and Innovation Programme (grant no. 683064).

Copyright and License:

2012 ACM Subject Classification:

Theory of computation

\rightarrow

Distributed algorithms

Editors:

Paola Bonizzoni and Veli Mäkinen

Series and Publisher:

Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

In the Congested Clique model [32, 31, 36] $n$ nodes (computers) are used to solve some problem. The input to the problem is spread among the nodes, and the communication among the nodes is done in rounds. In each round, every node is allowed to send one message to every other node in the network. Typically, the size of every message is $O(\log n)$ bits, and messages to different nodes can be different. Usually, the input of every node is assumed to be of size $O(n)$ words, and so, can be sent to other nodes in one round, and one can hope for $O(1)$ -round algorithms. In this model, the local computation time is ignored and the efficiency of an algorithm is measured by the number of communication rounds made by the algorithm.

One of the fundamental tasks in the Congested Clique model is sorting of elements. In one of the seminal results for this model, Lenzen [31] shows a sorting algorithm that run in $O(1)$ rounds. Lenzen’s algorithm supports keys of size $O(\log n)$ bits. We show how to generalize Lenzen’s sorting algorithm to support keys of size $O(n^{1-\varepsilon})$ (for some constant $\varepsilon$ ). Using this sorting algorithm we introduce efficient Congested Clique algorithms for several string problems.

String sorting (Section 3).

The first algorithm is for the string sorting problem [7, 25, 8]. This is a special case of the large objects sorting problem, where the order defined on the objects is the lexicographical order. We introduce an $O(1)$ rounds algorithm for this specific order, even if there are strings of length $\omega(n)$ .

Pattern matching (Section 4).

The second algorithm we present is an $O(1)$ rounds algorithm for pattern matching, which uses the string sorting algorithm. In the pattern matching problem the input is two strings, a pattern $P$ and a text $T$ , and the goal is to find all the occurrences of $P$ in $T$ . Algorithms for this problem were designed since the 1970’s [22, 29, 27, 40, 9]. In the very related model of Massively Parallel Computing (MPC) (see discussion below), Hajiaghayi et al. [21] introduce a pattern matching algorithm that is based on FFT convolutions. Their algorithm can be adjusted to an $O(1)$ rounds algorithm in the Congested Clique model. However, our algorithm has the advantage of using only combinatorial operations.

Suffix Array construction and the corresponding LCP array (Section 5)

The last algorithm we present is an algorithm that constructs the suffix array [33, 37] ( $\mathsf{SA}$ ) of a given string, together with the corresponding longest common prefix ( $\mathsf{LCP}$ ) array [28, 33]. The suffix array of a string $S$ , denoted by $\mathsf{SA}_{S}$ , is a sorted array of all of the suffixes of $S$ . The $\mathsf{LCP}_{S}$ array stores for every two lexicographic successive suffixes the length of their longest common prefix. It was proved [33, 28] that the combination of $\mathsf{SA}_{S}$ with $\mathsf{LCP}_{S}$ is a powerful tool, that can simulate a bottom-up suffix tree traversal and is useful for solving problems like finding the longest common substring of two strings. Our algorithm takes $O(\log\log n)$ rounds.

The input model.

Most of the problems considered so far in the Congested Clique model are graph problems. For such problems where the input is a graph, it is very natural to consider partitioning of the input among $n$ nodes. Each node in the network receives all the information on the neighborhood of one vertex of the input graph. However, when the input is a set of objects or strings, like in our problems, it is less clear how the input is spread among the $n$ nodes of the Congested Clique network. We tried to minimize our assumption on the input to get as general algorithms as possible. Inspired by the standard RAM model, we assume that the input of any problem can be considered as a long array that contains the input (just like the internal memory of the computer). In the Congested Clique model with $n$ nodes, we assume that the same input is now distributed among the local memories of the nodes (see Figure 1 for the case where the input objects are strings). So, to get as an input a sequence of objects with a total size of $O(n^{2})$ words, we consider their representation in a long array, one after the other, and then partition this array into $n$ pieces, each of length $O(n)$ . The assumption that the input of every node is of length $O(n)$ , is consistent with previous problems considered in the Congested Clique model [3, 31]. A useful assumption we assume for the sake of simplicity is that when the size of every object is bounded by $O(n)$ words, any object is stored only in one node. This assumption can be guaranteed within an overhead of $O(1)$ rounds.

Figure 1: An example of a sequence of strings, the input is partitioned between nodes

v_{1},\dots,v_{n}

.

Relation between Congested Clique and MPC.

The Massively Parallel Computing (MPC) model [5, 2] is a very popular model that is useful for the analyzing of the more practical model of MapReduce [15], from a theoretical point of view. In this model, a problem with input of size $O(N)$ words, is solved by $M$ machines, each with memory of size $S$ words such that $M\cdot S=\Theta(N)$ . The MPC model is synchronous, and in every round each machine sends and receives information, such that the data received by each machine fits its memory size. We point out that, as described by Behnezhad et al. [6, Theorem 3.1], every Congested Clique algorithm with $\Theta(n^{2})$ size input, that uses $O(n)$ space in every node, can be simulated in the MPC model, with $N=n^{2}$ and $S=\Theta(M)=\Theta(\sqrt{N})$ . Moreover, it is straightforward that every MPC algorithm that works with $S=\Theta(M)$ in $r$ rounds, can be simulated in an MPC instance with $S=\omega(M)$ (but still $M\cdot S=\Theta(N)$ ) in $r$ rounds since every machine can simulate several machines of the original algorithm. As a result, most of the algorithms we introduce in the Congested Clique model implies also algorithms with the same round complexity in the MPC model for $S=\Omega(M)$ . The only exception is the sorting algorithm for the case of $\varepsilon=0$ , which uses $\omega(n)$ memory in each machine (see Appendix B). We note that the regime of $S=\Omega(M)$ is the most common regime for algorithms in the MPC model, see for example [21, 18, 34].

1.1 Related Work

String Sorting.

The string sorting problem was studied in the PRAM model by Hagerup [20], where he introduces an optimal $O(\log n/\log\log n)$ time algorithm on a CRCW PRAM. The problem was also studied in the external I/O emory model by Arge et al. [4]. Recently, a more practical research was done by Bingman [8].

Pattern Matching.

In parallel models, on the 1980’s, Galil [17] and Vishkin [39] introduced pattern matching algorithms in the CRCW and CREW PRAM models. Later, Breslauer and Galil [10] improved the complexity for CRCW from $O(\log n)$ to $O(\log\log n)$ rounds and show that this round complexity is optimal.

Suffix Array.

In the world of parallel and distributed computing, the problem of $\mathsf{SA}$ construction was studied in several models, both in theory [26] and in practice [30, 24]. The most related line of work is the results of Kärkkäinen et al. [26] and the improvement for the Bulk Synchronous Parallel (BSP) model by Pace and Tiskin [35]. Kärkkäinen et al. [26] introduce a linear time algorithm that works in several models of parallel computing, and requires $O(\log^{2}n)$ synchronization steps in the BSP model (for the case of a polynomial number of processors). Their result uses a recursive process with a parameter that was used as a fixed value in all levels of recursion. Pace and Tiskin [35] show that one can enlarge the value of the parameter with the levels, what they called accelerated sampling, such that the total work does not change asymptotically, but the depth of the recursion, and hence the number of synchronization steps, becomes $O(\log\log n)$ . Our $\mathsf{SA}$ construction algorithm follows the same idea, but uses some different implementation details which fit the Congested Clique model, and exploits our large-objects sorting algorithm.

1.2 Our Contribution

Our results are summarized in the following theorems:

Theorem 1 (String Sorting).

There is an algorithm that given a sequence of strings $\mathcal{S}=(S_{1},S_{2},\dots,S_{k})$ , computes $\mathsf{rank}_{\mathcal{S}}(S_{j})$ for every string $S_{j}\in\mathcal{S}$ , and stores this rank in the node that receives $S_{j}[1]$ . The running time of the algorithm is $O(1)$ rounds of the Congested Clique model.

Theorem 2 (Pattern Matching).

There is an algorithm that given two strings $P$ and $T$ , computes for every $i\in[0..|T|-|P|+1]$ whether $T[i+1..i+|P|]=P$ in $O(1)$ rounds of the Congested Clique model.

Theorem 3 (Suffix Array and $\mathsf{LCP}$ ).

There is an algorithm that given a string $S$ , computes $\mathsf{SA}_{S}$ and $\mathsf{LCP}_{S}$ in $O(\log\log n)$ rounds of the Congested Clique model.

As described above, all our results are based on the algorithm for sorting large objects. The problem is defined as follows

Problem 3 (Large Object Sorting).

Assume that a Congested Clique network of $n$ nodes gets a sequence $\mathcal{B}=(B_{1},B_{2},\dots,B_{k})$ of objects, each of size $O(n^{1-\varepsilon})$ words for $\varepsilon\geq 0$ , where the total size of $\mathcal{B}$ ’s objects is $O(n^{2})$ words. For every object $B_{j}\in\mathcal{B}$ , the node that gets $B_{j}$ needs to learn $\mathsf{rank}_{\mathcal{B}}(B_{j})$ .

The algorithms for Section 1.2 are presented in Theorems 4 and 5, which we prove in Appendices A and B, respectively.

Theorem 4.

There is an algorithm that solves Section 1.2 for any constant $\varepsilon>0$ in $O(1)$ rounds of the Congested Clique model.

Theorem 5.

There is an algorithm that solves Section 1.2 for $\varepsilon=0$ in $O(\log n)$ rounds. Moreover, any comparison-based Congested Clique algorithm that solves Section 1.2 for $\varepsilon=0$ requires $\Omega(\log n/\log\log n)$ rounds.

2 Preliminaries

For $i,j\in\mathbb{Z}$ we denote $[i..j]=\{i,i+1,i+2,\dots,j\}$ . For a set $S\subseteq\mathbb{Z}$ and a scalar $\alpha\in\mathbb{Z}$ we denote $S+\alpha=\{a+\alpha\mid a\in S\}$ . For a set $\mathcal{K}$ of elements with a total order, and an element $b\in\mathcal{K}$ we denote $\mathsf{rank}_{\mathcal{K}}(b)=|\{a\in\mathcal{K}\mid a<b\}|$ (or simply $\mathsf{rank}(b)$ when $\mathcal{K}$ is clear from the context). We clarify that for a multi-set $M$ , we consider the rank of an element $b\in M$ to be the number of distinct elements smaller than $b$ in $M$ . For a set of objects $\mathcal{K}$ we denote $\|\mathcal{K}\|=\sum_{B\in\mathcal{K}}|B|$ as the total size (in words of space) of the objects in $\mathcal{K}$ .

Strings.

A string $S=S[1]S[2]\dots S[n]$ over an alphabet $\Sigma$ is a sequence of characters from $\Sigma$ . In this paper we assume $|\Sigma|=[1..\mathsf{poly}(n)]$ and therefore each character takes $O(\log n)$ bits. The length of $S$ is denoted by $|S|=n$ . For $1\leq i<j\leq n$ the string $S[i..j]=S[i]S[i+1]\dots S[j]$ is called a substring of $S$ . if $i=1$ then $S[i..j]$ is called a prefix of $S$ and if $j=n$ then $S[i..j]$ is a suffix of $S$ and is also denoted as $S[i..]$ . The following lemma from [11] is useful for our pattern matching algorithm in Section 4.

Lemma 6 ([11, Lemma 3.1]).

Let $u$ and $v$ be two strings such that $v$ contains at least three occurrences of $u$ . Let $t_{1}<t_{2}<\cdots<t_{h}$ be the locations of all occurrences $u$ in $v$ and assume that $t_{i+2}-t_{i}\leq|u|$ , for $i=[1..h-2]$ and $h\geq 3$ . Then, this sequence forms an arithmetic progression with difference $d=t_{i+1}-t_{i}$ , for $i=[1..h-1]$ (that is equal to the period length of $u$ ).

Here is the definition of the longest common prefix of two strings. It is useful for the definition of the lexicographical order and also for the LCP array.

Definition 7.

For two strings $S_{1},S_{2}\in\Sigma^{*}$ , we denote $\mathsf{LCP}(S_{1},S_{2})=$ $\max(\{\ell\mid S_{1}[1..\ell]=S_{2}[1..\ell]\}\cup\{0\})$ to be the length of the longest common prefix of $S_{1}$ and $S_{2}$ .

We provide here the formal definition of lexicographic order between strings.

Definition 8 (Lexicographic order).

For two strings $S_{1},S_{2}\in\Sigma^{*}$ we have $S_{1}\preceq S_{2}$ if one of the following holds:

1.

If $\ell=\mathsf{LCP}(S_{1},S_{2})<\min\{|S_{1}|,|S_{2}|\}$ and $S_{1}[\ell+1]<S_{2}[\ell+1]$
2.

$S_{1}$ is a prefix of $S_{2}$ , i.e. $|S_{1}|\leq|S_{2}|$ and $S_{1}=S_{2}[1..|S_{1}|]$ .

We denote the case where $S_{1}\preceq S_{2}$ and $S_{1}\neq S_{2}$ as $S_{1}\prec S_{2}$ .

Routing.

In the Congested Clique model, a routing problem involves delivering messages from a set of source nodes to a set of destination nodes, where each node may need to send and receive multiple messages. A well-known result by Lenzen [31] shows that if each node is the source and destination of at most $O(n)$ messages, then all messages can be delivered in $O(1)$ rounds. The following lemma is useful for routing in the Congested Clique model.

Lemma 9 ([13, Lemma 9]).

Any routing instance, in which every node $v$ is the target of up to $O(n)$ messages, and $v$ locally computes the messages it desires to send from at most $O(n\log n)$ bits, can be performed in O(1) rounds.

$\blacktriangleright$ Remark.

A particularly useful case in which $v$ locally computes the messages it wishes to send from $O(n\log n)$ bits is when $v$ stores $O(n)$ messages, each intended for all nodes in some consecutive range of nodes in the network.

We also provide a routing lemma that considers a symmetric case to that of Lemma 9, which we prove in the full version of the paper [19, Appendix B].

Lemma 10.

Assume that each node of the Congested Clique stores $O(n)$ words of space, each of size $O(\log n)$ . Each node has $O(n)$ queries, such that each query is a pair of a resolving node, and the content of the query which is encoded in $O(\log n)$ bits. Moreover, the query can be resolved by the resolving node, and the size of the result is $O(\log n)$ bits. Then, it is possible in $O(1)$ rounds of the Congested Clique that each node get all the results of its queries.

3 Sorting Strings

Although in Appendix B we show that it is impossible to sort general keys of size $\Theta(n)$ in $O(1)$ rounds, in this section we show that with some structural assumption on the keys and the order one can do much better. In particular, we show that if our keys are strings and we consider the lexicographic order, we can always sort them in $O(1)$ rounds, even if there are strings of length $\omega(n)$ .

We introduce a string sorting algorithm that uses the algorithm of Theorem 4 as a black box (specifically, our algorithm uses the algorithm for the special case of $\varepsilon=2/3$ , which is proved in Lemma 25), and apply a technique called renaming to reduce long strings into shorter strings. The reduction preserves the original lexicographic order of the strings. After applying the reduction several times, all the strings are reduced to strings of length $O(n^{1/3})$ which are sorted by an additional call to the algorithm of Theorem 4.

Renaming.

The idea behind the renaming technique is that to compare two long strings, one can partition the strings into blocks of the same length, and then compare pairs of blocks in corresponding positions in the two strings. The lexicographic order of the two original strings is determined by the lexicographic order of the first pair of blocks that are not the same. Such a comparison can be done by replacing each block with the block’s rank among all blocks, transforming the original strings into new, shorter strings.

As a first step, the algorithm splits each string into blocks of length $\left\lceil{n^{1/3}}\right\rceil$ characters¹¹1For the sake of simplicity, we assume from now on that $n^{1/3}$ is an integer and simply write $n^{1/3}$ instead of $\left\lceil{n^{1/3}}\right\rceil$ . as follows. A string $S$ of length $|S|=\ell$ is partitioned into $\left\lceil{\frac{\ell}{n^{1/3}}}\right\rceil$ blocks, each of length $n^{1/3}$ characters except for the last block which is of length $\ell\operatorname{mod}n^{1/3}$ characters (unless $\ell$ is a multiple of $n^{1/3}$ , in which case the length of the last block is also $n^{1/3}$ characters).

In the case that every string is stored in one node, such a partitioning can be done locally. In the more general case, each node $v$ broadcasts the number of strings starting in $v$ and the number of characters $v$ stores from $v$ ’s last string. This broadcast is done in $O(1)$ rounds by Lenzen’s routing scheme and using this information each node $v$ computes the partitioning positions of the strings stored in $v$ . Finally, a block that is spread among two (or more) nodes is transferred to be stored only in the node where the block begins. This transfer of block parts is executed in $O(1)$ rounds since each node is the source and destination of at most $n^{1/3}$ characters.

The next step of the algorithm is to sort all the blocks, using the algorithm of Theorem 4. The result of the sorting is the rank of every block. In particular, if the same block appears more than once, all the block’s occurrences will have the same rank, and for every two different blocks, the order of their rank will match their lexicographic order. Thus, the algorithm will define new strings by replacing each block with its rank in the sorting. As a result, every string of length $\ell$ will be reduced to length $\left\lceil{\frac{\ell}{n^{1/3}}}\right\rceil$ . Notice that the alphabet of the new strings is the set of ranks, which is a subset of $[1..n^{2}]$ , and therefore each new character uses $O(\log n)$ bits.

In the following lemma we prove that the new strings preserve the lexicographic order of the original strings.

Lemma 11.

Let $A$ and $B$ be two strings, and let $A^{\prime},B^{\prime}$ be the resulting strings defined by replacing each block of $A$ and $B$ with the block’s rank among all blocks, respectively. Then, $A\preceq B$ if and only if $A^{\prime}\preceq B^{\prime}$ .

Proof.

We first prove that $A\preceq B\Rightarrow A^{\prime}\preceq B^{\prime}$ . Recall that by Definition 8 there are two options for $A\preceq B$ to be hold:

1.

If $\ell=\mathsf{LCP}(A,B)<\min\{|A|,|B|\}$ and $A[\ell+1]<B[\ell+1]$
2.

$A$ is a prefix of $B$ , i.e. $|A|\leq|B|$ and $A=B[1..|A|]$ .

For any $j$ let $A_{j}$ and $B_{j}$ be the $j$ th blocks of $A$ and $B$ , respectively.

For the first case, let $\alpha=\left\lceil{\frac{\ell+1}{n^{1/3}}}\right\rceil$ . Notice that $A[\ell+1]$ and $B[\ell+1]$ are contained in $A_{\alpha}$ and $B_{\alpha}$ , respectively. Thus, for all $j<\alpha$ we have $A_{j}=B_{j}$ and for $j=\alpha$ we have $A_{\alpha}\neq B_{\alpha}$ . Moreover, by definition $A_{\alpha}\preceq B_{\alpha}$ . Hence, $A^{\prime}[1..\alpha-1]=B^{\prime}[1..\alpha-1]$ and $A^{\prime}[\alpha]<B^{\prime}[\alpha]$ which means that $A^{\prime}\preceq B^{\prime}$ .

For the second case, where $A$ is a prefix of $B$ there are three subcases: (1) If $A=B$ then all the blocks in $A$ will get exactly the same ranks as the corresponding blocks in $B$ and therefore $A^{\prime}=B^{\prime}$ . (2) Otherwise, $|A|<|B|$ . (2a) If $|A|$ is a multiple of $n^{1/3}$ then all the blocks of $A$ are exactly the same as the corresponding blocks of $B$ , but $B$ has at least one additional block. Therefore, $A^{\prime}$ is a prefix of $B^{\prime}$ and $A^{\prime}\preceq B^{\prime}$ . (2b) If $|A|$ is not a multiple of $n^{1/3}$ then the last block of $A$ is shorter than the corresponding block of $B$ . Let $\alpha=\left\lceil{\frac{|A|}{n^{1/3}}}\right\rceil$ be the index of the last block of $A$ , in particular the block $A_{\alpha}$ is a prefix of the block $B_{\alpha}$ , and therefore by definition the rank of $A_{\alpha}$ is smaller than the rank of $B_{\alpha}$ . Thus, we have that $A^{\prime}[1..\alpha-1]=B^{\prime}[1..\alpha-1]$ and $A^{\prime}[\alpha]<B^{\prime}[\alpha]$ which means that $A^{\prime}\preceq B^{\prime}$ .

By very similar arguments, one can prove that $B\prec A\Rightarrow B^{\prime}\prec A^{\prime}$ . Thus, the claim $A^{\prime}\preceq B^{\prime}\Rightarrow A\preceq B$ follows. $\hfill\blacktriangleleft$

Algorithm 1 String sorting.

We are now ready to prove Theorem 1 (see also Algorithm 1).

Proof of Theorem 1.

The algorithm repeats the renaming process $7$ times. Since the original longest string is of length $O(n^{2})$ , after seven iterations, we get that the length of every string is $\left\lceil{O({\frac{n^{2}}{(n^{1/3})^{7}}})}\right\rceil=1$ . Thus, at this time, each string is one block of length $1=O(n^{1/3})$ characters. In particular, each string is stored in one node. The algorithm uses one more time the algorithm of Theorem 4 (one could also use the sorting algorithm of [31, Corollary 4.6]) to solve the problem. Notice that during the execution of the renaming process, each block is moved completely to the first node that holds characters from this block. In particular, at the final sorting, each string contains one block, that starts at the original node where the complete string starts. Therefore, the ranks found by the sorting will be stored in the required nodes. $\hfill\blacktriangleleft$

4 Pattern Matching

In this section we prove Theorem 2 and introduce an $O(1)$ -round Congested Clique algorithm for the pattern matching problem. Recall that the input for this problem is a pattern string $P$ , and a text string $T$ . Moreover, we assume that $|P|+|T|=O(n^{2})$ , since for larger input one cannot hope for an $O(1)$ rounds algorithm, due to communication bottlenecks. The goal is to find all the offsets $i$ such that $P$ occurs in $T$ at offset $i$ . Formally, for every two strings $X, Y$ let $\mathsf{PM}(X,Y)=\{i\mid Y[i+1..i+|X|]=X\}$ be the set of occurrences of $X$ in $Y$ . Our goal is to compute $\mathsf{PM}(P,T)$ in a distributed manner. We introduce an algorithm that solves this problem in $O(1)$ rounds. Our algorithm distinguishes between the case where $|P|\leq n$ and $|P|>n$ . Therefore, as a first step, each node $v$ broadcasts the number of characters from $P$ in $v$ and then computes $|P|$ . In Section 4.1 we take care of the simple case where $|P|\leq n$ and in Section 4.2 we describe an algorithm for the case of $n<|P|\in O(n^{2})$ .

4.1 Short Pattern

If $|P|\leq n$ then the algorithm first broadcasts $P$ to all the nodes using Lemma 9. In addition, we want that every substring of $T$ of length $|P|$ will be stored completely in (at least) one node. For this, each node announces the number of characters from $T$ it holds. Then, each node that its last input character is the $i$ th character of $T$ , needs to get the values of $T[i+1],T[i+2],\dots,T[i+|P|-1]$ from the following nodes. So, each node that gets $T[j]$ sends it to all preceding nodes starting from the node that gets $T[j-|P|+1]$ (or the node that gets $T[1]$ if $j-|P|+1<1$ ). Notice that all these nodes form a range and that each node will get at most $|P|\leq n$ messages. Thus, this routing can be done in $O(1)$ rounds using Lemma 9. Now, every substring of $T$ of length $|P|$ is stored completely in one of the nodes, and all the nodes have $P$ . Thus, every node locally finds all the occurrences of $P$ in its portion of $T$ . To conclude we proved Theorem 2 for the special case where $|P|\leq n$ (see also Algorithm 2).

Algorithm 2 Pattern matching with short pattern.

Algorithm 3 Pattern matching with

|P|>n

.

4.2 Long Pattern

From now on we consider the case where $|P|>n$ . To conclude that an offset $i$ of the text is an occurrence of $P$ , i.e. that $T[i+1..i+|P|]=P$ , we have to get for any $1\leq j\leq|P|$ evidence that $T[i+j]=P[j]$ . We will use two types of such evidence. The first type is finding all the occurrences of the $n$ -length prefix and suffix of $P$ in $P$ and $T$ . The second type of evidence will be based on the string sorting algorithm of Theorem 1, to sort all the blocks between occurrences that were found in the first step, both in $P$ and $T$ . At every occurrence of $P$ in $T$ , all the occurrences of the prefix and suffix will match, and also the blocks between these occurrences will also be the same in the pattern and text (see also Algorithm 3).

First step - searching for the prefix and suffix.

Let $B=P[1..n]$ and $E=P[|P|-n+1..|P|]$ be the prefix and suffix of $P$ of length $n$ , respectively. We use Algorithm 2 four times, to find all occurrences of $B$ and $E$ in $P$ and $T$ (in every execution the algorithm ignores the parts of the original input which are not relevant for this execution). By the following lemma, all the locations of $B$ and $E$ found by one node can be stored in $O(1)$ words of space (which are $O(\log n)$ bits). The lemma is derived from Lemma 6 by using the so-called standard trick.

Lemma 12.

Let $X$ and $Y$ be two strings such that $|Y|\geq|X|$ and $|Y|=O(|X|)$ . Then $\mathsf{PM}(X,Y)$ can be stored in $O(1)$ words of space.

Proof.

We divide $Y$ into parts of length $2|X|-1$ characters, with overlap, such that each substring of length $|X|$ is contained in one part. For any $i=0,1,\dots\left\lfloor{\frac{|Y|}{|X|}}\right\rfloor-1$ let $Y_{i}=Y[i\cdot|X|+1..\min\{(i+2)\cdot|X|-1,|Y|\}]$ be the $i$ th part of $Y$ , and let $L_{i}=\mathsf{PM}(X,Y_{i})+i\cdot|X|$ be the set of all the occurrences of $X$ in $Y_{i}$ . Notice that every occurrence $t$ of $X$ in $Y$ is an occurrence of $X$ in one part of $Y$ , specifically in $Y_{\left\lfloor{{t}/{|X|}}\right\rfloor}$ . Thus, to represent $\mathsf{PM}(X,Y)$ it is enough to store all the occurrences of $X$ in every part $Y_{i}$ . Since we consider just $\frac{|Y|}{|X|}=O(1)$ parts of $Y$ , it is enough to show that $L_{i}$ can be stored in $O(1)$ words of space. If $|L_{i}|\leq 2$ then of course it could be stored in $O(1)$ words of space. Otherwise, $|L_{i}|\geq 3$ , and notice that for every two occurrences in $L_{i}$ their distance is at most $|Y_{i}|-|X|+1=|X|$ . Thus, by Lemma 6, $L_{i}$ forms an arithmetic progression and therefore can be represented in $O(1)$ words of space by storing only the first and last element and the difference between elements. $\hfill\blacktriangleleft$

Hence, every node broadcasts to the whole network all the locations of $B$ and $E$ in the node’s fragments of $P$ or $T$ . Combining all the broadcasted information, each node will have $\mathsf{PM}(B,P)$ , $\mathsf{PM}(B,T)$ , $\mathsf{PM}(E,P)$ and $\mathsf{PM}(E,T)$ .

Second step - completing the gaps.

In the second step, we want to find evidence of equality for all the locations in $P$ and $T$ which do not belong to any occurrence of $B$ or $E$ . Notice that $|B|=|E|=n$ and therefore every occurrence of $B$ or $E$ that starts at offset $i$ covers all the range $[i+1,i+n]$ . We will use all the occurrences of $B$ and $E$ that were found in the first step, to focus on the remaining regions which are not covered yet. Formally, we define the sets of uncovered locations in $P$ and $T$ as follows. For a string $X$ let

\displaystyle R_{X}=[1..|X|]\setminus\bigcup_{i\in\mathsf{PM}(B,X)\cup\mathsf{% PM}(E,X)}[i+1..i+n]

The algorithm uses $R_{P}$ and $R_{T}$ to define a (multi)set of strings which contains all the maximal regions of $P$ and $T$ which were not covered on the first step (see Figure 2):

	$\displaystyle\mathcal{S}_{P}=\{P[i..j]\|[i..j]\subseteq R_{P}\text{ and }i-1% \notin R_{P}\text{ and }j+1\notin R_{P}\}$		(1)
	$\displaystyle\mathcal{S}_{T}=\,\{T[i..j]\|[i..j]\subseteq R_{T}\text{ and }i-1% \notin R_{T}\text{ and }j+1\notin R_{T}\}.$		(2)
	$\displaystyle\mathcal{S}=\mathcal{S}_{P}\cup\mathcal{S}_{T}$		(3)

Figure 2: The yellow regions and red regions represents occurrences of

B

and

E

, respectively. The blue regions of the pattern and the text represent

R_{P}

and

R_{T}

, respectively. Notice that

\mathcal{S}_{P}=\{S_{1},S_{2},S_{3}\}

, and

\mathcal{S}_{T}=\{S_{4},S_{5},S_{6},S_{7}\}

. Given that the pattern is aligned to the

i

th offset, then

i\in PM(P,T)

if and only if

S_{1}=S_{5}

,

S_{2}=S_{6}

and

S_{3}=S_{7}

.

Notice that the total size of all the strings in $\mathcal{S}$ is $O(n^{2})$ , since $|P|+|T|=O(n^{2})$ . The algorithm uses the string sorting algorithm of Theorem 1, to sort all the strings in $\mathcal{S}$ . As a result, every string is stored with its rank - which is the same for two identical strings. A useful property of $\mathcal{S}$ is that it contains $O(n)$ strings, as we prove in the next lemma.

Lemma 13.

$|\mathcal{S}|=O(n)$ .

Proof.

We start with bounding $|\mathcal{S}_{T}|$ . By definition of $\mathcal{S}_{T}$ , for $T[i..j]\in\mathcal{S}_{T}$ we have $i-1\notin R_{T}$ . Moreover, by definition of $R_{T}$ , it must be the case that $[i-n..i-1]\cap R_{T}=\emptyset$ . Thus, one can associate with every string in $\mathcal{S}_{T}$ (except for the first string in $\mathcal{S}_{T}$ if it is a prefix of $T$ ) a set of $n$ unique locations from $[1..|T|]$ that are not in $R_{P}$ . Therefore, $n\cdot(|\mathcal{S}_{T}|-1)\leq|[1..|T|]|=O(n^{2})$ and so $|\mathcal{S}_{T}|=O(n)$ . Applying the same argument for $\mathcal{S}_{P}$ gives us $|\mathcal{S}_{P}|=O(n)$ and $|\mathcal{S}|=O(n)+O(n)=O(n)$ . $\hfill\blacktriangleleft$

After sorting the strings of $\mathcal{S}$ , each node $v$ broadcasts the ranks of the strings starting at $v$ . Since there are just $O(n)$ strings in $\mathcal{S}$ , this can be done in $O(1)$ rounds with Lemma 9. Therefore, at the end of the second phase, every node has all the ranks of strings in $\mathcal{S}$ . Recall that at the end of the first phase each node stores $\mathsf{PM}(B,P)$ , $\mathsf{PM}(B,T)$ , $\mathsf{PM}(E,P)$ and $\mathsf{PM}(E,T)$ . Hence, for every offset $i$ , to check whether $T[i+1..i+|P|]=P$ , each node first verifies that all the occurrences of $B$ and $E$ in $P$ match the corresponding occurrences of $B$ and $E$ in $T[i+1..i+|P|]$ . If this is the case, then it must be that all the maximal regions in $R_{P}$ are in corresponding positions to the maximal regions in $R_{T}$ at $[i+1..i+|P|]$ . Thus, the node compares the rank of every string in $\mathcal{S}_{P}$ with the rank of the corresponding (due to shift $i$ ) string of $\mathcal{S}_{T}$ and checks whether they are equal (see also Figure 2 and Algorithm 3).

The following two lemmas give us the mathematical justification for the last part of the algorithm. Lemma 14 states that the test made by the algorithm is sufficient to decide whether $i\in\mathsf{PM}(P,T)$ . Lemma 15 proves that for any $i$ where the first two conditions of Lemma 14 holds, the test of the third condition can be made by comparing the ranks of strings from $\mathcal{S}$ , just like the algorithm acts.

Lemma 14.

$T[i+1..i+|P|]=P$ if and only if all the following holds:

1.

$\mathsf{PM}(B,P)=(\mathsf{PM}(B,T)-i)\cap[0..|P|-n]$
2.

$\mathsf{PM}(E,P)=(\mathsf{PM}(E,T)-i)\cap[0..|P|-n]$
3.

For every maximal range $[a..b]\subseteq R_{P}$ we have $P[a..b]=T[i+a..i+b]$ .

Proof.

The first direction is simple. Assume $T[i+1..i+|P|]=P$ , then for every $0\leq j<|P|-n$ we have $j\in\mathsf{PM}(B,P)$ if and only if $B=P[j+1..j+n]=T[i+j+1..i+j+n]$ if and only if $i+j\in\mathsf{PM}(B,T)\iff j\in\mathsf{PM}(B,T)-i$ . A similar argument works for the second property. Lastly, $P[a..b]=T[i+a..i+b]$ is a direct consequence of the fact that $T[i+1..i+|P|]=P$ .

For the other direction, let $j\in[1..|P|]$ , our goal is to prove that $T[i+j]=P[j]$ . If $[j-n..j-1]\cap\mathsf{PM}(B,P)\neq\emptyset$ then there exists some $t\in[j-n..j-1]$ such that $t\in\mathsf{PM}(B,P)$ . By the first property we have $t+i\in\mathsf{PM}(B,T)$ . Thus, $P[t+1..t+n]=T[i+t+1..i+t+n]$ and in particular $P[j]=P[t+(j-t)]=T[t+i+(j-t)]=T[i+j]$ . The case where $[j-n..j-1]\cap\mathsf{PM}(E,P)\neq\emptyset$ is symmetric. The last case we have to consider is where $[j-n..j-1]\cap\mathsf{PM}(B,P)=\emptyset$ and $[j-n..j-1]\cap\mathsf{PM}(E,P)=\emptyset$ . In this case, let $P[a..b]$ be the maximal region in $R_{P}$ that contains $j$ (such a region must exists). By the third property we have $P[a..b]=T[i+a..i+b]$ and in particular $P[j]=P[a+j-a]=T[i+a+j-a]=T[i+j]$ as required. $\hfill\blacktriangleleft$

Lemma 15.

Let $i\in[0..|T|-|P|]$ such that $\mathsf{PM}(B,P)=(\mathsf{PM}(B,T)-i)\cap[0..|P|-n]$ and $\mathsf{PM}(E,P)=(\mathsf{PM}(E,T)-i)\cap[0..|P|-n]$ . Then, for every maximal range $[a..b]\subseteq R_{P}$ we have $P[a..b]\in\mathcal{S}_{P}$ and $T[i+a..i+b]\in\mathcal{S}_{T}$ .

Proof.

First, by definition of $S_{P}$ we have $P[a..b]\in\mathcal{S}_{P}$ . Similarly, to prove that $T[i+a..i+b]\in\mathcal{S}_{T}$ it is sufficient to prove that $[i+a..i+b]$ is a maximal range in $R_{T}$ i.e. that (1) $[i+a..i+b]\subseteq R_{T}$ and (2) $i+a-1\notin R_{T}$ and (3) $i+b\notin R_{T}$ .

For (1), let $j\in[i+a..i+b]$ assume by a way of contradiction that $j\notin R_{T}$ . Then by definition there exists some $t\in[j-n..j-1]$ such that $t\in\mathsf{PM}(B,T),\mathsf{PM}(E,T)$ . But then, it must be the case that for $t^{\prime}=t-i$ we have $t^{\prime}\in\mathsf{PM}(B,P)\cup\mathsf{PM}(E,P)$ . Hence, and therefore $j-i=t-i+j-t=t^{\prime}+(j-t)\notin R_{P}$ but $j-i\in[a..b]$ and therefore $[a..b]\not\subseteq R_{P}$ in contradiction. Therefore, $[i+a..i+b]\subseteq R_{T}$ .

For (2), since $[a..b]$ is a maximal range in $R_{P}$ we have $[a-1]\notin R_{P}$ . Moreover, since $0\in\mathsf{PM}(B,P)$ it must be that $a>n$ and therefore by definition of $R_{P}$ we have $a-1-n\in\mathsf{PM}(B,P)$ . Hence, $a-1-n+i\in\mathsf{PM}(B,T)$ and therefore $a-1-n+i+n=a-1+i\notin R_{T}$ .

Similarly for (3), since $[a..b]$ is a maximal range in $R_{P}$ we have $[b+1]\notin R_{P}$ . Moreover, since $|P|-n\in\mathsf{PM}(E,P)$ it must be that $b<|P|-n$ and therefore by definition of $R_{P}$ we have $b\in\mathsf{PM}(B,E)$ . Hence, $i+b\in\mathsf{PM}(B,T)$ and therefore $i+b\notin R_{T}$ . Thus, we proved that $[i+a..i+b]$ is a maximal range in $R_{T}$ and therefore $T[i+a..i+b]\in\mathcal{S}_{T}$ . $\hfill\blacktriangleleft$

5 Suffix Array Construction and the Corresponding LCP Array

In this section, we are proving Theorem 3 by introducing an algorithm that computes $\mathsf{SA}_{S}$ , the suffix array of a given string $S$ of length $O(n^{2})$ in the Congested Clique model in $O(\log\log n)$ rounds. Moreover, we show how to compute the complementary $\mathsf{LCP}_{S}$ array in the same asymptotic running time. We first give the formal definition of $\mathsf{SA}_{S}$ and $\mathsf{LCP}_{S}$ :

Definition 16.

For a string $S$ , the suffix array, denoted by $\mathsf{SA}_{S}$ is the sorted array of $S$ suffixes, i.e., $S[\mathsf{SA}_{S}[i]..]\prec S[\mathsf{SA}_{S}[i+1]..]$ for all $1\leq i<|S|-1$ . The corresponding $\mathsf{LCP}$ array, $\mathsf{LCP}_{S}$ , stores the $\mathsf{LCP}$ of every two consecutive suffixes due to the lexicographical order. Formally $\mathsf{LCP}[i]=\mathsf{LCP}(S[\mathsf{SA}_{S}[i]..],S[\mathsf{SA}_{S}[i+1]..])$ .

Our algorithm follows the recursive process described by Pace and Tiskin [35], which is a speedup of Kärkkäinen et al. [26] for parallel models. The main idea of the recursion is that in every level, the algorithm creates a smaller string that represents a subset of the original string positions, solve recursively and use the results of the subset to compute the complete results. While the depth of the recursion increases, the ratio between the length of the current string and the length of the new string increases as well. At the beginning the ratio is constant and in $O(\log\log n)$ rounds it becomes polynomial in $n$ . The main difference between our algorithm and [35, 26] is that our algorithm is simpler due to the powerful sorting algorithm provided in Theorem 4. Moreover, we also introduce how to compute $\mathsf{LCP}_{S}$ , in addition to $\mathsf{SA}_{S}$ . We first ignore the $\mathsf{LCP}_{S}$ computation and then in Section 5.1 we give the details needed for computing $\mathsf{LCP}_{S}$ .

Our algorithm uses the notion of difference cover [14], and difference cover sample as defined by Kärkkäinen et al. [26]. For a parameter $t$ , a difference cover $DC_{t}\subseteq[0,t-1]$ is a set of integers such that for any pair $i,j\in\mathbb{Z}$ the set $DC_{t}$ contains $i^{\prime},j^{\prime}\in DC_{t}$ with $j-i\equiv j^{\prime}-i^{\prime}(\operatorname{mod}t)$ . For every $t\in\mathbb{N}$ there exists a difference cover $DC_{t}$ of size $\Theta(\sqrt{t})$ , which can be computed in $O(\sqrt{t})$ time (see Colbourn and Ling [14]). For a string $S$ , [26] defined the difference cover sample $DC_{t}(S)=\{i\mid i\in[1..|S|]\text{ and }(i\operatorname{mod}t)\in DC_{t}\}$ , as the set of all indices in $S$ which are in $DC_{t}$ modulo $t$ . The following lemma was proved in [38].

Lemma 17 ([38, Lemma 2]).

For a string $S$ and an integer $t\leq|S|$ , there exists a difference cover $DC_{t}$ , such that $|DC_{t}(S)|\in O(|S|/\sqrt{t})$ and for any pair of positions $i,j\in[1..|S|]$ there is an integer $k\in[0..t-1]$ such that $(i+k)$ and $(j+k)$ are both in $DC_{t}(S)$ .

At every level of the recursion let $\varepsilon>0$ be a number such that the length of the string $S$ is $|S|\in O(n^{2-\varepsilon})$ . Later we will describe how to choose $\varepsilon$ exactly (see the time complexity analysis). At the first level $\varepsilon=O(1/\log n)$ satisfies this requirement. Let $DC_{t}\subset[0..t-1]$ be some fixed difference cover with $t=\min\{n^{\varepsilon},n^{1/3}\}$ of size $|DC_{t}|=O(\sqrt{t})$ .

For every $i\in[1..|S|]$ let $S_{i}=S[i..i+t-1]$ be the substring of $S$ of length $t$ starting at position $i$ (we assume that $S[j]$ is some dummy character for every $j\geq|S|$ ). As a first step the algorithm sorts all the strings in $\mathcal{S}=\{S_{i}\mid i\in DC_{t}(S)\}$ . Notice that the total length of all these strings is $|DC_{t}(S)|\cdot t=O(\frac{|S|}{\sqrt{t}})\cdot t=O(|S|\sqrt{t})\subseteq O(n^% {2-\varepsilon}\cdot n^{\varepsilon/2})=O(n^{2-\varepsilon/2})\subseteq O(n^{2})$ and therefore the algorithm can sort all the strings in $O(1)$ rounds using Theorem 4. Recall that as a result of running the sorting algorithm, the node that stores index $i$ of $S$ , has now the rank of $S_{i}$ , $\mathsf{rank}_{\mathcal{S}}(S_{i})$ , among all the strings of $\mathcal{S}$ (copies of the same string will have the same rank). The algorithm uses the ranks of the strings to create a new string of length $|DC_{t}(S)|$ . For every $i\in DC_{t}$ let $S^{(i)}=\mathsf{rank}_{\mathcal{S}}(S_{i})\mathsf{rank}_{\mathcal{S}}(S_{i+t})% \mathsf{rank}_{\mathcal{S}}(S_{i+2t})\mathsf{rank}_{\mathcal{S}}(S_{i+3t})\dots$ (if $0\in DC_{t}$ then $S^{(0)}$ starts from $\mathsf{rank}_{\mathcal{S}}(S_{t})$ since $S_{0}$ does not exist). Moreover, let $S^{\prime}$ be the concatenation of all $S^{(i)}$ s for $i\in DC_{t}$ (in some arbitrary order) with a special character ${}^{\prime}\$^{\prime}$ as a delimiter between the strings $S^{(i)}$ . The algorithm runs recursively on $S^{\prime}$ . The result of the recursive call is the suffix array of $S^{\prime}$ , $\mathsf{SA}_{S^{\prime}}$ (which stores the order of the suffixes in $S^{\prime}$ ). Note that every index $i\in DC_{t}(S)$ has a corresponding index in $S^{\prime}$ which is where $\mathsf{rank}_{\mathcal{S}}(S_{i})$ appears, we denote this position as $f(i)$ . The algorithm sends for every index $i$ the rank of $f(i)$ (which is the index in $\mathsf{SA}_{S^{\prime}}$ where $f(i)$ appears) to the node that stores index $i$ of $S$ .

Due to the following claim, which we prove formally later in Lemma 20, the order of suffixes of $S$ from $DC_{t}(S)$ is the same as the order of the corresponding suffixes of $S^{\prime}$ .

Claim 18.

For $a,b\in DC_{t}(S)$ we have $S[a..]\prec S[b..]$ if and only if $S^{\prime}[f(a)..]\prec S^{\prime}[f(b)..]$ .

Due to Claim 18, $\mathsf{SA}_{S^{\prime}}$ represents the order of the subset of suffixes of $S$ starting at $DC_{t}(S)$ . To extend the result for the complete order of all the suffixes of $S$ (hence, computing $\mathsf{SA}_{S}$ ), the algorithm creates for every index of $S$ a representative object of size $O(t)$ words of space. These objects have the property that by comparing two objects one can determine the order of the corresponding suffixes.

The representative objects.

For every index $i$ the object represents the suffix $S[i..]$ is composed of two parts. The first part is $S_{i}$ - which is the substring of $S$ of length $t$ starting at position $i$ . The second part is the ranks (due to the lexicographic order) of all the suffixes at position in $DC_{t}(S)\cap[i..i+t-1]$ among all suffixes of $DC_{t}(S)$ , using $\mathsf{SA}^{-1}_{S^{\prime}}$ . This information is stored as an array $A_{i}$ of length $t$ as follows. For every $j\in[0..t-1]$ if $i+j\in DC_{t}(S)$ we set $A_{i}[j]=\mathsf{SA}^{-1}_{S^{\prime}}[f(i+j)]$ , which is the rank of position $i+j$ (as computed by the recursive call) and $A_{i}[j]=-1$ otherwise. The first part is used to determine the order of two suffixes which their $\mathsf{LCP}$ is at most $t$ and the second part is used to determine the order of two suffixes which their $\mathsf{LCP}$ is larger than $t$ .

The comparison of the objects represent positions $a$ and $b$ , is done as follows. The algorithm first compares $S_{a}$ and $S_{b}$ . If $S_{a}\neq S_{b}$ then the order of $S[a..]$ and $S[b..]$ is determined by the order of $S_{a}$ and $S_{b}$ . Otherwise, $S_{a}=S_{b}$ and the algorithm uses the second part of the objects. By Lemma 17 there exists some $k\in[0..t-1]$ such that $a+k$ and $b+k$ are both in $DC_{t}(S)$ . In particular $A_{a}[k]$ and $A_{b}[k]$ both hold actual ranks of suffixes of $S^{\prime}$ (and not $-1$ s). The algorithm uses $A_{a}[k]$ and $A_{b}[k]$ to determine the order of $S[a..]$ and $S[b..]$ . By the following lemma the order of $A_{a}[x]$ and $A_{b}[x]$ is exactly the same order of the corresponding suffixes starting at positions $a$ and $b$ .

Lemma 19.

Let $a,b\in[|S|]$ such that $S[a..]\prec S[b..]$ and $S_{a}=S_{b}$ . Then, for every $k$ , if $A_{a}[k]\neq-1$ and $A_{b}[k]\neq-1$ then $A_{a}[k]<A_{b}[k]$ . Moreover, there exists some $k\in[0..t-1]$ such that $A_{a}[k]\neq-1$ and $A_{b}[k]\neq-1$ .

Proof.

Let $\ell=\mathsf{LCP}(S[a..],S[b..])$ . By definition, $S[a..a+\ell-1]=S[b..b+\ell-1]$ and $S[a+\ell]<S[b+\ell]$ . Notice that $\ell\geq t$ . Thus, for every $0\leq k\leq t\leq\ell$ we have $S[a+k..a+k+(\ell-k-1)]=S[b+k..b+k+(\ell-k-1)]$ and $S[a+k+(\ell-k)]<S[b+k+(\ell-k)]$ . Therefore, $S[a+k..]\prec S[b+k..]$ . For $k\in[0..t-1]$ such that $A_{a}[k]\neq-1$ and $A_{b}[k]\neq-1$ , it must be the case that $a+k,b+k\in DC_{t}(S)$ . Thus, by Claim 18 we have $S^{\prime}[f(a+k)..]\prec S^{\prime}[f(b+k)..]$ and therefore $A_{a}[k]<A_{b}[k]$ .

By Lemma 17 there exists some $k\in[0..t-1]$ such that $a+k$ and $b+k$ are both in $DC_{t}(S)$ . For this value of $k$ it is guaranteed that $A_{a}[k]\neq-1$ and $A_{b}[k]\neq-1$ . $\hfill\blacktriangleleft$

For every index $i\in[1..|S|]$ the algorithm creates an object of size $O(t)$ words of space. Thus, the total size of all the objects is $O(t|S|)\subseteq O(n^{\varepsilon}\cdot n^{2-\varepsilon})=O(n^{2})$ . Moreover, by definition $t\leq n^{1/3}$ . Therefore the algorithm sorts all the objects in $O(1)$ rounds using the algorithm of Theorem 4. By Lemma 19 the result of the object sorting algorithm is a sorting of all the suffixes of $S$ .

Time complexity.

Recall that for $|S|=O(n^{2-\varepsilon})$ we defined $t=\min\{n^{\varepsilon},n^{1/3}\}$ and that the length of $S^{\prime}$ is $|S^{\prime}|=|DC_{t}(S)|=O(|S|/\sqrt{t})$ . Let $c$ be a constant such that $|S^{\prime}|\leq c|S|/\sqrt{t}$ (for any $n>n_{0}$ for some $n_{0}$ ). Denote $S_{k}$ , $S^{\prime}_{k}$ , $\varepsilon_{k}$ and $t_{k}$ as the values of $S$ , $S^{\prime}$ , $\varepsilon$ and $t$ in the $k$ th level of the recursion, respectively. Our goal is to gurantee an exponantial growth in the value of $\varepsilon$ from level to level. For the first level, in order to have a growth, we have $|S^{\prime}_{1}|\leq\frac{c|S_{1}|}{\sqrt{t_{1}}}=\frac{c|S_{1}|}{n^{0.5% \varepsilon_{1}}}=\frac{c}{n^{0.1\varepsilon_{1}}}\frac{|S_{1}|}{n^{0.4% \varepsilon_{1}}}=\frac{c}{n^{0.1\varepsilon_{1}}}O(n^{2-1.4\varepsilon_{1}}).$ We are setting $\varepsilon_{1}=10\log c/\log n$ and so $|S^{\prime}_{1}|\leq O(n^{2-1.4\varepsilon_{1}})$ . Then, as long as $\varepsilon_{i}<1/3$ we define $\varepsilon_{i+1}=1.4\varepsilon_{i}$ and we get with similar analysis $|S^{\prime}_{i}|\leq O(n^{2-1.4\varepsilon_{i}})$ . By a straightforward induction we get $|S^{\prime}_{i}|\leq O(n^{2-{1.4}^{i}\varepsilon_{1}})$ . From the time that $\varepsilon_{i}\geq 1/3$ (i.e. $|S_{i}|=O(n^{5/3})$ ) we have $t_{i+1}=n^{1/3}$ . Thus, $|S^{\prime}_{i+1}|=O(|S_{i}|/\sqrt{t})=O(|S^{\prime}_{i}|/n^{1/6})=O(n^{3/2})$ and in four rounds we get $S^{\prime}_{i+4}=O(n)$ . At this time the algorithm solves the problem in $O(1)$ rounds in one node. Therefore the algorithm performs $O(\log\log n)+4=O(\log\log n)$ levels of recursion. Each level of recursion takes $O(1)$ rounds, and therefore the total running time of the algorithm is $O(\log\log n)$ .

$\blacktriangleright$ Remark.

As described in Section 1, our algorithm can be translated to $O(\log\log n)$ rounds algorithm in the MPC model. However, if one consider a variant of the MPC model where the product of machines and memory size is polynomially larger than $n$ , i.e. if $M\cdot S=n^{1+\alpha}$ for some constant $\alpha>0$ , then there is a faster algorithm. In particular one can use $t=n^{\alpha}$ in all rounds and get an $O(1/\alpha)=O(1)$ rounds algorithm.

5.1 Computing the Corresponding LCP Array

The computation of $\mathsf{LCP}_{S}$ is done during the computation of $\mathsf{SA}_{S}$ , by several additional operations at some steps of the computation. The recursive process has exactly the same structure, but now every level of the recursion can use both $\mathsf{SA}_{S^{\prime}}$ and $\mathsf{LCP}_{S^{\prime}}$ and has to compute $\mathsf{LCP}_{S}$ in addition to $\mathsf{SA}_{S}$ .

Recall that in the $\mathsf{SA}$ construction algorithm of the previous section, the order of two suffixes of $S$ that starts in $DC_{t}(S)$ is exactly the same as the order of the corresponding suffixes of $S^{\prime}$ that is represented in $\mathsf{SA}_{S^{\prime}}$ . The issue with computing $\mathsf{LCP}_{S}$ is slightly more complicated. In the following lemma we prove that for two positions $a,b\in DC_{t}(S)$ we have $\mathsf{LCP}(S^{\prime}[f(a)..],S^{\prime}[f(b)..])=\left\lfloor{\mathsf{LCP}(% S[a..],S[b..])/t}\right\rfloor$ , which means $\mathsf{LCP}(S[a..],S[b..]])\in t\cdot\mathsf{LCP}(S^{\prime}[f(a)..],S^{% \prime}[f(b)..])+[0..t-1]$ .

Lemma 20.

For $a,b\in DC_{t}(S)$ we have. $\mathsf{LCP}(S^{\prime}[f(a)..],S^{\prime}[f(b)..])=\left\lfloor{\mathsf{LCP}(% S[a..],S[b..])/t}\right\rfloor$ and if $S[a..]\prec S[b..]$ then $S^{\prime}[f(a)..]\prec S^{\prime}[f(b)..]$ .

Proof.

Let $\ell=\mathsf{LCP}(S[a..],S[b..])$ . By definition $S[a..a+\ell-1]=S[b..b+\ell-1]$ and $S[a+\ell]\neq S[b+\ell]$ . Our goal is to prove that for any $0\leq i<\left\lfloor{\ell/t}\right\rfloor$ , $S^{\prime}[f(a)+i]=S^{\prime}[f(b)+i]$ and that $S^{\prime}[f(a)+\left\lfloor{\ell/t}\right\rfloor]\neq S^{\prime}[f(b)+\left% \lfloor{\ell/t}\right\rfloor]$ .

Recall that by the definition of $S^{\prime}$ , $S^{\prime}[f(a)+i]$ is exactly $\mathsf{rank}_{\mathcal{S}}(S_{a+t\cdot i})$ (as long as $i$ is small enough). Similarly, $S^{\prime}[f(b)+i]=\mathsf{rank}_{\mathcal{S}}(S_{b+t\cdot i})$ . Thus, for any $0\leq i<\left\lfloor{\ell/t}\right\rfloor$ we have

	$\displaystyle S^{\prime}[f(a)+i]$	$\displaystyle=\mathsf{rank}_{\mathcal{S}}(S_{a+t\cdot i})=\mathsf{rank}_{% \mathcal{S}}(S[a+ti..a+ti+(t-1)])$		(4)
		$\displaystyle=\mathsf{rank}_{\mathcal{S}}(S[b+ti..b+ti+(t-1)])=\mathsf{rank}_{% \mathcal{S}}(S_{b+t\cdot i})=S^{\prime}[f(b)+i]$

With similar analysis one can get $S^{\prime}[f(a)+\left\lfloor{\ell/t}\right\rfloor]\neq S^{\prime}[f(b)+\left% \lfloor{\ell/t}\right\rfloor]$ .

The assumption $S[a..]\prec S[b..]$ means that $S[a+\ell]<S[b+\ell]$ . Let $k=t\cdot\left\lfloor{\ell/t}\right\rfloor$ , we will focus on $S_{a+k}$ and $S_{b+k}$ . Let $r=\ell-k$ and notice that $r<t$ (and $r=\ell\operatorname{mod}t$ ). We will show that $S_{a+\left\lfloor{\ell/t}\right\rfloor}\prec S_{b+\left\lfloor{\ell/t}\right\rfloor}$ . For any $i<r$ since $k+i<k+r=\ell$ we have $S_{a+\left\lfloor{\ell/t}\right\rfloor}[i]=S[a+k+i]=S[b+k+i]=S_{b+\left\lfloor% {\ell/t}\right\rfloor}[i]$ . Since $k+r=\ell$ we have $S_{a+\left\lfloor{\ell/t}\right\rfloor}[r]=S[a+k+r]=S[a+\ell]<S[b+\ell]=S[b+k+% r]=S_{b+\left\lfloor{\ell/t}\right\rfloor}[r]$ . Therefore, $S_{a+\left\lfloor{\ell/t}\right\rfloor}\prec S_{b+\left\lfloor{\ell/t}\right\rfloor}$ and $S^{\prime}[f(a)+\left\lfloor{\ell/t}\right\rfloor]=\mathsf{rank}_{\mathcal{S}}% (S_{a+\left\lfloor{\ell/t}\right\rfloor})<\mathsf{rank}_{\mathcal{S}}(S_{b+% \left\lfloor{\ell/t}\right\rfloor})=S^{\prime}[f(b)+\left\lfloor{\ell/t}\right\rfloor]$ . Combining with Equation 4 we have $S^{\prime}[f(a)..]\prec S^{\prime}[f(b)..]$ . $\hfill\blacktriangleleft$

First step - compute exact LCP for $DC_{t}(S)$ .

Let $i_{1},i_{2},\dots,i_{|DC_{t}(S)|}$ be the indices of $DC_{t}(S)$ such that for any $j$ we have $S[i_{j}..]\prec S[i_{j+1}..]$ . Notice that $i_{j}=f^{-1}(\mathsf{SA}_{S^{\prime}}[j])$ . The first step of the algorithm is to compute the exact value of $\mathsf{LCP}(S[i_{j}..],S[i_{j+1}..])$ for any $j$ . For this step the algorithm uses inverse suffix array $\mathsf{SA}^{-1}_{S^{\prime}}$ . Recall that for any $i$ , $\mathsf{SA}^{-1}_{S^{\prime}}[i]$ is the index $j$ such that $SA_{S^{\prime}}[j]=i$ . By a simple routing, in $O(1)$ rounds, the algorithm distributes the $\mathsf{SA}^{-1}_{S^{\prime}}$ information to the Congested Clique nodes, such that the node $v$ that stores the $a$ th character of $S$ for $a\in DC_{t}(S)$ will get $\mathsf{SA}^{-1}_{S^{\prime}}[f(a)]$ . Moreover, $v$ will get also the value $b$ such that $f(b)=\mathsf{SA}_{S^{\prime}}[\mathsf{SA}^{-1}_{S^{\prime}}[f(a)]+1]$ which is the index of the lexicographic successive suffix among $DC_{T}(S)$ suffixes. In addition $v$ gets $\ell=\mathsf{LCP}_{S^{\prime}}[\mathsf{SA}^{-1}_{S^{\prime}}[f(a)]]$ which is exactly $\ell=\mathsf{LCP}(S^{\prime}[f(a)..],S^{\prime}[f(b)..])$ . Now, $v$ creates $2t=O(n^{\varepsilon})$ queries - to get all the characters of $S[a+\ell\cdot t..a+\ell\cdot t+t-1]$ and $S[b+\ell\cdot t..b+\ell\cdot t+t-1]$ (each query is for one character). Notice that every node holds at most $O(n^{1-\varepsilon})$ indices of $DC_{t}(S)$ , and therefore every node has at most $O(n^{1-\varepsilon}\cdot t)=O(n)$ queries. Thus, using Lemma 10 in $O(1)$ rounds all the queries will be answered. Using the answers, every index $i_{j}$ computes the exact value of $\mathsf{LCP}(S[i_{j}..],S[i_{j+1}..])$ .

Let $\overline{\mathsf{LCP}}_{S^{\prime}}$ the array of all the revised $\mathsf{LCP}$ values of $\mathsf{LCP}_{S^{\prime}}$ i.e., the value of $\overline{\mathsf{LCP}}_{S^{\prime}}[i]$ is the exact $\mathsf{LCP}$ of the suffixes $f^{-1}(\mathsf{SA}_{S^{\prime}}[i])$ and $f^{-1}(\mathsf{SA}_{S^{\prime}}[i+1])$ which are the lexicographically $i$ th and $i+1$ th suffixes among $DC_{t}(S)$ . We store $\overline{\mathsf{LCP}}_{S^{\prime}}$ in a distributed manner in the Congested Clique network.

Second step - compute $\mathsf{LCP}_{S}$

In the second step, the algorithm computes the $\mathsf{LCP}$ of every suffix of $S$ with its lexicographic successive suffix. This computation is done similarly to the second step of the $\mathsf{SA}_{S}$ construction algorithm. In every level, the computation of $\mathsf{LCP}_{S}$ is done after the computation of $\mathsf{SA}_{S}$ . Thus, we can use the order of the suffixes of $S$ as an input for this step. For every index $i$ , let $\hat{i}$ be the index of the successive suffix to $S[i..]$ . The node that stores $S[i]$ gets $\hat{i}$ in $O(1)$ rounds. The algorithm uses the same representative objects used to compute $\mathsf{SA}_{S}$ , to compute $\mathsf{LCP}(S[i..],S[\hat{i}..])=\mathsf{LCP}_{S}[\mathsf{SA}^{-1}_{S}[i]]$ . Recall that the representative objects of $i$ and $\hat{i}$ are composed of two parts. The first part contains $S_{i}$ and $S_{\hat{i}}$ which are the substrings of $S$ of length $t$ starting at positions $i$ and $\hat{i}$ . The algorithm first compares $S_{i}$ and $S_{\hat{i}}$ and if $S_{i}\neq S_{\hat{i}}$ then $\mathsf{LCP}(S[i..],S[\hat{i}..])=\mathsf{LCP}(S_{i},S_{\hat{i}})$ . If $S_{i}=S_{\hat{i}}$ , the algorithm uses the second part of the representative objects. Recall that in this part the object of an index $a$ stores an array of length $t$ , with the ranks (among suffix starting at $DC_{t}(S)$ ) of suffixes from $DC_{t}(S)\cap[a..a+t-1]$ and $-1$ s. Moreover, by Lemma 17 it is guaranteed that there is some $k\in[0..t-1]$ such that $A_{i}[k]\neq-1$ and $A_{\hat{i}}[k]\neq-1$ . Since $S_{i}=S_{\hat{i}}$ and $k<|S_{i}|$ , we have $\mathsf{LCP}(S[i..],S[\hat{i}..])=k+\mathsf{LCP}(S[i+k..],S[\hat{i}+k..])$ . Thus, we have the ranks of the suffixes $i+k$ and $\hat{i}+k$ among suffixes of $S$ starting at positions in $DC_{t}(S)$ (which are $\mathsf{SA}^{-1}_{S^{\prime}}[f(i+k)]$ and $\mathsf{SA}^{-1}_{S^{\prime}}[f(\hat{i}+k)]$ ). Due to the following fact, $\mathsf{LCP}(S[i+k..],S[\hat{i}+k..])=\min(\overline{\mathsf{LCP}}_{S^{\prime}% }[j]\mid\mathsf{SA}^{-1}_{S^{\prime}}[f(i+k)]\leq j\leq\mathsf{SA}^{-1}_{S^{% \prime}}[f(\hat{i}+k)])$ . (This is because $\overline{\mathsf{LCP}}_{S^{\prime}}$ is an array of the $\mathsf{LCP}$ values of monotone sequence of strings, and $S[i+k),S[\hat{i}+k)$ are both elements in the sequence).

Fact 20.

Let $T_{1},T_{2},\dots,T_{k}$ be a sequence of strings such that $T_{i}\prec T_{i+1}$ and $T_{i}$ is not a prefix of $T_{i+1}$ for all $i\in[1..k-1]$ . Then, for any $a<b$ we have $\mathsf{LCP}(T_{a},T_{b})=\min\{\mathsf{LCP}(T_{i},T_{i+1})\mid i\in[a..b-1]\}$ .

Proof.

Let $\ell=\min\{\mathsf{LCP}(T_{i},T_{i+1}\mid i\in[a..b-1])\}$ and let $i^{\prime}\in[a..b-1]$ be an index with $\mathsf{LCP}(T_{i^{\prime}},T_{i^{\prime}+1})=\ell$ . For any $1\leq j\leq\ell$ we have $T_{a}[j]=T_{a+1}[j]=T_{a+2}[j]=\cdots=T_{b}[j]$ since for every $i\in[a..b-1]$ we have $\mathsf{LCP}(T_{i},T_{i+1})\geq\ell$ and by a straightforward induction. On the other hand, $T_{a}[\ell]\leq T_{a+1}[\ell]\leq T_{a+2}[\ell]\leq\cdots\leq T_{b}[\ell]$ , because $T_{i}\prec T_{i+1}$ for all $i$ . Moreover, since $T_{i^{\prime}}[\ell+1]\neq T_{i^{\prime}+1}[\ell+1]$ , it must be that $T_{i^{\prime}}[\ell+1]<T_{i^{\prime}+1}[\ell+1]$ and therefore $T_{a}[\ell+1]<T_{b}[\ell+1]$ . Thus, $\mathsf{LCP}(T_{a},T_{b})=\ell$ . $\hfill\blacktriangleleft$ Thus, for every $i$ , if $\mathsf{LCP}(S[i..],S[\hat{i}..])$ is not determined by $S_{i}$ and $S_{\hat{i}}$ , the algorithm has to perform one range minimum query (RMQ) on $\overline{\mathsf{LCP}}_{S^{\prime}}$ . Now, we will describe how to compute all these range minimum queries in $O(1)$ rounds. This lemma might be of independent interest.

Definition 21.

Given an array $A$ and two indices $i, j$ such that $1\leq i\leq j\leq|A|$ , a Range Minimum Query $\mathsf{RMQ}_{A}(i,j)$ returns the minimum value $x$ in the range $A[i..j]$ .

Lemma 22.

Let $A$ be an array of $O(n^{2})$ numbers (each number of size $O(\log n)$ bits), distributed among $n$ nodes in the Congested Clique model, such that each node holds a subarray of length $O(n)$ . In addition, every node has $O(n)$ $\mathsf{RMQ}$ queries. Then, there is an algorithm that computes for each node the results of all the $\mathsf{RMQ}$ queries in $O(1)$ rounds.

Proof.

First, each node broadcasts its subarray length, i.e. how many numbers it contains. Second, each node broadcasts the minimum number within the node.

There are two types of $\mathsf{RMQ}$ queries. The first type is where the range of the $\mathsf{RMQ}$ is contained in one specific node, i.e. both $i$ and $j$ of the $\mathsf{RMQ}$ are on the same node. The second type is where $i$ and $j$ are not in the same node. For this case, we separate the $\mathsf{RMQ}$ into three ranges. The first range is $i$ to $i^{\prime}$ , where $i^{\prime}$ is the index of the last number in the node that contains the $i$ ’th number. The third range is $j^{\prime}$ to $j$ , where $j^{\prime}$ is the index of the first number in the node that contains the $j$ ’th number. The second range is $i^{\prime}+1$ to $j^{\prime}-1$ , i.e. all the indices in the intermediate nodes (this range might be empty). To calculate $\mathsf{RMQ}(i,j)$ , it is enough to calculate $\mathsf{RMQ}$ on the three ranges, since $\mathsf{RMQ}(i,j)=\min(\mathsf{RMQ}(i,i^{\prime}),\mathsf{RMQ}(i^{\prime}+1,j^% {\prime}-1),\mathsf{RMQ}(j^{\prime},j))$ . It is easy to calculate $\mathsf{RMQ}(i^{\prime}+1,j^{\prime}-1)$ , since on the second step of the algorithm, each node broadcasts its minimum number. We are left with two $\mathsf{RMQ}$ queries to two specific resolving nodes.

To conclude, after the second step, the $O(n)$ $\mathsf{RMQ}$ queries on each node can be calculated with $O(n)$ $\mathsf{RMQ}$ queries to specific resolving nodes. Since both an $\mathsf{RMQ}$ to a specific resolving node and the result can be encoded with $O(\log n)$ bits. Hence, using Lemma 10, in $O(1)$ rounds all the $O(n)$ $\mathsf{RMQ}$ queries to specific resolving nodes are resolved. $\hfill\blacktriangleleft$

Complexity.

The overhead of computing $\mathsf{LCP}_{S}$ from $\mathsf{LCP}_{S^{\prime}}$ is just a constant number of rounds per level of the recursion. So, in total the computation of $\mathsf{SA}_{S}$ and $\mathsf{LCP}_{S}$ takes $O(\log\log n)$ rounds.

References

[1] Miklós Ajtai, János Komlós, and Endre Szemerédi. An o(n log n) sorting network. In Proceedings of the 15th Annual ACM Symposium on Theory of Computing, pages 1–9. ACM, 1983. doi:10.1145/800061.808726.
[2] Alexandr Andoni, Aleksandar Nikolov, Krzysztof Onak, and Grigory Yaroslavtsev. Parallel algorithms for geometric graph problems. In David B. Shmoys, editor, Symposium on Theory of Computing, STOC 2014, pages 574–583. ACM, 2014. doi:10.1145/2591796.2591805.
[3] Benny Applebaum, Dariusz R. Kowalski, Boaz Patt-Shamir, and Adi Rosén. Clique here: On the distributed complexity in fully-connected networks. Parallel Process. Lett., 26(1):1650004:1–1650004:12, 2016. doi:10.1142/S0129626416500043.
[4] Lars Arge, Paolo Ferragina, Roberto Grossi, and Jeffrey Scott Vitter. On sorting strings in external memory (extended abstract). In Frank Thomson Leighton and Peter W. Shor, editors, Proceedings of the Twenty-Ninth Annual ACM Symposium on the Theory of Computing, pages 540–548. ACM, 1997. doi:10.1145/258533.258647.
[5] Paul Beame, Paraschos Koutris, and Dan Suciu. Communication steps for parallel query processing. In Richard Hull and Wenfei Fan, editors, Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2013, pages 273–284. ACM, 2013. doi:10.1145/2463664.2465224.
[6] Soheil Behnezhad, Mahsa Derakhshan, and MohammadTaghi Hajiaghayi. Brief announcement: Semi-mapreduce meets congested clique. CoRR, abs/1802.10297, 2018. doi:10.48550/arXiv.1802.10297.
[7] Jon Louis Bentley and Robert Sedgewick. Fast algorithms for sorting and searching strings. In Michael E. Saks, editor, Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 360–369. ACM/SIAM, 1997. URL: http://dl.acm.org/citation.cfm?id=314161.314321.
[8] Timo Bingmann. Scalable String and Suffix Sorting: Algorithms, Techniques, and Tools. PhD thesis, Karlsruhe Institute of Technology, Germany, 2018. URL: https://publikationen.bibliothek.kit.edu/1000085031.
[9] Robert S. Boyer and J. Strother Moore. A fast string searching algorithm. Commun. ACM, 20(10):762–772, 1977. doi:10.1145/359842.359859.
[10] Dany Breslauer and Zvi Galil. An optimal o( $\backslash$ log $\backslash$ logn) time parallel string matching algorithm. SIAM Journal on Computing, 19(6):1051–1058, 1990. doi:10.1137/0219072.
[11] Dany Breslauer and Zvi Galil. Real-time streaming string-matching. ACM Transactions on Algorithms, 10(4):22:1–22:12, 2014. doi:10.1145/2635814.
[12] Keren Censor-Hillel, Michal Dory, Janne H. Korhonen, and Dean Leitersdorf. Fast approximate shortest paths in the congested clique. Distributed Comput., 34(6):463–487, 2021. doi:10.1007/s00446-020-00380-5.
[13] Keren Censor-Hillel, Orr Fischer, Tzlil Gonen, François Le Gall, Dean Leitersdorf, and Rotem Oshman. Fast distributed algorithms for girth, cycles and small subgraphs. In Hagit Attiya, editor, 34th International Symposium on Distributed Computing, DISC 2020, October 12-16, 2020, Virtual Conference, volume 179 of LIPIcs, pages 33:1–33:17. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPIcs.DISC.2020.33.
[14] Charles J Colbourn and Alan CH Ling. Quorums from difference covers. Information Processing Letters, 75(1-2):9–12, 2000. doi:10.1016/S0020-0190(00)00080-6.
[15] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. In Eric A. Brewer and Peter Chen, editors, 6th Symposium on Operating System Design and Implementation (OSDI 2004), pages 137–150. USENIX Association, 2004. URL: http://www.usenix.org/events/osdi04/tech/dean.html.
[16] Lester R Ford Jr and Selmer M Johnson. A tournament problem. The American Mathematical Monthly, 66(5):387–389, 1959.
[17] Zvi Galil. Optimal parallel algorithms for string matching. Information and Control, 67(1-3):144–157, 1985. doi:10.1016/S0019-9958(85)80031-0.
[18] Mohsen Ghaffari, Christoph Grunau, and Slobodan Mitrovic. Massively parallel algorithms for b-matching. In Kunal Agrawal and I-Ting Angelina Lee, editors, SPAA ’22: 34th ACM Symposium on Parallelism in Algorithms and Architectures, pages 35–44. ACM, 2022. doi:10.1145/3490148.3538589.
[19] Shay Golan and Matan Kraus. String problems in the congested clique model, 2025. doi:10.48550/arXiv.2504.08376.
[20] Torben Hagerup. Optimal parallel string algorithms: sorting, merging and computing the minimum. In Frank Thomson Leighton and Michael T. Goodrich, editors, Proceedings of the Twenty-Sixth Annual ACM Symposium on Theory of Computing, 23-25 May 1994, Montréal, Québec, Canada, pages 382–391. ACM, 1994. doi:10.1145/195058.195202.
[21] MohammadTaghi Hajiaghayi, Hamed Saleh, Saeed Seddighin, and Xiaorui Sun. String matching with wildcards in the massively parallel computation model. In Kunal Agrawal and Yossi Azar, editors, SPAA ’21: 33rd ACM Symposium on Parallelism in Algorithms and Architectures, pages 275–284. ACM, 2021. doi:10.1145/3409964.3461793.
[22] Vaughan R. Pratt James H. Morris Jr. A linear pattern-matching algorithm. techreport 40, University of California, Berkeley, 1970.
[23] Tomasz Jurdzinski and Krzysztof Nowicki. MST in O(1) rounds of congested clique. In Artur Czumaj, editor, Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 2620–2632. SIAM, 2018. doi:10.1137/1.9781611975031.167.
[24] Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Parallel external memory suffix sorting. In Ferdinando Cicalese, Ely Porat, and Ugo Vaccaro, editors, Proceedings of ‘Combinatorial Pattern Matching - 26th Annual Symposium, CPM 2015, volume 9133 of Lecture Notes in Computer Science, pages 329–342. Springer, 2015. doi:10.1007/978-3-319-19929-0_28.
[25] Juha Kärkkäinen and Tommi Rantala. Engineering radix sort for strings. In Amihood Amir, Andrew Turpin, and Alistair Moffat, editors, Proceedings of String Processing and Information Retrieval, 15th International Symposium, SPIRE 2008,, volume 5280 of Lecture Notes in Computer Science, pages 3–14. Springer, 2008. doi:10.1007/978-3-540-89097-3_3.
[26] Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. Linear work suffix array construction. Journal of the ACM, 53(6):918–936, 2006. doi:10.1145/1217856.1217858.
[27] Richard M. Karp and Michael O. Rabin. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2):249–260, 1987. doi:10.1147/rd.312.0249.
[28] Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa, and Kunsoo Park. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Amihood Amir and Gad M. Landau, editors, Combinatorial Pattern Matching, 12th Annual Symposium, CPM 2001, volume 2089 of Lecture Notes in Computer Science, pages 181–192. Springer, 2001. doi:10.1007/3-540-48194-X_17.
[29] Donald E. Knuth, James H. Morris, Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM Journal on Computing, 6(2):323–350, 1977. doi:10.1137/0206024.
[30] Fabian Kulla and Peter Sanders. Scalable parallel suffix array construction. Parallel Comput., 33(9):605–612, 2007. doi:10.1016/j.parco.2007.06.004.
[31] Christoph Lenzen. Optimal deterministic routing and sorting on the congested clique. In Panagiota Fatourou and Gadi Taubenfeld, editors, ACM Symposium on Principles of Distributed Computing, PODC ’13, Montreal, QC, Canada, July 22-24, 2013, pages 42–50. ACM, 2013. doi:10.1145/2484239.2501983.
[32] Zvi Lotker, Elan Pavlov, Boaz Patt-Shamir, and David Peleg. MST construction in o(log log n) communication rounds. In Arnold L. Rosenberg and Friedhelm Meyer auf der Heide, editors, SPAA 2003: Proceedings of the Fifteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 94–100. ACM, 2003. doi:10.1145/777412.777428.
[33] Udi Manber and Eugene W. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935–948, 1993. doi:10.1137/0222058.
[34] Krzysztof Nowicki. A deterministic algorithm for the MST problem in constant rounds of congested clique. In Samir Khuller and Virginia Vassilevska Williams, editors, STOC ’21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 1154–1165. ACM, 2021. doi:10.1145/3406325.3451136.
[35] Matthew Felice Pace and Alexander Tiskin. Parallel suffix array construction by accelerated sampling. In Jan Holub and Jan Zdárek, editors, Proceedings of the Prague Stringology Conference 2013, Prague, Czech Republic, September 2-4, 2013, pages 142–156. Department of Theoretical Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2013. URL: http://www.stringology.org/event/2013/p13.html.
[36] Boaz Patt-Shamir and Marat Teplitsky. The round complexity of distributed sorting: extended abstract. In Cyril Gavoille and Pierre Fraigniaud, editors, Proceedings of the 30th Annual ACM Symposium on Principles of Distributed Computing, PODC, pages 249–256. ACM, 2011. doi:10.1145/1993806.1993851.
[37] Simon J. Puglisi, William F. Smyth, and Andrew Turpin. A taxonomy of suffix array construction algorithms. ACM Comput. Surv., 39(2):4, 2007. doi:10.1145/1242471.1242472.
[38] Tatiana Starikovskaya and Hjalte Wedel Vildhøj. Time-space trade-offs for the longest common substring problem. In Johannes Fischer and Peter Sanders, editors, Combinatorial Pattern Matching, 24th Annual Symposium, CPM, volume 7922 of Lecture Notes in Computer Science, pages 223–234. Springer, 2013. doi:10.1007/978-3-642-38905-4_22.
[39] Uzi Vishkin. Optimal parallel pattern matching in strings. Inf. Control., 67(1-3):91–113, 1985. doi:10.1016/S0019-9958(85)80028-0.
[40] Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, SWAT 1973, pages 1–11. IEEE Computer Society, 1973. doi:10.1109/SWAT.1973.13.

Appendix A Sorting Large Objects

In this section, we solve Section 1.2, the large objects sorting problem, for the special case of $\varepsilon=2/3$ , in the Congested Clique model, by presenting a deterministic sorting algorithm for objects of size $O(n^{1/3})$ words, that takes $O(1)$ rounds. In the full version of the paper [19], we generalize the algorithm for any $\varepsilon>0$ , which proves Theorem 4.

Our algorithm makes use of the following two lemmas.

Lemma 23 ([12, Lemma 3]).

Let $x_{1},x_{2},\dots,x_{n}$ be natural numbers, and let $X$ , $x$ and $k$ be natural numbers such that $\sum_{i=1}^{n}x_{i}=X$ , $x_{i}\leq x$ for all $i$ . Then there is a partition of $[n]$ into $k$ sets $I_{1},I_{2},\dots,I_{k}$ such that for each $j$ , the set $I_{j}$ consists of consecutive elements, and $\sum_{i\in I_{j}}x_{i}\leq X/k+x$ .

Lemma 24 ([23, Lemma 1.2]).

Let $A$ be a Congested Clique algorithm which, except of the nodes $u_{1},\dots,u_{n}$ corresponding to the input strings, uses $O(n)$ auxiliary nodes $v_{1},v_{2},\dots$ such that the auxiliary nodes do not have initially any knowledge of the input strings on the nodes $u_{1},\dots,u_{n}$ . Then, each round of $A$ might be simulated in $O(1)$ rounds in the standard Congested Clique model, without auxiliary nodes.

Our algorithm is a generalization of Lenzen’s [31] sorting algorithm. In [31], each node is given $n$ keys of size $O(1)$ words of space (i.e. $O(\log n)$ bits) and the nodes need to learn the ranks of their keys in the total order of the union of all keys. Our algorithm uses similar methods but has another level of recursion to handle also objects of size $\omega(1)$ (yet $O(n^{1/3})$ ) words. Formally, we prove here the following lemma.

Lemma 25.

Consider a variant of Section 1.2 where each object is of size $O(n^{1/3})$ words. Moreover, every object is stored in one node. Then, there exists an algorithm that solves this variant in $O(1)$ rounds.

The main part of the algorithm is sorting the objects of the network by redistributing the objects among the nodes such that for any two objects $B<B^{\prime}$ that the algorithm sends to nodes $v_{i}$ and $v_{j}$ , respectively, we have $i\leq j$ .

The algorithm stores with each object $B$ the original node of $B$ and the index of $B$ in the original node. The algorithm uses the order of original nodes and indices to break ties.

To sort large objects of size $O(n^{1/3})$ words, the algorithm uses two building blocks. First, we show how to sort the objects of a set of $n^{1/3}$ nodes²²2We assume that $n^{1/3}$ is an integer. Otherwise, we add $O(n)$ auxiliary nodes such that the total number of nodes $n^{\prime}$ holds $n^{\prime 1/3}\in\mathbb{N}$ . By Lemma 24 the round complexity is increased only by a constant factor.. This algorithm is the base of the second building block, which is a recursive algorithm that sorts the objects of a set of $\omega(n^{1/3})$ nodes.

A.1 Sorting at most $n^{1/3}$ Nodes with Objects of Size $O(n^{1/3})$

In this section we present Algorithm 4, that sorts all the objects in a set of nodes $W\subset V$ of at most $n^{1/3}$ nodes, each object of size $O(n^{1/3})$ words, and each node stores $O(n)$ words. As in [31], each node marks some objects as candidates. Then, $n^{1/3}$ of the candidates are chosen to be delimiters, and the objects are redistributed according to these delimiters. The main part of the analysis is to prove that the redistribution works well, i.e. the delimiters divide the nodes into sets of almost evenly sizes and therefore each set can be sent to one node in $O(1)$ rounds.

Algorithm 4 Sorting objects of at most

n^{1/3}

nodes.

Correctness.

The correctness of Algorithm 4 derives from steps 4 to 6. As in [31, Lemma 4.2] due to the partitioning by delimiters, all the objects in $K_{i,j}$ are larger than the objects in $K_{i^{\prime},j^{\prime}}$ for all $v_{i},v_{i^{\prime}}\in W$ and $j^{\prime}<j$ .

Complexity.

We now show that Algorithm 4 runs in $O(1)$ rounds. Notice that communication only happens in steps 3 and 6. In both steps, each node sends $O(n)$ words. We will show that each node also receives $O(n)$ words, and therefore we can use Lenzen’s routing scheme.

For step 3, notice that $|W|\leq n^{1/3}$ nodes, there are $O(n^{1/3})$ candidates per node, and the size of any candidate is $O(n^{1/3})$ words. Therefore each node receives $O(n^{1/3})\cdot O(n^{1/3})\cdot O(n^{1/3})=O(n)$ words of space.

It is left to prove that in step 6 each node receives $O(n)$ words. A similar argument was also proved in [31, Lemma 4.3], and we show here (with the proof given in the full version of the paper [19]) that partitioning the objects using step 2 is an efficient interpretation of Lenzen’s sorting algorithm in the case of large objects.

Lemma 26.

When executing Algorithm 4, for each $j\in[1..|W|]$ , it holds that
$\left\|\bigcup_{i=1}^{|W|}K_{i,j}\right\|=O(n)$ .

A.2 Sorting more than $n^{1/3}$ Nodes with Objects of Size $O(n^{1/3})$

The following algorithm sorts all the objects in $U\subseteq V$ for $|U|>n^{1/3}$ nodes, where each object is of size $O(n^{1/3})$ words, and each node stores $O(n)$ words. In particular, this algorithm sorts all the objects in $n$ nodes.

In this recursive algorithm, each node marks some objects as candidates, such that all the candidates are fit into $O({|U|/n^{1/3}})$ nodes. The candidates are sorted recursively, and $n^{1/3}$ of the candidates are chosen to be delimiters. Then, the objects are redistributed according to these delimiters, and each set of size $O(|U|/n^{1/3})$ nodes is sorted recursively.

Algorithm 5 Sorting objects of more than

n^{1/3}

nodes.

Correctness.

The correctness of Algorithm 5 stems from steps 5 to 8 and follows analogously to the correctness of Algorithm 4.

Complexity.

We will focus on the steps in Algorithm 5 where communication is made and show that each step takes $O(1)$ rounds.

In step 3, we need to further explain some algorithmic details. Each node sends $O(n^{1/3})$ objects of size $O(n^{1/3})$ words, so at most $O(n^{2/3})$ words per node. The candidates of node $v_{i}$ are sent to node $v_{j}$ for $j=\left\lceil{\frac{i}{n^{1/3}}}\right\rceil$ , therefore each node receives at most $n^{1/3}\cdot O(n^{2/3})=O(n)$ words. By Lenzen’s routing scheme this is done in $O(1)$ rounds. In step 4, notice that since the number of nodes is at most $n$ , the depth of the recursion is $O(1)$ , therefore the recursion does not increase the round complexity asymptotically.

In step 5, the delimiters should be recognized. Each node $v$ in the first $\left\lceil{|U|/n^{1/3}}\right\rceil$ nodes broadcasts the number of objects that $v$ receives in step 4. Therefore, each node $v$ computes for every object $B$ whether the rank of $B$ among the candidates is a multiple of $\left\lceil{|\mathcal{C}|/n^{1/3}}\right\rceil$ and if so, selects $B$ to be a delimiter. There are $O(n^{1/3})$ delimiters of size $O(n^{1/3})$ each, which is in total $O(n^{2/3})$ words. Therefore, the delimiters are announced to all the nodes in $O(1)$ rounds using Lemma 9.

In step 7 we first show that the total number of words that each set $W_{j}$ receives, is $O(|U|\cdot n^{2/3})$ words.

Lemma 27 (Proof is given in the full version [19]).

When executing Algorithm 5, for each $j\in[1..n^{1/3}]$ , it holds that $\left\|\bigcup_{i=1}^{|U|}K_{i,j}\right\|=O(|U|\cdot n^{2/3})$ .

Now, the algorithm selects for every set $W_{j}$ a leader $v_{W_{j}}$ . Each node $v_{i}$ sends $\|K_{i,j}\|$ to $v_{W_{j}}$ . The leader $v_{W_{j}}$ computes and sends to each node $v_{i}\in U$ a node $w\in W_{j}$ such that $v_{i}$ should send $K_{i,j}$ to $w$ . By Lemma 23, there is a computation such that for each $w\in W_{j}$ , the number of words that $w$ receives is at most $O(n)$ words, by setting in the lemma $x_{i}\coloneqq\|K_{i,j}\|$ , $X\coloneqq|U|\cdot O(n^{2/3})$ , $x\coloneqq O(n)$ and $k\coloneqq|W_{j}|=\left\lfloor{|U|/n^{1/3}}\right\rfloor$ . On the other hand, each node sends at most $O(n)$ words. By Lenzen’s routing scheme this is done in $O(1)$ rounds.

In step 8, similar to step 4, the depth of the recursion is $O(1)$ .

We are ready to prove Lemma 25.

Proof of Lemma 25.

First, we apply Algorithm 5 with $U=V$ . Hence, all the objects are ordered in a non-decreasing lexicographical order among all the nodes of the network.

Next, we show how to compute for each object $B$ , $\mathsf{rank}(B)$ . Each node $v_{i}$ for $1\leq i<n$ sends to node $v_{i+1}$ the largest object of $v_{i}$ (by the lexicographical order), denoted $B^{\ell}_{i}$ . Then, each node $v_{i}$ computes and broadcasts the number of distinct objects $v_{i}$ holds that are different from $B^{\ell}_{i-1}$ (for $i=1$ , node $v_{i}$ just broadcasts the number of distinct objects $v_{i}$ holds), ignoring the tiebreakers of the original node and original index (notice that the number of distinct objects might be 0). Now, each node $v_{i}$ computes the rank of all the objects $v_{i}$ holds.

Lastly, for every object $B$ , $\mathsf{rank}(B)$ is sent to the original node of $B$ , using the information of the original node that $B$ stores. By Lenzen’s routing scheme this is done in $O(1)$ rounds. $\hfill\blacktriangleleft$

Appendix B Sorting Objects of Size $O(n)$

In this section we prove Theorem 5. Notice that Section 1.2 with $\varepsilon=0$ means that every key is of size $O(n)$ words of space.

B.1 Upper Bound

In this section we show an algorithm that sorts objects of size $O(n)$ words in $O(\log n)$ rounds. Our algorithm simulates an execution of a sorting network. A sorting network with $N$ wires (analog to cells of an input array) sorts comparable objects as follows. The network has a fixed number of parallel levels, each level is composed of at most $N/2$ comparators. A comparator compares two input objects and swap their positions if they are out of order. The number of parallel levels of a sorting network is called the depth of the sorting network.

Ajtai, Komlós, and Szemerédi (AKS) [1] described a sorting network of depth $O(\log N)$ for every $N\in\mathbb{N}$ . Notice that in our setting $N\in O(n^{2})$ . We prove the following lemma.

Lemma 28.

There is an algorithm that solves Section 1.2 with $\varepsilon=0$ in $O(\log n)$ rounds.

Proof.

We show how to simulate an execution of each level of AKS sorting network for $N$ input objects in the Congested Clique model in $O(1)$ rounds.

First, each node broadcasts the number of objects it stores, and then each node calculates $N$ , the number of objects in the network, and produces the AKS sorting network with $N$ wires. In addition, each node attaches to every object within the node the global index of the object as a metadata.

On each level, there are $O(n^{2})$ comparators. Each node $v_{i}$ is responsible for the $[(i-1)\left\lceil{N/n}\right\rceil+1..i\left\lceil{N/n}\right\rceil]=O(n)$ comparators (if exist). To do so, each node with inputs to a comparator under $v_{i}$ ’s responsibility, sends for every such input the size and the index of the input to $v_{i}$ . This routing takes $O(1)$ rounds. Denote the sum of the sizes of the objects (inputs) under $v_{i}$ responsibility as $S_{v_{i}}$ . Then, $v_{i}$ creates $a_{i}=\left\lceil{S_{v_{i}}/n}\right\rceil$ auxiliary nodes. Notice that the total number of auxiliary nodes is at most

\sum_{i=1}^{n}a_{i}=\sum_{i=1}^{n}\left\lceil{S_{v_{i}}/n}\right\rceil\leq\sum% _{i=1}^{n}1+S_{v_{i}}/n\leq n+O(n^{2})/n=O(n).

By Lemma 24 the algorithm can simulate each round with the auxiliary nodes on the original network in $O(1)$ rounds. Then, $v_{i}$ calculates a partition of the comparators between $v_{i}$ ’s auxiliary nodes such that for each auxiliary node of $v_{i}$ , the total size of objects for $v_{i}$ ’s comparators to this auxiliary node is $O(n)$ (by Lemma 23, there is such a partition). Then, let $u$ be a node that holds an object $B$ that should be sent to one of $v_{i}$ ’s comparators. $v_{i}$ sends to $u$ , which auxiliary node is the target of $B$ , and then $u$ sends $B$ to this auxiliary node. By Lenzen’s routing scheme, this is done in $O(1)$ rounds.

Finally, all the comparators execute the comparisons, and if a swap is needed, the objects swap their metdata indices. Since a swap of wires in a comparator is equivalent to a swap of indices in our simulation, we have that after the last level, the metadata index of each object is its rank among the objects.

To conclude, it takes $O(1)$ rounds to simulate each level of a sorting network. There are at most $O(\log(n^{2}))=O(\log n)$ levels to AKS sorting network, and therefore the running time for sorting objects of size $O(n)$ is $O(\log n)$ rounds. $\hfill\blacktriangleleft$

B.2 Lower Bound

In this section we prove the lower bound of Theorem 5.

Lemma 29.

Every comparison based algorithm that solves Section 1.2 with $\varepsilon=0$ must take $\Omega(\log n/\log\log n)$ rounds.

Proof.

Let $A$ be an algorithm that solves Section 1.2 with $n$ nodes and $n$ keys, each of size $\Theta(n)$ , in $r$ rounds. We describe another algorithm $A^{\prime}$ that also runs in $r$ rounds, solves Section 1.2 and performs $O(nr\log r)$ comparisons. Thus, for $r=o(\log n/\log\log n)$ we get a sorting algorithm of $n$ keys with $O(n\cdot r\log r)=o(n\log n)$ comparisons, which contradicts the celebrated comparison based sorting lower bound by Ford and Johnson [16].

The algorithm $A^{\prime}$ works as follows. Let us focus on one specific node $v$ . $v$ will simulate $A$ , but will ignore all the comparisons performed in $A$ . Let $x_{1},x_{2},\dots$ be the keys $v$ receives during the running of the algorithm, due to their arrival order (breaking ties arbitrarily). In $A^{\prime}$ , the node $v$ maintains at any time the sorted order of all the keys that $v$ received so far. Whenever $v$ receives a new key $x_{i}$ , $v$ performs a binary search on the keys $\{x_{1},\dots,x_{i-1}\}$ (which are maintained in a sorted order). This takes $O(\log i)$ comparisons, and then $v$ updates the sorted order of all the keys $x_{1},\dots,x_{i}$ with no additional comparisons.

Since $v$ maintains at any time the sorted order of all the keys $v$ has received until this time, $v$ can simulate any operation that $v$ has to perform due to $A$ , even if the operation requires some comparison between keys.

We focus on the case where every key is of size $\Theta(n)$ words of space. In this case a node $v$ must get $\Omega(n)$ words to receive a key. Since $A^{\prime}$ runs in $r$ rounds, $v$ gets at most $O(r)$ keys. The total number of comparisons $v$ performs while running $A^{\prime}$ is $\sum_{i=1}^{O(r)}\log i=O(r\log r)$ . There are $n$ nodes in the Congested Clique , and therefore $O(n\cdot r\log r)$ comparisons in total across all the nodes. $\hfill\blacktriangleleft$

[bib.bib1] [1] Miklós Ajtai, János Komlós, and Endre Szemerédi. An o(n log n) sorting network. In Proceedings of the 15th Annual ACM Symposium on Theory of Computing, pages 1–9. ACM, 1983. doi:10.1145/800061.808726.

[bib.bib2] [2] Alexandr Andoni, Aleksandar Nikolov, Krzysztof Onak, and Grigory Yaroslavtsev. Parallel algorithms for geometric graph problems. In David B. Shmoys, editor, Symposium on Theory of Computing, STOC 2014, pages 574–583. ACM, 2014. doi:10.1145/2591796.2591805.

[bib.bib3] [3] Benny Applebaum, Dariusz R. Kowalski, Boaz Patt-Shamir, and Adi Rosén. Clique here: On the distributed complexity in fully-connected networks. Parallel Process. Lett., 26(1):1650004:1–1650004:12, 2016. doi:10.1142/S0129626416500043.

[bib.bib4] [4] Lars Arge, Paolo Ferragina, Roberto Grossi, and Jeffrey Scott Vitter. On sorting strings in external memory (extended abstract). In Frank Thomson Leighton and Peter W. Shor, editors, Proceedings of the Twenty-Ninth Annual ACM Symposium on the Theory of Computing, pages 540–548. ACM, 1997. doi:10.1145/258533.258647.

[bib.bib5] [5] Paul Beame, Paraschos Koutris, and Dan Suciu. Communication steps for parallel query processing. In Richard Hull and Wenfei Fan, editors, Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2013, pages 273–284. ACM, 2013. doi:10.1145/2463664.2465224.

[bib.bib6] [6] Soheil Behnezhad, Mahsa Derakhshan, and MohammadTaghi Hajiaghayi. Brief announcement: Semi-mapreduce meets congested clique. CoRR, abs/1802.10297, 2018. doi:10.48550/arXiv.1802.10297.

[bib.bib7] [7] Jon Louis Bentley and Robert Sedgewick. Fast algorithms for sorting and searching strings. In Michael E. Saks, editor, Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 360–369. ACM/SIAM, 1997. URL: http://dl.acm.org/citation.cfm?id=314161.314321.

[bib.bib8] [8] Timo Bingmann. Scalable String and Suffix Sorting: Algorithms, Techniques, and Tools. PhD thesis, Karlsruhe Institute of Technology, Germany, 2018. URL: https://publikationen.bibliothek.kit.edu/1000085031.

[bib.bib9] [9] Robert S. Boyer and J. Strother Moore. A fast string searching algorithm. Commun. ACM, 20(10):762–772, 1977. doi:10.1145/359842.359859.

[bib.bib10] [10] Dany Breslauer and Zvi Galil. An optimal o( $\backslash$ log $\backslash$ logn) time parallel string matching algorithm. SIAM Journal on Computing, 19(6):1051–1058, 1990. doi:10.1137/0219072.

[bib.bib11] [11] Dany Breslauer and Zvi Galil. Real-time streaming string-matching. ACM Transactions on Algorithms, 10(4):22:1–22:12, 2014. doi:10.1145/2635814.

[bib.bib12] [12] Keren Censor-Hillel, Michal Dory, Janne H. Korhonen, and Dean Leitersdorf. Fast approximate shortest paths in the congested clique. Distributed Comput., 34(6):463–487, 2021. doi:10.1007/s00446-020-00380-5.

[bib.bib13] [13] Keren Censor-Hillel, Orr Fischer, Tzlil Gonen, François Le Gall, Dean Leitersdorf, and Rotem Oshman. Fast distributed algorithms for girth, cycles and small subgraphs. In Hagit Attiya, editor, 34th International Symposium on Distributed Computing, DISC 2020, October 12-16, 2020, Virtual Conference, volume 179 of LIPIcs, pages 33:1–33:17. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPIcs.DISC.2020.33.

[bib.bib14] [14] Charles J Colbourn and Alan CH Ling. Quorums from difference covers. Information Processing Letters, 75(1-2):9–12, 2000. doi:10.1016/S0020-0190(00)00080-6.

[bib.bib15] [15] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. In Eric A. Brewer and Peter Chen, editors, 6th Symposium on Operating System Design and Implementation (OSDI 2004), pages 137–150. USENIX Association, 2004. URL: http://www.usenix.org/events/osdi04/tech/dean.html.

[bib.bib16] [16] Lester R Ford Jr and Selmer M Johnson. A tournament problem. The American Mathematical Monthly, 66(5):387–389, 1959.

[bib.bib17] [17] Zvi Galil. Optimal parallel algorithms for string matching. Information and Control, 67(1-3):144–157, 1985. doi:10.1016/S0019-9958(85)80031-0.

[bib.bib18] [18] Mohsen Ghaffari, Christoph Grunau, and Slobodan Mitrovic. Massively parallel algorithms for b-matching. In Kunal Agrawal and I-Ting Angelina Lee, editors, SPAA ’22: 34th ACM Symposium on Parallelism in Algorithms and Architectures, pages 35–44. ACM, 2022. doi:10.1145/3490148.3538589.

[bib.bib19] [19] Shay Golan and Matan Kraus. String problems in the congested clique model, 2025. doi:10.48550/arXiv.2504.08376.

[bib.bib20] [20] Torben Hagerup. Optimal parallel string algorithms: sorting, merging and computing the minimum. In Frank Thomson Leighton and Michael T. Goodrich, editors, Proceedings of the Twenty-Sixth Annual ACM Symposium on Theory of Computing, 23-25 May 1994, Montréal, Québec, Canada, pages 382–391. ACM, 1994. doi:10.1145/195058.195202.

[bib.bib21] [21] MohammadTaghi Hajiaghayi, Hamed Saleh, Saeed Seddighin, and Xiaorui Sun. String matching with wildcards in the massively parallel computation model. In Kunal Agrawal and Yossi Azar, editors, SPAA ’21: 33rd ACM Symposium on Parallelism in Algorithms and Architectures, pages 275–284. ACM, 2021. doi:10.1145/3409964.3461793.

[bib.bib22] [22] Vaughan R. Pratt James H. Morris Jr. A linear pattern-matching algorithm. techreport 40, University of California, Berkeley, 1970.

[bib.bib23] [23] Tomasz Jurdzinski and Krzysztof Nowicki. MST in O(1) rounds of congested clique. In Artur Czumaj, editor, Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 2620–2632. SIAM, 2018. doi:10.1137/1.9781611975031.167.

[bib.bib24] [24] Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Parallel external memory suffix sorting. In Ferdinando Cicalese, Ely Porat, and Ugo Vaccaro, editors, Proceedings of ‘Combinatorial Pattern Matching - 26th Annual Symposium, CPM 2015, volume 9133 of Lecture Notes in Computer Science, pages 329–342. Springer, 2015. doi:10.1007/978-3-319-19929-0_28.

[bib.bib25] [25] Juha Kärkkäinen and Tommi Rantala. Engineering radix sort for strings. In Amihood Amir, Andrew Turpin, and Alistair Moffat, editors, Proceedings of String Processing and Information Retrieval, 15th International Symposium, SPIRE 2008,, volume 5280 of Lecture Notes in Computer Science, pages 3–14. Springer, 2008. doi:10.1007/978-3-540-89097-3_3.

[bib.bib26] [26] Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. Linear work suffix array construction. Journal of the ACM, 53(6):918–936, 2006. doi:10.1145/1217856.1217858.

[bib.bib27] [27] Richard M. Karp and Michael O. Rabin. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2):249–260, 1987. doi:10.1147/rd.312.0249.

[bib.bib28] [28] Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa, and Kunsoo Park. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Amihood Amir and Gad M. Landau, editors, Combinatorial Pattern Matching, 12th Annual Symposium, CPM 2001, volume 2089 of Lecture Notes in Computer Science, pages 181–192. Springer, 2001. doi:10.1007/3-540-48194-X_17.

[bib.bib29] [29] Donald E. Knuth, James H. Morris, Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM Journal on Computing, 6(2):323–350, 1977. doi:10.1137/0206024.

[bib.bib30] [30] Fabian Kulla and Peter Sanders. Scalable parallel suffix array construction. Parallel Comput., 33(9):605–612, 2007. doi:10.1016/j.parco.2007.06.004.

[bib.bib31] [31] Christoph Lenzen. Optimal deterministic routing and sorting on the congested clique. In Panagiota Fatourou and Gadi Taubenfeld, editors, ACM Symposium on Principles of Distributed Computing, PODC ’13, Montreal, QC, Canada, July 22-24, 2013, pages 42–50. ACM, 2013. doi:10.1145/2484239.2501983.

[bib.bib32] [32] Zvi Lotker, Elan Pavlov, Boaz Patt-Shamir, and David Peleg. MST construction in o(log log n) communication rounds. In Arnold L. Rosenberg and Friedhelm Meyer auf der Heide, editors, SPAA 2003: Proceedings of the Fifteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 94–100. ACM, 2003. doi:10.1145/777412.777428.

[bib.bib33] [33] Udi Manber and Eugene W. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935–948, 1993. doi:10.1137/0222058.

[bib.bib34] [34] Krzysztof Nowicki. A deterministic algorithm for the MST problem in constant rounds of congested clique. In Samir Khuller and Virginia Vassilevska Williams, editors, STOC ’21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 1154–1165. ACM, 2021. doi:10.1145/3406325.3451136.

[bib.bib35] [35] Matthew Felice Pace and Alexander Tiskin. Parallel suffix array construction by accelerated sampling. In Jan Holub and Jan Zdárek, editors, Proceedings of the Prague Stringology Conference 2013, Prague, Czech Republic, September 2-4, 2013, pages 142–156. Department of Theoretical Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2013. URL: http://www.stringology.org/event/2013/p13.html.

[bib.bib36] [36] Boaz Patt-Shamir and Marat Teplitsky. The round complexity of distributed sorting: extended abstract. In Cyril Gavoille and Pierre Fraigniaud, editors, Proceedings of the 30th Annual ACM Symposium on Principles of Distributed Computing, PODC, pages 249–256. ACM, 2011. doi:10.1145/1993806.1993851.

[bib.bib37] [37] Simon J. Puglisi, William F. Smyth, and Andrew Turpin. A taxonomy of suffix array construction algorithms. ACM Comput. Surv., 39(2):4, 2007. doi:10.1145/1242471.1242472.

[bib.bib38] [38] Tatiana Starikovskaya and Hjalte Wedel Vildhøj. Time-space trade-offs for the longest common substring problem. In Johannes Fischer and Peter Sanders, editors, Combinatorial Pattern Matching, 24th Annual Symposium, CPM, volume 7922 of Lecture Notes in Computer Science, pages 223–234. Springer, 2013. doi:10.1007/978-3-642-38905-4_22.

[bib.bib39] [39] Uzi Vishkin. Optimal parallel pattern matching in strings. Inf. Control., 67(1-3):91–113, 1985. doi:10.1016/S0019-9958(85)80028-0.

[bib.bib40] [40] Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, SWAT 1973, pages 1–11. IEEE Computer Society, 1973. doi:10.1109/SWAT.1973.13.

String Problems in the Congested Clique Model

Abstract

Keywords and phrases:

Funding:

Copyright and License:

2012 ACM Subject Classification:

Related Version:

DOI:

Event:

Editors:

Series and Publisher:

1 Introduction

String sorting (Section 3).

Pattern matching (Section 4).

Suffix Array construction and the corresponding LCP array (Section 5)

The input model.

Relation between Congested Clique and MPC.

1.1 Related Work

String Sorting.

Pattern Matching.

Suffix Array.

1.2 Our Contribution

Theorem 1 (String Sorting).

Theorem 2 (Pattern Matching).

Theorem 3 (Suffix Array and 𝖫𝖢𝖯).

Problem 3 (Large Object Sorting).

Theorem 4.

Theorem 5.

2 Preliminaries

Strings.

Lemma 6 ([11, Lemma 3.1]).

Definition 7.

Definition 8 (Lexicographic order).

Routing.

Lemma 9 ([13, Lemma 9]).

▶ Remark.

Lemma 10.

3 Sorting Strings

Renaming.

Lemma 11.

Proof.

Proof of Theorem 1.

4 Pattern Matching

4.1 Short Pattern

4.2 Long Pattern

First step - searching for the prefix and suffix.

Lemma 12.

Proof.

Second step - completing the gaps.

Lemma 13.

Proof.

Lemma 14.

Proof.

Lemma 15.

Proof.

5 Suffix Array Construction and the Corresponding LCP Array

Definition 16.

Lemma 17 ([38, Lemma 2]).

Claim 18.

The representative objects.

Lemma 19.

Proof.

Time complexity.

▶ Remark.

5.1 Computing the Corresponding LCP Array

Lemma 20.

Proof.

First step - compute exact LCP for 𝑫⁢𝑪𝒕⁢(𝑺).

Second step - compute 𝗟𝗖𝗣𝑺

Fact 20.

Proof.

Definition 21.

Lemma 22.

Proof.

Complexity.

References

Appendix A Sorting Large Objects

Lemma 23 ([12, Lemma 3]).

Lemma 24 ([23, Lemma 1.2]).

Lemma 25.

Theorem 3 (Suffix Array and $\mathsf{LCP}$ ).

$\blacktriangleright$ Remark.

$\blacktriangleright$ Remark.

First step - compute exact LCP for $DC_{t}(S)$ .

Second step - compute $\mathsf{LCP}_{S}$

A.1 Sorting at most $n^{1/3}$ Nodes with Objects of Size $O(n^{1/3})$

A.2 Sorting more than $n^{1/3}$ Nodes with Objects of Size $O(n^{1/3})$

Appendix B Sorting Objects of Size $O(n)$