Abstract 1 Introduction 2 Related Work 3 Background 4 The Upper Bound 5 Lower Bounds 6 General Cost Functions 7 Different Cardinality Relations 8 Conclusion References

Parallel Query Processing with Heterogeneous Machines

Simon Frisk University of Wisconsin-Madison, WI, USA Paraschos Koutris ORCID University of Wisconsin-Madison, WI, USA
Abstract

We study the problem of computing a full Conjunctive Query in parallel using p heterogeneous machines. Our computational model is similar to the MPC model, but each machine has its own cost function mapping from the number of bits it receives to a cost. An optimal algorithm should minimize the maximum cost across all machines. We consider algorithms over a single communication round and give a lower bound and matching upper bound for databases where each relation has the same cardinality. We do this for both linear cost functions like in previous work, but also for more general cost functions. For databases with relations of different cardinalities, we also find a lower bound, and give matching upper bounds for specific queries like the cartesian product, the join, the star query, and the triangle query. Our approach is inspired by the HyperCube algorithm, but there are additional challenges involved when machines have heterogeneous cost functions.

Keywords and phrases:
Joins, Massively Parallel Computation, Heterogeneous
Copyright and License:
[Uncaptioned image] © Simon Frisk and Paraschos Koutris; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Database theory
Related Version:
Full Version: https://arxiv.org/abs/2501.08896 [5]
Editors:
Sudeepa Roy and Ahmet Kara

1 Introduction

Large datasets are commonly processed using massively parallel systems. To analyze query processing in such a setting, Beame et al. [2] introduced the massively parallel computation (MPC) model. The MPC model considers a cluster with a shared-nothing architecture, where computation proceeds in rounds: each round consists of communication between machines, followed by computation on the locally stored data. The main measure of complexity in the MPC model is the load, which captures the maximum number of bits received by a machine. An efficient MPC algorithm is designed to make the load as small as possible.

However, the MPC model operates on an assumption of homogeneity, meaning the cost of a machine is indifferent to where the received data was sent from, and how powerful the machine is. This is an unrealistic assumption, as the large-scale clusters that massively parallel computation is performed on are heterogeneous. Heterogeneity can occur in both compute resources (processing speed, memory) and the network that connects the machines.

In this work, we consider massively parallel data processing in clusters with heterogeneity in compute resources. We use a computational model that, similar to the MPC model, has a homogeneous network topology (every machine is connected directly to any other machine). However, each machine c is equipped with its own cost function gc: this function maps the number of bits the machine receives to the cost. The load L of a round is then defined as the maximum cost across all machines, i.e., L=maxc[p]gc. The computational model in this paper captures the MPC model as a special case, when for each machine the cost function is the identity function gc(N)=N. Our model is also a special instance of the topology-aware model in [3], however, one that has not been studied in prior work.

Based on the above heterogeneous model, we study the problem of computing join queries with a minimum load. We will focus on one-round algorithms, i.e., we want to have only local computation after one round of communication. One-round algorithms are particularly relevant to data processing systems with a disaggregated storage architecture (e.g., Amazon Aurora [14], Snowflake [4]). These algorithms can be viewed as algorithms that send the data from the storage layer to the compute layer in such a way that no further communication has to be done in the compute layer. This paper therefore addresses the problem of optimally sending data from the data layer to the compute layer when there is compute heterogeneity.

Our Contributions.

The main contribution of this work is upper and lower bounds for the load L of computing a join query (corresponding to a full Conjunctive Query) in one round with heterogeneous machines. In particular:

  • We present an algorithm (Section 4) that evaluates a join query in one round when the cost function is linear with different weights, i.e., gc(N)=N/wc for machine c. Our algorithm works for two different types of inputs where all relations have the same size: matching databases that are sparse, and dense databases that contain a constant fraction of all possible input tuples.

  • We give (Section 5) lower bounds that (almost) match the upper bounds for both the sparse and dense cases. Our lower bounds are unconditional, that is, they make no assumptions on how the algorithm behaves and how it encodes the input tuples.

  • We next consider the case with non-linear cost functions (Section 6). Previous work, even in the topology-aware MPC model, assumes linear cost functions. We generalize this to a wider class of cost functions.

  • Finally, we consider queries where the cardinalities of input relations are different (Section 7). We give a lower bound on the load to compute such queries in a single round, for the same two data distributions as in the equal cardinality case. We also give an algorithm that matches the upper bound for Conjunctive Queries for the cartesian product, binary join, star query, and triangle query.

Technical Ideas.

In the MPC model, the HyperCube algorithm has proved to be the key technique that gives optimal join algorithms. The HyperCube algorithm maps tuples to machines via a hash function that hashes each tuple to a vector. Tuples are sent to machines where the projection of the coordinates of the machine equals the hash vector of the tuple. Each machine obtains the same number of tuples (with high probability) and has the same load. However, in the heterogenous setting, each machine may be allocated a different number of tuples, since slower machines can handle less data than faster machines. Thus, instead of considering how to organize the machines in a hypercube, we consider how to partition the space of all possible tuples Λ=[n]k into subspaces (which are hyperrectangles) ΛcΛ, one for each machine c. Each machine is then responsible for computing all the output tuples in this subspace, and to do this correctly it needs to receive all input tuples that may contribute to these. The technical challenge is twofold: (i) how to optimally set the dimensions of each Λc to minimize the load across all machines, and (ii) how to geometrically position the subspaces such that the space Λ is fully covered. We will show that query parameters such as fractional edge packings and vertex covers are still critical in characterizing the optimal load, but the algorithmic techniques we use are different from the HyperCube algorithm.

2 Related Work

MPC Algorithms.

The MPC model is a computational model introduced by Beame et al. [2]. It has been used to analyze parallel algorithms for joins and other fundamental data processing tasks. The seminal paper [2] shows matching upper and lower bounds on the load for Conjunctive Queries in one round for matching databases. A lower bound for queries with skew was also given, which was matched by an upper bound for some classes of queries. Later work [12] studied the worst-case optimal load for any input in one round algorithms and proposed an algorithm matching the lower bound. Further research explored the computation of join queries using multiple rounds [12, 11, 6, 9, 13], or the design of parallel output-sensitive algorithms in the MPC model [10].

Topology-aware Algorithms.

A recent line of work aims to consider a topology-aware parallel model that is aware of the heterogeneity in the cluster topology and compute resources [3, 8, 7]. In this model, the topology is modeled as a graph G=(V,E), where a subset VCV of nodes are compute nodes. Computation proceeds in rounds similar to the MPC model, but the cost model is different. Instead of modeling the cost as the maximum number of bits sent to a processor, each edge in the network has a cost which is a function of the number of bits it transmits. The cost of a round is then the maximum cost across all edges. A common cost function is that the cost of edge e is fe(N)=N/we, which is similar to the cost function used in this paper. Under this topology-aware model, recent work has studied lower and upper bounds for set intersections, cartesian product, and sorting [8], as well as binary joins [7]. Both of these papers assume that the underlying network has a symmetric tree topology.

The computational model in this paper is a special case of the topology-aware MPC model, where the network topology is a star. This is a tree with depth 1, where all leaves are compute nodes, and the root node is a router. The cost function from a compute node to the router is 0, and the cost function from the router to machine c is precisely the cost function of the machine, gc(N). Prior work in the topology-aware MPC model does not capture the work in this paper, for two reasons. First, it considers symmetric trees, meaning the cost function across a link is the same in each direction, which is not true in this paper. Second, we consider arbitrary full conjunctive queries, which have not been studied previously.

3 Background

Computation Model.

Initially, the p machines in the cluster hold an arbitrary piece of the input data. The computation then proceeds in r rounds. A round consists of the communication phase, where machines can exchange data, followed by the computation phase, where computation is performed on locally stored data. In this paper, we focus on algorithms where r=1, meaning there is a single round of communication followed by computation on local data. The output of a computation is the (set) union of the output across all machines.

In the standard MPC model, the cost of a round is modeled as the maximum amount of data (in bits) received by any machine. That is, if Nc is the number of bits received by machine c, the cost of computation is L=maxc[p]Nc.

In this paper, we will extend this model to heterogeneous machines. This means that each machine c[p] has a cost function gc:++ that maps from the number of bits received (Nc) to a positive real number denoting cost. The cost of a round is similar to before, i.e., maxc[p]gc(Nc). We will mostly work with linear cost functions gc(x)=x/wc for some wc+. Here, the weight constant wc for each machine captures the cost at the machine, which may include both data transmission and processing. Later in the paper, we will study more general cost functions.

Conjunctive Queries.

In this paper, we work with Conjunctive Queries without projection or selection. These can be thought of as natural joins between l relations:

q(x1,,xk) :- S1(𝐲1),,Sl(𝐲l)

There are k variables, denoted x1,,xk, and l atoms, denoted S1,,Sl. For each j, the vector 𝐲j consists of variables, and rj is the arity of the atom Sj. We restrict the queries in this paper to have no self-joins, meaning no two atoms can refer to the same underlying relation. We will often use the notation xSj to mean that variable x occurs in the atom Sj. We will work with relations where the values come from a domain [n] = {1,2, …, n}. We denote the cardinality of atom Sj as mj and the number of bits needed to encode Sj as Mj.

A fractional vertex cover 𝐯 for q assigns a weight vi0 to each variable xi such that for every atom Sj, we have xiSjvi1.

A fractional edge packing 𝐮 for q assigns a weights ui0 to each atom Sj such that for every variable xi, we have j:xiSjuj1.

HyperCube Algorithm.

HyperCube is an elegant algorithm for distributed multiway joins, originally introduced by Afrati and Ullman for the MapReduce model [1]. It computes multiway joins in a single round of communication, as opposed to traditional methods where relations are joined pairwise. We will illustrate how HyperCube computes a full CQ q with k variables using p machines.

The p machines are organized in a hyperrectangle with k dimensions, one for each variable. The sides of the hyperrectangle have {pi}i[k] machines, where pi[1,p] and i[k]pi=p. Each machine c has a coordinate 𝐂c[p1]××[pk]. Denote πSj𝐂c as the projection of 𝐂c on Sj. We will use k hash functions {hi}i[k], one for each variable, where hi:[n][pi]. Denote 𝐡=(h1,,hk) as the vector of all hash functions, and πSj𝐡 as the projection of 𝐡 on Sj. A tuple ajSj will be sent to all machines c such that (πSj𝐡)(aj)=πSj𝐂c. Then the query can be computed locally on each machine with all tuples that were sent to that machine. The correctness of the algorithm follows from that each tuple 𝐚[n]k that should be in the output is produced by the machine 𝐡(𝐚).

Input Distributions.

In this paper, we will focus on two classes of inputs, sparse and dense. The first type of input is a matching database. The cardinality of relation Sj is mj. For every value in the domain v[n], every relation Sj, and attribute A of that relation, there exists at most one tuple ajSj such that the value of aj in the attribute A is v. If the arity of a relation Sj is 1, we require that mj/nθ for some constant θ(0,1). We will start by considering the case when each relation has the same cardinality. In Section 7, we will generalize this to the case when each relation can have a different cardinality mjn.

The second class of inputs are θ-dense databases, where θ(0,1). For this input, a relation Sj or arity rj has a fraction θ of all nrj possible tuples. We consider θ to be a constant in data complexity terms. We will first study instances where the cardinality of each relation is the same (which means that the arity rj is the same for each relation) and generalize in Section 7 to unequal cardinalities.

4 The Upper Bound

In this section, we give algorithms for computing a full Conjunctive Query q with k variables. We will consider the linear cost model, where we have p machines, and machine c[p] has a linear cost function gc(N)=N/wc for some weight wc0. We will denote 𝐰:=(w1,,wp).

Let I be an instance with uniform cardinalities m over a domain [n]. Let 𝐯 be a fractional vertex cover of q and v=i[k]vi. Then, define:

L𝐯upper:=mlogn𝐰v=mlogn(c[p]wcv)1/v
Theorem 1 (Dense Inputs).

Let q be a full CQ with uniform arity r and a θ-dense input I with domain [n] (every relation has size m=θnr). Then, for every fractional vertex cover 𝐯, we can evaluate q in one round in the linear cost model with load O(L𝐯upper).

Theorem 2 (Sparse Inputs).

Let q be a full CQ and I be a matching database with domain [n] and uniform relation sizes m. Then, for every fractional vertex cover 𝐯 we can evaluate q in one round in the linear cost model with load (with high probability) O(L𝐯upper).

In the rest of the section, we will prove the above two theorems. We start with an overview of our approach, which is similar to the HyperCube algorithm albeit with some important modifications. We do not consider how to pick share exponents to decide the number of machines to put in each dimension of the hypercube. This concept is now not meaningful, since the machines are different.

Instead, we consider the hyperrectangle Λ=[n]k, which can be thought of as the space containing all possible output tuples. Our algorithm partitions Λ into hyperrectangles {Λc}c[p]. We will use this partitioning to guide how machines will compute the output. To do this, we need a vector of k functions 𝐡=(h1,,hk), where hi:[n][n]. For the sparse data distribution, 𝐡 will be a random hash function (essentially perturbing the input tuples). For the dense data distribution, 𝐡 will be the identity function 𝐡(𝐚)=𝐚.

Then, machine c will be responsible for computing every tuple 𝐚[n]k such that 𝐡(𝐚)Λc. To achieve this, our algorithm sends information about a tuple ajSj to all machines c where (πSj𝐡)(aj)πSjΛc, where πSjΛc is the projection of the subspace to the attributes of Sj. Similar to the HyperCube algorithm, this guarantees that every potential output tuple 𝐚, if it exists in the output, is produced at one machine, namely the machine c with 𝐡(𝐚)Λc.

We will denote by λc,i the side length of Λc on variable xi for machine c. Moreover, we will use |Λ| to denote the volume of Λ, i.e., the number of points in the space. Note that |πSΛc|=xSλc,i.

There are two main aspects to describe of our algorithm. The first is how to pick the side lengths λc,i for each machine and dimension to minimize the load – this corresponds to minimizing the projections πSjΛc of the hyperrectangles. The second is how to geometrically position the hyperrectangles Λc in Λ to cover the whole space. We describe these two components in the next two sections.

4.1 Partitioning the Space

Theorem 3.

Let 𝐯=(v1,,vk) be any fractional vertex cover of a CQ q. Let v=j[k]vi. For every machine c, let the side length of a hyperrectangle Λc in Λ along some variable xi be

λc,i:=(wc𝐰v)vin

Then, the following two properties hold:

  1. 1.

    c[p]|Λc|=nk;

  2. 2.

    for every machine c and every atom S with arity r: |πSΛc|wc𝐰vnr

Proof.

We start by showing that the assignment above covers all of Λ, by summing the covered volume for each machine.

c[p]|Λc|=c[p]j[k]λc,i=c[p]j[k][(wc𝐰v)vin]=c[p][(wc𝐰v)vnk]=nkc[p]wcvc[p]wcv=nk

Next, we show the bound on the volume of the projected hyperrectangle on each atom. We focus on some atom S with arity r. Then, we have:

|πSΛc|=xiSλc,i=(wc𝐰v)xiSvinr

Note that wc𝐰v1. Furthermore, since 𝐯 is a vertex cover, xiSvi1. Hence, we get the desired inequality.

The above lemma provides the appropriate dimensions of each hyperrectangle Λc, but it does not tell us how these hyperrectangles must be positioned geometrically within Λ such that they cover the whole space.

Example 4.

Consider the Cartesian product q(x,y) :- S1(x),S2(y). We have p=17 machines. There are 2 machines with w=4, 1 machine with w=3, 3 machines with w=2, and 11 machines with w=1. Consider the vertex cover with vx=vy=1. Then, 𝐰u=8. This gives that machines with w=4 should have side lengths n/2, machines with w=3 should have side lengths 3n/8, machines with w=2 side lengths n/4 and finally w=1 should have side lengths n/8. The figure below shows one way to position the rectangles to cover Λ. Each rectangle is labeled with the weight of the machine that occupies that space.

Figure 1: One way to pack the machines in the example.

In the example above we can perfectly fit the rectangles together to cover Λ. In the case when all hyperrectangles have the same dimensions, such as when machines have the same weight wc, packing is a trivial problem. In general, there might not be a perfect way to fit the hyperrectangles together to cover the full space. This will require us to increase the size of some of the hyperrectangles Λc, but the volumes will be increased only by a constant factor.

4.2 Packing Hyperrectangles

In this subsection, we will show how to geometrically position the hyperrectangles {Λ1,,Λp} to cover Λ. During this process, we will have to adjust the dimensions of each Λc so that the hyperrectangles can fit together. This will result in adjusted hyperrectangles {Λ¯1,,Λ¯p}, however, we only have to pay a constant factor increase in their dimensions. In particular:

Theorem 5 (Packing Theorem).

The hyperrectangles {Λ1,,Λp} can be packed to cover Λ by adjusting hyperrectangles to {Λ¯1,,Λ¯p} such that for all relations Sj with arity rj and machines c, |πSjΛ¯c|2k+1+rj|πSjΛc|.

Except for in this subsection, we will always denote the hyperrectangle for machine c as Λc, even after the packing algorithm has run.

A condensed description of the packing algorithm can be seen in 1. The algorithm sets dimensions of Λc according to Theorem 3. Each side of each hyperrectangle is then rounded independently to the nearest higher power of two. This gives some adjusted hyperrectangles, {Λ^1,,Λ^p}. The hyperrectangles are then put into buckets, where each bucket contains all hyperrectangles of the same size. Denote the number of buckets as b.

Because the sides of hyperrectangles have been rounded to powers of two, we can always, if we have enough hyperrectangles in some small bucket, merge them into one hyperrectangle that fits in a larger bucket. We will order buckets in increasing order of hyperrectangle size. Starting with the first bucket, we will merge as many hyperrectangles as possible into hyperrectangles that fit in the second bucket. We do this for each consecutive pair of buckets until the last bucket is reached.

In the next step, we take the largest bucket, and pairwise merge hyperrectangles into hyperrectangles of twice the volume, by stacking them in a minimum dimension. This gives a new bucket of hyperrectangles. We repeat this procedure until there is just one hyperrectangle R in the obtained bucket.

We will now take this hyperrectangle R and use it to fill Λ. Some dimensions of R may be smaller than n. In such a case, we just scale up R in those dimensions to be exactly n.

Algorithm 1 Packing Algorithm.
1:Λ1,,ΛpAccording to Theorem 3.
2:Λ^1,,Λ^pRound each side to higher power of two.
3:B1,,BbBuckets of Λ^c of similar dimensions.
4:for Bt{B1,,Bb1} do
5:  Merge as many rectangles from Bt into Bt+1 as possible.
6:end for
7:tb
8:while |Bt|>1 do
9:  Bt+1Pairwise merge hyperrectangles in Bt in the smallest dimension.
10:  tt+1.
11:end while
12:RThe one hyperrectangle in Bt.
13:Scale R up to cover Λ.
14:return R.

We now analyze the details of the algorithm. Recall that the packing algorithm starts by rounding all sides λc,i to the nearest higher power of two, λ^c,i, obtaining rounded hyperrectangles Λ^c. That is, for each machine c and dimension i we find αc,i such that: 2αc,i1<λc,i2αc,i=λ^c,i Each side of Λc is rounded independently. This means that for Λ^c, Λ^c where ii, it is possible that λc,i=λc,i for some but not all variables xi.

Lemma 6.

For any two machines with weights wc and wc such that wcwc, for any variable xi, λ^c,iλ^c,i.

Proof.

Since wcwc, we know that (wc𝐰v)vin(wc𝐰v)vin. This means that λc,iλc,i. Then it is also true that λ^c,iλ^c,i.

We now create buckets B1,,Bb of all hyperrectangles {Λ^c}c[p], one bucket for each hyperrectangle with the same dimensions. This means that for each hyperrectangle Λ^c,Λ^c in the same bucket, for all xi, λ^c,i=λ^c,i. We order the buckets in increasing order of the volume of the hyperrectangles in it, denoted as V[Bt].

Lemma 7.

Let Bt,Bt be buckets with t<t. Then, V[Bt]/V[Bt] hyperrectangles from Bt can be packed to form one hyperrectangle with the same shape as the hyperrectangles in Bt.

Proof.

Let Λ^cBt,Λ^cBt. By  6, all dimensions of Λ^c are at least as big the corresponding dimension of Λ^c. More specifically, since the side lengths are of the form 2αc,i, we know that λ^c,i=2ai,tλ^c,i for some ai,t+. For some dimension i, take 2ai,t hyperrectangles from the bucket Bt and stack them together in the dimension i. This will create one hyperrectangle where dimension i is the same as dimension i in Λ^c. We now continue this process across all the other dimensions. Let at=i[k]ai,t. Then, this process uses 2at=V[Bt]/V[Bt] hyperrectangles of shape Λ^c. Note that for at least one i, ai,t>0, since otherwise Λ^c=Λ^c, and then they are in the same bucket.

The above lemma means that for each adjacent pair of buckets, Bt,Bt+1, if Bt contains at least V[Bt+1]/V[Bt] hyperrectangles, we can merge them into one hyperrectangle in Bt+1. The packing algorithm will merge as many hyperrectangles as possible, starting with the smallest bucket. When there are no merges left possible, each bucket Bt has at most V[Bt+1]/V[Bt]1 hyperrectangles, since otherwise another merge is possible. We can now show that by only using the rectangles in the largest bucket Bb, we can almost cover the whole output space.

Lemma 8.

Let pt be the number of hyperrectangles in bucket Bt, for i{1,,b}. Then, |Λ|<(1+pb)V[Bb].

Proof.

We use the observation that across all buckets the total volume is at least |Λ|. Then:

|Λ| t=1bptV[Bt]=pbV[Bb]+t=1b1ptV[Bt]pbV[Bb]+t=1b1(V[Bt+1]V[Bt]1)V[Bt]
=pbV[Bb]+t=1b1(V[Bt+1]V[Bt])(1+pb)V[Bb]

where the second inequality holds because pt<V[Bt+1]/V[Bt] and the last inequality holds because it is is a telescopic sum.

We will now pack Λ using only the pb hyperrectangles in the last bucket Bb. Let n^ be the domain n rounded to the nearest higher power of two. Note that no dimension of a hyperrectangle Λ^cBb is greater than n^. This is because λc,in, so λ^c,in^. We will merge the hyperrectangles in Bb the following way. Find the minimum dimension λ^c,i of Λ^cBb, and pairwise merge hyperrectangles in Bb into hyperrectangles of volume 2V[Bb] by putting them adjacent in dimension i. This creates a new bucket Bb+1, with pb/2 hyperrectangles, and at most one hyperrectangle in Bb is left unmerged. This process can be repeated on hyperrectangles in Bb+1, until a bucket Bb+d is obtained, where Bb+d contains one hyperrectangle, so no further merges are possible. We will now show that we can cover Λ using just the one hyperrectangle in Bb+d, by scaling it up by at most a constant factor.

Lemma 9.

V[Bb+d]>|Λ|/2.

Proof.

Let β be the number of hyperrectangles in Bb after the first merge. Denote by pt the number of hyperrectangles in bucket Bt after the last merge step for each t{b,,b+d}, pt{0,1}. Since V[Bt+1]/V[Bt]=2,

t=bb+d1ptV[Bt]t=bb+d1V[Bt]V[Bb+d]V[Bb]

Moreover, we have:

βV[Bb]=t=bb+dptV[Bt]=V[Bb+d]+t=bb+d1ptV[Bt]2V[Bb+d]V[Bb]

Finally, by reorganizing the above inequality and applying 8, we obtain that V[Bb+d](β+1)V[Bb]/2>|Λ|/2.

The above lemma shows that RBb+d almost covers Λ. There might however exist variables xi such that |Ri|<n. We will scale R in each dimension i by a factor fi=max{n/|Ri|,1}. This will guarantee that for each xi,i[k], |Ri|n, and hence R covers Λ. To scale R by a factor fi in dimension i, we have to scale each Λ^c that is packed into R by that same factor fi in dimension i, which gives the final sizes of hyperrectangles, which we denote Λ¯c. If hyperrectangle c is packed into R, λ¯c,i=fiλ^c,i. If hyperrectangle c is not packed into R, λ¯c,i=0 since the hyperrectangle is not used.

Lemma 10.

Let RBb+d be the remaining hyperrectangle. Scale R in dimension i by a factor fi=max{n/|Ri|,1}. Then R covers Λ, and we have scaled R in such a way that for each subset S[k], the following holds: iSfi2k+1.

Proof.

The choice of fi=max{n/|Ri|,1} means that |Ri|fin. Hence Λ is covered. We know that i[k]Rink/2, by the previous lemma. Note that hyperrectangles in Bb,Bb+d, and hence also R, have side lengths at most n^ (n rounded up to the nearest higher power of two) since we merged the smallest dimensions first. Furthermore, n^<2n. Therefore, for each i[k], |Ri|2n. Now,

i[k]fi =i[k]max{n/|Ri|,1}=i[k]max{n,|Ri|}|Ri|=i[k]max{n,|Ri|}V[R]
i[k]2nV[Λ]/2=22knknk=2k+1

For any S[k], the product iSfi would be less than the product above, since for all i[k], fi1.

We can now prove the main theorem about packing.

Proof of Theorem 5.

Let Λc be the hyperrectangle of machine c as given by Theorem 3 and let Λ¯c be the hyperrectangle after the packing algorithm has run. The packing algorithm can increase sides λc,i first by rounding up to λ^c,i, which is at most a factor 2 bigger. If hyperrectangle c is included in the final hyperrectangle R, sides of Λ^c might then be scaled up again by a factor fi=min{1,n/|Ri|}, to λ¯c,i. For an atom Sj with arity rj, we now have:

|πSjΛ¯c||πSjΛc|=xiSjλ^c,iλc,iλ¯c,iλ^c,ixiSj2fi2rj+k+1

The second inequality comes from 10.

4.3 Putting Everything Together

We can now prove the main theorems in this section.

Proof of Theorem 1.

The worst case load of the algorithm is that every possible tuple in Λc exists. We will calculate the load of machine c from relation Sj, Lcj. Denote ncj as the number of tuples received by machine c from Sj. We get

Lcj=ncjlognwc=lognwc|πSjΛc|1wcwc𝐰vnrlogn=O(nrlogn𝐰v)

Here the inequality comes from Theorem 3. The result follows since the query has a constant number of atoms.

Proof of Theorem 2.

Denote Ncj as the number of bits received by machine c from relation Sj. The probability that a tuple ajSj maps to machine c is the following:

Pr[(πSj𝐡)(aj)Λc]=|πSjΛc|nrjwc𝐰vnrnr=wc𝐰v

The inequality comes from Theorem 3. Note that since we use hashing and have a matching database instance, the probability that a tuple is mapped to machine c is the same and independent among all tuples in the hyperrectangle. Therefore, ncjBin(m,wc𝐰v). We get the following expected value

E[Lcj]=1wcE[ncj]logn=1wcwcm𝐰vlogn=O(mlogn𝐰v)

We also show that the probability that the load is more than this is exponentially small. Indeed, applying the Chernoff bound, which we describe in the full paper, we have:

Pr[Lcj(1+δ)mlogn𝐰v] =Pr[Ncj(1+δ)wcmlogn𝐰v]
=Pr[ncj(1+δ)mwc𝐰v]exp(δ2mwc3𝐰v)

We obtain the probability bound by taking the union bound across all atoms and machines.

5 Lower Bounds

We present a lower bound on the load when machines have linear cost functions and all atoms have the same cardinality. This lower bound applies to both the sparse and the dense case, and considers the behavior of the algorithm over a probability distribution of inputs.

We consider again for each machine c a linear cost function gc(N)=N/wc with 𝐰=(w1,,wp) as weights. Let u=(u1,,ul) be a fractional edge packing for q, with u=j[l]uj. Moreover, let m be the cardinality of every relation. Then, define

L𝐮lower:=m(c[p]wcu)1/u=m𝐰u
Theorem 11.

Let q be a CQ and let u=(u1,,ul) be a fractional edge packing for q. Consider the uniform probability distribution of matching databases with m tuples per relation over domain [n]. Denote by EI[|q(I)|] the expected value of the number of output tuples |q(I)|, over instances I in the probability distribution . Then, any one-round algorithm that in expectation outputs at least EI[|q(I)|] tuples has load Ω(L𝐮lower) in the linear cost model. The same lower bound holds for the probability distribution d of θ-dense instances over domain [n].

Each fractional edge packing u gives a different lower bound, the highest of which is obtained by minimizing 𝐰u. Since the p-norm is a decreasing function of p, the highest lower bound is given by the maximum fractional edge packing. The maximum fractional edge packing is equal to the minimum vertex cover via duality of linear programs, hence the lower bound matches our upper bounds within a logarithmic factor.

Theorem 12.

min𝐯(L𝐯upper)=lognmax𝐮(L𝐮lower)

We will next give an overview of the proof of Theorem 11, with some details left to the full paper. We assume that initially each relation is stored at a separate location. Let msgj be the bit string that a fixed machine receives from Sj, and let msg be the concatenation of msgj for all j. Note that |msg| is the number of bits that the machine receives. We let Msg(I) be the random variable mapping from the set of possible database instances to the value of msg. Msgj(Sj) is defined in the same way but maps to msgj.

Definition 13.

Let R be a relation, and let aR be a tuple. We say that a is known by the machine, given message msg, if for all all database instances I where Msg(I)=msg, aR. We denote the set of known tuples by machine c given message msg as Kmsgc(R). Furthermore, we define Kmsg(R)=cKmsgc(R).

For each Sj, let fc,j[0,1] be the maximum length of the message msgj that c receives (across all instances in the distribution) divided by Mj, the number of bits in the encoding of Sj. Note that since we use the optimal encoding, Mj is the entropy of our input distribution.

Lemma 14.

In a θ-dense database, Mj=Ω(mj).

Lemma 15.

In a matching database, Mj=Ω(mj).

To show Theorem 11, we use the following and previous lemmas, which are proven in the full paper [5]. The below lemma was proven in [2] for matching databases. However, some assumptions about the data distributions are changed and we also show the lemma for a θ-dense database distribution.

Lemma 16.

Let 𝐮=(u1,,ul) be a fractional edge packing of q. Then the expected number of known output tuples is

E[|Kmsgc(q(I))|]j[l]fc,jujE[|q(I)|]

We can now prove the main theorem of this section. We will use the notation u=j[l]uj.

Proof of Theorem 11.

From the definition of the load, fc,jLwc/M. Applying 16,

E[|Kmsgc(q(I))|]E[|q(I)|]j[l]fc,jujE[|q(I)|]j[l](LwcM)uj=E[|q(I)|](LwcM)u

We now use that |Kmsg(q(I))|=|c[p]Kmsgc(q(I))|c[p]|Kmsgc(q(I))|.

E[|Kmsg(q(I))|]c[p][(LwcM)uE[|q(I)|]]=LuMuE[|q(I)|]c[p]wcu.

If the algorithm is to produce the whole output of the query, the expected number of known output tuples has to be at least the expected output size of the query, so

LuMuE[|q(I)|]c[p]wcuE[|q(I)|]

Use 14 or 15. This concludes the proof.

6 General Cost Functions

In previous sections, we considered machines with linear cost functions. In this section, we extend the result to a broader class of cost functions, where each machine c is equipped with a general cost function gc.

Definition 17.

A cost function g:++ is well-behaved if it satisfies the following:

  1. 1.

    g(0)=0;

  2. 2.

    g is increasing;

  3. 3.

    there exists a constant a>1 such that for all x1, g((1+δ)x)(1+δ)ag(x)

These restrictions on a cost function are natural, since the cost of receiving zero bits should be zero, and the cost of receiving additional bits should be positive. The last condition states that a cost function cannot grow faster than some polynomial at each point. This requirement is not required in the lower bound, but without it, it is difficult to create a matching upper bound since just one bit in addition to what is expected can arbitrarily increase the cost of the machine.

Definition 18.

For a well-behaved cost function g, define the function g:++ by:

g(L):=maxx+{g(x)L}

Under the above definition, gc(L) can be interpreted as the maximum number of bits the cost function permits the machine c to receive with load at most L. The restriction gc(0)=0 implies that if gc(L) is defined for some L, it is also defined for all L+ where L<L.

6.1 Lower Bound

Given a query q, consider any fractional edge packing 𝐮=(u1,,ul) with u=j[l]uj. Suppose each relation has uniform cardinality m. Then, define L¯𝐮lower to be the minimum L0 that satisfies the following inequality.

c[p](gc(L))umu
Theorem 19.

Let q be a CQ and let u=(u1,,ul) be a fractional edge packing for q. Consider the uniform probability distribution of matching databases with m tuples per relation over domain [n]. Denote by EI[|q(I)|] the expected value of the number of output tuples |q(I)|, over instances I in the probability distribution . Then any one-round algorithm with well-behaved cost functions {gc}c that in expectation outputs at least EI[q(I)] tuples has load Ω(L¯𝐮lower). The same lower bound holds for the probability distribution d of θ-dense instances over domain [n].

Proof.

This proof is similar to the proof of Theorem 11. We apply again 16 and sum over all machines.

E[|Kmsg(q(I))|]E[|q(I)|]c[p]j[l]fc,iuj E[|q(I)|]c[p]j[l](gc(L)M)uj
=E[|q(I)|]c[p](gc(L)M)u

Here the second inequality comes from that fc,iMgc(L), since a machine can not receive more bits than what is permitted by the load. Use that M=O(m) by 14 or 15. We require that E[|Kmsg(q(I))|]E[|q(I)|]. This proves the theorem.

The highest lower bound is given by the 𝐮 that maximizes L¯𝐮lower. We will now prove that the maximum fractional edge packing 𝐮 always gives the best lower bound. We will assume that m1, meaning the database is not empty.

Lemma 20.

Let 𝐮 be the maximum fractional edge packing. Then, L¯𝐮lower=max𝐮L¯𝐮lower.

Proof.

Let L=L¯𝐮lower. Suppose L is another lower bound given by another edge packing u with uu. It suffices to show that L satisfies c[p](gc(L))umu, since then the lowest L satisfying the equation can be at most L.

Note that for all c, gc(L)uumuu, since gc(L)m. Then,

c[p](gc(L))uc[p](gc(L))umuu=muuc[p](gc(L))umuumu=mu

where the last inequality follows from the fact that L satisfies c[p](gc(L))umu.

Example 21.

As an example, consider cost functions of the form gc(x)=xawc, where a>0. Then gc(L)=(Lwc)1/a. The lower bound then becomes:

Lmax𝐮maxc[p](ma(c[p]wcu/a)a/u)

6.2 Upper Bound

We give an algorithm for evaluating full CQs with equal cardinality atoms, where each cost function {gc}c is well-behaved. The approach is similar to linear cost functions, but we will need another method to pick the dimensions of each hyperrectangle {Λc}c.

The algorithm will require the numerical value of L¯lower=max𝐮L¯𝐮lower, the lower bound on the load. We therefore need a method to find L¯𝐮lower for the maximal fractional edge packing 𝐮. For this, we need to find the minimal positive value of the function f(L)=c[p](gc(L))umu. We know that L is more than 0 and at most Lmax=minc[p]gc(m), since the query can be computed with load Lmax with just one machine. L¯𝐮lower can be found using binary search on this interval.

Given the value L=L¯lower, our algorithm computes the hyperrectangles using the same general technique as in Section 4, with the difference that the sizes of each dimension are different. In particular, for the minimum vertex cover 𝐯, we calculate the i-dimension for machine c as follows:

λc,i:=(gc(L)m)vin

As we show in the full paper [5], the dimensions we choose are such that we can still apply the same packing technique as in Section 4. Thus:

Theorem 22 (Dense Inputs).

Let q be a full CQ with uniform arity r and a θ-dense input I with domain [n] (every relation has size m=θnr). Then, we can evaluate q in one round with well-behaved cost functions {gc}c with load O(L¯lowerlogn).

Theorem 23 (Sparse Inputs).

Let q be a full CQ over a matching instance I with uniform cardinalities m and domain [n]. Then, we can evaluate q in one round with well-behaved cost functions {gc}c with load O(L¯lowerlogn) with high probability.

The proofs of the above theorems are provided in the full paper.

7 Different Cardinality Relations

We now move on to the general case where we do not require the cardinality of every atom to be the same. We will assume linear cost functions, that is, cost functions have the form gc(N)=N/wc. We will give a general lower bound and matching upper bounds for the cartesian product, the binary join, the star query and the triangle query.

7.1 Lower Bound

An important difference in the lower bound we will present next, to the lower bound for queries of equal cardinalities, is that we need to consider different edge packings for each different machine. We will denote the edge packing for query q and machine c as uc=(uc,1,,uc,l).

Theorem 24.

Let q be a CQ and let uc=(uc,1,,uc,l) be any fractional edge packing for q and machine c. Consider the uniform probability distribution of matching databases with mj tuples for relation Sj over domain n. Denote by EI[|q(I)|] the expected value of the number of output tuples |q(I)|, over instances I in the probability distribution . Then, any one-round algorithm with linear cost functions that in expectation outputs at least EI[q(I)] tuples has load Ω(L), where L is the smallest load that satisfies the following equation

c[p]j[l](Lwcmj)uc,j1 (1)

The same lower bound holds for the probability distribution d of θ-dense instances over domain [n].

Proof.

The proof is similar to proofs of previous lower bounds. We will use 16 to bound the number of output tuples produced by one machine. We require that all machines together produce at least E[|q(I)|] output tuples.

E[|Kmsgc[q(I)]|]j[l]fc,iuc,jE[|q(I)|]E[|q(I)|]j[l](fc,iMjMj)uc,jE[|q(I)|]j[l](LwcMj)uc,j

Here we used that fc,iMjLwc. This is because Lwc is the maximum number of bits machine c can receive about each relation, to keep the load L. The theorem follows by taking a sum across all machines.

The highest lower bound L is given by the set of edge packings {𝐮c}c[p], one for each machine, that maximizes the load needed to satisfy the equation. Next, we show how to compute the numerical value of L, which will be used by the upper bound.

Lemma 25.

Let L denote the maximum lower bound on the load from Theorem 24. Then:

maxjmjc[p]wcLmaxjmjmaxc[p]wc
Proof.

We start with the first inequality. Note that the edge packing where edge j with maximum cardinality gets uj=1, and all other edges get weight 0 is a valid edge packing. Hence c[p]Lwcmj1 is a lower bound.

For the second inequality, if only the biggest machine is used in computation, the load is maxjmj/maxc[p]wc. Therefore this load can always be achieved by computing the query on only the biggest machine. Since L is the optimal load, it will never be more than this.

Note that pmaxc[p]wcc[p]wc. This together with the lemma above shows that the range of possible values for L is at most a factor

maxjmj/maxc[p]maxjmj/c[p]wc=c[p]wcmaxc[p]wcpmaxc[p]wcmaxc[p]wc=p

Since the range of possible values of L is p, we can find L by starting with a guess L^maxjmjc[p]wc. We can check if our guess is correct by finding the edge packings {𝐮c}c[p] for each machine. We can then check if c[p]j[l](L^wcmj)uc,j is at least 1. If this is not the case, we double our guess L^. Since the range of possible values of L is just a factor p, we have to iterate this procedure at most logp times.

7.2 Upper Bound

We will now show how to match the lower bound for the cartesian product, the binary join, the star query and the triangle query. For the general full CQ, creating an algorithm is challenging remains an open problem. The difficulty is that the lower bound in Theorem 24 might give a different edge packing 𝐮c to each machine. A linear program to find the edge packing for a machine can be obtained by minimizing the logarithm of j[l](Lwc/mj)uc,j. By considering the dual program, it is possible to find hyperrectangles Λc such that the expected load of machines matches the lower bound and the total volume of all Λc cover Λ, similar to what was done in [2] for homogenous machines. However, for the packing to work, we need all sides of the hyperrectangle Λc to increase when the weight of a machine increases. It is not clear how to guarantee this, or whether it is possible. We have however been able to match the lower bound for specific queries, using the same algorithm as in previous sections, by modifying how shapes of subspaces are picked. Here, we describe how to do this for the Cartesian Product and Binary Joins. Proof of correctness of the following algorithm is provided in the full paper [5]. There, we also show that a property similar to 6 holds. This means we can use the same method for packing as previously presented. In the full paper [5], we also show the Star Query and the Triangle Query.

Cartesian Product

We consider the cartesian product q(x,y) :- S1(x),S2(y).

Let L be the lower bound on the load. Let the side length of Λc along dimension j be

λc,i:=min(LwcMj,1)n=(LwcMj)uc,i

Binary Join

Next, consider the binary join.

q(x,y,z) :- S1(x,z),S2(y,z)

Let L be the lower bound on the load. Let the side lengths of Λc be the following:

λc,x:=n,λc,y:=n,λc,z:=Lwcmax(M1,M2)n

Matching the Lower Bound

The theorems below show that the algorithm that follows matches the lower bound. Proofs are provided in the full paper [5].

Theorem 26 (Dense).

Let q be one of the cartesian product, the binary join, the star query and the triangle query over a θ-dense input I, where arities rj of tables do not have to be uniform. Let L be the lower bound on the load. Then we can evaluate q with heterogenous machines with weights w1,,wp with load O(Llogn).

Theorem 27 (Sparse).

Let q be one of the cartesian product, the binary join, the star query and the triangle query over a matching database I where atom Sj has cardinality mj. Let L be the lower bound on the load. Then we can evaluate q with heterogenous machines with weights w1,,wp with load O(Llogn) with high probability.

8 Conclusion

In this paper, we studied the problem of computing full Conjunctive Queries in parallel on heterogeneous machines. Our algorithms are inspired by the HyperCube algorithm but take a new approach of considering how to optimally partition the space of possible output tuples among machines. This gives an optimal algorithm for queries where relations have the same cardinalities, for both linear and more general cost functions, and an optimal algorithm for queries with atoms of any cardinality for specific queries.

References

  • [1] Foto N. Afrati and Jeffrey D. Ullman. Optimizing joins in a map-reduce environment. In Ioana Manolescu, Stefano Spaccapietra, Jens Teubner, Masaru Kitsuregawa, Alain Léger, Felix Naumann, Anastasia Ailamaki, and Fatma Özcan, editors, EDBT 2010, 13th International Conference on Extending Database Technology, Lausanne, Switzerland, March 22-26, 2010, Proceedings, volume 426 of ACM International Conference Proceeding Series, pages 99–110. ACM, 2010. doi:10.1145/1739041.1739056.
  • [2] Paul Beame, Paraschos Koutris, and Dan Suciu. Communication steps for parallel query processing. J. ACM, 64(6):40:1–40:58, 2017. doi:10.1145/3125644.
  • [3] Spyros Blanas, Paraschos Koutris, and Anastasios Sidiropoulos. Topology-aware parallel data processing: Models, algorithms and systems at scale. In 10th Conference on Innovative Data Systems Research, CIDR 2020, Amsterdam, The Netherlands, January 12-15, 2020, Online Proceedings. www.cidrdb.org, 2020. URL: http://cidrdb.org/cidr2020/papers/p10-blanas-cidr20.pdf.
  • [4] Benoît Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, Allison W. Lee, Ashish Motivala, Abdul Q. Munir, Steven Pelley, Peter Povinec, Greg Rahn, Spyridon Triantafyllis, and Philipp Unterbrunner. The snowflake elastic data warehouse. In Fatma Özcan, Georgia Koutrika, and Sam Madden, editors, Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 215–226. ACM, 2016. doi:10.1145/2882903.2903741.
  • [5] Simon Frisk and Paraschos Koutris. Parallel query processing with heterogeneous machines, 2025. arXiv:2501.08896.
  • [6] Xiao Hu. Cover or pack: New upper and lower bounds for massively parallel joins. In Leonid Libkin, Reinhard Pichler, and Paolo Guagliardo, editors, PODS’21: Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Virtual Event, China, June 20-25, 2021, pages 181–198. ACM, 2021. doi:10.1145/3452021.3458319.
  • [7] Xiao Hu and Paraschos Koutris. Topology-aware parallel joins. Proc. ACM Manag. Data, 2(2):97, 2024. doi:10.1145/3651598.
  • [8] Xiao Hu, Paraschos Koutris, and Spyros Blanas. Algorithms for a topology-aware massively parallel computation model. In Leonid Libkin, Reinhard Pichler, and Paolo Guagliardo, editors, PODS’21: Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Virtual Event, China, June 20-25, 2021, pages 199–214. ACM, 2021. doi:10.1145/3452021.3458318.
  • [9] Xiao Hu and Yufei Tao. Parallel acyclic joins: Optimal algorithms and cyclicity separation. J. ACM, 71(1):6:1–6:44, 2024. doi:10.1145/3633512.
  • [10] Xiao Hu and Ke Yi. Instance and output optimal parallel algorithms for acyclic joins. In Dan Suciu, Sebastian Skritek, and Christoph Koch, editors, Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, pages 450–463. ACM, 2019. doi:10.1145/3294052.3319698.
  • [11] Bas Ketsman, Dan Suciu, and Yufei Tao. A near-optimal parallel algorithm for joining binary relations. Log. Methods Comput. Sci., 18(2), 2022. doi:10.46298/LMCS-18(2:6)2022.
  • [12] Paraschos Koutris, Paul Beame, and Dan Suciu. Worst-case optimal algorithms for parallel query processing. In Wim Martens and Thomas Zeume, editors, 19th International Conference on Database Theory, ICDT 2016, Bordeaux, France, March 15-18, 2016, volume 48 of LIPIcs, pages 8:1–8:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2016. doi:10.4230/LIPICS.ICDT.2016.8.
  • [13] Yufei Tao. A simple parallel algorithm for natural joins on binary relations. In Carsten Lutz and Jean Christoph Jung, editors, 23rd International Conference on Database Theory, ICDT 2020, March 30-April 2, 2020, Copenhagen, Denmark, volume 155 of LIPIcs, pages 25:1–25:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPICS.ICDT.2020.25.
  • [14] Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvili, and Xiaofeng Bao. Amazon aurora: Design considerations for high throughput cloud-native relational databases. In Semih Salihoglu, Wenchao Zhou, Rada Chirkova, Jun Yang, and Dan Suciu, editors, Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, pages 1041–1052. ACM, 2017. doi:10.1145/3035918.3056101.