Parallel Query Processing with Heterogeneous Machines
Abstract
We study the problem of computing a full Conjunctive Query in parallel using heterogeneous machines. Our computational model is similar to the MPC model, but each machine has its own cost function mapping from the number of bits it receives to a cost. An optimal algorithm should minimize the maximum cost across all machines. We consider algorithms over a single communication round and give a lower bound and matching upper bound for databases where each relation has the same cardinality. We do this for both linear cost functions like in previous work, but also for more general cost functions. For databases with relations of different cardinalities, we also find a lower bound, and give matching upper bounds for specific queries like the cartesian product, the join, the star query, and the triangle query. Our approach is inspired by the HyperCube algorithm, but there are additional challenges involved when machines have heterogeneous cost functions.
Keywords and phrases:
Joins, Massively Parallel Computation, HeterogeneousCopyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Theory of computation Database theoryEditors:
Sudeepa Roy and Ahmet KaraSeries and Publisher:

1 Introduction
Large datasets are commonly processed using massively parallel systems. To analyze query processing in such a setting, Beame et al. [2] introduced the massively parallel computation (MPC) model. The MPC model considers a cluster with a shared-nothing architecture, where computation proceeds in rounds: each round consists of communication between machines, followed by computation on the locally stored data. The main measure of complexity in the MPC model is the load, which captures the maximum number of bits received by a machine. An efficient MPC algorithm is designed to make the load as small as possible.
However, the MPC model operates on an assumption of homogeneity, meaning the cost of a machine is indifferent to where the received data was sent from, and how powerful the machine is. This is an unrealistic assumption, as the large-scale clusters that massively parallel computation is performed on are heterogeneous. Heterogeneity can occur in both compute resources (processing speed, memory) and the network that connects the machines.
In this work, we consider massively parallel data processing in clusters with heterogeneity in compute resources. We use a computational model that, similar to the MPC model, has a homogeneous network topology (every machine is connected directly to any other machine). However, each machine is equipped with its own cost function : this function maps the number of bits the machine receives to the cost. The load of a round is then defined as the maximum cost across all machines, i.e., . The computational model in this paper captures the MPC model as a special case, when for each machine the cost function is the identity function . Our model is also a special instance of the topology-aware model in [3], however, one that has not been studied in prior work.
Based on the above heterogeneous model, we study the problem of computing join queries with a minimum load. We will focus on one-round algorithms, i.e., we want to have only local computation after one round of communication. One-round algorithms are particularly relevant to data processing systems with a disaggregated storage architecture (e.g., Amazon Aurora [14], Snowflake [4]). These algorithms can be viewed as algorithms that send the data from the storage layer to the compute layer in such a way that no further communication has to be done in the compute layer. This paper therefore addresses the problem of optimally sending data from the data layer to the compute layer when there is compute heterogeneity.
Our Contributions.
The main contribution of this work is upper and lower bounds for the load of computing a join query (corresponding to a full Conjunctive Query) in one round with heterogeneous machines. In particular:
-
We present an algorithm (Section 4) that evaluates a join query in one round when the cost function is linear with different weights, i.e., for machine . Our algorithm works for two different types of inputs where all relations have the same size: matching databases that are sparse, and dense databases that contain a constant fraction of all possible input tuples.
-
We give (Section 5) lower bounds that (almost) match the upper bounds for both the sparse and dense cases. Our lower bounds are unconditional, that is, they make no assumptions on how the algorithm behaves and how it encodes the input tuples.
-
We next consider the case with non-linear cost functions (Section 6). Previous work, even in the topology-aware MPC model, assumes linear cost functions. We generalize this to a wider class of cost functions.
-
Finally, we consider queries where the cardinalities of input relations are different (Section 7). We give a lower bound on the load to compute such queries in a single round, for the same two data distributions as in the equal cardinality case. We also give an algorithm that matches the upper bound for Conjunctive Queries for the cartesian product, binary join, star query, and triangle query.
Technical Ideas.
In the MPC model, the HyperCube algorithm has proved to be the key technique that gives optimal join algorithms. The HyperCube algorithm maps tuples to machines via a hash function that hashes each tuple to a vector. Tuples are sent to machines where the projection of the coordinates of the machine equals the hash vector of the tuple. Each machine obtains the same number of tuples (with high probability) and has the same load. However, in the heterogenous setting, each machine may be allocated a different number of tuples, since slower machines can handle less data than faster machines. Thus, instead of considering how to organize the machines in a hypercube, we consider how to partition the space of all possible tuples into subspaces (which are hyperrectangles) , one for each machine . Each machine is then responsible for computing all the output tuples in this subspace, and to do this correctly it needs to receive all input tuples that may contribute to these. The technical challenge is twofold: how to optimally set the dimensions of each to minimize the load across all machines, and how to geometrically position the subspaces such that the space is fully covered. We will show that query parameters such as fractional edge packings and vertex covers are still critical in characterizing the optimal load, but the algorithmic techniques we use are different from the HyperCube algorithm.
2 Related Work
MPC Algorithms.
The MPC model is a computational model introduced by Beame et al. [2]. It has been used to analyze parallel algorithms for joins and other fundamental data processing tasks. The seminal paper [2] shows matching upper and lower bounds on the load for Conjunctive Queries in one round for matching databases. A lower bound for queries with skew was also given, which was matched by an upper bound for some classes of queries. Later work [12] studied the worst-case optimal load for any input in one round algorithms and proposed an algorithm matching the lower bound. Further research explored the computation of join queries using multiple rounds [12, 11, 6, 9, 13], or the design of parallel output-sensitive algorithms in the MPC model [10].
Topology-aware Algorithms.
A recent line of work aims to consider a topology-aware parallel model that is aware of the heterogeneity in the cluster topology and compute resources [3, 8, 7]. In this model, the topology is modeled as a graph , where a subset of nodes are compute nodes. Computation proceeds in rounds similar to the MPC model, but the cost model is different. Instead of modeling the cost as the maximum number of bits sent to a processor, each edge in the network has a cost which is a function of the number of bits it transmits. The cost of a round is then the maximum cost across all edges. A common cost function is that the cost of edge is , which is similar to the cost function used in this paper. Under this topology-aware model, recent work has studied lower and upper bounds for set intersections, cartesian product, and sorting [8], as well as binary joins [7]. Both of these papers assume that the underlying network has a symmetric tree topology.
The computational model in this paper is a special case of the topology-aware MPC model, where the network topology is a star. This is a tree with depth 1, where all leaves are compute nodes, and the root node is a router. The cost function from a compute node to the router is , and the cost function from the router to machine is precisely the cost function of the machine, . Prior work in the topology-aware MPC model does not capture the work in this paper, for two reasons. First, it considers symmetric trees, meaning the cost function across a link is the same in each direction, which is not true in this paper. Second, we consider arbitrary full conjunctive queries, which have not been studied previously.
3 Background
Computation Model.
Initially, the machines in the cluster hold an arbitrary piece of the input data. The computation then proceeds in rounds. A round consists of the communication phase, where machines can exchange data, followed by the computation phase, where computation is performed on locally stored data. In this paper, we focus on algorithms where , meaning there is a single round of communication followed by computation on local data. The output of a computation is the (set) union of the output across all machines.
In the standard MPC model, the cost of a round is modeled as the maximum amount of data (in bits) received by any machine. That is, if is the number of bits received by machine , the cost of computation is .
In this paper, we will extend this model to heterogeneous machines. This means that each machine has a cost function that maps from the number of bits received () to a positive real number denoting cost. The cost of a round is similar to before, i.e., . We will mostly work with linear cost functions for some . Here, the weight constant for each machine captures the cost at the machine, which may include both data transmission and processing. Later in the paper, we will study more general cost functions.
Conjunctive Queries.
In this paper, we work with Conjunctive Queries without projection or selection. These can be thought of as natural joins between relations:
There are variables, denoted , and atoms, denoted . For each , the vector consists of variables, and is the arity of the atom . We restrict the queries in this paper to have no self-joins, meaning no two atoms can refer to the same underlying relation. We will often use the notation to mean that variable occurs in the atom . We will work with relations where the values come from a domain = {1,2, …, n}. We denote the cardinality of atom as and the number of bits needed to encode as .
A fractional vertex cover for assigns a weight to each variable such that for every atom , we have .
A fractional edge packing for assigns a weights to each atom such that for every variable , we have .
HyperCube Algorithm.
HyperCube is an elegant algorithm for distributed multiway joins, originally introduced by Afrati and Ullman for the MapReduce model [1]. It computes multiway joins in a single round of communication, as opposed to traditional methods where relations are joined pairwise. We will illustrate how HyperCube computes a full CQ with variables using machines.
The machines are organized in a hyperrectangle with dimensions, one for each variable. The sides of the hyperrectangle have machines, where and . Each machine has a coordinate . Denote as the projection of on . We will use hash functions , one for each variable, where . Denote as the vector of all hash functions, and as the projection of on . A tuple will be sent to all machines such that . Then the query can be computed locally on each machine with all tuples that were sent to that machine. The correctness of the algorithm follows from that each tuple that should be in the output is produced by the machine .
Input Distributions.
In this paper, we will focus on two classes of inputs, sparse and dense. The first type of input is a matching database. The cardinality of relation is . For every value in the domain , every relation , and attribute of that relation, there exists at most one tuple such that the value of in the attribute is . If the arity of a relation is , we require that for some constant . We will start by considering the case when each relation has the same cardinality. In Section 7, we will generalize this to the case when each relation can have a different cardinality .
The second class of inputs are -dense databases, where . For this input, a relation or arity has a fraction of all possible tuples. We consider to be a constant in data complexity terms. We will first study instances where the cardinality of each relation is the same (which means that the arity is the same for each relation) and generalize in Section 7 to unequal cardinalities.
4 The Upper Bound
In this section, we give algorithms for computing a full Conjunctive Query with variables. We will consider the linear cost model, where we have machines, and machine has a linear cost function for some weight . We will denote .
Let be an instance with uniform cardinalities over a domain . Let be a fractional vertex cover of and . Then, define:
Theorem 1 (Dense Inputs).
Let be a full CQ with uniform arity and a -dense input with domain (every relation has size ). Then, for every fractional vertex cover , we can evaluate in one round in the linear cost model with load .
Theorem 2 (Sparse Inputs).
Let be a full CQ and be a matching database with domain and uniform relation sizes . Then, for every fractional vertex cover we can evaluate in one round in the linear cost model with load (with high probability) .
In the rest of the section, we will prove the above two theorems. We start with an overview of our approach, which is similar to the HyperCube algorithm albeit with some important modifications. We do not consider how to pick share exponents to decide the number of machines to put in each dimension of the hypercube. This concept is now not meaningful, since the machines are different.
Instead, we consider the hyperrectangle , which can be thought of as the space containing all possible output tuples. Our algorithm partitions into hyperrectangles . We will use this partitioning to guide how machines will compute the output. To do this, we need a vector of functions , where . For the sparse data distribution, will be a random hash function (essentially perturbing the input tuples). For the dense data distribution, will be the identity function .
Then, machine will be responsible for computing every tuple such that . To achieve this, our algorithm sends information about a tuple to all machines where , where is the projection of the subspace to the attributes of . Similar to the HyperCube algorithm, this guarantees that every potential output tuple , if it exists in the output, is produced at one machine, namely the machine with .
We will denote by the side length of on variable for machine . Moreover, we will use to denote the volume of , i.e., the number of points in the space. Note that .
There are two main aspects to describe of our algorithm. The first is how to pick the side lengths for each machine and dimension to minimize the load – this corresponds to minimizing the projections of the hyperrectangles. The second is how to geometrically position the hyperrectangles in to cover the whole space. We describe these two components in the next two sections.
4.1 Partitioning the Space
Theorem 3.
Let be any fractional vertex cover of a CQ . Let . For every machine , let the side length of a hyperrectangle in along some variable be
Then, the following two properties hold:
-
1.
;
-
2.
for every machine and every atom with arity :
Proof.
We start by showing that the assignment above covers all of , by summing the covered volume for each machine.
Next, we show the bound on the volume of the projected hyperrectangle on each atom. We focus on some atom with arity . Then, we have:
Note that . Furthermore, since is a vertex cover, . Hence, we get the desired inequality.
The above lemma provides the appropriate dimensions of each hyperrectangle , but it does not tell us how these hyperrectangles must be positioned geometrically within such that they cover the whole space.
Example 4.
Consider the Cartesian product . We have machines. There are 2 machines with , 1 machine with , 3 machines with , and 11 machines with . Consider the vertex cover with . Then, . This gives that machines with should have side lengths , machines with should have side lengths , machines with side lengths and finally should have side lengths . The figure below shows one way to position the rectangles to cover . Each rectangle is labeled with the weight of the machine that occupies that space.
In the example above we can perfectly fit the rectangles together to cover . In the case when all hyperrectangles have the same dimensions, such as when machines have the same weight , packing is a trivial problem. In general, there might not be a perfect way to fit the hyperrectangles together to cover the full space. This will require us to increase the size of some of the hyperrectangles , but the volumes will be increased only by a constant factor.
4.2 Packing Hyperrectangles
In this subsection, we will show how to geometrically position the hyperrectangles to cover . During this process, we will have to adjust the dimensions of each so that the hyperrectangles can fit together. This will result in adjusted hyperrectangles , however, we only have to pay a constant factor increase in their dimensions. In particular:
Theorem 5 (Packing Theorem).
The hyperrectangles can be packed to cover by adjusting hyperrectangles to such that for all relations with arity and machines , .
Except for in this subsection, we will always denote the hyperrectangle for machine as , even after the packing algorithm has run.
A condensed description of the packing algorithm can be seen in 1. The algorithm sets dimensions of according to Theorem 3. Each side of each hyperrectangle is then rounded independently to the nearest higher power of two. This gives some adjusted hyperrectangles, . The hyperrectangles are then put into buckets, where each bucket contains all hyperrectangles of the same size. Denote the number of buckets as .
Because the sides of hyperrectangles have been rounded to powers of two, we can always, if we have enough hyperrectangles in some small bucket, merge them into one hyperrectangle that fits in a larger bucket. We will order buckets in increasing order of hyperrectangle size. Starting with the first bucket, we will merge as many hyperrectangles as possible into hyperrectangles that fit in the second bucket. We do this for each consecutive pair of buckets until the last bucket is reached.
In the next step, we take the largest bucket, and pairwise merge hyperrectangles into hyperrectangles of twice the volume, by stacking them in a minimum dimension. This gives a new bucket of hyperrectangles. We repeat this procedure until there is just one hyperrectangle in the obtained bucket.
We will now take this hyperrectangle and use it to fill . Some dimensions of may be smaller than . In such a case, we just scale up in those dimensions to be exactly .
We now analyze the details of the algorithm. Recall that the packing algorithm starts by rounding all sides to the nearest higher power of two, , obtaining rounded hyperrectangles . That is, for each machine and dimension we find such that: Each side of is rounded independently. This means that for , where , it is possible that for some but not all variables .
Lemma 6.
For any two machines with weights and such that , for any variable , .
Proof.
Since , we know that . This means that . Then it is also true that .
We now create buckets of all hyperrectangles , one bucket for each hyperrectangle with the same dimensions. This means that for each hyperrectangle in the same bucket, for all , . We order the buckets in increasing order of the volume of the hyperrectangles in it, denoted as .
Lemma 7.
Let be buckets with . Then, hyperrectangles from can be packed to form one hyperrectangle with the same shape as the hyperrectangles in .
Proof.
Let . By 6, all dimensions of are at least as big the corresponding dimension of . More specifically, since the side lengths are of the form , we know that for some . For some dimension , take hyperrectangles from the bucket and stack them together in the dimension . This will create one hyperrectangle where dimension is the same as dimension in . We now continue this process across all the other dimensions. Let . Then, this process uses hyperrectangles of shape . Note that for at least one , , since otherwise , and then they are in the same bucket.
The above lemma means that for each adjacent pair of buckets, , if contains at least hyperrectangles, we can merge them into one hyperrectangle in . The packing algorithm will merge as many hyperrectangles as possible, starting with the smallest bucket. When there are no merges left possible, each bucket has at most hyperrectangles, since otherwise another merge is possible. We can now show that by only using the rectangles in the largest bucket , we can almost cover the whole output space.
Lemma 8.
Let be the number of hyperrectangles in bucket , for . Then, .
Proof.
We use the observation that across all buckets the total volume is at least . Then:
where the second inequality holds because and the last inequality holds because it is is a telescopic sum.
We will now pack using only the hyperrectangles in the last bucket . Let be the domain rounded to the nearest higher power of two. Note that no dimension of a hyperrectangle is greater than . This is because , so . We will merge the hyperrectangles in the following way. Find the minimum dimension of , and pairwise merge hyperrectangles in into hyperrectangles of volume by putting them adjacent in dimension . This creates a new bucket , with hyperrectangles, and at most one hyperrectangle in is left unmerged. This process can be repeated on hyperrectangles in , until a bucket is obtained, where contains one hyperrectangle, so no further merges are possible. We will now show that we can cover using just the one hyperrectangle in , by scaling it up by at most a constant factor.
Lemma 9.
.
Proof.
Let be the number of hyperrectangles in after the first merge. Denote by the number of hyperrectangles in bucket after the last merge step for each , . Since ,
Moreover, we have:
Finally, by reorganizing the above inequality and applying 8, we obtain that .
The above lemma shows that almost covers . There might however exist variables such that . We will scale in each dimension by a factor . This will guarantee that for each , , and hence covers . To scale by a factor in dimension , we have to scale each that is packed into by that same factor in dimension , which gives the final sizes of hyperrectangles, which we denote . If hyperrectangle is packed into , . If hyperrectangle is not packed into , since the hyperrectangle is not used.
Lemma 10.
Let be the remaining hyperrectangle. Scale in dimension by a factor . Then covers , and we have scaled in such a way that for each subset , the following holds: .
Proof.
The choice of means that . Hence is covered. We know that , by the previous lemma. Note that hyperrectangles in , and hence also , have side lengths at most ( rounded up to the nearest higher power of two) since we merged the smallest dimensions first. Furthermore, . Therefore, for each , . Now,
For any , the product would be less than the product above, since for all , .
We can now prove the main theorem about packing.
Proof of Theorem 5.
Let be the hyperrectangle of machine as given by Theorem 3 and let be the hyperrectangle after the packing algorithm has run. The packing algorithm can increase sides first by rounding up to , which is at most a factor bigger. If hyperrectangle is included in the final hyperrectangle , sides of might then be scaled up again by a factor , to . For an atom with arity , we now have:
The second inequality comes from 10.
4.3 Putting Everything Together
We can now prove the main theorems in this section.
Proof of Theorem 1.
The worst case load of the algorithm is that every possible tuple in exists. We will calculate the load of machine from relation , . Denote as the number of tuples received by machine from . We get
Here the inequality comes from Theorem 3. The result follows since the query has a constant number of atoms.
Proof of Theorem 2.
Denote as the number of bits received by machine from relation . The probability that a tuple maps to machine is the following:
The inequality comes from Theorem 3. Note that since we use hashing and have a matching database instance, the probability that a tuple is mapped to machine is the same and independent among all tuples in the hyperrectangle. Therefore, . We get the following expected value
We also show that the probability that the load is more than this is exponentially small. Indeed, applying the Chernoff bound, which we describe in the full paper, we have:
We obtain the probability bound by taking the union bound across all atoms and machines.
5 Lower Bounds
We present a lower bound on the load when machines have linear cost functions and all atoms have the same cardinality. This lower bound applies to both the sparse and the dense case, and considers the behavior of the algorithm over a probability distribution of inputs.
We consider again for each machine a linear cost function with as weights. Let be a fractional edge packing for , with . Moreover, let be the cardinality of every relation. Then, define
Theorem 11.
Let be a CQ and let be a fractional edge packing for . Consider the uniform probability distribution of matching databases with tuples per relation over domain . Denote by the expected value of the number of output tuples , over instances in the probability distribution . Then, any one-round algorithm that in expectation outputs at least tuples has load in the linear cost model. The same lower bound holds for the probability distribution of -dense instances over domain .
Each fractional edge packing u gives a different lower bound, the highest of which is obtained by minimizing . Since the -norm is a decreasing function of , the highest lower bound is given by the maximum fractional edge packing. The maximum fractional edge packing is equal to the minimum vertex cover via duality of linear programs, hence the lower bound matches our upper bounds within a logarithmic factor.
Theorem 12.
We will next give an overview of the proof of Theorem 11, with some details left to the full paper. We assume that initially each relation is stored at a separate location. Let be the bit string that a fixed machine receives from , and let msg be the concatenation of for all . Note that is the number of bits that the machine receives. We let be the random variable mapping from the set of possible database instances to the value of msg. is defined in the same way but maps to .
Definition 13.
Let be a relation, and let be a tuple. We say that is known by the machine, given message msg, if for all all database instances where , . We denote the set of known tuples by machine given message msg as . Furthermore, we define .
For each , let be the maximum length of the message that receives (across all instances in the distribution) divided by , the number of bits in the encoding of . Note that since we use the optimal encoding, is the entropy of our input distribution.
Lemma 14.
In a -dense database, .
Lemma 15.
In a matching database, .
To show Theorem 11, we use the following and previous lemmas, which are proven in the full paper [5]. The below lemma was proven in [2] for matching databases. However, some assumptions about the data distributions are changed and we also show the lemma for a -dense database distribution.
Lemma 16.
Let be a fractional edge packing of . Then the expected number of known output tuples is
We can now prove the main theorem of this section. We will use the notation .
Proof of Theorem 11.
6 General Cost Functions
In previous sections, we considered machines with linear cost functions. In this section, we extend the result to a broader class of cost functions, where each machine is equipped with a general cost function .
Definition 17.
A cost function is well-behaved if it satisfies the following:
-
1.
;
-
2.
is increasing;
-
3.
there exists a constant such that for all ,
These restrictions on a cost function are natural, since the cost of receiving zero bits should be zero, and the cost of receiving additional bits should be positive. The last condition states that a cost function cannot grow faster than some polynomial at each point. This requirement is not required in the lower bound, but without it, it is difficult to create a matching upper bound since just one bit in addition to what is expected can arbitrarily increase the cost of the machine.
Definition 18.
For a well-behaved cost function , define the function by:
Under the above definition, can be interpreted as the maximum number of bits the cost function permits the machine to receive with load at most . The restriction implies that if is defined for some , it is also defined for all where .
6.1 Lower Bound
Given a query , consider any fractional edge packing with . Suppose each relation has uniform cardinality . Then, define to be the minimum that satisfies the following inequality.
Theorem 19.
Let be a CQ and let be a fractional edge packing for . Consider the uniform probability distribution of matching databases with tuples per relation over domain . Denote by the expected value of the number of output tuples , over instances in the probability distribution . Then any one-round algorithm with well-behaved cost functions that in expectation outputs at least tuples has load . The same lower bound holds for the probability distribution of -dense instances over domain .
Proof.
This proof is similar to the proof of Theorem 11. We apply again 16 and sum over all machines.
Here the second inequality comes from that , since a machine can not receive more bits than what is permitted by the load. Use that by 14 or 15. We require that . This proves the theorem.
The highest lower bound is given by the that maximizes . We will now prove that the maximum fractional edge packing always gives the best lower bound. We will assume that , meaning the database is not empty.
Lemma 20.
Let be the maximum fractional edge packing. Then, .
Proof.
Let . Suppose is another lower bound given by another edge packing with . It suffices to show that satisfies , since then the lowest satisfying the equation can be at most .
Note that for all , , since . Then,
where the last inequality follows from the fact that satisfies .
Example 21.
As an example, consider cost functions of the form , where . Then . The lower bound then becomes:
6.2 Upper Bound
We give an algorithm for evaluating full CQs with equal cardinality atoms, where each cost function is well-behaved. The approach is similar to linear cost functions, but we will need another method to pick the dimensions of each hyperrectangle .
The algorithm will require the numerical value of , the lower bound on the load. We therefore need a method to find for the maximal fractional edge packing . For this, we need to find the minimal positive value of the function . We know that is more than and at most , since the query can be computed with load with just one machine. can be found using binary search on this interval.
Given the value , our algorithm computes the hyperrectangles using the same general technique as in Section 4, with the difference that the sizes of each dimension are different. In particular, for the minimum vertex cover , we calculate the -dimension for machine as follows:
As we show in the full paper [5], the dimensions we choose are such that we can still apply the same packing technique as in Section 4. Thus:
Theorem 22 (Dense Inputs).
Let be a full CQ with uniform arity and a -dense input with domain (every relation has size ). Then, we can evaluate in one round with well-behaved cost functions with load .
Theorem 23 (Sparse Inputs).
Let be a full CQ over a matching instance with uniform cardinalities and domain . Then, we can evaluate in one round with well-behaved cost functions with load with high probability.
The proofs of the above theorems are provided in the full paper.
7 Different Cardinality Relations
We now move on to the general case where we do not require the cardinality of every atom to be the same. We will assume linear cost functions, that is, cost functions have the form . We will give a general lower bound and matching upper bounds for the cartesian product, the binary join, the star query and the triangle query.
7.1 Lower Bound
An important difference in the lower bound we will present next, to the lower bound for queries of equal cardinalities, is that we need to consider different edge packings for each different machine. We will denote the edge packing for query and machine as .
Theorem 24.
Let be a CQ and let be any fractional edge packing for and machine . Consider the uniform probability distribution of matching databases with tuples for relation over domain . Denote by the expected value of the number of output tuples , over instances in the probability distribution . Then, any one-round algorithm with linear cost functions that in expectation outputs at least tuples has load , where is the smallest load that satisfies the following equation
(1) |
The same lower bound holds for the probability distribution of -dense instances over domain .
Proof.
The proof is similar to proofs of previous lower bounds. We will use 16 to bound the number of output tuples produced by one machine. We require that all machines together produce at least output tuples.
Here we used that . This is because is the maximum number of bits machine can receive about each relation, to keep the load . The theorem follows by taking a sum across all machines.
The highest lower bound is given by the set of edge packings , one for each machine, that maximizes the load needed to satisfy the equation. Next, we show how to compute the numerical value of , which will be used by the upper bound.
Lemma 25.
Let denote the maximum lower bound on the load from Theorem 24. Then:
Proof.
We start with the first inequality. Note that the edge packing where edge with maximum cardinality gets , and all other edges get weight 0 is a valid edge packing. Hence is a lower bound.
For the second inequality, if only the biggest machine is used in computation, the load is . Therefore this load can always be achieved by computing the query on only the biggest machine. Since is the optimal load, it will never be more than this.
Note that . This together with the lemma above shows that the range of possible values for is at most a factor
Since the range of possible values of is , we can find by starting with a guess . We can check if our guess is correct by finding the edge packings for each machine. We can then check if is at least . If this is not the case, we double our guess . Since the range of possible values of is just a factor , we have to iterate this procedure at most times.
7.2 Upper Bound
We will now show how to match the lower bound for the cartesian product, the binary join, the star query and the triangle query. For the general full CQ, creating an algorithm is challenging remains an open problem. The difficulty is that the lower bound in Theorem 24 might give a different edge packing to each machine. A linear program to find the edge packing for a machine can be obtained by minimizing the logarithm of . By considering the dual program, it is possible to find hyperrectangles such that the expected load of machines matches the lower bound and the total volume of all cover , similar to what was done in [2] for homogenous machines. However, for the packing to work, we need all sides of the hyperrectangle to increase when the weight of a machine increases. It is not clear how to guarantee this, or whether it is possible. We have however been able to match the lower bound for specific queries, using the same algorithm as in previous sections, by modifying how shapes of subspaces are picked. Here, we describe how to do this for the Cartesian Product and Binary Joins. Proof of correctness of the following algorithm is provided in the full paper [5]. There, we also show that a property similar to 6 holds. This means we can use the same method for packing as previously presented. In the full paper [5], we also show the Star Query and the Triangle Query.
Cartesian Product
We consider the cartesian product .
Let be the lower bound on the load. Let the side length of along dimension be
Binary Join
Next, consider the binary join.
Let be the lower bound on the load. Let the side lengths of be the following:
Matching the Lower Bound
The theorems below show that the algorithm that follows matches the lower bound. Proofs are provided in the full paper [5].
Theorem 26 (Dense).
Let be one of the cartesian product, the binary join, the star query and the triangle query over a -dense input , where arities of tables do not have to be uniform. Let be the lower bound on the load. Then we can evaluate with heterogenous machines with weights with load .
Theorem 27 (Sparse).
Let be one of the cartesian product, the binary join, the star query and the triangle query over a matching database where atom has cardinality . Let be the lower bound on the load. Then we can evaluate with heterogenous machines with weights with load with high probability.
8 Conclusion
In this paper, we studied the problem of computing full Conjunctive Queries in parallel on heterogeneous machines. Our algorithms are inspired by the HyperCube algorithm but take a new approach of considering how to optimally partition the space of possible output tuples among machines. This gives an optimal algorithm for queries where relations have the same cardinalities, for both linear and more general cost functions, and an optimal algorithm for queries with atoms of any cardinality for specific queries.
References
- [1] Foto N. Afrati and Jeffrey D. Ullman. Optimizing joins in a map-reduce environment. In Ioana Manolescu, Stefano Spaccapietra, Jens Teubner, Masaru Kitsuregawa, Alain Léger, Felix Naumann, Anastasia Ailamaki, and Fatma Özcan, editors, EDBT 2010, 13th International Conference on Extending Database Technology, Lausanne, Switzerland, March 22-26, 2010, Proceedings, volume 426 of ACM International Conference Proceeding Series, pages 99–110. ACM, 2010. doi:10.1145/1739041.1739056.
- [2] Paul Beame, Paraschos Koutris, and Dan Suciu. Communication steps for parallel query processing. J. ACM, 64(6):40:1–40:58, 2017. doi:10.1145/3125644.
- [3] Spyros Blanas, Paraschos Koutris, and Anastasios Sidiropoulos. Topology-aware parallel data processing: Models, algorithms and systems at scale. In 10th Conference on Innovative Data Systems Research, CIDR 2020, Amsterdam, The Netherlands, January 12-15, 2020, Online Proceedings. www.cidrdb.org, 2020. URL: http://cidrdb.org/cidr2020/papers/p10-blanas-cidr20.pdf.
- [4] Benoît Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, Allison W. Lee, Ashish Motivala, Abdul Q. Munir, Steven Pelley, Peter Povinec, Greg Rahn, Spyridon Triantafyllis, and Philipp Unterbrunner. The snowflake elastic data warehouse. In Fatma Özcan, Georgia Koutrika, and Sam Madden, editors, Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 215–226. ACM, 2016. doi:10.1145/2882903.2903741.
- [5] Simon Frisk and Paraschos Koutris. Parallel query processing with heterogeneous machines, 2025. arXiv:2501.08896.
- [6] Xiao Hu. Cover or pack: New upper and lower bounds for massively parallel joins. In Leonid Libkin, Reinhard Pichler, and Paolo Guagliardo, editors, PODS’21: Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Virtual Event, China, June 20-25, 2021, pages 181–198. ACM, 2021. doi:10.1145/3452021.3458319.
- [7] Xiao Hu and Paraschos Koutris. Topology-aware parallel joins. Proc. ACM Manag. Data, 2(2):97, 2024. doi:10.1145/3651598.
- [8] Xiao Hu, Paraschos Koutris, and Spyros Blanas. Algorithms for a topology-aware massively parallel computation model. In Leonid Libkin, Reinhard Pichler, and Paolo Guagliardo, editors, PODS’21: Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Virtual Event, China, June 20-25, 2021, pages 199–214. ACM, 2021. doi:10.1145/3452021.3458318.
- [9] Xiao Hu and Yufei Tao. Parallel acyclic joins: Optimal algorithms and cyclicity separation. J. ACM, 71(1):6:1–6:44, 2024. doi:10.1145/3633512.
- [10] Xiao Hu and Ke Yi. Instance and output optimal parallel algorithms for acyclic joins. In Dan Suciu, Sebastian Skritek, and Christoph Koch, editors, Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, pages 450–463. ACM, 2019. doi:10.1145/3294052.3319698.
- [11] Bas Ketsman, Dan Suciu, and Yufei Tao. A near-optimal parallel algorithm for joining binary relations. Log. Methods Comput. Sci., 18(2), 2022. doi:10.46298/LMCS-18(2:6)2022.
- [12] Paraschos Koutris, Paul Beame, and Dan Suciu. Worst-case optimal algorithms for parallel query processing. In Wim Martens and Thomas Zeume, editors, 19th International Conference on Database Theory, ICDT 2016, Bordeaux, France, March 15-18, 2016, volume 48 of LIPIcs, pages 8:1–8:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2016. doi:10.4230/LIPICS.ICDT.2016.8.
- [13] Yufei Tao. A simple parallel algorithm for natural joins on binary relations. In Carsten Lutz and Jean Christoph Jung, editors, 23rd International Conference on Database Theory, ICDT 2020, March 30-April 2, 2020, Copenhagen, Denmark, volume 155 of LIPIcs, pages 25:1–25:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPICS.ICDT.2020.25.
- [14] Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvili, and Xiaofeng Bao. Amazon aurora: Design considerations for high throughput cloud-native relational databases. In Semih Salihoglu, Wenchao Zhou, Rada Chirkova, Jun Yang, and Dan Suciu, editors, Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, pages 1041–1052. ACM, 2017. doi:10.1145/3035918.3056101.