Parallel Query Processing with Heterogeneous Machines

Frisk, Simon; Koutris, Paraschos

doi:10.4230/LIPIcs.ICDT.2025.27

Parallel Query Processing with Heterogeneous Machines

Simon Frisk University of Wisconsin-Madison, WI, USA Paraschos Koutris

University of Wisconsin-Madison, WI, USA

Abstract

We study the problem of computing a full Conjunctive Query in parallel using $p$ heterogeneous machines. Our computational model is similar to the MPC model, but each machine has its own cost function mapping from the number of bits it receives to a cost. An optimal algorithm should minimize the maximum cost across all machines. We consider algorithms over a single communication round and give a lower bound and matching upper bound for databases where each relation has the same cardinality. We do this for both linear cost functions like in previous work, but also for more general cost functions. For databases with relations of different cardinalities, we also find a lower bound, and give matching upper bounds for specific queries like the cartesian product, the join, the star query, and the triangle query. Our approach is inspired by the HyperCube algorithm, but there are additional challenges involved when machines have heterogeneous cost functions.

Keywords and phrases:

Joins, Massively Parallel Computation, Heterogeneous

Copyright and License:

2012 ACM Subject Classification:

Theory of computation

\rightarrow

Database theory

Editors:

Sudeepa Roy and Ahmet Kara

Series and Publisher:

Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

Large datasets are commonly processed using massively parallel systems. To analyze query processing in such a setting, Beame et al. [2] introduced the massively parallel computation (MPC) model. The MPC model considers a cluster with a shared-nothing architecture, where computation proceeds in rounds: each round consists of communication between machines, followed by computation on the locally stored data. The main measure of complexity in the MPC model is the load, which captures the maximum number of bits received by a machine. An efficient MPC algorithm is designed to make the load as small as possible.

However, the MPC model operates on an assumption of homogeneity, meaning the cost of a machine is indifferent to where the received data was sent from, and how powerful the machine is. This is an unrealistic assumption, as the large-scale clusters that massively parallel computation is performed on are heterogeneous. Heterogeneity can occur in both compute resources (processing speed, memory) and the network that connects the machines.

In this work, we consider massively parallel data processing in clusters with heterogeneity in compute resources. We use a computational model that, similar to the MPC model, has a homogeneous network topology (every machine is connected directly to any other machine). However, each machine $c$ is equipped with its own cost function $g_{c}$ : this function maps the number of bits the machine receives to the cost. The load $L$ of a round is then defined as the maximum cost across all machines, i.e., $L=\max_{c\in[p]}g_{c}$ . The computational model in this paper captures the MPC model as a special case, when for each machine the cost function is the identity function $g_{c}(N)=N$ . Our model is also a special instance of the topology-aware model in [3], however, one that has not been studied in prior work.

Based on the above heterogeneous model, we study the problem of computing join queries with a minimum load. We will focus on one-round algorithms, i.e., we want to have only local computation after one round of communication. One-round algorithms are particularly relevant to data processing systems with a disaggregated storage architecture (e.g., Amazon Aurora [14], Snowflake [4]). These algorithms can be viewed as algorithms that send the data from the storage layer to the compute layer in such a way that no further communication has to be done in the compute layer. This paper therefore addresses the problem of optimally sending data from the data layer to the compute layer when there is compute heterogeneity.

Our Contributions.

The main contribution of this work is upper and lower bounds for the load $L$ of computing a join query (corresponding to a full Conjunctive Query) in one round with heterogeneous machines. In particular:

$\blacksquare$

We present an algorithm (Section 4) that evaluates a join query in one round when the cost function is linear with different weights, i.e., $g_{c}(N)=N/w_{c}$ for machine $c$ . Our algorithm works for two different types of inputs where all relations have the same size: matching databases that are sparse, and dense databases that contain a constant fraction of all possible input tuples.
$\blacksquare$

We give (Section 5) lower bounds that (almost) match the upper bounds for both the sparse and dense cases. Our lower bounds are unconditional, that is, they make no assumptions on how the algorithm behaves and how it encodes the input tuples.
$\blacksquare$

We next consider the case with non-linear cost functions (Section 6). Previous work, even in the topology-aware MPC model, assumes linear cost functions. We generalize this to a wider class of cost functions.
$\blacksquare$

Finally, we consider queries where the cardinalities of input relations are different (Section 7). We give a lower bound on the load to compute such queries in a single round, for the same two data distributions as in the equal cardinality case. We also give an algorithm that matches the upper bound for Conjunctive Queries for the cartesian product, binary join, star query, and triangle query.

Technical Ideas.

In the MPC model, the HyperCube algorithm has proved to be the key technique that gives optimal join algorithms. The HyperCube algorithm maps tuples to machines via a hash function that hashes each tuple to a vector. Tuples are sent to machines where the projection of the coordinates of the machine equals the hash vector of the tuple. Each machine obtains the same number of tuples (with high probability) and has the same load. However, in the heterogenous setting, each machine may be allocated a different number of tuples, since slower machines can handle less data than faster machines. Thus, instead of considering how to organize the machines in a hypercube, we consider how to partition the space of all possible tuples $\Lambda=[n]^{k}$ into subspaces (which are hyperrectangles) $\Lambda_{c}\subseteq\Lambda$ , one for each machine $c$ . Each machine is then responsible for computing all the output tuples in this subspace, and to do this correctly it needs to receive all input tuples that may contribute to these. The technical challenge is twofold: $(i)$ how to optimally set the dimensions of each $\Lambda_{c}$ to minimize the load across all machines, and $(ii)$ how to geometrically position the subspaces such that the space $\Lambda$ is fully covered. We will show that query parameters such as fractional edge packings and vertex covers are still critical in characterizing the optimal load, but the algorithmic techniques we use are different from the HyperCube algorithm.

2 Related Work

MPC Algorithms.

The MPC model is a computational model introduced by Beame et al. [2]. It has been used to analyze parallel algorithms for joins and other fundamental data processing tasks. The seminal paper [2] shows matching upper and lower bounds on the load for Conjunctive Queries in one round for matching databases. A lower bound for queries with skew was also given, which was matched by an upper bound for some classes of queries. Later work [12] studied the worst-case optimal load for any input in one round algorithms and proposed an algorithm matching the lower bound. Further research explored the computation of join queries using multiple rounds [12, 11, 6, 9, 13], or the design of parallel output-sensitive algorithms in the MPC model [10].

Topology-aware Algorithms.

A recent line of work aims to consider a topology-aware parallel model that is aware of the heterogeneity in the cluster topology and compute resources [3, 8, 7]. In this model, the topology is modeled as a graph $G=(V,E)$ , where a subset $V_{C}\subseteq V$ of nodes are compute nodes. Computation proceeds in rounds similar to the MPC model, but the cost model is different. Instead of modeling the cost as the maximum number of bits sent to a processor, each edge in the network has a cost which is a function of the number of bits it transmits. The cost of a round is then the maximum cost across all edges. A common cost function is that the cost of edge $e$ is $f_{e}(N)=N/w_{e}$ , which is similar to the cost function used in this paper. Under this topology-aware model, recent work has studied lower and upper bounds for set intersections, cartesian product, and sorting [8], as well as binary joins [7]. Both of these papers assume that the underlying network has a symmetric tree topology.

The computational model in this paper is a special case of the topology-aware MPC model, where the network topology is a star. This is a tree with depth 1, where all leaves are compute nodes, and the root node is a router. The cost function from a compute node to the router is $0$ , and the cost function from the router to machine $c$ is precisely the cost function of the machine, $g_{c}(N)$ . Prior work in the topology-aware MPC model does not capture the work in this paper, for two reasons. First, it considers symmetric trees, meaning the cost function across a link is the same in each direction, which is not true in this paper. Second, we consider arbitrary full conjunctive queries, which have not been studied previously.

3 Background

Computation Model.

Initially, the $p$ machines in the cluster hold an arbitrary piece of the input data. The computation then proceeds in $r$ rounds. A round consists of the communication phase, where machines can exchange data, followed by the computation phase, where computation is performed on locally stored data. In this paper, we focus on algorithms where $r=1$ , meaning there is a single round of communication followed by computation on local data. The output of a computation is the (set) union of the output across all machines.

In the standard MPC model, the cost of a round is modeled as the maximum amount of data (in bits) received by any machine. That is, if $N_{c}$ is the number of bits received by machine $c$ , the cost of computation is $L=\max_{c\in[p]}N_{c}$ .

In this paper, we will extend this model to heterogeneous machines. This means that each machine $c\in[p]$ has a cost function $g_{c}:\mathbb{Z}^{+}\rightarrow\mathbb{R}^{+}$ that maps from the number of bits received ( $N_{c}$ ) to a positive real number denoting cost. The cost of a round is similar to before, i.e., $\max_{c\in[p]}g_{c}(N_{c})$ . We will mostly work with linear cost functions $g_{c}(x)=x/w_{c}$ for some $w_{c}\in\mathbb{Z^{+}}$ . Here, the weight constant $w_{c}$ for each machine captures the cost at the machine, which may include both data transmission and processing. Later in the paper, we will study more general cost functions.

Conjunctive Queries.

In this paper, we work with Conjunctive Queries without projection or selection. These can be thought of as natural joins between $l$ relations:

q(x_{1},...,x_{k})\textit{ :- }S_{1}(\mathbf{y}_{1}),...,S_{l}(\mathbf{y}_{l})

There are $k$ variables, denoted $x_{1},...,x_{k}$ , and $l$ atoms, denoted $S_{1},...,S_{l}$ . For each $j$ , the vector $\mathbf{y}_{j}$ consists of variables, and $r_{j}$ is the arity of the atom $S_{j}$ . We restrict the queries in this paper to have no self-joins, meaning no two atoms can refer to the same underlying relation. We will often use the notation $x\in S_{j}$ to mean that variable $x$ occurs in the atom $S_{j}$ . We will work with relations where the values come from a domain $[n]$ = {1,2, …, n}. We denote the cardinality of atom $S_{j}$ as $m_{j}$ and the number of bits needed to encode $S_{j}$ as $M_{j}$ .

A fractional vertex cover $\mathbf{v}$ for $q$ assigns a weight $v_{i}\geq 0$ to each variable $x_{i}$ such that for every atom $S_{j}$ , we have $\sum_{x_{i}\in S_{j}}v_{i}\geq 1$ .

A fractional edge packing $\mathbf{u}$ for $q$ assigns a weights $u_{i}\geq 0$ to each atom $S_{j}$ such that for every variable $x_{i}$ , we have $\sum_{j:x_{i}\in S_{j}}u_{j}\leq 1$ .

HyperCube Algorithm.

HyperCube is an elegant algorithm for distributed multiway joins, originally introduced by Afrati and Ullman for the MapReduce model [1]. It computes multiway joins in a single round of communication, as opposed to traditional methods where relations are joined pairwise. We will illustrate how HyperCube computes a full CQ $q$ with $k$ variables using $p$ machines.

The $p$ machines are organized in a hyperrectangle with $k$ dimensions, one for each variable. The sides of the hyperrectangle have $\{p_{i}\}_{i\in[k]}$ machines, where $p_{i}\in[1,p]$ and $\prod_{i\in[k]}p_{i}=p$ . Each machine $c$ has a coordinate $\mathbf{C}_{c}\in[p_{1}]\times...\times[p_{k}]$ . Denote $\pi_{S_{j}}\mathbf{C}_{c}$ as the projection of $\mathbf{C}_{c}$ on $S_{j}$ . We will use $k$ hash functions $\{h_{i}\}_{i\in[k]}$ , one for each variable, where $h_{i}:[n]\rightarrow[p_{i}]$ . Denote $\mathbf{h}=(h_{1},...,h_{k})$ as the vector of all hash functions, and $\pi_{S_{j}}\mathbf{h}$ as the projection of $\mathbf{h}$ on $S_{j}$ . A tuple $a_{j}\in S_{j}$ will be sent to all machines $c$ such that $(\pi_{S_{j}}\mathbf{h})(a_{j})=\pi_{S_{j}}\mathbf{C}_{c}$ . Then the query can be computed locally on each machine with all tuples that were sent to that machine. The correctness of the algorithm follows from that each tuple $\mathbf{a}\in[n]^{k}$ that should be in the output is produced by the machine $\mathbf{h}(\mathbf{a})$ .

Input Distributions.

In this paper, we will focus on two classes of inputs, sparse and dense. The first type of input is a matching database. The cardinality of relation $S_{j}$ is $m_{j}$ . For every value in the domain $v\in[n]$ , every relation $S_{j}$ , and attribute $A$ of that relation, there exists at most one tuple $a_{j}\in S_{j}$ such that the value of $a_{j}$ in the attribute $A$ is $v$ . If the arity of a relation $S_{j}$ is $1$ , we require that $m_{j}/n\leq\theta$ for some constant $\theta\in(0,1)$ . We will start by considering the case when each relation has the same cardinality. In Section 7, we will generalize this to the case when each relation can have a different cardinality $m_{j}\leq n$ .

The second class of inputs are $\theta$ -dense databases, where $\theta\in(0,1)$ . For this input, a relation $S_{j}$ or arity $r_{j}$ has a fraction $\theta$ of all $n^{r_{j}}$ possible tuples. We consider $\theta$ to be a constant in data complexity terms. We will first study instances where the cardinality of each relation is the same (which means that the arity $r_{j}$ is the same for each relation) and generalize in Section 7 to unequal cardinalities.

4 The Upper Bound

In this section, we give algorithms for computing a full Conjunctive Query $q$ with $k$ variables. We will consider the linear cost model, where we have $p$ machines, and machine $c\in[p]$ has a linear cost function $g_{c}(N)=N/w_{c}$ for some weight $w_{c}\geq 0$ . We will denote $\mathbf{w}:=(w_{1},\dots,w_{p})$ .

Let $I$ be an instance with uniform cardinalities $m$ over a domain $[n]$ . Let $\mathbf{v}$ be a fractional vertex cover of $q$ and $v=\sum_{i\in[k]}v_{i}$ . Then, define:

L^{\textsf{upper}}_{\mathbf{v}}:=\frac{m\log n}{\left\lVert\mathbf{w}\right% \rVert_{v}}=\frac{m\log n}{\left(\sum_{c\in[p]}w_{c}^{v}\right)^{1/v}}

Theorem 1 (Dense Inputs).

Let $q$ be a full CQ with uniform arity $r$ and a $\theta$ -dense input $I$ with domain $[n]$ (every relation has size $m=\theta n^{r}$ ). Then, for every fractional vertex cover $\mathbf{v}$ , we can evaluate $q$ in one round in the linear cost model with load $O(L^{\textsf{upper}}_{\mathbf{v}})$ .

Theorem 2 (Sparse Inputs).

Let $q$ be a full CQ and $I$ be a matching database with domain $[n]$ and uniform relation sizes $m$ . Then, for every fractional vertex cover $\mathbf{v}$ we can evaluate $q$ in one round in the linear cost model with load (with high probability) $O(L^{\textsf{upper}}_{\mathbf{v}})$ .

In the rest of the section, we will prove the above two theorems. We start with an overview of our approach, which is similar to the HyperCube algorithm albeit with some important modifications. We do not consider how to pick share exponents to decide the number of machines to put in each dimension of the hypercube. This concept is now not meaningful, since the machines are different.

Instead, we consider the hyperrectangle $\Lambda=[n]^{k}$ , which can be thought of as the space containing all possible output tuples. Our algorithm partitions $\Lambda$ into hyperrectangles $\{\Lambda_{c}\}_{c\in[p]}$ . We will use this partitioning to guide how machines will compute the output. To do this, we need a vector of $k$ functions $\mathbf{h}=(h_{1},\dots,h_{k})$ , where $h_{i}:[n]\rightarrow[n]$ . For the sparse data distribution, $\mathbf{h}$ will be a random hash function (essentially perturbing the input tuples). For the dense data distribution, $\mathbf{h}$ will be the identity function $\mathbf{h}(\mathbf{a})=\mathbf{a}$ .

Then, machine $c$ will be responsible for computing every tuple $\mathbf{a}\in[n]^{k}$ such that $\mathbf{h}(\mathbf{a})\in\Lambda_{c}$ . To achieve this, our algorithm sends information about a tuple $a_{j}\in S_{j}$ to all machines $c$ where $(\pi_{S_{j}}\mathbf{h})(a_{j})\in\pi_{S_{j}}\Lambda_{c}$ , where $\pi_{S_{j}}\Lambda_{c}$ is the projection of the subspace to the attributes of $S_{j}$ . Similar to the HyperCube algorithm, this guarantees that every potential output tuple $\mathbf{a}$ , if it exists in the output, is produced at one machine, namely the machine $c$ with $\mathbf{h}(\mathbf{a})\in\Lambda_{c}$ .

We will denote by $\lambda_{c,i}$ the side length of $\Lambda_{c}$ on variable $x_{i}$ for machine $c$ . Moreover, we will use $|{\Lambda}|$ to denote the volume of $\Lambda$ , i.e., the number of points in the space. Note that $|{\pi_{S}\Lambda_{c}}|=\prod_{x\in S}\lambda_{c,i}$ .

There are two main aspects to describe of our algorithm. The first is how to pick the side lengths $\lambda_{c,i}$ for each machine and dimension to minimize the load – this corresponds to minimizing the projections $\pi_{S_{j}}\Lambda_{c}$ of the hyperrectangles. The second is how to geometrically position the hyperrectangles $\Lambda_{c}$ in $\Lambda$ to cover the whole space. We describe these two components in the next two sections.

4.1 Partitioning the Space

Theorem 3.

Let $\mathbf{v}=(v_{1},...,v_{k})$ be any fractional vertex cover of a CQ $q$ . Let $v=\sum_{j\in[k]}v_{i}$ . For every machine $c$ , let the side length of a hyperrectangle $\Lambda_{c}$ in $\Lambda$ along some variable $x_{i}$ be

\lambda_{c,i}:=\left(\frac{w_{c}}{\left\lVert\mathbf{w}\right\rVert_{v}}\right% )^{v_{i}}n

Then, the following two properties hold:

1.

$\sum_{c\in[p]}|\Lambda_{c}|=n^{k}$ ;
2.

for every machine $c$ and every atom $S$ with arity $r$ : $|{\pi_{S}\Lambda_{c}}|\leq\frac{w_{c}}{\left\lVert\mathbf{w}\right\rVert_{v}}% \cdot n^{r}$

Proof.

We start by showing that the assignment above covers all of $\Lambda$ , by summing the covered volume for each machine.

\sum_{c\in[p]}|{\Lambda_{c}}|=\sum_{c\in[p]}\prod_{j\in[k]}\lambda_{c,i}=\sum_% {c\in[p]}\prod_{j\in[k]}\left[\left(\frac{w_{c}}{\left\lVert\mathbf{w}\right% \rVert_{v}}\right)^{v_{i}}n\right]=\sum_{c\in[p]}\left[\left(\frac{w_{c}}{% \left\lVert\mathbf{w}\right\rVert_{v}}\right)^{v}n^{k}\right]=n^{k}\frac{\sum_% {c\in[p]}w_{c}^{v}}{\sum_{c\in[p]}w_{c}^{v}}=n^{k}

Next, we show the bound on the volume of the projected hyperrectangle on each atom. We focus on some atom $S$ with arity $r$ . Then, we have:

|{\pi_{S}\Lambda_{c}}|=\prod_{x_{i}\in S}\lambda_{c,i}=\left(\frac{w_{c}}{% \left\lVert\mathbf{w}\right\rVert_{v}}\right)^{\sum_{x_{i}\in S}v_{i}}n^{r}

Note that $\frac{w_{c}}{\left\lVert\mathbf{w}\right\rVert_{v}}\leq 1$ . Furthermore, since $\mathbf{v}$ is a vertex cover, $\sum_{x_{i}\in S}v_{i}\geq 1$ . Hence, we get the desired inequality. $\hfill\blacktriangleleft$

The above lemma provides the appropriate dimensions of each hyperrectangle $\Lambda_{c}$ , but it does not tell us how these hyperrectangles must be positioned geometrically within $\Lambda$ such that they cover the whole space.

Example 4.

Consider the Cartesian product $q(x,y)\textit{ :- }S_{1}(x),S_{2}(y)$ . We have $p=17$ machines. There are 2 machines with $w=4$ , 1 machine with $w=3$ , 3 machines with $w=2$ , and 11 machines with $w=1$ . Consider the vertex cover with $v_{x}=v_{y}=1$ . Then, $\left\lVert\mathbf{w}\right\rVert_{u}=8$ . This gives that machines with $w=4$ should have side lengths $n/2$ , machines with $w=3$ should have side lengths $3n/8$ , machines with $w=2$ side lengths $n/4$ and finally $w=1$ should have side lengths $n/8$ . The figure below shows one way to position the rectangles to cover $\Lambda$ . Each rectangle is labeled with the weight of the machine that occupies that space.

Figure 1: One way to pack the machines in the example.

In the example above we can perfectly fit the rectangles together to cover $\Lambda$ . In the case when all hyperrectangles have the same dimensions, such as when machines have the same weight $w_{c}$ , packing is a trivial problem. In general, there might not be a perfect way to fit the hyperrectangles together to cover the full space. This will require us to increase the size of some of the hyperrectangles $\Lambda_{c}$ , but the volumes will be increased only by a constant factor.

4.2 Packing Hyperrectangles

In this subsection, we will show how to geometrically position the hyperrectangles $\{\Lambda_{1},...,\Lambda_{p}\}$ to cover $\Lambda$ . During this process, we will have to adjust the dimensions of each $\Lambda_{c}$ so that the hyperrectangles can fit together. This will result in adjusted hyperrectangles $\{\bar{\Lambda}_{1},...,\bar{\Lambda}_{p}\}$ , however, we only have to pay a constant factor increase in their dimensions. In particular:

Theorem 5 (Packing Theorem).

The hyperrectangles $\{\Lambda_{1},...,\Lambda_{p}\}$ can be packed to cover $\Lambda$ by adjusting hyperrectangles to $\{\bar{\Lambda}_{1},...,\bar{\Lambda}_{p}\}$ such that for all relations $S_{j}$ with arity $r_{j}$ and machines $c$ , $|{\pi_{S_{j}}\bar{\Lambda}_{c}}|\leq 2^{k+1+r_{j}}\cdot|{\pi_{S_{j}}\Lambda_{c% }}|$ .

Except for in this subsection, we will always denote the hyperrectangle for machine $c$ as $\Lambda_{c}$ , even after the packing algorithm has run.

A condensed description of the packing algorithm can be seen in 1. The algorithm sets dimensions of $\Lambda_{c}$ according to Theorem 3. Each side of each hyperrectangle is then rounded independently to the nearest higher power of two. This gives some adjusted hyperrectangles, $\{\hat{\Lambda}_{1},...,\hat{\Lambda}_{p}\}$ . The hyperrectangles are then put into buckets, where each bucket contains all hyperrectangles of the same size. Denote the number of buckets as $b$ .

Because the sides of hyperrectangles have been rounded to powers of two, we can always, if we have enough hyperrectangles in some small bucket, merge them into one hyperrectangle that fits in a larger bucket. We will order buckets in increasing order of hyperrectangle size. Starting with the first bucket, we will merge as many hyperrectangles as possible into hyperrectangles that fit in the second bucket. We do this for each consecutive pair of buckets until the last bucket is reached.

In the next step, we take the largest bucket, and pairwise merge hyperrectangles into hyperrectangles of twice the volume, by stacking them in a minimum dimension. This gives a new bucket of hyperrectangles. We repeat this procedure until there is just one hyperrectangle $R$ in the obtained bucket.

We will now take this hyperrectangle $R$ and use it to fill $\Lambda$ . Some dimensions of $R$ may be smaller than $n$ . In such a case, we just scale up $R$ in those dimensions to be exactly $n$ .

Algorithm 1 Packing Algorithm.

1:

\Lambda_{1},\dots,\Lambda_{p}\leftarrow\text{According to }\autoref{theorem:partitionspace}

.

2:

\hat{\Lambda}_{1},\dots,\hat{\Lambda}_{p}\leftarrow\text{Round each side to % higher power of two}

.

3:

B_{1},\dots,B_{b}\leftarrow\text{Buckets of $\hat{\Lambda}_{c}$ of similar dimensions}

.

4:for

B_{t}\in\{B_{1},\dots,B_{b-1}\}

do

5: Merge as many rectangles from

B_{t}

into

B_{t+1}

as possible.

6:end for

7:

t\leftarrow b

8:while

|B_{t}|>1

do

9:

B_{t+1}\leftarrow\text{Pairwise merge hyperrectangles in $B_{t}$ in the % smallest dimension}

.

10:

t\leftarrow t+1

.

11:end while

12:

R\leftarrow\text{The one hyperrectangle in $B_{t}$}

.

13:Scale

R

up to cover

\Lambda

.

14:return

R

.

We now analyze the details of the algorithm. Recall that the packing algorithm starts by rounding all sides $\lambda_{c,i}$ to the nearest higher power of two, $\hat{\lambda}_{c,i}$ , obtaining rounded hyperrectangles $\hat{\Lambda}_{c}$ . That is, for each machine $c$ and dimension $i$ we find $\alpha_{c,i}$ such that: $2^{\alpha_{c,i}-1}<\lambda_{c,i}\leq 2^{\alpha_{c,i}}=\hat{\lambda}_{c,i}$ Each side of $\Lambda_{c}$ is rounded independently. This means that for $\hat{\Lambda}_{c}$ , $\hat{\Lambda}_{c^{\prime}}$ where $i\neq i^{\prime}$ , it is possible that $\lambda_{c,i}=\lambda_{c^{\prime},i}$ for some but not all variables $x_{i}$ .

Lemma 6.

For any two machines with weights $w_{c}$ and $w_{c^{\prime}}$ such that $w_{c}\leq w_{c^{\prime}}$ , for any variable $x_{i}$ , $\hat{\lambda}_{c,i}\leq\hat{\lambda}_{c^{\prime},i}$ .

Proof.

Since $w_{c}\leq w_{c^{\prime}}$ , we know that $\left(\frac{w_{c}}{\left\lVert\mathbf{w}\right\rVert_{v}}\right)^{v_{i}}n\leq% \left(\frac{w_{c^{\prime}}}{\left\lVert\mathbf{w}\right\rVert_{v}}\right)^{v_{% i}}n$ . This means that $\lambda_{c,i}\leq\lambda_{c^{\prime},i}$ . Then it is also true that $\hat{\lambda}_{c,i}\leq\hat{\lambda}_{c^{\prime},i}$ . $\hfill\blacktriangleleft$

We now create buckets $B_{1},...,B_{b}$ of all hyperrectangles $\{\hat{\Lambda}_{c}\}_{c\in[p]}$ , one bucket for each hyperrectangle with the same dimensions. This means that for each hyperrectangle $\hat{\Lambda}_{c},\hat{\Lambda}_{c^{\prime}}$ in the same bucket, for all $x_{i}$ , $\hat{\lambda}_{c,i}=\hat{\lambda}_{c^{\prime},i}$ . We order the buckets in increasing order of the volume of the hyperrectangles in it, denoted as $V[B_{t}]$ .

Lemma 7.

Let $B_{t},B_{t^{\prime}}$ be buckets with $t<t^{\prime}$ . Then, $V[B_{t^{\prime}}]/V[B_{t}]$ hyperrectangles from $B_{t}$ can be packed to form one hyperrectangle with the same shape as the hyperrectangles in $B_{t^{\prime}}$ .

Proof.

Let $\hat{\Lambda}_{c}\in B_{t},\hat{\Lambda}_{c^{\prime}}\in B_{t^{\prime}}$ . By 6, all dimensions of $\hat{\Lambda}_{c^{\prime}}$ are at least as big the corresponding dimension of $\hat{\Lambda}_{c}$ . More specifically, since the side lengths are of the form $2^{\alpha_{c,i}}$ , we know that $\hat{\lambda}_{c^{\prime},i}=2^{a_{i,t}}\cdot\hat{\lambda}_{c,i}$ for some $a_{i,t}\in\mathbb{Z}^{+}$ . For some dimension $i$ , take $2^{a_{i,t}}$ hyperrectangles from the bucket $B_{t}$ and stack them together in the dimension $i$ . This will create one hyperrectangle where dimension $i$ is the same as dimension $i$ in $\hat{\Lambda}_{c^{\prime}}$ . We now continue this process across all the other dimensions. Let $a_{t}=\sum_{i\in[k]}a_{i,t}$ . Then, this process uses $2^{a_{t}}=V[B_{t^{\prime}}]/V[B_{t}]$ hyperrectangles of shape $\hat{\Lambda}_{c}$ . Note that for at least one $i$ , $a_{i,t}>0$ , since otherwise $\hat{\Lambda}_{c}=\hat{\Lambda}_{c^{\prime}}$ , and then they are in the same bucket. $\hfill\blacktriangleleft$

The above lemma means that for each adjacent pair of buckets, $B_{t},B_{t+1}$ , if $B_{t}$ contains at least $V[B_{t+1}]/V[B_{t}]$ hyperrectangles, we can merge them into one hyperrectangle in $B_{t+1}$ . The packing algorithm will merge as many hyperrectangles as possible, starting with the smallest bucket. When there are no merges left possible, each bucket $B_{t}$ has at most $V[B_{t+1}]/V[B_{t}]-1$ hyperrectangles, since otherwise another merge is possible. We can now show that by only using the rectangles in the largest bucket $B_{b}$ , we can almost cover the whole output space.

Lemma 8.

Let $p_{t}$ be the number of hyperrectangles in bucket $B_{t}$ , for $i\in\{1,\dots,b\}$ . Then, $|{\Lambda}|<(1+p_{b})V[B_{b}]$ .

Proof.

We use the observation that across all buckets the total volume is at least $|{\Lambda}|$ . Then:

	$\displaystyle\|{\Lambda}\|$	$\displaystyle\leq\sum_{t=1}^{b}p_{t}V[B_{t}]=p_{b}V[B_{b}]+\sum_{t=1}^{b-1}p_{% t}V[B_{t}]\leq p_{b}V[B_{b}]+\sum_{t=1}^{b-1}\left(\frac{V[B_{t+1}]}{V[B_{t}]}% -1\right)V[B_{t}]$
		$\displaystyle=p_{b}V[B_{b}]+\sum_{t=1}^{b-1}(V[B_{t+1}]-V[B_{t}])\leq(1+p_{b})% V[B_{b}]$

where the second inequality holds because $p_{t}<V[B_{t+1}]/V[B_{t}]$ and the last inequality holds because it is is a telescopic sum. $\hfill\blacktriangleleft$

We will now pack $\Lambda$ using only the $p_{b}$ hyperrectangles in the last bucket $B_{b}$ . Let $\hat{n}$ be the domain $n$ rounded to the nearest higher power of two. Note that no dimension of a hyperrectangle $\hat{\Lambda}_{c}\in B_{b}$ is greater than $\hat{n}$ . This is because $\lambda_{c,i}\leq n$ , so $\hat{\lambda}_{c,i}\leq\hat{n}$ . We will merge the hyperrectangles in $B_{b}$ the following way. Find the minimum dimension $\hat{\lambda}_{c,i}$ of $\hat{\Lambda}_{c}\in B_{b}$ , and pairwise merge hyperrectangles in $B_{b}$ into hyperrectangles of volume $2V[B_{b}]$ by putting them adjacent in dimension $i$ . This creates a new bucket $B_{b+1}$ , with $\lfloor p_{b}/2\rfloor$ hyperrectangles, and at most one hyperrectangle in $B_{b}$ is left unmerged. This process can be repeated on hyperrectangles in $B_{b+1}$ , until a bucket $B_{b+d}$ is obtained, where $B_{b+d}$ contains one hyperrectangle, so no further merges are possible. We will now show that we can cover $\Lambda$ using just the one hyperrectangle in $B_{b+d}$ , by scaling it up by at most a constant factor.

Lemma 9.

$V[B_{b+d}]>|{\Lambda}|/{2}$ .

Proof.

Let $\beta$ be the number of hyperrectangles in $B_{b}$ after the first merge. Denote by $p_{t}$ the number of hyperrectangles in bucket $B_{t}$ after the last merge step for each $t\in\{b,\dots,b+d\}$ , $p_{t}\in\{0,1\}$ . Since $V[B_{t+1}]/V[B_{t}]=2$ ,

\sum_{t=b}^{b+d-1}p_{t}V[B_{t}]\leq\sum_{t=b}^{b+d-1}V[B_{t}]\leq V[B_{b+d}]-V% [B_{b}]

Moreover, we have:

\beta V[B_{b}]=\sum_{t=b}^{b+d}p_{t}V[B_{t}]=V[B_{b+d}]+\sum_{t=b}^{b+d-1}p_{t% }V[B_{t}]\leq 2V[B_{b+d}]-V[B_{b}]

Finally, by reorganizing the above inequality and applying 8, we obtain that $V[B_{b+d}]\geq(\beta+1)V[B_{b}]/2>|{\Lambda}|/2$ . $\hfill\blacktriangleleft$

The above lemma shows that $R\in B_{b+d}$ almost covers $\Lambda$ . There might however exist variables $x_{i}$ such that $|R_{i}|<n$ . We will scale $R$ in each dimension $i$ by a factor $f_{i}=\max\{n/|R_{i}|,1\}$ . This will guarantee that for each $x_{i},i\in[k]$ , $|R_{i}|\geq n$ , and hence $R$ covers $\Lambda$ . To scale $R$ by a factor $f_{i}$ in dimension $i$ , we have to scale each $\hat{\Lambda}_{c}$ that is packed into $R$ by that same factor $f_{i}$ in dimension $i$ , which gives the final sizes of hyperrectangles, which we denote $\bar{\Lambda}_{c}$ . If hyperrectangle $c$ is packed into $R$ , $\bar{\lambda}_{c,i}=f_{i}\hat{\lambda}_{c,i}$ . If hyperrectangle $c$ is not packed into $R$ , $\bar{\lambda}_{c,i}=0$ since the hyperrectangle is not used.

Lemma 10.

Let $R\in B_{b+d}$ be the remaining hyperrectangle. Scale $R$ in dimension $i$ by a factor $f_{i}=\max\{n/|R_{i}|,1\}$ . Then $R$ covers $\Lambda$ , and we have scaled $R$ in such a way that for each subset $S\subseteq[k]$ , the following holds: $\prod_{i\in S}f_{i}\leq 2^{k+1}$ .

Proof.

The choice of $f_{i}=\max\{n/|R_{i}|,1\}$ means that $|R_{i}|f_{i}\geq n$ . Hence $\Lambda$ is covered. We know that $\prod_{i\in[k]}R_{i}\geq n^{k}/2$ , by the previous lemma. Note that hyperrectangles in $B_{b},\dots B_{b+d}$ , and hence also $R$ , have side lengths at most $\hat{n}$ ( $n$ rounded up to the nearest higher power of two) since we merged the smallest dimensions first. Furthermore, $\hat{n}<2n$ . Therefore, for each $i\in[k]$ , $|R_{i}|\leq 2n$ . Now,

	$\displaystyle\prod_{i\in[k]}f_{i}$	$\displaystyle=\prod_{i\in[k]}\max\{n/\|R_{i}\|,1\}=\prod_{i\in[k]}\frac{\max\{n,% \|R_{i}\|\}}{\|R_{i}\|}=\frac{\prod_{i\in[k]}\max\{n,\|R_{i}\|\}}{V[R]}$
		$\displaystyle\leq\frac{\prod_{i\in[k]}2n}{V[\Lambda]/2}=\frac{2\cdot 2^{k}n^{k% }}{n^{k}}=2^{k+1}$

For any $S\subseteq[k]$ , the product $\prod_{i\in S}f_{i}$ would be less than the product above, since for all $i\in[k]$ , $f_{i}\geq 1$ . $\hfill\blacktriangleleft$

We can now prove the main theorem about packing.

Proof of Theorem 5.

Let $\Lambda_{c}$ be the hyperrectangle of machine $c$ as given by Theorem 3 and let $\bar{\Lambda}_{c}$ be the hyperrectangle after the packing algorithm has run. The packing algorithm can increase sides $\lambda_{c,i}$ first by rounding up to $\hat{\lambda}_{c,i}$ , which is at most a factor $2$ bigger. If hyperrectangle $c$ is included in the final hyperrectangle $R$ , sides of $\hat{\Lambda}_{c}$ might then be scaled up again by a factor $f_{i}=\min\{1,n/|R_{i}|\}$ , to $\bar{\lambda}_{c,i}$ . For an atom $S_{j}$ with arity $r_{j}$ , we now have:

\frac{|{\pi_{S_{j}}\bar{\Lambda}_{c}}|}{|{\pi_{S_{j}}{\Lambda}_{c}}|}=\prod_{x% _{i}\in S_{j}}\frac{\hat{\lambda}_{c,i}}{\lambda_{c,i}}\cdot\frac{\bar{\lambda% }_{c,i}}{\hat{\lambda}_{c,i}}\leq\prod_{x_{i}\in S_{j}}2\cdot f_{i}\leq 2^{r_{% j}+k+1}

The second inequality comes from 10. $\hfill\blacktriangleleft$

4.3 Putting Everything Together

We can now prove the main theorems in this section.

Proof of Theorem 1.

The worst case load of the algorithm is that every possible tuple in $\Lambda_{c}$ exists. We will calculate the load of machine $c$ from relation $S_{j}$ , $L_{cj}$ . Denote $n_{cj}$ as the number of tuples received by machine $c$ from $S_{j}$ . We get

L_{cj}=\frac{n_{cj}\log n}{w_{c}}=\frac{\log n}{w_{c}}|\pi_{S_{j}}\Lambda_{c}|% \leq\frac{1}{w_{c}}\frac{w_{c}}{\left\lVert\mathbf{w}\right\rVert_{v}}n^{r}% \log n=O\left(\frac{n^{r}\log n}{\left\lVert\mathbf{w}\right\rVert_{v}}\right)

Here the inequality comes from Theorem 3. The result follows since the query has a constant number of atoms. $\hfill\blacktriangleleft$

Proof of Theorem 2.

Denote $N_{cj}$ as the number of bits received by machine $c$ from relation $S_{j}$ . The probability that a tuple $a_{j}\in S_{j}$ maps to machine $c$ is the following:

Pr[(\pi_{S_{j}}\mathbf{h})(a_{j})\in\Lambda_{c}]=\frac{|\pi_{S_{j}}\Lambda_{c}% |}{n^{r_{j}}}\leq\frac{w_{c}}{\left\lVert\mathbf{w}\right\rVert_{v}n^{r}}n^{r}% =\frac{w_{c}}{\left\lVert\mathbf{w}\right\rVert_{v}}

The inequality comes from Theorem 3. Note that since we use hashing and have a matching database instance, the probability that a tuple is mapped to machine $c$ is the same and independent among all tuples in the hyperrectangle. Therefore, $n_{cj}\sim Bin(m,\frac{w_{c}}{\left\lVert\mathbf{w}\right\rVert_{v}})$ . We get the following expected value

E[L_{cj}]=\frac{1}{w_{c}}E[n_{cj}]\log n=\frac{1}{w_{c}}\frac{w_{c}m}{\left% \lVert\mathbf{w}\right\rVert_{v}}\log n=O\left(\frac{m\log n}{\left\lVert% \mathbf{w}\right\rVert_{v}}\right)

We also show that the probability that the load is more than this is exponentially small. Indeed, applying the Chernoff bound, which we describe in the full paper, we have:

	$\displaystyle Pr\left[L_{cj}\geq(1+\delta)\frac{m\log n}{\left\lVert\mathbf{w}% \right\rVert_{v}}\right]$	$\displaystyle=Pr\left[N_{cj}\geq(1+\delta)\frac{w_{c}m\log n}{\left\lVert% \mathbf{w}\right\rVert_{v}}\right]$
		$\displaystyle=Pr\left[n_{cj}\geq(1+\delta)\frac{mw_{c}}{\left\lVert\mathbf{w}% \right\rVert_{v}}\right]\leq exp\left(-\delta^{2}\frac{mw_{c}}{3\left\lVert% \mathbf{w}\right\rVert_{v}}\right)$

We obtain the probability bound by taking the union bound across all atoms and machines. $\hfill\blacktriangleleft$

5 Lower Bounds

We present a lower bound on the load when machines have linear cost functions and all atoms have the same cardinality. This lower bound applies to both the sparse and the dense case, and considers the behavior of the algorithm over a probability distribution of inputs.

We consider again for each machine $c$ a linear cost function $g_{c}(N)=N/w_{c}$ with $\mathbf{w}=(w_{1},...,w_{p})$ as weights. Let $\textbf{u}=(u_{1},...,u_{l})$ be a fractional edge packing for $q$ , with $u=\sum_{j\in[l]}u_{j}$ . Moreover, let $m$ be the cardinality of every relation. Then, define

L^{\textsf{lower}}_{\mathbf{u}}:=\frac{m}{(\sum_{c\in[p]}w_{c}^{u})^{1/u}}=% \frac{m}{\left\lVert\mathbf{w}\right\rVert_{u}}

Theorem 11.

Let $q$ be a CQ and let $\textbf{u}=(u_{1},...,u_{l})$ be a fractional edge packing for $q$ . Consider the uniform probability distribution $\mathcal{I}$ of matching databases with $m$ tuples per relation over domain $[n]$ . Denote by $E_{I\sim\mathcal{I}}[|q(I)|]$ the expected value of the number of output tuples $|q(I)|$ , over instances $I$ in the probability distribution $\mathcal{I}$ . Then, any one-round algorithm that in expectation outputs at least $E_{I\sim\mathcal{I}}[|q(I)|]$ tuples has load $\Omega(L^{\textsf{lower}}_{\mathbf{u}})$ in the linear cost model. The same lower bound holds for the probability distribution $\mathcal{I}^{d}$ of $\theta$ -dense instances over domain $[n]$ .

Each fractional edge packing u gives a different lower bound, the highest of which is obtained by minimizing $\left\lVert\mathbf{w}\right\rVert_{u}$ . Since the $p$ -norm is a decreasing function of $p$ , the highest lower bound is given by the maximum fractional edge packing. The maximum fractional edge packing is equal to the minimum vertex cover via duality of linear programs, hence the lower bound matches our upper bounds within a logarithmic factor.

Theorem 12.

$\min_{\mathbf{v}}(L^{\textsf{upper}}_{\mathbf{v}})=\log n\cdot\max_{\mathbf{u}% }(L^{\textsf{lower}}_{\mathbf{u}})$

We will next give an overview of the proof of Theorem 11, with some details left to the full paper. We assume that initially each relation is stored at a separate location. Let $\text{msg}_{j}$ be the bit string that a fixed machine receives from $S_{j}$ , and let msg be the concatenation of $\text{msg}_{j}$ for all $j$ . Note that $|\text{msg}|$ is the number of bits that the machine receives. We let $\text{Msg}(I)$ be the random variable mapping from the set of possible database instances to the value of msg. $\text{Msg}_{j}(S_{j})$ is defined in the same way but maps to $\text{msg}_{j}$ .

Definition 13.

Let $R$ be a relation, and let $a\in R$ be a tuple. We say that $a$ is known by the machine, given message msg, if for all all database instances $I$ where $\text{Msg}(I)=\text{msg}$ , $a\in R$ . We denote the set of known tuples by machine $c$ given message msg as $K_{msg}^{c}(R)$ . Furthermore, we define $K_{msg}(R)=\bigcup_{c}K_{msg}^{c}(R)$ .

For each $S_{j}$ , let $f_{c,j}\in[0,1]$ be the maximum length of the message $msg_{j}$ that $c$ receives (across all instances in the distribution) divided by $M_{j}$ , the number of bits in the encoding of $S_{j}$ . Note that since we use the optimal encoding, $M_{j}$ is the entropy of our input distribution.

Lemma 14.

In a $\theta$ -dense database, $M_{j}=\Omega(m_{j})$ .

Lemma 15.

In a matching database, $M_{j}=\Omega(m_{j})$ .

To show Theorem 11, we use the following and previous lemmas, which are proven in the full paper [5]. The below lemma was proven in [2] for matching databases. However, some assumptions about the data distributions are changed and we also show the lemma for a $\theta$ -dense database distribution.

Lemma 16.

Let $\mathbf{u}=(u_{1},\dots,u_{l})$ be a fractional edge packing of $q$ . Then the expected number of known output tuples is

E[|K_{msg}^{c}(q(I))|]\leq\prod_{j\in[l]}f_{c,j}^{u_{j}}\cdot E[|q(I)|]

We can now prove the main theorem of this section. We will use the notation $u=\sum_{j\in[l]}u_{j}$ .

Proof of Theorem 11.

From the definition of the load, $f_{c,j}\leq Lw_{c}/M$ . Applying 16,

E[|K_{msg}^{c}(q(I))|]\leq E[|q(I)|]\cdot\prod_{j\in[l]}f_{c,j}^{u_{j}}\leq E[% |q(I)|]\cdot\prod_{j\in[l]}\left(\frac{Lw_{c}}{M}\right)^{u_{j}}=E[|q(I)|]% \left(\frac{Lw_{c}}{M}\right)^{u}

We now use that $|K_{msg}(q(I))|=|\bigcup_{c\in[p]}K_{msg}^{c}(q(I))|\leq\sum_{c\in[p]}|K_{msg}% ^{c}(q(I))|$ .

E[|K_{msg}(q(I))|]\leq\sum_{c\in[p]}\left[\left(\frac{Lw_{c}}{M}\right)^{u}E[|% q(I)|]\right]=\frac{L^{u}}{M^{u}}E[|q(I)|]\sum_{c\in[p]}w_{c}^{u}.

If the algorithm is to produce the whole output of the query, the expected number of known output tuples has to be at least the expected output size of the query, so

\frac{L^{u}}{M^{u}}E[|q(I)|]\sum_{c\in[p]}w_{c}^{u}\geq E[|q(I)|]

Use 14 or 15. This concludes the proof. $\hfill\blacktriangleleft$

6 General Cost Functions

In previous sections, we considered machines with linear cost functions. In this section, we extend the result to a broader class of cost functions, where each machine $c$ is equipped with a general cost function $g_{c}$ .

Definition 17.

A cost function $g:\mathbb{Z}^{+}\rightarrow\mathbb{R}^{+}$ is well-behaved if it satisfies the following:

1.

$g(0)=0$ ;
2.

$g$ is increasing;
3.

there exists a constant $a>1$ such that for all $x\geq 1$ , $g((1+\delta)x)\leq(1+\delta)^{a}g(x)$

These restrictions on a cost function are natural, since the cost of receiving zero bits should be zero, and the cost of receiving additional bits should be positive. The last condition states that a cost function cannot grow faster than some polynomial at each point. This requirement is not required in the lower bound, but without it, it is difficult to create a matching upper bound since just one bit in addition to what is expected can arbitrarily increase the cost of the machine.

Definition 18.

For a well-behaved cost function $g$ , define the function $g^{*}:\mathbb{R}^{+}\rightarrow\mathbb{Z}^{+}$ by:

g^{*}(L):=\max_{x\in\mathbb{Z}^{+}}\{g(x)\leq L\}

Under the above definition, $g_{c}^{*}(L)$ can be interpreted as the maximum number of bits the cost function permits the machine $c$ to receive with load at most $L$ . The restriction $g_{c}(0)=0$ implies that if $g_{c}^{*}(L)$ is defined for some $L$ , it is also defined for all $L^{\prime}\in\mathbb{R}^{+}$ where $L^{\prime}<L$ .

6.1 Lower Bound

Given a query $q$ , consider any fractional edge packing $\mathbf{u}=(u_{1},...,u_{l})$ with $u=\sum_{j\in[l]}u_{j}$ . Suppose each relation has uniform cardinality $m$ . Then, define $\bar{L}_{\mathbf{u}}^{\textsf{lower}}$ to be the minimum $L\geq 0$ that satisfies the following inequality.

\sum_{c\in[p]}(g_{c}^{*}(L))^{u}\geq m^{u}

Theorem 19.

Let $q$ be a CQ and let $\textbf{u}=(u_{1},...,u_{l})$ be a fractional edge packing for $q$ . Consider the uniform probability distribution $\mathcal{I}$ of matching databases with $m$ tuples per relation over domain $[n]$ . Denote by $E_{I\sim\mathcal{I}}[|q(I)|]$ the expected value of the number of output tuples $|q(I)|$ , over instances $I$ in the probability distribution $\mathcal{I}$ . Then any one-round algorithm with well-behaved cost functions $\{g_{c}\}_{c}$ that in expectation outputs at least $E_{I\sim\mathcal{I}}[q(I)]$ tuples has load $\Omega(\bar{L}^{\textsf{lower}}_{\mathbf{u}})$ . The same lower bound holds for the probability distribution $\mathcal{I}^{d}$ of $\theta$ -dense instances over domain $[n]$ .

Proof.

This proof is similar to the proof of Theorem 11. We apply again 16 and sum over all machines.

	$\displaystyle E[\|K_{msg}(q(I))\|]\leq E[\|q(I)\|]\sum_{c\in[p]}\prod_{j\in[l]}f_{% c,i}^{u_{j}}$	$\displaystyle\leq E[\|q(I)\|]\sum_{c\in[p]}\prod_{j\in[l]}\left(\frac{g_{c}^{*}(% L)}{M}\right)^{u_{j}}$
		$\displaystyle=E[\|q(I)\|]\sum_{c\in[p]}\left(\frac{g_{c}^{*}(L)}{M}\right)^{u}$

Here the second inequality comes from that $f_{c,i}M\leq g_{c}^{*}(L)$ , since a machine can not receive more bits than what is permitted by the load. Use that $M=O(m)$ by 14 or 15. We require that $E[|K_{msg}(q(I))|]\geq E[|q(I)|]$ . This proves the theorem. $\hfill\blacktriangleleft$

The highest lower bound is given by the $\mathbf{u}$ that maximizes $\bar{L}^{\textsf{lower}}_{\mathbf{u}}$ . We will now prove that the maximum fractional edge packing $\mathbf{u^{*}}$ always gives the best lower bound. We will assume that $m\geq 1$ , meaning the database is not empty.

Lemma 20.

Let $\mathbf{u^{*}}$ be the maximum fractional edge packing. Then, $\bar{L}^{\textsf{lower}}_{\mathbf{u}^{*}}=\max_{\mathbf{u}}\bar{L}^{\textsf{% lower}}_{\mathbf{u}}$ .

Proof.

Let $L^{*}=\bar{L}^{\textsf{lower}}_{\mathbf{u}^{*}}$ . Suppose $L^{\prime}$ is another lower bound given by another edge packing $u^{\prime}$ with $u^{\prime}\leq u^{*}$ . It suffices to show that $L^{*}$ satisfies $\sum_{c\in[p]}(g_{c}^{*}(L^{*}))^{u^{\prime}}\geq m^{u^{\prime}}$ , since then the lowest $L$ satisfying the equation can be at most $L^{*}$ .

Note that for all $c$ , $g_{c}^{*}(L^{*})^{u^{*}-u^{\prime}}\leq m^{u^{*}-u^{\prime}}$ , since $g_{c}^{*}(L^{*})\leq m$ . Then,

\sum_{c\in[p]}(g_{c}^{*}(L^{*}))^{u^{\prime}}\geq\sum_{c\in[p]}(g_{c}^{*}(L^{*% }))^{u^{*}}m^{u^{\prime}-u^{*}}=m^{u^{\prime}-u^{*}}\sum_{c\in[p]}(g_{c}^{*}(L% ^{*}))^{u^{*}}\geq m^{u^{\prime}-u^{*}}\cdot m^{u^{*}}=m^{u^{\prime}}

where the last inequality follows from the fact that $L^{*}$ satisfies $\sum_{c\in[p]}(g_{c}^{*}(L^{*}))^{u^{*}}\geq m^{u^{*}}$ . $\hfill\blacktriangleleft$

Example 21.

As an example, consider cost functions of the form $g_{c}(x)=\frac{x^{a}}{w_{c}}$ , where $a>0$ . Then $g_{c}^{*}(L)=(Lw_{c})^{1/a}$ . The lower bound then becomes:

L\geq\max_{\mathbf{u}}\max_{c\in[p]}\left(\frac{m^{a}}{(\sum_{c\in[p]}w_{c}^{u% /a})^{a/u}}\right)

6.2 Upper Bound

We give an algorithm for evaluating full CQs with equal cardinality atoms, where each cost function $\{g_{c}\}_{c}$ is well-behaved. The approach is similar to linear cost functions, but we will need another method to pick the dimensions of each hyperrectangle $\{\Lambda_{c}\}_{c}$ .

The algorithm will require the numerical value of $\bar{L}^{\textsf{lower}}=\max_{\mathbf{u}}\bar{L}^{\textsf{lower}}_{\mathbf{u}}$ , the lower bound on the load. We therefore need a method to find $\bar{L}^{\textsf{lower}}_{\mathbf{u}^{*}}$ for the maximal fractional edge packing $\mathbf{u}^{*}$ . For this, we need to find the minimal positive value of the function $f(L)=\sum_{c\in[p]}(g_{c}^{*}(L))^{u^{*}}-m^{u^{*}}$ . We know that $L$ is more than $0$ and at most $L_{max}=\min_{c\in[p]}g_{c}(m)$ , since the query can be computed with load $L_{max}$ with just one machine. $\bar{L}^{\textsf{lower}}_{\mathbf{u}^{*}}$ can be found using binary search on this interval.

Given the value $L^{*}=\bar{L}^{\textsf{lower}}$ , our algorithm computes the hyperrectangles using the same general technique as in Section 4, with the difference that the sizes of each dimension are different. In particular, for the minimum vertex cover $\mathbf{v}$ , we calculate the $i$ -dimension for machine $c$ as follows:

\lambda_{c,i}:=\left(\frac{g_{c}^{*}(L^{*})}{m}\right)^{v_{i}}n

As we show in the full paper [5], the dimensions we choose are such that we can still apply the same packing technique as in Section 4. Thus:

Theorem 22 (Dense Inputs).

Let $q$ be a full CQ with uniform arity $r$ and a $\theta$ -dense input $I$ with domain $[n]$ (every relation has size $m=\theta n^{r}$ ). Then, we can evaluate $q$ in one round with well-behaved cost functions $\{g_{c}\}_{c}$ with load $O(\bar{L}^{\textsf{lower}}\cdot\log n)$ .

Theorem 23 (Sparse Inputs).

Let $q$ be a full CQ over a matching instance $I$ with uniform cardinalities $m$ and domain $[n]$ . Then, we can evaluate $q$ in one round with well-behaved cost functions $\{g_{c}\}_{c}$ with load $O(\bar{L}^{\textsf{lower}}\cdot\log n)$ with high probability.

The proofs of the above theorems are provided in the full paper.

7 Different Cardinality Relations

We now move on to the general case where we do not require the cardinality of every atom to be the same. We will assume linear cost functions, that is, cost functions have the form $g_{c}(N)=N/w_{c}$ . We will give a general lower bound and matching upper bounds for the cartesian product, the binary join, the star query and the triangle query.

7.1 Lower Bound

An important difference in the lower bound we will present next, to the lower bound for queries of equal cardinalities, is that we need to consider different edge packings for each different machine. We will denote the edge packing for query $q$ and machine $c$ as $\textbf{u}_{c}=(u_{c,1},\dots,u_{c,l})$ .

Theorem 24.

Let $q$ be a CQ and let $\textbf{u}_{c}=(u_{c,1},\dots,u_{c,l})$ be any fractional edge packing for $q$ and machine $c$ . Consider the uniform probability distribution $\mathcal{I}$ of matching databases with $m_{j}$ tuples for relation $S_{j}$ over domain $n$ . Denote by $E_{I\sim\mathcal{I}}[|q(I)|]$ the expected value of the number of output tuples $|q(I)|$ , over instances $I$ in the probability distribution $\mathcal{I}$ . Then, any one-round algorithm with linear cost functions that in expectation outputs at least $E_{I\sim\mathcal{I}}[q(I)]$ tuples has load $\Omega(L)$ , where $L$ is the smallest load that satisfies the following equation

\sum_{c\in[p]}\prod_{j\in[l]}\left(\frac{Lw_{c}}{m_{j}}\right)^{u_{c,j}}\geq 1

(1)

The same lower bound holds for the probability distribution $\mathcal{I}^{d}$ of $\theta$ -dense instances over domain $[n]$ .

Proof.

The proof is similar to proofs of previous lower bounds. We will use 16 to bound the number of output tuples produced by one machine. We require that all machines together produce at least $E[|q(I)|]$ output tuples.

E[|K_{msg}^{c}[q(I)]|]\leq\prod_{j\in[l]}f_{c,i}^{u_{c,j}}E[|q(I)|]\leq E[|q(I% )|]\prod_{j\in[l]}\left(\frac{f_{c,i}M_{j}}{M_{j}}\right)^{u_{c,j}}\leq E[|q(I% )|]\prod_{j\in[l]}\left(\frac{Lw_{c}}{M_{j}}\right)^{u_{c,j}}

Here we used that $f_{c,i}M_{j}\leq Lw_{c}$ . This is because $Lw_{c}$ is the maximum number of bits machine $c$ can receive about each relation, to keep the load $L$ . The theorem follows by taking a sum across all machines. $\hfill\blacktriangleleft$

The highest lower bound $L^{*}$ is given by the set of edge packings $\{\mathbf{u}_{c}\}_{c\in[p]}$ , one for each machine, that maximizes the load needed to satisfy the equation. Next, we show how to compute the numerical value of $L^{*}$ , which will be used by the upper bound.

Lemma 25.

Let $L^{*}$ denote the maximum lower bound on the load from Theorem 24. Then:

\frac{\max_{j}m_{j}}{\sum_{c\in[p]}w_{c}}\leq L^{*}\leq\frac{\max_{j}m_{j}}{% \max_{c\in[p]}w_{c}}

Proof.

We start with the first inequality. Note that the edge packing where edge $j^{*}$ with maximum cardinality gets $u_{j^{*}}=1$ , and all other edges get weight 0 is a valid edge packing. Hence $\sum_{c\in[p]}\frac{Lw_{c}}{m_{j^{*}}}\geq 1$ is a lower bound.

For the second inequality, if only the biggest machine is used in computation, the load is $\max_{j}m_{j}/\max_{c\in[p]}w_{c}$ . Therefore this load can always be achieved by computing the query on only the biggest machine. Since $L^{*}$ is the optimal load, it will never be more than this. $\hfill\blacktriangleleft$

Note that $p\cdot\max_{c\in[p]}w_{c}\geq\sum_{c\in[p]}w_{c}$ . This together with the lemma above shows that the range of possible values for $L^{*}$ is at most a factor

\frac{\max_{j}m_{j}/\max_{c\in[p]}}{\max_{j}m_{j}/\sum_{c\in[p]}w_{c}}=\frac{% \sum_{c\in[p]}w_{c}}{\max_{c\in[p]}w_{c}}\leq\frac{p\cdot\max_{c\in[p]}w_{c}}{% \max_{c\in[p]}w_{c}}=p

Since the range of possible values of $L^{*}$ is $p$ , we can find $L^{*}$ by starting with a guess $\hat{L}\leftarrow\frac{\max_{j}m_{j}}{\sum_{c\in[p]}w_{c}}$ . We can check if our guess is correct by finding the edge packings $\{\mathbf{u}_{c}\}_{c\in[p]}$ for each machine. We can then check if $\sum_{c\in[p]}\prod_{j\in[l]}\left(\frac{\hat{L}w_{c}}{m_{j}}\right)^{u_{c,j}}$ is at least $1$ . If this is not the case, we double our guess $\hat{L}$ . Since the range of possible values of $L^{*}$ is just a factor $p$ , we have to iterate this procedure at most $\log p$ times.

7.2 Upper Bound

We will now show how to match the lower bound for the cartesian product, the binary join, the star query and the triangle query. For the general full CQ, creating an algorithm is challenging remains an open problem. The difficulty is that the lower bound in Theorem 24 might give a different edge packing $\mathbf{u}_{c}$ to each machine. A linear program to find the edge packing for a machine can be obtained by minimizing the logarithm of $\prod_{j\in[l]}(L^{*}w_{c}/m_{j})^{u_{c,j}}$ . By considering the dual program, it is possible to find hyperrectangles $\Lambda_{c}$ such that the expected load of machines matches the lower bound and the total volume of all $\Lambda_{c}$ cover $\Lambda$ , similar to what was done in [2] for homogenous machines. However, for the packing to work, we need all sides of the hyperrectangle $\Lambda_{c}$ to increase when the weight of a machine increases. It is not clear how to guarantee this, or whether it is possible. We have however been able to match the lower bound for specific queries, using the same algorithm as in previous sections, by modifying how shapes of subspaces are picked. Here, we describe how to do this for the Cartesian Product and Binary Joins. Proof of correctness of the following algorithm is provided in the full paper [5]. There, we also show that a property similar to 6 holds. This means we can use the same method for packing as previously presented. In the full paper [5], we also show the Star Query and the Triangle Query.

Cartesian Product

We consider the cartesian product $q(x,y)\textit{ :- }S_{1}(x),S_{2}(y)$ .

Let $L^{*}$ be the lower bound on the load. Let the side length of $\Lambda_{c}$ along dimension $j$ be

\lambda_{c,i}:=\min\left(\frac{L^{*}w_{c}}{M_{j}},1\right)n=\left(\frac{L^{*}w% _{c}}{M_{j}}\right)^{u_{c,i}}

Binary Join

Next, consider the binary join.

q(x,y,z)\textit{ :- }S_{1}(x,z),S_{2}(y,z)

Let $L^{*}$ be the lower bound on the load. Let the side lengths of $\Lambda_{c}$ be the following:

\lambda_{c,x}:=n,\quad\lambda_{c,y}:=n,\quad\lambda_{c,z}:=\frac{L^{*}w_{c}}{% \max(M_{1},M_{2})}n

Matching the Lower Bound

The theorems below show that the algorithm that follows matches the lower bound. Proofs are provided in the full paper [5].

Theorem 26 (Dense).

Let $q$ be one of the cartesian product, the binary join, the star query and the triangle query over a $\theta$ -dense input $I$ , where arities $r_{j}$ of tables do not have to be uniform. Let $L^{*}$ be the lower bound on the load. Then we can evaluate $q$ with heterogenous machines with weights $w_{1},\dots,w_{p}$ with load $O(L^{*}\log n)$ .

Theorem 27 (Sparse).

Let $q$ be one of the cartesian product, the binary join, the star query and the triangle query over a matching database $I$ where atom $S_{j}$ has cardinality $m_{j}$ . Let $L^{*}$ be the lower bound on the load. Then we can evaluate $q$ with heterogenous machines with weights $w_{1},\dots,w_{p}$ with load $O\left(L^{*}\log n\right)$ with high probability.

8 Conclusion

In this paper, we studied the problem of computing full Conjunctive Queries in parallel on heterogeneous machines. Our algorithms are inspired by the HyperCube algorithm but take a new approach of considering how to optimally partition the space of possible output tuples among machines. This gives an optimal algorithm for queries where relations have the same cardinalities, for both linear and more general cost functions, and an optimal algorithm for queries with atoms of any cardinality for specific queries.

References

[1] Foto N. Afrati and Jeffrey D. Ullman. Optimizing joins in a map-reduce environment. In Ioana Manolescu, Stefano Spaccapietra, Jens Teubner, Masaru Kitsuregawa, Alain Léger, Felix Naumann, Anastasia Ailamaki, and Fatma Özcan, editors, EDBT 2010, 13th International Conference on Extending Database Technology, Lausanne, Switzerland, March 22-26, 2010, Proceedings, volume 426 of ACM International Conference Proceeding Series, pages 99–110. ACM, 2010. doi:10.1145/1739041.1739056.
[2] Paul Beame, Paraschos Koutris, and Dan Suciu. Communication steps for parallel query processing. J. ACM, 64(6):40:1–40:58, 2017. doi:10.1145/3125644.
[3] Spyros Blanas, Paraschos Koutris, and Anastasios Sidiropoulos. Topology-aware parallel data processing: Models, algorithms and systems at scale. In 10th Conference on Innovative Data Systems Research, CIDR 2020, Amsterdam, The Netherlands, January 12-15, 2020, Online Proceedings. www.cidrdb.org, 2020. URL: http://cidrdb.org/cidr2020/papers/p10-blanas-cidr20.pdf.
[4] Benoît Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, Allison W. Lee, Ashish Motivala, Abdul Q. Munir, Steven Pelley, Peter Povinec, Greg Rahn, Spyridon Triantafyllis, and Philipp Unterbrunner. The snowflake elastic data warehouse. In Fatma Özcan, Georgia Koutrika, and Sam Madden, editors, Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 215–226. ACM, 2016. doi:10.1145/2882903.2903741.
[5] Simon Frisk and Paraschos Koutris. Parallel query processing with heterogeneous machines, 2025. arXiv:2501.08896.
[6] Xiao Hu. Cover or pack: New upper and lower bounds for massively parallel joins. In Leonid Libkin, Reinhard Pichler, and Paolo Guagliardo, editors, PODS’21: Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Virtual Event, China, June 20-25, 2021, pages 181–198. ACM, 2021. doi:10.1145/3452021.3458319.
[7] Xiao Hu and Paraschos Koutris. Topology-aware parallel joins. Proc. ACM Manag. Data, 2(2):97, 2024. doi:10.1145/3651598.
[8] Xiao Hu, Paraschos Koutris, and Spyros Blanas. Algorithms for a topology-aware massively parallel computation model. In Leonid Libkin, Reinhard Pichler, and Paolo Guagliardo, editors, PODS’21: Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Virtual Event, China, June 20-25, 2021, pages 199–214. ACM, 2021. doi:10.1145/3452021.3458318.
[9] Xiao Hu and Yufei Tao. Parallel acyclic joins: Optimal algorithms and cyclicity separation. J. ACM, 71(1):6:1–6:44, 2024. doi:10.1145/3633512.
[10] Xiao Hu and Ke Yi. Instance and output optimal parallel algorithms for acyclic joins. In Dan Suciu, Sebastian Skritek, and Christoph Koch, editors, Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, pages 450–463. ACM, 2019. doi:10.1145/3294052.3319698.
[11] Bas Ketsman, Dan Suciu, and Yufei Tao. A near-optimal parallel algorithm for joining binary relations. Log. Methods Comput. Sci., 18(2), 2022. doi:10.46298/LMCS-18(2:6)2022.
[12] Paraschos Koutris, Paul Beame, and Dan Suciu. Worst-case optimal algorithms for parallel query processing. In Wim Martens and Thomas Zeume, editors, 19th International Conference on Database Theory, ICDT 2016, Bordeaux, France, March 15-18, 2016, volume 48 of LIPIcs, pages 8:1–8:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2016. doi:10.4230/LIPICS.ICDT.2016.8.
[13] Yufei Tao. A simple parallel algorithm for natural joins on binary relations. In Carsten Lutz and Jean Christoph Jung, editors, 23rd International Conference on Database Theory, ICDT 2020, March 30-April 2, 2020, Copenhagen, Denmark, volume 155 of LIPIcs, pages 25:1–25:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPICS.ICDT.2020.25.
[14] Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvili, and Xiaofeng Bao. Amazon aurora: Design considerations for high throughput cloud-native relational databases. In Semih Salihoglu, Wenchao Zhou, Rada Chirkova, Jun Yang, and Dan Suciu, editors, Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, pages 1041–1052. ACM, 2017. doi:10.1145/3035918.3056101.

[bib.bib1] [1] Foto N. Afrati and Jeffrey D. Ullman. Optimizing joins in a map-reduce environment. In Ioana Manolescu, Stefano Spaccapietra, Jens Teubner, Masaru Kitsuregawa, Alain Léger, Felix Naumann, Anastasia Ailamaki, and Fatma Özcan, editors, EDBT 2010, 13th International Conference on Extending Database Technology, Lausanne, Switzerland, March 22-26, 2010, Proceedings, volume 426 of ACM International Conference Proceeding Series, pages 99–110. ACM, 2010. doi:10.1145/1739041.1739056.

[bib.bib2] [2] Paul Beame, Paraschos Koutris, and Dan Suciu. Communication steps for parallel query processing. J. ACM, 64(6):40:1–40:58, 2017. doi:10.1145/3125644.

[bib.bib3] [3] Spyros Blanas, Paraschos Koutris, and Anastasios Sidiropoulos. Topology-aware parallel data processing: Models, algorithms and systems at scale. In 10th Conference on Innovative Data Systems Research, CIDR 2020, Amsterdam, The Netherlands, January 12-15, 2020, Online Proceedings. www.cidrdb.org, 2020. URL: http://cidrdb.org/cidr2020/papers/p10-blanas-cidr20.pdf.

[bib.bib4] [4] Benoît Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, Allison W. Lee, Ashish Motivala, Abdul Q. Munir, Steven Pelley, Peter Povinec, Greg Rahn, Spyridon Triantafyllis, and Philipp Unterbrunner. The snowflake elastic data warehouse. In Fatma Özcan, Georgia Koutrika, and Sam Madden, editors, Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 215–226. ACM, 2016. doi:10.1145/2882903.2903741.

[bib.bib5] [5] Simon Frisk and Paraschos Koutris. Parallel query processing with heterogeneous machines, 2025. arXiv:2501.08896.

[bib.bib6] [6] Xiao Hu. Cover or pack: New upper and lower bounds for massively parallel joins. In Leonid Libkin, Reinhard Pichler, and Paolo Guagliardo, editors, PODS’21: Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Virtual Event, China, June 20-25, 2021, pages 181–198. ACM, 2021. doi:10.1145/3452021.3458319.

[bib.bib7] [7] Xiao Hu and Paraschos Koutris. Topology-aware parallel joins. Proc. ACM Manag. Data, 2(2):97, 2024. doi:10.1145/3651598.

[bib.bib8] [8] Xiao Hu, Paraschos Koutris, and Spyros Blanas. Algorithms for a topology-aware massively parallel computation model. In Leonid Libkin, Reinhard Pichler, and Paolo Guagliardo, editors, PODS’21: Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Virtual Event, China, June 20-25, 2021, pages 199–214. ACM, 2021. doi:10.1145/3452021.3458318.

[bib.bib9] [9] Xiao Hu and Yufei Tao. Parallel acyclic joins: Optimal algorithms and cyclicity separation. J. ACM, 71(1):6:1–6:44, 2024. doi:10.1145/3633512.

[bib.bib10] [10] Xiao Hu and Ke Yi. Instance and output optimal parallel algorithms for acyclic joins. In Dan Suciu, Sebastian Skritek, and Christoph Koch, editors, Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, pages 450–463. ACM, 2019. doi:10.1145/3294052.3319698.

[bib.bib11] [11] Bas Ketsman, Dan Suciu, and Yufei Tao. A near-optimal parallel algorithm for joining binary relations. Log. Methods Comput. Sci., 18(2), 2022. doi:10.46298/LMCS-18(2:6)2022.

[bib.bib12] [12] Paraschos Koutris, Paul Beame, and Dan Suciu. Worst-case optimal algorithms for parallel query processing. In Wim Martens and Thomas Zeume, editors, 19th International Conference on Database Theory, ICDT 2016, Bordeaux, France, March 15-18, 2016, volume 48 of LIPIcs, pages 8:1–8:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2016. doi:10.4230/LIPICS.ICDT.2016.8.

[bib.bib13] [13] Yufei Tao. A simple parallel algorithm for natural joins on binary relations. In Carsten Lutz and Jean Christoph Jung, editors, 23rd International Conference on Database Theory, ICDT 2020, March 30-April 2, 2020, Copenhagen, Denmark, volume 155 of LIPIcs, pages 25:1–25:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPICS.ICDT.2020.25.

[bib.bib14] [14] Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvili, and Xiaofeng Bao. Amazon aurora: Design considerations for high throughput cloud-native relational databases. In Semih Salihoglu, Wenchao Zhou, Rada Chirkova, Jun Yang, and Dan Suciu, editors, Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, pages 1041–1052. ACM, 2017. doi:10.1145/3035918.3056101.

Parallel Query Processing with Heterogeneous Machines

Abstract

Keywords and phrases:

Copyright and License:

2012 ACM Subject Classification:

Related Version:

DOI:

Event:

Editors:

Series and Publisher:

1 Introduction

Our Contributions.

Technical Ideas.

2 Related Work

MPC Algorithms.

Topology-aware Algorithms.

3 Background

Computation Model.

Conjunctive Queries.

HyperCube Algorithm.

Input Distributions.

4 The Upper Bound

Theorem 1 (Dense Inputs).

Theorem 2 (Sparse Inputs).

4.1 Partitioning the Space

Theorem 3.

Proof.

Example 4.

4.2 Packing Hyperrectangles

Theorem 5 (Packing Theorem).

Lemma 6.

Proof.

Lemma 7.

Proof.

Lemma 8.

Proof.

Lemma 9.

Proof.

Lemma 10.

Proof.

Proof of Theorem 5.

4.3 Putting Everything Together

Proof of Theorem 1.

Proof of Theorem 2.

5 Lower Bounds

Theorem 11.

Theorem 12.

Definition 13.

Lemma 14.

Lemma 15.

Lemma 16.

Proof of Theorem 11.

6 General Cost Functions

Definition 17.

Definition 18.

6.1 Lower Bound

Theorem 19.

Proof.

Lemma 20.

Proof.

Example 21.

6.2 Upper Bound

Theorem 22 (Dense Inputs).

Theorem 23 (Sparse Inputs).

7 Different Cardinality Relations

7.1 Lower Bound

Theorem 24.

Proof.

Lemma 25.

Proof.

7.2 Upper Bound

Cartesian Product

Binary Join

Matching the Lower Bound

Theorem 26 (Dense).

Theorem 27 (Sparse).

8 Conclusion

References