Generalized Covers for Conjunctive Queries

Koutris, Paraschos

doi:10.4230/LIPIcs.ICDT.2025.28

Generalized Covers for Conjunctive Queries

Paraschos Koutris

University of Wisconsin-Madison, WI, USA

Abstract

Covers of query results were introduced as succinct lossless representations of join query outputs. A cover is a subset of the query result from which we can efficiently enumerate the output with constant delay and linear preprocessing time. However, covers are dependent on a single tree decomposition of the query. In this work, we generalize the notion of a cover to a set of multiple tree decompositions. We show that this generalization can potentially produce asymptotically smaller covers while maintaining the properties of constant-delay enumeration and linear preprocessing time. In particular, given a set of tree decompositions, we can determine exactly the asymptotic size of a minimum cover, which is tied to the notion of entropic width of the query. We also provide a simple greedy algorithm that computes this cover efficiently. Finally, we relate covers to semiring circuits when the semiring is idempotent.

Keywords and phrases:

Conjunctive Query, tree decomposition, cover

Copyright and License:

2012 ACM Subject Classification:

Theory of computation

\rightarrow

Database theory

DOI:

10.4230/LIPIcs.ICDT.2025.28

Event:

28th International Conference on Database Theory (ICDT 2025)

Editors:

Sudeepa Roy and Ahmet Kara

Series and Publisher:

Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

Join queries are one of the fundamental operations in relational databases. However, the output of a join query can be potentially huge when executed over a large database instance. This is particularly relevant when a join output needs to be sent over a network link in a distributed setting, or if the output result needs to be stored before being processed in a downstream task. In such scenarios, it is often desirable to construct concise representations of the query output such that we can reconstruct the query result as efficiently as possible when and where it is needed. Several succinct data structures have been proposed in the literature, including factorized databases [16] and compressed representations [7].

A cover is one such succinct and lossless representation of a join output that was introduced by Kara and Olteanu [13]. A cover is simply a subset of the query result that, together with a given tree decomposition of the query, can efficiently reconstruct the full query output. Here, efficient means that after a preprocessing step that is linear to the size of the representation, we can enumerate the output with a constant delay guarantee. One of the key results in [13] is that, given a tree decomposition $T$ of the query and a database instance of size $N$ , we can always produce a cover of size $O(N^{\mathsf{fhw}(T)})$ , where $\mathsf{fhw}(T)$ is the fractional hypertree width of $T$ . Thus, choosing a tree decomposition of minimum width gives a cover of size $O(N^{\mathsf{fhw}})$ and that can be shown to be asymptotically optimal. This bound is a large improvement over storing the full output, which can be as large as the AGM bound [2], i.e., $\Omega(N^{\rho^{*}})$ where $\rho^{*}$ is the fractional edge cover of the query.

The main insight of this work is that it is possible to construct covers of even smaller asymptotic size if we define the cover to depend not on a single decomposition, but on multiple tree decompositions. This is analogous to the PANDA algorithm [14], which improves the runtime of query evaluation from $O(N^{\mathsf{fhw}})$ to $O(N^{\mathsf{subw}})$ by considering simultaneously multiple tree decompositions. Here, $\mathsf{subw}$ is the submodular width of the query. Importantly, even though we allow a dependence on multiple tree decompositions, we can still show that the guarantee of linear-time preprocessing with constant delay enumeration remains unchanged. Thus, this generalized notion of a cover produces an even more succinct and efficient representation of the query output.

In fact, we will show in this paper that by considering the (finite) set of all possible tree decompositions, we can construct covers of size $O(N^{\mathsf{entw}})$ , where $\mathsf{entw}$ is the entropic width of the query, a width measure that is at least as small as the submodular width. As a consequence, we show the existence of a data structure of size $O(N^{\mathsf{entw}})$ from which we can enumerate the query output with constant delay. We should note here that, unfortunately, the time to construct this data structure can be much larger than $O(N^{\mathsf{entw}})$ , and so the cover construction does not produce a faster query evaluation algorithm.

From a theoretical point of a view, this makes progress to the following fundamental question: what is the smallest possible compression of a query result that still guarantees constant-delay enumeration (i.e., fast decompression)? From a practical point of view, the enumeration guarantee is dependent on the number of tree decompositions, which can be exponentially large in the query. However, we show that as we add more decompositions the size of the cover can only improve; hence, even a small set of decompositions can lead to a large improvement in the cover size compared to considering only a single one. The algorithm we present that constructs a small cover – even though it needs access to the full query result – requires only a linear scan over the query result.

Our Contribution.

We summarize our contributions as follows:

$\blacksquare$

We generalize the notion of a cover of a query result to consider multiple tree decompositions (Section 4). We prove several interesting properties for covers, and show that given a cover of size $K$ we can spend $O(|K|)$ preprocessing time to enumerate the query output with constant delay.
$\blacksquare$

We present a simple greedy algorithm (Section 5) that, given the query output, produces an asymptotically optimal cover in time linear w.r.t. the query output. Moreover, we show an upper bound that uses not only the input size $N$ as a parameter, but any combination of degree constraints, including different cardinalities, functional dependencies, and degree bounds.
$\blacksquare$

We provide a lower bound (Section 6) that asymptotically matches our upper bound. The lower bound uses a technically interesting connection to disjunctive Datalog rules to find a worst-case instance for a cover.
$\blacksquare$

Finally, we show in Section 7 how we can use a cover to produce a semiring circuit of size linear to the cover size for the provenance polynomial of the query for any idempotent semiring. A semiring circuit can be thought as a factorization of the query output (more precisely, it is a factorization of the polynomial associated with the query output), with the additional flexibility that it can consider different tree decompositions in its internal representation.

2 Related Work

In this section, we present work that relates to covers and in general efficient representations of query results.

Factorized Databases and Circuits.

As shown in [13], covers over a single tree decomposition are related to $d$ -representations over the same tree decomposition. $d$ -representations (and $f$ -representations) are also lossless representations of query outputs, which allow for efficient enumeration, aggregation, and even ML model computation [18]. Query outputs can also be concisely stored using circuits [9, 1] – these circuits represent the polynomial corresponding to the join query and can be evaluated under different semiring semantics. As we will show later in the paper, analogous to the connection between covers and $d$ -representations, there is also a direct connection between generalized covers and semiring circuits.

Compressed Representations of Outputs.

Recent work has also looked at how we can further decrease the size of a representation by trading off the enumeration delay guarantee [7, 19, 6]. Such a tradeoff allows for asymptotically smaller succinct data structures, but with an increased possible delay when outputting two consecutive outputs. Even though it would be theoretically possible to consider covers with weaker enumeration guarantees, in this work we focus only on constant delay.

Preprocessing Time and Enumeration.

Several algorithms on join computation use a two-phase framework where a preprocessing phase is followed by an output enumeration phase [3]. For instance, for any join query with fractional hypertree width $\mathsf{fhw}$ , we need $O(N^{\mathsf{fhw}})$ preprocessing time to achieve constant delay enumeration. The intermediate data structure constructed by the preprocessing phase can be also viewed as a succinct representation of the result; however, it is critical in this case to also have an efficient algorithm to construct the data structure. Recent work has looked at preprocessing time/size and enumeration tradeoffs in this setting [11, 12]. In our paper, even though we construct a very small succinct representation, the runtime can exceed the size of the representation and thus does not lead to a better query evaluation algorithm.

3 Preliminaries

In this section, we present useful notation and terminology.

Conjunctive Queries.

A Conjunctive Query (CQ) $Q$ is an expression associated to a hypergraph $\mathcal{H}=([n],\mathcal{E})$ where $[n]=\{1,\dots,n\}$ and a set $U\subseteq[n]$ :

Q(\mathbf{x}_{U})\leftarrow\bigwedge_{e\in\mathcal{E}}R_{e}(\mathbf{x}_{e})

where each $R_{e}$ is a relation of arity $|e|$ , the variables $x_{1},x_{2},\dots,x_{n}$ take values in some discrete domain, and $\mathbf{x}_{e}:=(x_{i})_{i\in e}$ . We say that $Q$ is Boolean if $U=\{\}$ and full if $U=[n]$ .

Given a tuple $t$ over $[n]$ and a subset $S\subseteq[n]$ , we will use $t[S]$ to denote the projection of the tuple on the attributes in $S$ .

Tree Decompositions.

We recall the notion of a tree decomposition.

Definition 1 (Tree Decomposition).

A tree decomposition of a hypergraph $\mathcal{H}$ is a pair $(\mathcal{T},\chi)$ , where $\mathcal{T}$ is a tree and $\chi$ maps each node $t$ of the tree to a subset $\chi(t)$ of $V(\mathcal{H})$ , called a bag, such that:

1.

every hyperedge $e\in E(\mathcal{H})$ is a subset of $\chi(t)$ for some $t\in V(\mathcal{T})$ ; and
2.

for every vertex $v\in V(\mathcal{H})$ , the set $\{t|v\in\chi(t)\}$ is a non-empty connected subtree of $\mathcal{T}$ .

We say that a tree decomposition is non-redundant if no bag is a subset of another; otherwise it is redundant. We let $\mathsf{TD}(\mathcal{H})$ be the set of all non-redundant tree decompositions of $\mathcal{H}$ . This set is known to be finite; it can be shown that $|\mathsf{TD}(\mathcal{H})|\leq n!$ , where $n$ is the number of vertices of the hypergraph (Proposition 2.9 [15]).

Entropic Functions.

Let $n$ be a positive integer. A function $h:2^{[n]}\rightarrow\mathbb{R}_{+}$ is called a set function on $[n]=\{1,2,\dots,n\}$ . Given a discrete random variable $X$ with support in $\mathcal{X}$ and probability distribution $p:\mathcal{X}\rightarrow[0,1]$ , its entropy is defined as $H(X)=-\sum_{x\in\mathcal{X}}p(x)\log p(x)$ . When the probability distribution is uniform in the support set, then $H(X)=\log|\mathcal{X}|$ . Given a set of random variables $X_{1},\dots,X_{k}$ , the joint entropy $H(X_{1},\dots,X_{k})$ is defined as the entropy of the random variable $Y$ that represents the tuple $(X_{1},\dots,X_{k})$ . A set function is an entropic function of order $n$ if there exist random variables $A_{1},\dots,A_{n}$ such that $h(S)=H((A_{i})_{i\in S})$ for any $S\subseteq[n]$ . We denote by ${\Gamma}^{*}_{n}$ the set of all entropic functions of order $n$ , and $\overline{\Gamma}^{*}_{n}$ the topological closure of ${\Gamma}^{*}_{n}$ .¹¹1The topological closure of a set $S$ is defined as the smallest closed set that contains $S$ . The important property of $\overline{\Gamma}^{*}_{n}$ is that it is a convex cone and hence characterized by the (infinitely many) linear inequalities it satisfies.

Entropic Width.

Let $\mathsf{DC}$ be a set of triples $(X,Y,N_{Y|X})$ for some $X\subset Y\subseteq[n]$ and $N_{Y|X}\in\mathbb{N}$ that encodes a set of degree constraints. A database instance $I$ over the relational schema $\{R_{e}\}_{e\in\mathcal{E}}$ satisfies the constraints if for every relation $R_{e}$ in $I$ with $X\subseteq Y\subseteq e$ and every tuple $t$ defined over schema $X$ , we have $|\pi_{Y}(R_{e}\ltimes t)|\leq N_{Y|X}$ . A constraint of the form $(\emptyset,e,N_{e})$ is simply a cardinality constraint that says that relation $R_{e}$ has size at most $N_{e}$ . The degree constraints on an instance can be translated as constraints on entropic functions as follows:

\mathsf{HDC}:=\left\{h:2^{[n]}\rightarrow\mathbb{R}_{+}\mid\bigwedge_{\left(X,% Y,N_{Y\mid X}\right)\in\mathsf{DC}}h(Y|X)\leq\log N_{Y|X}\right\}

where $h(Y|X):=h(Y)-h(X)$ . For a given set of degree constraints $\mathsf{HDC}$ , we define the “scaled-up” degree constraints $\mathsf{HDC}\times k$ as the same set of constraints but with all the degree bounds multiplied by $k$ . It will be also helpful to define the entropic constraints in the case we are only given a uniform cardinality constraint (input size):

\mathsf{ED}:=\{h:2^{[n]}\rightarrow\mathbb{R}_{+}\mid h(e)\leq 1,\forall e\in% \mathcal{E}\}

We now define the entropic width (defined in [9]) and degree-aware entropic width (defined in [15] as eda-subw) w.r.t. a set of tree decompositions $\mathbb{T}$ respectively:

\textsf{entw}_{\mathbb{T}}(\mathcal{H},\mathsf{ED}):=\max_{h\in\overline{% \Gamma}^{*}_{n}\cap\mathsf{ED}}\min_{(\mathcal{T},\chi)\in\mathbb{T}}\max_{v% \in V(\mathcal{T})}h(\chi(v))

\textsf{da-entw}_{\mathbb{T}}(\mathcal{H},\mathsf{HDC}):=\max_{h\in\overline{% \Gamma}^{*}_{n}\cap\mathsf{HDC}}\min_{(\mathcal{T},\chi)\in\mathbb{T}}\max_{v% \in V(\mathcal{T})}h(\chi(v))

When $\mathbb{T}=\mathsf{TD}(\mathcal{H})$ , we obtain the standard notions of (degree-aware) entropic width.

Computational Model.

We use the uniform-cost RAM model where data values as well as pointers to databases are of constant size. Our runtime analysis will consider data complexity (where the query is considered fixed) unless otherwise stated.

4 Generalized Covers

We start by generalizing the notion of a cover of a query result from one to multiple tree decompositions.

Definition 2 (Cover).

Let $Q$ be a full CQ with hypergraph $\mathcal{H}$ , $D$ be an instance, and $\mathbb{T}$ be a finite set of tree decompositions of $\mathcal{H}$ . A relation $K$ over schema $[n]$ is a cover of $Q(D)$ w.r.t. $\mathbb{T}$ if

\bigcup_{(\mathcal{T},\chi)\in\mathbb{T}}\Join_{t\in V(\mathcal{T})}(\pi_{{% \chi(t)}}K)=Q(D).

In other words, if for every decomposition we project the cover to its bags and then join the bags together, the union of these outputs must form exactly $Q(D)$ . When the set $\mathbb{T}$ consists of a single tree decomposition, then we recover the notion of a cover as in [13]. In our more general definition, the reconstruction of the output $Q(D)$ is based not on a single tree decomposition, but a set of tree decompositions.

Proposition 3.

Let $Q$ be a full CQ, $D$ be an instance, and $\mathbb{T}$ be a set of tree decompositions of $Q$ . If $K$ is a cover of $Q(D)$ w.r.t. $\mathbb{T}$ , then $K\subseteq Q(D)$ .

Proof.

This follows immediately from the fact that for every tree decomposition $(\mathcal{T},\chi)$ and relational instance $K$ , we have $K\subseteq\Join_{t\in V(\mathcal{T})}(\pi_{{\chi(t)}}K)$ . $\hfill\blacktriangleleft$

$\blacktriangleright$ Remark 4.

It is useful to contrast with the original definition of a cover used in [13] for a single tree decomposition $(\mathcal{T},\chi)$ . In this definition, a cover $K$ was equivalently defined as a set of tuples for which $\pi_{\chi(v)}(K)=\pi_{\chi(v)}(Q(D))$ for every $v\in V(\mathcal{T})$ . However, such an equivalent characterization does not hold anymore in the case of multiple tree decompositions, since each decomposition may be responsible to produce a (strict) subset of the query output.

We say that a cover $K$ is minimal w.r.t. $\mathbb{T}$ if there is no strict subset $K^{\prime}\subsetneq K$ that is also a cover w.r.t. $\mathbb{T}$ ²²2We should note here that in [13] a cover always refers to a minimal cover. In this paper, however, we will also consider covers that may not be minimal.. A cover is minimum w.r.t. $\mathbb{T}$ if it has the smallest possible size across all covers w.r.t. $\mathbb{T}$ .

Example 5.

We will use as an example the 4-cycle query:

Q(x_{1},x_{2},x_{3},x_{4})\leftarrow R_{1}(x_{1},x_{2}),R_{2}(x_{2},x_{3}),R_{% 3}(x_{3},x_{4}),R_{4}(x_{4},x_{1}).

This CQ has only two non-redundant tree decompositions: $T_{1}=\{1,2,3\},\{1,3,4\}$ and $T_{2}=\{1,2,4\},\{2,3,4\}$ . We consider the following relational instance $D$ , where each relation has $2n$ tuples and thus the input size is $|D|=\Theta(n)$ . Note also that $|Q(D)|=2n^{2}$ , hence the output size is quadratic w.r.t. the input size.

Suppose now that we want to find a cover w.r.t. $\{T_{1}\}$ only. Using a result from [13], we know that the size of any cover w.r.t. a single decomposition is at least the size of the largest bag in the decomposition. For $T_{1}$ , both bags $\{1,2,3\}$ and $\{1,3,4\}$ have size $\Theta(n^{2})$ , hence the cover has size $\Omega(n^{2})$ . A similar reasoning provides a lower bound of $\Omega(n^{2})$ for the size of any cover w.r.t. $T_{2}$ . Hence, no matter which tree decomposition we use, a cover w.r.t. a single tree decomposition has size $\Omega(n^{2})$ .

Now, suppose we want to find a cover w.r.t. $\{T_{1},T_{2}\}$ , i.e., we will include both non-redundant tree decompositions. We claim that the following set $K$ of size only $O(n)$ is a cover w.r.t. $\{T_{1},T_{2}\}$ :

Indeed, partition $D=D^{a}\cup D^{b}$ , where $D^{a},D^{b}$ contain the tuples with $a, b$ -values respectively, and similarly partition $K$ into $K^{a},K^{b}$ . Then, one can verify that $\pi_{124}(K^{a})\Join\pi_{234}(K^{a})=Q(D^{a})$ and $\pi_{123}(K^{b})\Join\pi_{134}(K^{b})=Q(D^{b})$ . In other words, we use decomposition $T_{2}$ to guide the covering of the $a$ -tuples, and $T_{1}$ to guide the covering of the $b$ -tuples. Note that $\pi_{123}(K^{a})\Join\pi_{134}(K^{a})\subsetneq Q(D^{a})$ .

As we will show later in the paper, for the 4-cycle query we can always produce a cover of size $O(|D|^{3/2})$ using both tree decompositions. In this example, we are able to do even better and produce a cover of only linear size.

4.1 Basic Properties of Covers

The following propositions establish some basic facts about covers.

Proposition 6.

Let $Q$ be a full CQ, $D$ be an instance, and $\mathbb{T},\mathbb{T}^{\prime}$ be two sets of tree decompositions of $Q$ such that $\mathbb{T}^{\prime}\supseteq\mathbb{T}$ . If $K$ is a cover of $Q(D)$ w.r.t. $\mathbb{T}$ , then $K$ is a cover of $Q(D)$ w.r.t. $\mathbb{T}^{\prime}$ as well.

Proof.

Indeed, we can write the following:

	$\displaystyle Q(D)$	$\displaystyle=\bigcup_{(\mathcal{T},\chi)\in\mathbb{T}}\Join_{v\in V(\mathcal{% T})}\pi_{{\chi(v)}}K\subseteq\bigcup_{(\mathcal{T},\chi)\in\mathbb{T}^{\prime}% }\Join_{v\in V(\mathcal{T})}\pi_{{\chi(v)}}K$
	$\displaystyle\bigcup_{(\mathcal{T},\chi)\in\mathbb{T}^{\prime}}\Join_{v\in V(% \mathcal{T})}\pi_{{\chi(v)}}K$	$\displaystyle\subseteq\bigcup_{(\mathcal{T},\chi)\in\mathbb{T}^{\prime}}\Join_% {v\in V(\mathcal{T})}(\pi_{{\chi(v)}}Q(D))=Q(D)$

In the last equation, the first inequality follows from Lemma 3 that implies $K\subseteq Q(D)$ , while the last equality follows from the fact that for every tree decomposition $(\mathcal{T},\chi)$ , $\Join_{v\in V(\mathcal{T})}(\pi_{{\chi(v)}}Q(D))=Q(D)$ . $\hfill\blacktriangleleft$

The above proposition tells us that if we add more tree decompositions in a set $\mathbb{T}$ , the minimum cover can never increase in size. But how large does $\mathbb{T}$ need to be to achieve the smallest possible cover (among all possible tree decomposition sets)?

Proposition 7.

Let $Q$ be a full CQ, $D$ be an instance, and $\mathbb{T}$ be a finite set of tree decompositions of $Q$ . If $K$ covers $Q(D)$ w.r.t. $\mathbb{T}$ , then it covers $Q(D)$ w.r.t. the (finite) set of all non-redundant tree decompositions $\mathsf{TD}(\mathcal{H})$ .

Proof.

From 6, we have that $K$ covers $Q(D)$ w.r.t. $\mathbb{T}\cup\mathsf{TD}(\mathcal{H})$ . Hence:

\bigcup_{(\mathcal{T},\chi)\in\mathbb{T}\cup\mathsf{TD}(\mathcal{H})}\Join_{v% \in V(\mathcal{T})}\pi_{{\chi(v)}}K=Q(D)

To complete the proof, we will show that

\bigcup_{(\mathcal{T},\chi)\in\mathsf{TD}(\mathcal{H})}\Join_{v\in V(\mathcal{% T})}\pi_{{\chi(v)}}K=\bigcup_{(\mathcal{T},\chi)\in\mathbb{T}\cup\mathsf{TD}(% \mathcal{H})}\Join_{v\in V(\mathcal{T})}\pi_{{\chi(v)}}K

It is straightforward that the left-hand-side is contained in the right-hand-side. For the other direction, let $(\mathcal{T},\chi)$ be a redundant tree decomposition in $\mathbb{T}$ . Then, there exists a non-redundant tree decomposition $(\mathcal{T}^{\prime},\chi^{\prime})$ in $\mathsf{TD}(\mathcal{H})$ such that for every bag $v^{\prime}\in V(\mathcal{T}^{\prime})$ , there exists a bag $v\in V(\mathcal{T})$ such that $\chi(v)=\chi^{\prime}(v^{\prime})$ (this is constructed by eliminating the bags that are contained in some other bag). We will show that for any $K$ over schema $[n]$ , it holds that $\Join_{v\in V(\mathcal{T})}\pi_{{\chi(v)}}K\subseteq\Join_{v\in V(\mathcal{T}^% {\prime})}\pi_{{\chi(v)}}K$ . Indeed, take any tuple $t\in\Join_{v\in V(\mathcal{T})}\pi_{{\chi(v)}}K$ . Consider some bag $v^{\prime}\in V(\mathcal{T}^{\prime})$ . Then, there exists $v\in V(\mathcal{T})$ with $\chi(v^{\prime})=\chi(v)$ . Hence, $\pi_{\chi(v^{\prime})}(t)=\pi_{\chi(v)}(t)\in\pi_{\chi(v)}(K)=\pi_{\chi(v^{% \prime})}(K)$ . This implies that $t\in\Join_{v\in V(\mathcal{T}^{\prime})}\pi_{{\chi(v)}}K$ as well. $\hfill\blacktriangleleft$

The above proposition says that we can achieve the best possible cover in terms of size if we take our set of tree decompositions to be $\mathsf{TD}(\mathcal{H})$ , which is finite.

4.2 From Covers to Constant-Delay Enumeration

In this section, we show an important property of covers. Given a cover $K$ of $Q(D)$ w.r.t. some set of tree decompositions, we show that with preprocessing time only linear w.r.t. $|K|$ , it is possible to enumerate the output of a query $Q$ with constant delay. A constant-delay enumeration algorithm means that the time between outputting two consecutive tuples in $Q(D)$ (as well as the time to produce the first tuple, and the time to finish after producing the last tuple) is $O(1)$ w.r.t. data complexity.

Theorem 8.

Let $Q$ be a full CQ, $D$ be an instance, and $\mathbb{T}$ be a finite set of tree decompositions of $Q$ . Suppose we are given a cover $K$ for $Q(D)$ w.r.t. $\mathbb{T}$ . Then, with preprocessing time $O(|K|)$ we can enumerate $Q(D)$ with constant delay.

Proof.

We first describe the preprocessing phase. We will use $K$ to construct a materialized instance $R_{B}$ for every bag $B$ in every decomposition of $\mathbb{T}$ . In particular, we scan $K$ , and for every tuple $t\in K$ and every decomposition $(\mathcal{T},\chi)$ in $\mathbb{T}$ and every bag $v\in V(\mathcal{T})$ , we add the projection $t[\chi(v)]$ to the bag $R_{\chi(v)}$ . At this point, since $K$ is a cover for $Q(D)$ , we know that $Q(D)=\bigcup_{(\mathcal{T},\chi)\in\mathbb{T}}\Join_{v\in V(\mathcal{T})}R_{% \chi(v)}$ .

Once we have materialized each bag, computing the join $J_{(\mathcal{T},\chi)}=\Join_{v\in V(\mathcal{T})}R_{\chi(v)}$ corresponds to computing an acyclic CQ. Indeed, by the definition of a tree decomposition, the query formed by treating each bag as a relation must be acyclic. Thus, we can use the standard result [3] that, with linear preprocessing time $O(\sum_{v\in V(\mathcal{T})}|R_{\chi(v)}|)=O(|K|)$ , we can enumerate the results in $J_{(\mathcal{T},\chi)}$ with constant delay. The last thing we need to address is that the same output tuple may be produced via more than one tree decomposition in $\mathbb{T}$ . To deal with this, we can apply Cheater’s lemma [5] that tells us that we can enumerate the union of a constant number of constant-delay enumeration algorithms with constant delay. Alternatively, we can also apply the method of Durand and Strozecki [8] for enumeration of unions of sets, as done in [4]. $\hfill\blacktriangleleft$

The main message of the above theorem is that considering more tree decompositions does not hurt the constant-delay enumeration property of covers if we consider data complexity.

$\blacktriangleright$ Remark 9.

Although the delay is constant w.r.t. data complexity, it depends linearly on the number of tree decompositions $|\mathbb{T}|$ ; more precisely, the delay will be $O(|\mathbb{T}||\cdot|Q|)$ . If we choose to include all non-redunant tree decompositions, i.e., $\mathbb{T}=\mathsf{TD}(\mathcal{H})$ , then the delay can become exponentially large in the size of the query, since $\mathsf{TD}(\mathcal{H})$ can be as large as $n!$ . However, in practice one may not need to consider all tree decompositions to obtain a concise representation of the output.

5 Finding Small Covers

In this section, we will seek to find the smallest possible cover we can obtain for $Q(D)$ w.r.t. any finite set of tree decompositions $\mathbb{T}$ . We will be interested in providing worst-case guarantees on the size, and not an instance-optimal construction of the minimum-sized cover. In particular, we will show the following theorem.

Theorem 10.

Given a full CQ $Q$ with hypergraph $\mathcal{H}$ and a database instance $D$ that satisfies the degree constraints $\mathsf{DC}$ , there exists a cover of $Q(D)$ w.r.t. a finite set of tree decompositions $\mathbb{T}$ of size $O(2^{\textsf{da-entw}_{\mathbb{T}}(\mathcal{H},\mathsf{HDC})})$ . Moreover, given the output of the query $Q(D)$ , we can construct this cover in time $O(|Q(D)|)$ .

When $\mathbb{T}=\mathsf{TD}(\mathcal{H})$ , then the size bound simply becomes $O(2^{\textsf{da-entw}(\mathcal{H},\mathsf{HDC})})$ . To show the above theorem, we will give in the rest of the section a constructive proof, based on a simple greedy algorithm (algorithm 1). This algorithm is inspired by the construction used in [15] (Proof of Lemma 4.1) to construct small models for disjunctive Datalog rules. Here, we adapt this idea to a CQ and do a direct analysis.

The algorithm starts with an empty set $\widebar{K}\leftarrow\emptyset$ , and iterates over the tuples $t\in Q(D)$ (in any order). For every tuple $t\in Q(D)$ , if there exists a tree decomposition $(\mathcal{T},\chi)\in\mathbb{T}$ such that $t\in\Join_{v\in V(\mathcal{T})}\pi_{\chi(v)}(\widebar{K})$ we do nothing; otherwise, we add $t$ to $\widebar{K}$ . The algorithm terminates when we have visited all output tuples. To bound the running time of algorithm 1, we can implement the existence check in line 3 in (amortized) constant time per tuple by maintaining a hash table on the projection on each bag $\pi_{B}(\widebar{K})$ , and then simply checking that for every $v\in V(\mathcal{T})$ , we have $t[\chi(v)]\in\pi_{\chi(v)}(\widebar{K})$ (for example, cuckoo hashing [17] gives us constant-time access and amortized constant-time insertions.) Hence, we have the following bound on the running time.

Lemma 11.

algorithm 1 runs in time $O(|Q(D)|)$ .

Algorithm 1 Greedy Cover.

Input : output

Q(D)

of a full CQ

Q

, set of tree decompositions

\mathbb{T}

Output : cover of

Q(D)

w.r.t.

\mathbb{T}

1

\widebar{K}\leftarrow\emptyset

2 foreach $t\in Q(D)$ do

3 if $\forall(\mathcal{T},\chi)\in\mathbb{T}:t\notin\Join_{v\in V(\mathcal{T})}\pi_{% \chi(v)}(\widebar{K})$ then

4

\widebar{K}\leftarrow\widebar{K}\cup\{t\}

5

return

\widebar{K}

By construction of $\widebar{K}$ , we also obtain:

Lemma 12.

The output $\widebar{K}$ of algorithm 1 is a cover of $Q(D)$ w.r.t. $\mathbb{T}$ .

Note that if we run algorithm 1 with two different sets of tree decompositions $\mathbb{T}\subseteq\mathbb{T}^{\prime}$ , the cover produced for $\mathbb{T}^{\prime}$ will be at least as small as the one for $\mathbb{T}$ . However, the delay in the enumeration of the results from the cover will increase. Hence, we essentially have a tradeoff between the compression size (the size of the cover) and time to decompress (the delay).

Let $\mathcal{M}_{\mathbb{T}}$ be the set of all maps $\beta:\mathbb{T}\rightarrow 2^{[n]}$ such that $\beta(\mathcal{T},\chi)=\chi(v)$ for some $v\in V(\mathcal{T})$ . Such a map $\beta$ is called a bag selector that chooses a bag from each tree decomposition in $\mathbb{T}$ .

Lemma 13.

The set $\widebar{K}$ can be partitioned as $\{\widebar{K}_{\beta}\}_{\beta\in\mathcal{M}_{\mathbb{T}}}$ such that for any two tuples $t,t^{\prime}\in\widebar{K}_{\beta}$ it holds that $t[B]\neq t^{\prime}[B]$ for every $B\in\mathsf{image}(\beta)$ .

Proof.

We will construct the sets $\widebar{K}_{\beta}$ following the construction of $\widebar{K}$ in algorithm 1. Initially, all $\widebar{K}_{\beta}$ are empty. Consider the point where a new tuple $t$ is added in $\widebar{K}$ in line 4. Then, for every decomposition $(\mathcal{T},\chi)\in\mathbb{T}$ there must exist some bag $v\in V(\mathcal{T})$ such that $t[\chi(v)]\notin\pi_{\chi(v)}(\widebar{K})$ (if there is more than one such bag, we choose one arbitrarily). Consider the bag selector $\beta$ such that its image is exactly this set of bags, and add $t$ to $\widebar{K}_{\beta}$ . It is easy to see that, for every tuple $t^{\prime}\in\widebar{K}_{\beta}$ and $B\in\mathsf{image}(\beta)$ , we have $t^{\prime}[B]\in\pi_{B}(\widebar{K})$ and thus it must be that $t[B]\neq t^{\prime}[B]$ . $\hfill\blacktriangleleft$

Example 14.

We show how algorithm 1 and the above lemma work using Example 5. Recall that we have two tree decompositions: $T_{1}=\{1,2,3\},\{1,3,4\}$ and $T_{2}=\{1,2,4\},\{2,3,4\}$ . There are four bag selectors: $\beta_{1}=\{123,124\}$ , $\beta_{2}=\{123,234\}$ , $\beta_{3}=\{134,124\}$ , and $\beta_{4}=\{134,234\}$ . To make the exposition simpler, suppose the instance obtains only four output tuples:

t_{1}=(a_{1}^{1},a_{2},a_{3}^{1},a_{4}),\quad t_{2}=(a_{1}^{1},a_{2},a_{3}^{2}% ,a_{4}),\quad t_{3}=(a_{1}^{2},a_{2},a_{3}^{1},a_{4}),\quad t_{4}=(a_{1}^{2},a% _{2},a_{3}^{2},a_{4})

$\widebar{K}$ and all $\widebar{K}_{\beta}$ are initially empty. After reading the first tuple, we can assign it to any selector, say $K_{\beta_{1}}=\{t_{1}\}$ . The second tuple is also added to $\widebar{K}$ ; however, we will not add it to $\widebar{K}_{\beta_{1}}$ because $t_{1},t_{2}$ agree on the bag $124$ . Thus, $\widebar{K}_{\beta_{2}}=\{t_{2}\}$ . For $t_{3}$ , we will add it to $\widebar{K}_{\beta_{1}}$ , so $\widebar{K}_{\beta_{1}}=\{t_{1},t_{3}\}$ . Finally, $t_{4}$ will not be added, since it can be generated via decomposition $T_{2}$ .

Equipped with the above two lemmas, we can now prove an upper bound on the size of the cover constructed via algorithm 1.

Theorem 15.

If $\widebar{K}$ is the output of algorithm 1, then $|\widebar{K}|=O(2^{\textsf{da-entw}_{\mathbb{T}}(\mathcal{H},\mathsf{HDC}})$

Proof.

From Lemma 13, we have that $\sum_{\beta\in\mathcal{M}_{\mathbb{T}}}|\widebar{K}_{\beta}|=|\widebar{K}|$ . Hence, there exists a set $\widebar{K}_{\beta}$ with size $|\widebar{K}_{\beta}|\geq|\widebar{K}|/|\mathcal{M}_{\mathbb{T}}|$ such that for any two tuples $t,t^{\prime}\in\widebar{K}_{\beta}$ it holds that $t[B]\neq t^{\prime}[B]$ for every $B\in\mathsf{image}(\beta)$ . The set $\widebar{K}_{\beta}\subseteq Q(D)$ satisfies all degree constraints, since it is a subset of $Q(D)$ which also satisfies all degree constraints. Now, take a probability distribution over $[n]$ that is uniformly random for $\widebar{K}_{\beta}$ , that is, each tuple in $\widebar{K}_{\beta}$ is chosen independently with the same probability. Let $\bar{h}$ denote the entropy of this distribution. Note that by construction, $\bar{h}\in\overline{\Gamma}^{*}_{n}\cap\mathsf{HDC}$ . Moreover, because the projections of the tuples on the bags of $B\in\mathsf{image}(\beta)$ are disjoint, $\log|\widebar{K}_{\beta}|=\bar{h}(B)$ .

Now, consider any tree decomposition $(\mathcal{T},\chi)\in\mathbb{T}$ ; then, there exists a bag $v\in V(\mathcal{T})$ such that $B=\chi(v)\in\mathsf{image}(\beta)$ :

\log|\widebar{K}_{\beta}|=\bar{h}(B)\leq\max_{v\in V(\mathcal{T})}\bar{h}(\chi% (v)).

Taking the $\min$ over all tree decompositions in $\mathbb{T}$ , we can write:

	$\displaystyle\log\|\widebar{K}_{\beta}\|$	$\displaystyle\leq\min_{(\mathcal{T},\chi)\in\mathbb{T}}\max_{v\in V(\mathcal{T% })}\bar{h}(\chi(v))$
		$\displaystyle\leq\max_{h\in\overline{\Gamma}^{*}_{n}\cap\mathsf{HDC}}\min_{(% \mathcal{T},\chi)\in\mathbb{T}}\max_{v\in V(\mathcal{T})}h(\chi(v))$
		$\displaystyle=\textsf{da-entw}_{\mathbb{T}}(\mathcal{H},\mathsf{HDC}).$

The bound in the statement follows from the fact that $|\widebar{K}_{\beta}|\geq|\widebar{K}|/|\mathcal{M}_{\mathbb{T}}|$ . $\hfill\blacktriangleleft$

Corollary 16.

Given a full CQ $Q$ with hypergraph $\mathcal{H}$ and a database instance $D$ , there exists a cover of $Q(D)$ w.r.t. $\mathsf{TD}(\mathcal{H})$ of size $O(|D|^{\textsf{entw}(\mathcal{H})})$ .

Example 17.

Continuing our running example for the 4-cycle query, we know that for this query $\textsf{entw}=\textsf{subw}=3/2$ . Hence, algorithm 1 constructs a cover that has size at most $O(|D|^{3/2})$ ; for this, it suffices to use as $\mathbb{T}$ the only two non-redundant tree decompositions.

Combining the above corollary with Theorem 8, we obtain that we can construct a data structure of size only $O(|D|^{\textsf{entw}})$ such that we can provide a constant-delay enumeration guarantee of the output $Q(D)$ . Note that the best-known such data structure has size $O(|D|^{\textsf{subw}})$ , where subw is the submodular width of the query. However, it holds that $\textsf{entw}\leq\textsf{subw}$ [14].³³3It is not known whether there is some query where there is a gap between the two width measures. Of course, we do not know whether we can construct this data structure in time $O(|D|^{\textsf{entw}})$ , since our algorithm needs time linear w.r.t. the output size, which can be asymptotically larger. However, as the next proposition shows, we can construct a (potentially larger) cover in time and size $O(|D|^{\textsf{subw}})$ using the PANDA algorithm as a blackbox.

Proposition 18.

Given a full CQ $Q$ with hypergraph $\mathcal{H}$ and a database instance $D$ , there exists a cover of $Q(D)$ w.r.t. $\mathsf{TD}(\mathcal{H})$ of size $O(|D|^{\textsf{subw}(\mathcal{H})})$ that can be constructed in time $\tilde{O}(|D|^{\textsf{subw}(\mathcal{H})})$ .⁴⁴4The notation $\tilde{O}$ allows for a polylogarithmic factor in the input size $|D|$ .

6 Lower Bounds for Covers

To prove the lower bound, we will relate the size of a cover to the size of a model of a disjunctive Datalog rule, following a similar argument as used in [9] to connect the size of a semiring circuit to disjunctive Datalog. Following the definition in [15], a disjunctive Datalog rule over a hypergraph $\mathcal{H}=([n],\mathcal{E})$ is an expression of the form:

P:\bigvee_{B\in\mathcal{B}}T_{B}(\mathbf{x}_{B})\leftarrow\bigwedge_{e\in% \mathcal{E}}R_{e}(\mathbf{x}_{e})

where $B\subseteq[n]$ and $\mathcal{B}\subseteq 2^{[n]}$ . Each output relation $T_{B}$ on the head of the rule is called a target. Given an instance $D$ , a model of a disjunctive Datalog rule is a tuple $\mathbf{T}=(T_{B})_{B\in\mathcal{B}}$ of relation instances such that for any tuple $t$ over schema $[n]$ , if $t[e]\in R_{e}$ for all $e\in\mathcal{E}$ , then there exists a target $T_{B}$ such that $t[B]\in T_{B}$ . The size of a model is defined to be $\max_{B}|T_{B}|$ and the output size of a disjunctive Datalog rule over an instance $D$ is the minimum size over all models of the rule.

Lemma 19.

Let $Q$ be a full CQ, $D$ be an instance, and $\mathbb{T}$ be a finite set of tree decompositions of $Q$ . Suppose $K$ is a cover of $Q(D)$ w.r.t. $\mathbb{T}$ , and let $\beta$ be any bag selector for $\mathbb{T}$ . Consider the disjunctive rule

P_{\beta}:\bigvee_{B\in\mathsf{image}(\beta)}T_{B}(\mathbf{x}_{B})\leftarrow% \bigwedge_{e\in\mathcal{E}}R_{e}(\mathbf{x}_{e}).

Then, $|K|\geq|P_{\beta}(D)|$ .

Proof.

We will construct a model of $P_{\beta}$ from the cover $K$ by taking $T_{B}:=\pi_{B}(K)$ for every $B\in\mathsf{image}(\beta)$ . Clearly, $|P_{\beta}|=\max_{B}|T_{B}|\leq|K|$ . To show that $\{T_{B}\}_{B\in\mathsf{image}(\beta)}$ is a model for $P_{\beta}$ , consider any tuple $t\in Q(D)$ . Then, since $K$ is a cover, there exists a tree decomposition $(\mathcal{T},\chi)$ such that $t\in\Join_{v\in V(\mathcal{T})}\pi_{\chi(v)}(K)$ . Since $\beta$ is a bag selector, there exists $B\in\mathsf{image}(\beta)$ for which $\beta(\mathcal{T},\chi)=B$ , and for this bag, $t[B]\in\pi_{B}(K)=T_{B}$ . $\hfill\blacktriangleleft$

Example 20.

We continue our running example for the 4-cycle query with $\mathbb{T}=\{T_{1},T_{2}\}$ . Consider the bag selector $\beta$ where we choose bag $\{1,2,3\}$ from decomposition $T_{1}$ and bag $\{1,2,4\}$ from decomposition $T_{2}$ . Then, the following disjunctive Datalog rule is formed:

P_{\beta}:T_{123}(x_{1},x_{2},x_{3})\vee T_{124}(x_{1},x_{2},x_{4})\leftarrow R% _{1}(x_{1},x_{2}),R_{2}(x_{2},x_{3}),R_{3}(x_{3},x_{4}),R_{4}(x_{4},x_{1}).

Lemma 19 tells us than any lower bound on the size of the model for $P_{\beta}$ is also a lower bound for the size of any cover w.r.t. $\mathbb{T}$ . Note that the rule that has only target $T_{123}$ (or $T_{124}$ ) does not transfer its lower bound, since the disjunction on the left-hand side needs to include all tree decompositions in $\mathbb{T}$ .

Theorem 21.

Let $Q$ be a full CQ, and $\mathbb{T}$ be a finite set of tree decompositions of $Q$ . Consider a set of constraints $\mathsf{HDC}$ . Then, for any $\epsilon>0$ there exists a scale factor $k$ and an instance $D$ that satisfies the constraints $\mathsf{HDC}\times k$ for which for every cover $K$ of $Q(D)$ w.r.t. $\mathbb{T}$ we have: $\log|{K}|\geq(1-\epsilon)\cdot{\textsf{da-entw}_{\mathbb{T}}(\mathcal{H},% \mathsf{HDC}\times k})$ .

Proof.

We will use the lower bound construction for an output of a disjunctive rule [15] and follow the proof of Theorem 3.7 in [9]. Let $\mathbf{B}_{\mathbb{T}}$ be the collection of images of all $\beta\in\mathcal{M}_{\mathbb{T}}$ . Then, we can bound $\textsf{da-entw}_{\mathbb{T}}(\mathcal{H},\mathsf{HDC}\times k)$ as follows:

\textsf{da-entw}_{\mathbb{T}}(\mathcal{H},\mathsf{HDC}\times k)\leq\max_{% \mathcal{B}\in\mathbf{B}_{\mathbb{T}}}\max_{h\in\overline{\Gamma}^{*}_{n}\cap% \mathsf{HDC}\times k}\min_{B\in\mathcal{B}}h(B)

For a bag selector $\beta$ , consider the disjunctive rule $P_{\beta}$ as constructed in 19. We know from Lemma 4.4 in [15] that for any $\epsilon>0$ there exists an integer $k_{\beta}>0$ and an instance $I_{\beta}$ such that

\log|P(I_{\beta})|\geq(1-\epsilon)\max_{h\in\overline{\Gamma}^{*}_{n}\cap% \mathsf{HDC}\times k_{\beta}}\min_{B\in\mathcal{B}}h(B).

Consider now the $k_{0}$ that maximizes $k_{\beta}$ over all bag selectors in $\mathcal{M}_{\mathbb{T}}$ , and let $\beta_{0}$ the bag selector that maximizes the right-hand side of the first inequality. Then,

\log|P(I_{\beta_{0}})|\geq(1-\epsilon)\cdot\textsf{da-entw}_{\mathbb{T}}(% \mathcal{H},\mathsf{HDC}\times k_{0}).

To conclude the proof, observe that from Lemma 19 we have that for a cover $K$ , it holds that $\log|K|\geq\log|P(I_{\beta_{0}})|$ . $\hfill\blacktriangleleft$

This lower bound matches (asymptotically) the upper bound in the previous section. When the degree constraints correspond to a uniform cardinality constraint $N$ on each relation and $\mathbb{T}$ is the set of all non-redundant tree decompositions, then we obtain from Theorem 21 that there exists an instance where any cover has size $\Omega(N^{\textsf{entw}})$ .

7 From Covers to Semiring Circuits

In this section, we will show how we can go from a cover of a query result to a semiring circuit. Before we present the main result, we need to introduce some further terminology.

A semiring $\mathbb{S}=(\mathbf{D},\oplus,\otimes,\mathbf{0},\mathbf{1})$ is a structure where $\oplus$ is the addition and $\otimes$ is the multiplication such that $(i)$ $(\mathbf{D},\oplus,\mathbf{0})$ and $(\mathbf{D},\otimes,\mathbf{1})$ are commutative monoids, $(ii)$ multiplication is distributive over addition, and $(iii)$ $x\otimes\mathbf{0}=\mathbf{0}$ for every element $x\in\mathbf{D}$ . The semiring is also idempotent if $x\oplus x=x$ for every element $x\in\mathbf{D}$ . An example of an idempotent semiring is the tropical semiring $(\mathbb{Z},\min,+,+\infty,0)$ .

Given a full CQ with hypergraph $\mathcal{H}$ , an instance $D$ and a semiring $\mathbb{S}$ , we will add to each input tuple $t$ from relation $R_{e}$ an annotation $x^{e}_{t}$ : this annotation can be thought as a variable that takes values from the domain $\mathbf{D}$ of the semiring $\mathbb{S}$ . For example, if $\mathbb{S}$ is the Boolean semiring $(\{0,1\},\vee,\wedge,0,1)$ , the annotation captures the presence or absence of the tuple $t$ . As another example, if $\mathbb{S}$ is the counting semiring $(\mathbb{N},+,\times,0,1)$ , the annotation captures the multiplicity of the tuple $t$ . We can now define the sum-product polynomial as:

p_{D}^{\mathcal{H}}:=\bigoplus_{t\in Q(D)}\bigotimes_{e\in\mathcal{E}}x^{e}_{t% [e]}.

The sum-product polynomial captures different types of provenance of the query, depending on the semiring we use to interpret it [10]. When $\mathbb{S}$ is the Boolean semiring, the polynomial captures Boolean provenance, while if it is the counting semiring, it encodes how-provenance. Why-provenance can also be captured in this framework.

We can concisely represent the sum-product polynomial using circuits. For our purposes, we will define a circuit $F$ over $\mathbb{S}$ as a directed acyclic graph with input nodes the variables $x^{e}_{t[e]}$ and constants $\mathbf{0},\mathbf{1}$ . Every other node is labelled by a semiring operation ( $\oplus$ or $\otimes$ ) and has fan-in 2. An input gate of $F$ is any gate with fan-in 0 and the output gate of $F$ is the unique gate with fan-out 0.⁵⁵5In general, we could define a circuit to have multiple output gates, but for the purposes of this paper it suffices to consider circuits with a unique output gate. The size of the circuit, denoted as $|F|$ , is the number of gates in the circuit. If we have a circuit $F$ that computes the polynomial $p_{D}^{\mathcal{H}}$ , then we can evaluate $p_{D}^{\mathcal{H}}$ for any input values in time only $O(|F|)$ by evaluating the circuit in a bottom-up fashion.

Our main result in this section shows a tight connection between covers and circuits. In particular, we show that we can use a cover of size $K$ to construct a semiring circuit of the same size $O(K)$ . Hence, small covers will lead to small circuits.

Theorem 22.

Let $Q$ be a full CQ with hypergraph $\mathcal{H}$ , $D$ be an instance, and $\mathbb{T}$ be a finite set of tree decompositions of $Q$ . Let $K$ be a cover of $Q(D)$ w.r.t. $\mathbb{T}$ . Then, in time $O(|K|)$ we can construct a semiring circuit of size $O(|K|)$ for the polynomial $p^{\mathcal{H}}_{D}$ under any idempotent semiring.

Proof.

For a decomposition $T=(\mathcal{T},\chi)\in\mathbb{T}$ , we will first augment it with a mapping $\mu_{T}:\mathcal{E}\rightarrow V(\mathcal{T})$ ; this mapping assigns each hyperedge of $\mathcal{H}$ to a node $v$ in the tree decomposition. For a node $v\in V(\mathcal{T})$ and a tuple $t$ , we can now define a new variable

y^{v}_{t}:=\bigotimes_{e\in\mathcal{E}:\mu_{T}(e)=v}x^{e}_{t[e]}

In other words, the new variable corresponds to taking the semiring product of the input tuples for the hyperedges we have assigned to the bag $v$ . If there is no hyperedge assigned to $v$ , then $y^{v}_{t}:=\mathbf{1}$ . Any variable $y^{v}_{t}$ can be always encoded via a constant-size subcircuit that computes a constant-size semiring product.

Using the definition of a cover, we can now write:

$\displaystyle p^{\mathcal{H}}_{D}$	$\displaystyle=\bigoplus_{t\in Q(D)}\bigotimes_{e\in\mathcal{E}}x^{e}_{t[e]}$
	$\displaystyle=\bigoplus_{t\in\bigcup_{(\mathcal{T},\chi)\in\mathbb{T}}\Join_{v% \in V(\mathcal{T})}(\pi_{{\chi(v)}}K)}\left(\bigotimes_{e\in\mathcal{E}}x^{e}_% {t[e]}\right)$	$\displaystyle(\text{by definition of a cover})$
	$\displaystyle=\bigoplus_{(\mathcal{T},\chi)\in\mathbb{T}}\left(\bigoplus_{t\in% \Join_{v\in V(\mathcal{T})}(\pi_{{\chi(v)}}K)}\left(\bigotimes_{e\in\mathcal{E% }}x^{e}_{t[e]}\right)\right)$	$\displaystyle(\text{by the idempotence of }\oplus)$
	$\displaystyle=\bigoplus_{(\mathcal{T},\chi)\in\mathbb{T}}\left(\bigoplus_{t\in% \Join_{v\in V(\mathcal{T})}(\pi_{{\chi(v)}}K)}\left(\bigotimes_{v\in V(% \mathcal{T})}\bigotimes_{e\in\mathcal{E}:\mu(e)=v}x^{e}_{t[e]}\right)\right)$	$\displaystyle(\text{by commutativity of }\oplus)$
	$\displaystyle=\bigoplus_{(\mathcal{T},\chi)\in\mathbb{T}}\left(\bigoplus_{t\in% \Join_{v\in V(\mathcal{T})}(\pi_{{\chi(v)}}K)}\left(\bigotimes_{v\in V(% \mathcal{T})}y^{v}_{t}\right)\right)$	$\displaystyle(\text{by definition of }y^{v}_{t})$

At this point, we can see that it suffices to build a subcircuit for each tree decomposition $(\mathcal{T},\chi)\in\mathbb{T}$ and then sum their output gates to obtain the final output. Since $|\mathbb{T}|$ is data-independent, the number of these sum-gates is a constant. Hence, our problem reduces to constructing a circuit for the polynomial

p_{T}:=\bigoplus_{t\in\Join_{v\in V(\mathcal{T})}(\pi_{{\chi(v)}}K)}\left(% \bigotimes_{v\in V(\mathcal{T})}y^{v}_{t}\right).

This corresponds to constructing a circuit for the polynomial of an acyclic query with body $\bigwedge_{v\in V(\mathcal{T})}S_{\chi(v)}(\mathbf{x}_{\chi(v)})$ , where each input relation $S_{\chi(v)}$ is constructed by computing the projection on the cover $\pi_{{\chi(v)}}K$ . Any such projection can be computed in linear time $O(|K|)$ . Additionally, from [9] (Theorem 4.1), we know how to construct this circuit in time linear w.r.t. the size of the input, which is $O(|K|)$ . Moreover, the size of the circuit is bounded by the input size, and thus is also $O(|K|)$ . $\hfill\blacktriangleleft$

$\blacktriangleright$ Remark 23.

The idempotence of the semiring is a critical requirement for the above theorem, since otherwise the proof breaks. Indeed, if the same output tuple is produced by multiple tree decompositions, then it is not possible to correctly split to decompositions with $\oplus$ . Thus, we cannot use a cover to construct a semiring circuit for, say, the counting semiring $(\mathbb{N},+,\times,0,1)$ which is not idempotent.

8 Conclusion

In this paper, we have shown that by generalizing covers to depend on multiple tree decompositions, we can construct covers of asymptotically smaller size. We have also provided matching worst-case lower bounds for general degree constraints.

Several research questions on covers remain open. A first question is how efficiently we can construct a small cover of size $O(|D|^{\mathsf{entw}})$ . Our current construction takes time $O(|Q(D)|)$ , which can be as large as $\Omega(|D|^{\rho^{*}})$ . It would be interesting to explore whether it is possible to reduce this time down to $O(|D|^{\mathsf{subw}})$ using the PANDA algorithm as a blackbox.

A second direction is to consider algorithms that compute the instance-optimal cover (instead of the worst-case optimal). It is likely that this becomes a computationally hard problem – especially when we consider multiple tree decompositions – however, there might be cases where the problem of finding the smallest cover is tractable.

Finally, it is possible to consider a variant of a generalized cover where the output of each decomposition is disjoint from each other. In other words, we want $Q(D)$ to be partitioned into $\{I_{i}\}_{i\in\mathbb{T}}$ such that each $I_{i}$ has a cover $K_{i}$ w.r.t. the decomposition $i$ . Such a “disjoint cover” $\{K_{i}\}_{i\in\mathbb{T}}$ would allow us to construct a semiring circuit for any semiring, even one that is non-idempotent. The current greedy algorithm, however, does not have the disjointness guarantee and thus it is not clear how we can construct disjoint covers of small size.

References

[1] Antoine Amarilli and Florent Capelli. Tractable circuits in database theory. SIGMOD Rec., 53(2):6–20, 2024. doi:10.1145/3685980.3685982.
[2] Albert Atserias, Martin Grohe, and Dániel Marx. Size bounds and query plans for relational joins. In FOCS, pages 739–748. IEEE Computer Society, 2008. doi:10.1109/FOCS.2008.43.
[3] Guillaume Bagan, Arnaud Durand, and Etienne Grandjean. On acyclic conjunctive queries and constant delay enumeration. In CSL, volume 4646 of Lecture Notes in Computer Science, pages 208–222. Springer, 2007. doi:10.1007/978-3-540-74915-8_18.
[4] Christoph Berkholz and Nicole Schweikardt. Constant delay enumeration with fpt-preprocessing for conjunctive queries of bounded submodular width. In MFCS, volume 138 of LIPIcs, pages 58:1–58:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2019. doi:10.4230/LIPICS.MFCS.2019.58.
[5] Nofar Carmeli and Markus Kröll. On the enumeration complexity of unions of conjunctive queries. ACM Trans. Database Syst., 46(2):5:1–5:41, 2021. doi:10.1145/3450263.
[6] Shaleen Deep, Xiao Hu, and Paraschos Koutris. General space-time tradeoffs via relational queries. In WADS, volume 14079 of Lecture Notes in Computer Science, pages 309–325. Springer, 2023. doi:10.1007/978-3-031-38906-1_21.
[7] Shaleen Deep and Paraschos Koutris. Compressed representations of conjunctive query results. In PODS, pages 307–322. ACM, 2018. doi:10.1145/3196959.3196979.
[8] Arnaud Durand and Yann Strozecki. Enumeration complexity of logical query problems with second-order variables. In CSL, volume 12 of LIPIcs, pages 189–202. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2011. doi:10.4230/LIPICS.CSL.2011.189.
[9] Austen Z. Fan, Paraschos Koutris, and Hangdong Zhao. Tight bounds of circuits for sum-product queries. Proc. ACM Manag. Data, 2(2):87, 2024. doi:10.1145/3651588.
[10] Todd J. Green, Gregory Karvounarakis, and Val Tannen. Provenance semirings. In PODS, pages 31–40. ACM, 2007. doi:10.1145/1265530.1265535.
[11] Ahmet Kara, Milos Nikolic, Dan Olteanu, and Haozhe Zhang. Trade-offs in static and dynamic evaluation of hierarchical queries. In PODS, pages 375–392. ACM, 2020. doi:10.1145/3375395.3387646.
[12] Ahmet Kara, Milos Nikolic, Dan Olteanu, and Haozhe Zhang. Evaluation trade-offs for acyclic conjunctive queries. In CSL, volume 252 of LIPIcs, pages 29:1–29:20. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.CSL.2023.29.
[13] Ahmet Kara and Dan Olteanu. Covers of query results. In ICDT, volume 98 of LIPIcs, pages 16:1–16:22. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2018. doi:10.4230/LIPICS.ICDT.2018.16.
[14] Mahmoud Abo Khamis, Hung Q. Ngo, and Dan Suciu. What do Shannon-type inequalities, submodular width, and disjunctive datalog have to do with one another? In PODS, pages 429–444. ACM, 2017. doi:10.1145/3034786.3056105.
[15] Mahmoud Abo Khamis, Hung Q. Ngo, and Dan Suciu. What do shannon-type inequalities, submodular width, and disjunctive datalog have to do with one another?, 2023. arXiv:1612.02503.
[16] Dan Olteanu and Jakub Závodný. Size bounds for factorised representations of query results. ACM Trans. Database Syst., 40(1):2:1–2:44, 2015. doi:10.1145/2656335.
[17] Rasmus Pagh and Flemming Friche Rodler. Cuckoo hashing. In ESA, volume 2161 of Lecture Notes in Computer Science, pages 121–133. Springer, 2001. doi:10.1007/3-540-44676-1_10.
[18] Maximilian Schleich, Dan Olteanu, and Radu Ciucanu. Learning linear regression models over factorized joins. In SIGMOD Conference, pages 3–18. ACM, 2016. doi:10.1145/2882903.2882939.
[19] Hangdong Zhao, Shaleen Deep, and Paraschos Koutris. Space-time tradeoffs for conjunctive queries with access patterns. In PODS, pages 59–68. ACM, 2023. doi:10.1145/3584372.3588675.

Appendix A Appendix

We show here the proof of Proposition 18.

Proof.

Consider $\mathbf{B}_{\mathsf{TD}}$ and for each $\mathcal{B}\in\mathbf{B}_{\mathsf{TD}}$ , the disjunctive Datalog rule

P:\bigvee_{B\in\mathcal{B}}T_{B}(\mathbf{x}_{B})\leftarrow\bigwedge_{e\in% \mathcal{E}}R_{e}(\mathbf{x}_{e})

From the proof of Proposition 7.13 in [15], we can construct in time $\tilde{O}(|D|^{\textsf{subw}(\mathcal{H})})$ a model $(T_{B})_{B\in\mathcal{B}}$ of size $O(|D|^{\textsf{subw}(\mathcal{H})})$ using the PANDA algorithm.

Next, at each bag $\chi(v)$ in every tree decomposition $(\mathcal{T},\chi)\in\mathsf{TD}$ , we construct a table $S_{\chi(v)}(\mathbf{x}_{\chi(v)})$ ( $\mathbf{x}_{\chi(v)}$ is its schema) as follows: (1) take the union over all output tables $T_{\chi(v)}(\mathbf{x}_{\chi(v)})$ from every disjunctive Datalog rule defined by $\mathcal{B}=\mathsf{image}(\beta)\in\mathbf{B}_{\mathsf{TD}}$ , if the bag selector $\beta$ selects this bag $\chi(v)$ from $(\mathcal{T},\chi)$ , and (2) semijoin-reduce the union by every input relation $R_{e}(\mathbf{x}_{e})$ , to prune off tuples that do not contribute to the output. It is easy to verify that $S_{\chi(v)}(\mathbf{x}_{\chi(v)})$ is of size $O(|D|^{\textsf{subw}(\mathcal{H})})$ .

By Corollary 7.13 in [15], the query results can be written as the following union of CQs, where we have one CQ per tree decomposition:

Q(D)=\bigcup_{(\mathcal{T},\chi)\in\mathsf{TD}}\Join_{v\in V(\mathcal{T})}S_{% \chi(v)}(\mathbf{x}_{\chi(v)})

Let $Q_{(\mathcal{T},\chi)}:=\Join_{v\in V(\mathcal{T})}S_{\chi(v)}(\mathbf{x}_{% \chi(v)})$ . Note that this corresponds to computing an acyclic join over relations with sizes bounded by $O(|D|^{\textsf{subw}(\mathcal{H})})$ . We can now use the standard construction of a cover (Lemma 23 in [13]) to construct in time $\tilde{O}(|D|^{\textsf{subw}(\mathcal{H})})$ a cover $K_{(\mathcal{T},\chi)}$ of $Q_{(\mathcal{T},\chi)}$ of size $O(|D|^{\textsf{subw}(\mathcal{H})})$ . Finally, to obtain the cover $K$ we simply take the union of all covers, i.e., $K:=\bigcup_{(\mathcal{T},\chi)}K_{(\mathcal{T},\chi)}$ . $\hfill\blacktriangleleft$

[bib.bib1] [1] Antoine Amarilli and Florent Capelli. Tractable circuits in database theory. SIGMOD Rec., 53(2):6–20, 2024. doi:10.1145/3685980.3685982.

[bib.bib2] [2] Albert Atserias, Martin Grohe, and Dániel Marx. Size bounds and query plans for relational joins. In FOCS, pages 739–748. IEEE Computer Society, 2008. doi:10.1109/FOCS.2008.43.

[bib.bib3] [3] Guillaume Bagan, Arnaud Durand, and Etienne Grandjean. On acyclic conjunctive queries and constant delay enumeration. In CSL, volume 4646 of Lecture Notes in Computer Science, pages 208–222. Springer, 2007. doi:10.1007/978-3-540-74915-8_18.

[bib.bib4] [4] Christoph Berkholz and Nicole Schweikardt. Constant delay enumeration with fpt-preprocessing for conjunctive queries of bounded submodular width. In MFCS, volume 138 of LIPIcs, pages 58:1–58:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2019. doi:10.4230/LIPICS.MFCS.2019.58.

[bib.bib5] [5] Nofar Carmeli and Markus Kröll. On the enumeration complexity of unions of conjunctive queries. ACM Trans. Database Syst., 46(2):5:1–5:41, 2021. doi:10.1145/3450263.

[bib.bib6] [6] Shaleen Deep, Xiao Hu, and Paraschos Koutris. General space-time tradeoffs via relational queries. In WADS, volume 14079 of Lecture Notes in Computer Science, pages 309–325. Springer, 2023. doi:10.1007/978-3-031-38906-1_21.

[bib.bib7] [7] Shaleen Deep and Paraschos Koutris. Compressed representations of conjunctive query results. In PODS, pages 307–322. ACM, 2018. doi:10.1145/3196959.3196979.

[bib.bib8] [8] Arnaud Durand and Yann Strozecki. Enumeration complexity of logical query problems with second-order variables. In CSL, volume 12 of LIPIcs, pages 189–202. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2011. doi:10.4230/LIPICS.CSL.2011.189.

[bib.bib9] [9] Austen Z. Fan, Paraschos Koutris, and Hangdong Zhao. Tight bounds of circuits for sum-product queries. Proc. ACM Manag. Data, 2(2):87, 2024. doi:10.1145/3651588.

[bib.bib10] [10] Todd J. Green, Gregory Karvounarakis, and Val Tannen. Provenance semirings. In PODS, pages 31–40. ACM, 2007. doi:10.1145/1265530.1265535.

[bib.bib11] [11] Ahmet Kara, Milos Nikolic, Dan Olteanu, and Haozhe Zhang. Trade-offs in static and dynamic evaluation of hierarchical queries. In PODS, pages 375–392. ACM, 2020. doi:10.1145/3375395.3387646.

[bib.bib12] [12] Ahmet Kara, Milos Nikolic, Dan Olteanu, and Haozhe Zhang. Evaluation trade-offs for acyclic conjunctive queries. In CSL, volume 252 of LIPIcs, pages 29:1–29:20. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.CSL.2023.29.

[bib.bib13] [13] Ahmet Kara and Dan Olteanu. Covers of query results. In ICDT, volume 98 of LIPIcs, pages 16:1–16:22. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2018. doi:10.4230/LIPICS.ICDT.2018.16.

[bib.bib14] [14] Mahmoud Abo Khamis, Hung Q. Ngo, and Dan Suciu. What do Shannon-type inequalities, submodular width, and disjunctive datalog have to do with one another? In PODS, pages 429–444. ACM, 2017. doi:10.1145/3034786.3056105.

[bib.bib15] [15] Mahmoud Abo Khamis, Hung Q. Ngo, and Dan Suciu. What do shannon-type inequalities, submodular width, and disjunctive datalog have to do with one another?, 2023. arXiv:1612.02503.

[bib.bib16] [16] Dan Olteanu and Jakub Závodný. Size bounds for factorised representations of query results. ACM Trans. Database Syst., 40(1):2:1–2:44, 2015. doi:10.1145/2656335.

[bib.bib17] [17] Rasmus Pagh and Flemming Friche Rodler. Cuckoo hashing. In ESA, volume 2161 of Lecture Notes in Computer Science, pages 121–133. Springer, 2001. doi:10.1007/3-540-44676-1_10.

[bib.bib18] [18] Maximilian Schleich, Dan Olteanu, and Radu Ciucanu. Learning linear regression models over factorized joins. In SIGMOD Conference, pages 3–18. ACM, 2016. doi:10.1145/2882903.2882939.

[bib.bib19] [19] Hangdong Zhao, Shaleen Deep, and Paraschos Koutris. Space-time tradeoffs for conjunctive queries with access patterns. In PODS, pages 59–68. ACM, 2023. doi:10.1145/3584372.3588675.

Generalized Covers for Conjunctive Queries

Abstract

Keywords and phrases:

Copyright and License:

2012 ACM Subject Classification:

DOI:

Event:

Editors:

Series and Publisher:

1 Introduction

Our Contribution.

2 Related Work

Factorized Databases and Circuits.

Compressed Representations of Outputs.

Preprocessing Time and Enumeration.

3 Preliminaries

Conjunctive Queries.

Tree Decompositions.

Definition 1 (Tree Decomposition).

Entropic Functions.

Entropic Width.

Computational Model.

4 Generalized Covers

Definition 2 (Cover).

Proposition 3.

Proof.

▶ Remark 4.

Example 5.

4.1 Basic Properties of Covers

Proposition 6.

Proof.

Proposition 7.

Proof.

4.2 From Covers to Constant-Delay Enumeration

Theorem 8.

Proof.

▶ Remark 9.

5 Finding Small Covers

Theorem 10.

Lemma 11.

Lemma 12.

Lemma 13.

Proof.

Example 14.

Theorem 15.

Proof.

Corollary 16.

Example 17.

Proposition 18.

6 Lower Bounds for Covers

Lemma 19.

Proof.

Example 20.

Theorem 21.

Proof.

7 From Covers to Semiring Circuits

Theorem 22.

Proof.

▶ Remark 23.

8 Conclusion

References

Appendix A Appendix

Proof.

$\blacktriangleright$ Remark 4.

$\blacktriangleright$ Remark 9.

$\blacktriangleright$ Remark 23.