Abstract 1 Introduction 2 Related Work 3 Preliminaries 4 Generalized Covers 5 Finding Small Covers 6 Lower Bounds for Covers 7 From Covers to Semiring Circuits 8 Conclusion References Appendix A Appendix

Generalized Covers for Conjunctive Queries

Paraschos Koutris ORCID University of Wisconsin-Madison, WI, USA
Abstract

Covers of query results were introduced as succinct lossless representations of join query outputs. A cover is a subset of the query result from which we can efficiently enumerate the output with constant delay and linear preprocessing time. However, covers are dependent on a single tree decomposition of the query. In this work, we generalize the notion of a cover to a set of multiple tree decompositions. We show that this generalization can potentially produce asymptotically smaller covers while maintaining the properties of constant-delay enumeration and linear preprocessing time. In particular, given a set of tree decompositions, we can determine exactly the asymptotic size of a minimum cover, which is tied to the notion of entropic width of the query. We also provide a simple greedy algorithm that computes this cover efficiently. Finally, we relate covers to semiring circuits when the semiring is idempotent.

Keywords and phrases:
Conjunctive Query, tree decomposition, cover
Copyright and License:
[Uncaptioned image] © Paraschos Koutris; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Database theory
Editors:
Sudeepa Roy and Ahmet Kara

1 Introduction

Join queries are one of the fundamental operations in relational databases. However, the output of a join query can be potentially huge when executed over a large database instance. This is particularly relevant when a join output needs to be sent over a network link in a distributed setting, or if the output result needs to be stored before being processed in a downstream task. In such scenarios, it is often desirable to construct concise representations of the query output such that we can reconstruct the query result as efficiently as possible when and where it is needed. Several succinct data structures have been proposed in the literature, including factorized databases [16] and compressed representations [7].

A cover is one such succinct and lossless representation of a join output that was introduced by Kara and Olteanu [13]. A cover is simply a subset of the query result that, together with a given tree decomposition of the query, can efficiently reconstruct the full query output. Here, efficient means that after a preprocessing step that is linear to the size of the representation, we can enumerate the output with a constant delay guarantee. One of the key results in [13] is that, given a tree decomposition T of the query and a database instance of size N, we can always produce a cover of size O(N𝖿𝗁𝗐(T)), where 𝖿𝗁𝗐(T) is the fractional hypertree width of T. Thus, choosing a tree decomposition of minimum width gives a cover of size O(N𝖿𝗁𝗐) and that can be shown to be asymptotically optimal. This bound is a large improvement over storing the full output, which can be as large as the AGM bound [2], i.e., Ω(Nρ) where ρ is the fractional edge cover of the query.

The main insight of this work is that it is possible to construct covers of even smaller asymptotic size if we define the cover to depend not on a single decomposition, but on multiple tree decompositions. This is analogous to the PANDA algorithm [14], which improves the runtime of query evaluation from O(N𝖿𝗁𝗐) to O(N𝗌𝗎𝖻𝗐) by considering simultaneously multiple tree decompositions. Here, 𝗌𝗎𝖻𝗐 is the submodular width of the query. Importantly, even though we allow a dependence on multiple tree decompositions, we can still show that the guarantee of linear-time preprocessing with constant delay enumeration remains unchanged. Thus, this generalized notion of a cover produces an even more succinct and efficient representation of the query output.

In fact, we will show in this paper that by considering the (finite) set of all possible tree decompositions, we can construct covers of size O(N𝖾𝗇𝗍𝗐), where 𝖾𝗇𝗍𝗐 is the entropic width of the query, a width measure that is at least as small as the submodular width. As a consequence, we show the existence of a data structure of size O(N𝖾𝗇𝗍𝗐) from which we can enumerate the query output with constant delay. We should note here that, unfortunately, the time to construct this data structure can be much larger than O(N𝖾𝗇𝗍𝗐), and so the cover construction does not produce a faster query evaluation algorithm.

From a theoretical point of a view, this makes progress to the following fundamental question: what is the smallest possible compression of a query result that still guarantees constant-delay enumeration (i.e., fast decompression)? From a practical point of view, the enumeration guarantee is dependent on the number of tree decompositions, which can be exponentially large in the query. However, we show that as we add more decompositions the size of the cover can only improve; hence, even a small set of decompositions can lead to a large improvement in the cover size compared to considering only a single one. The algorithm we present that constructs a small cover – even though it needs access to the full query result – requires only a linear scan over the query result.

Our Contribution.

We summarize our contributions as follows:

  • We generalize the notion of a cover of a query result to consider multiple tree decompositions (Section 4). We prove several interesting properties for covers, and show that given a cover of size K we can spend O(|K|) preprocessing time to enumerate the query output with constant delay.

  • We present a simple greedy algorithm (Section 5) that, given the query output, produces an asymptotically optimal cover in time linear w.r.t. the query output. Moreover, we show an upper bound that uses not only the input size N as a parameter, but any combination of degree constraints, including different cardinalities, functional dependencies, and degree bounds.

  • We provide a lower bound (Section 6) that asymptotically matches our upper bound. The lower bound uses a technically interesting connection to disjunctive Datalog rules to find a worst-case instance for a cover.

  • Finally, we show in Section 7 how we can use a cover to produce a semiring circuit of size linear to the cover size for the provenance polynomial of the query for any idempotent semiring. A semiring circuit can be thought as a factorization of the query output (more precisely, it is a factorization of the polynomial associated with the query output), with the additional flexibility that it can consider different tree decompositions in its internal representation.

2 Related Work

In this section, we present work that relates to covers and in general efficient representations of query results.

Factorized Databases and Circuits.

As shown in [13], covers over a single tree decomposition are related to d-representations over the same tree decomposition. d-representations (and f-representations) are also lossless representations of query outputs, which allow for efficient enumeration, aggregation, and even ML model computation [18]. Query outputs can also be concisely stored using circuits [9, 1] – these circuits represent the polynomial corresponding to the join query and can be evaluated under different semiring semantics. As we will show later in the paper, analogous to the connection between covers and d-representations, there is also a direct connection between generalized covers and semiring circuits.

Compressed Representations of Outputs.

Recent work has also looked at how we can further decrease the size of a representation by trading off the enumeration delay guarantee [7, 19, 6]. Such a tradeoff allows for asymptotically smaller succinct data structures, but with an increased possible delay when outputting two consecutive outputs. Even though it would be theoretically possible to consider covers with weaker enumeration guarantees, in this work we focus only on constant delay.

Preprocessing Time and Enumeration.

Several algorithms on join computation use a two-phase framework where a preprocessing phase is followed by an output enumeration phase [3]. For instance, for any join query with fractional hypertree width 𝖿𝗁𝗐, we need O(N𝖿𝗁𝗐) preprocessing time to achieve constant delay enumeration. The intermediate data structure constructed by the preprocessing phase can be also viewed as a succinct representation of the result; however, it is critical in this case to also have an efficient algorithm to construct the data structure. Recent work has looked at preprocessing time/size and enumeration tradeoffs in this setting [11, 12]. In our paper, even though we construct a very small succinct representation, the runtime can exceed the size of the representation and thus does not lead to a better query evaluation algorithm.

3 Preliminaries

In this section, we present useful notation and terminology.

Conjunctive Queries.

A Conjunctive Query (CQ) Q is an expression associated to a hypergraph =([n],) where [n]={1,,n} and a set U[n]:

Q(𝐱U)eRe(𝐱e)

where each Re is a relation of arity |e|, the variables x1,x2,,xn take values in some discrete domain, and 𝐱e:=(xi)ie. We say that Q is Boolean if U={} and full if U=[n].

Given a tuple t over [n] and a subset S[n], we will use t[S] to denote the projection of the tuple on the attributes in S.

Tree Decompositions.

We recall the notion of a tree decomposition.

Definition 1 (Tree Decomposition).

A tree decomposition of a hypergraph is a pair (𝒯,χ), where 𝒯 is a tree and χ maps each node t of the tree to a subset χ(t) of V(), called a bag, such that:

  1. 1.

    every hyperedge eE() is a subset of χ(t) for some tV(𝒯); and

  2. 2.

    for every vertex vV(), the set {t|vχ(t)} is a non-empty connected subtree of 𝒯.

We say that a tree decomposition is non-redundant if no bag is a subset of another; otherwise it is redundant. We let 𝖳𝖣() be the set of all non-redundant tree decompositions of . This set is known to be finite; it can be shown that |𝖳𝖣()|n!, where n is the number of vertices of the hypergraph (Proposition 2.9 [15]).

Entropic Functions.

Let n be a positive integer. A function h:2[n]+ is called a set function on [n]={1,2,,n}. Given a discrete random variable X with support in 𝒳 and probability distribution p:𝒳[0,1], its entropy is defined as H(X)=x𝒳p(x)logp(x). When the probability distribution is uniform in the support set, then H(X)=log|𝒳|. Given a set of random variables X1,,Xk, the joint entropy H(X1,,Xk) is defined as the entropy of the random variable Y that represents the tuple (X1,,Xk). A set function is an entropic function of order n if there exist random variables A1,,An such that h(S)=H((Ai)iS) for any S[n]. We denote by Γn the set of all entropic functions of order n, and Γ¯n the topological closure of Γn.111The topological closure of a set S is defined as the smallest closed set that contains S. The important property of Γ¯n is that it is a convex cone and hence characterized by the (infinitely many) linear inequalities it satisfies.

Entropic Width.

Let 𝖣𝖢 be a set of triples (X,Y,NY|X) for some XY[n] and NY|X that encodes a set of degree constraints. A database instance I over the relational schema {Re}e satisfies the constraints if for every relation Re in I with XYe and every tuple t defined over schema X, we have |πY(Ret)|NY|X. A constraint of the form (,e,Ne) is simply a cardinality constraint that says that relation Re has size at most Ne. The degree constraints on an instance can be translated as constraints on entropic functions as follows:

𝖧𝖣𝖢:={h:2[n]+(X,Y,NYX)𝖣𝖢h(Y|X)logNY|X}

where h(Y|X):=h(Y)h(X). For a given set of degree constraints 𝖧𝖣𝖢, we define the “scaled-up” degree constraints 𝖧𝖣𝖢×k as the same set of constraints but with all the degree bounds multiplied by k. It will be also helpful to define the entropic constraints in the case we are only given a uniform cardinality constraint (input size):

𝖤𝖣:={h:2[n]+h(e)1,e}

We now define the entropic width (defined in [9]) and degree-aware entropic width (defined in [15] as eda-subw) w.r.t. a set of tree decompositions 𝕋 respectively:

entw𝕋(,𝖤𝖣):=maxhΓ¯n𝖤𝖣min(𝒯,χ)𝕋maxvV(𝒯)h(χ(v))
da-entw𝕋(,𝖧𝖣𝖢):=maxhΓ¯n𝖧𝖣𝖢min(𝒯,χ)𝕋maxvV(𝒯)h(χ(v))

When 𝕋=𝖳𝖣(), we obtain the standard notions of (degree-aware) entropic width.

Computational Model.

We use the uniform-cost RAM model where data values as well as pointers to databases are of constant size. Our runtime analysis will consider data complexity (where the query is considered fixed) unless otherwise stated.

4 Generalized Covers

We start by generalizing the notion of a cover of a query result from one to multiple tree decompositions.

Definition 2 (Cover).

Let Q be a full CQ with hypergraph , D be an instance, and 𝕋 be a finite set of tree decompositions of . A relation K over schema [n] is a cover of Q(D) w.r.t. 𝕋 if

(𝒯,χ)𝕋tV(𝒯)(πχ(t)K)=Q(D).

In other words, if for every decomposition we project the cover to its bags and then join the bags together, the union of these outputs must form exactly Q(D). When the set 𝕋 consists of a single tree decomposition, then we recover the notion of a cover as in [13]. In our more general definition, the reconstruction of the output Q(D) is based not on a single tree decomposition, but a set of tree decompositions.

Proposition 3.

Let Q be a full CQ, D be an instance, and 𝕋 be a set of tree decompositions of Q. If K is a cover of Q(D) w.r.t. 𝕋, then KQ(D).

Proof.

This follows immediately from the fact that for every tree decomposition (𝒯,χ) and relational instance K, we have KtV(𝒯)(πχ(t)K).

 Remark 4.

It is useful to contrast with the original definition of a cover used in [13] for a single tree decomposition (𝒯,χ). In this definition, a cover K was equivalently defined as a set of tuples for which πχ(v)(K)=πχ(v)(Q(D)) for every vV(𝒯). However, such an equivalent characterization does not hold anymore in the case of multiple tree decompositions, since each decomposition may be responsible to produce a (strict) subset of the query output.

We say that a cover K is minimal w.r.t. 𝕋 if there is no strict subset KK that is also a cover w.r.t. 𝕋 222We should note here that in [13] a cover always refers to a minimal cover. In this paper, however, we will also consider covers that may not be minimal.. A cover is minimum w.r.t. 𝕋 if it has the smallest possible size across all covers w.r.t. 𝕋.

Example 5.

We will use as an example the 4-cycle query:

Q(x1,x2,x3,x4)R1(x1,x2),R2(x2,x3),R3(x3,x4),R4(x4,x1).

This CQ has only two non-redundant tree decompositions: T1={1,2,3},{1,3,4} and T2={1,2,4},{2,3,4}. We consider the following relational instance D, where each relation has 2n tuples and thus the input size is |D|=Θ(n). Note also that |Q(D)|=2n2, hence the output size is quadratic w.r.t. the input size.

Suppose now that we want to find a cover w.r.t. {T1} only. Using a result from [13], we know that the size of any cover w.r.t. a single decomposition is at least the size of the largest bag in the decomposition. For T1, both bags {1,2,3} and {1,3,4} have size Θ(n2), hence the cover has size Ω(n2). A similar reasoning provides a lower bound of Ω(n2) for the size of any cover w.r.t. T2. Hence, no matter which tree decomposition we use, a cover w.r.t. a single tree decomposition has size Ω(n2).

Now, suppose we want to find a cover w.r.t. {T1,T2}, i.e., we will include both non-redundant tree decompositions. We claim that the following set K of size only O(n) is a cover w.r.t. {T1,T2}:

Indeed, partition D=DaDb, where Da,Db contain the tuples with a,b -values respectively, and similarly partition K into Ka,Kb. Then, one can verify that π124(Ka)π234(Ka)=Q(Da) and π123(Kb)π134(Kb)=Q(Db). In other words, we use decomposition T2 to guide the covering of the a-tuples, and T1 to guide the covering of the b-tuples. Note that π123(Ka)π134(Ka)Q(Da).

As we will show later in the paper, for the 4-cycle query we can always produce a cover of size O(|D|3/2) using both tree decompositions. In this example, we are able to do even better and produce a cover of only linear size.

4.1 Basic Properties of Covers

The following propositions establish some basic facts about covers.

Proposition 6.

Let Q be a full CQ, D be an instance, and 𝕋,𝕋 be two sets of tree decompositions of Q such that 𝕋𝕋. If K is a cover of Q(D) w.r.t. 𝕋, then K is a cover of Q(D) w.r.t. 𝕋 as well.

Proof.

Indeed, we can write the following:

Q(D) =(𝒯,χ)𝕋vV(𝒯)πχ(v)K(𝒯,χ)𝕋vV(𝒯)πχ(v)K
(𝒯,χ)𝕋vV(𝒯)πχ(v)K (𝒯,χ)𝕋vV(𝒯)(πχ(v)Q(D))=Q(D)

In the last equation, the first inequality follows from Lemma 3 that implies KQ(D), while the last equality follows from the fact that for every tree decomposition (𝒯,χ), vV(𝒯)(πχ(v)Q(D))=Q(D).

The above proposition tells us that if we add more tree decompositions in a set 𝕋, the minimum cover can never increase in size. But how large does 𝕋 need to be to achieve the smallest possible cover (among all possible tree decomposition sets)?

Proposition 7.

Let Q be a full CQ, D be an instance, and 𝕋 be a finite set of tree decompositions of Q. If K covers Q(D) w.r.t. 𝕋, then it covers Q(D) w.r.t. the (finite) set of all non-redundant tree decompositions 𝖳𝖣().

Proof.

From 6, we have that K covers Q(D) w.r.t. 𝕋𝖳𝖣(). Hence:

(𝒯,χ)𝕋𝖳𝖣()vV(𝒯)πχ(v)K=Q(D)

To complete the proof, we will show that

(𝒯,χ)𝖳𝖣()vV(𝒯)πχ(v)K=(𝒯,χ)𝕋𝖳𝖣()vV(𝒯)πχ(v)K

It is straightforward that the left-hand-side is contained in the right-hand-side. For the other direction, let (𝒯,χ) be a redundant tree decomposition in 𝕋. Then, there exists a non-redundant tree decomposition (𝒯,χ) in 𝖳𝖣() such that for every bag vV(𝒯), there exists a bag vV(𝒯) such that χ(v)=χ(v) (this is constructed by eliminating the bags that are contained in some other bag). We will show that for any K over schema [n], it holds that vV(𝒯)πχ(v)KvV(𝒯)πχ(v)K. Indeed, take any tuple tvV(𝒯)πχ(v)K. Consider some bag vV(𝒯). Then, there exists vV(𝒯) with χ(v)=χ(v). Hence, πχ(v)(t)=πχ(v)(t)πχ(v)(K)=πχ(v)(K). This implies that tvV(𝒯)πχ(v)K as well.

The above proposition says that we can achieve the best possible cover in terms of size if we take our set of tree decompositions to be 𝖳𝖣(), which is finite.

4.2 From Covers to Constant-Delay Enumeration

In this section, we show an important property of covers. Given a cover K of Q(D) w.r.t. some set of tree decompositions, we show that with preprocessing time only linear w.r.t. |K|, it is possible to enumerate the output of a query Q with constant delay. A constant-delay enumeration algorithm means that the time between outputting two consecutive tuples in Q(D) (as well as the time to produce the first tuple, and the time to finish after producing the last tuple) is O(1) w.r.t. data complexity.

Theorem 8.

Let Q be a full CQ, D be an instance, and 𝕋 be a finite set of tree decompositions of Q. Suppose we are given a cover K for Q(D) w.r.t. 𝕋. Then, with preprocessing time O(|K|) we can enumerate Q(D) with constant delay.

Proof.

We first describe the preprocessing phase. We will use K to construct a materialized instance RB for every bag B in every decomposition of 𝕋. In particular, we scan K, and for every tuple tK and every decomposition (𝒯,χ) in 𝕋 and every bag vV(𝒯), we add the projection t[χ(v)] to the bag Rχ(v). At this point, since K is a cover for Q(D), we know that Q(D)=(𝒯,χ)𝕋vV(𝒯)Rχ(v).

Once we have materialized each bag, computing the join J(𝒯,χ)=vV(𝒯)Rχ(v) corresponds to computing an acyclic CQ. Indeed, by the definition of a tree decomposition, the query formed by treating each bag as a relation must be acyclic. Thus, we can use the standard result [3] that, with linear preprocessing time O(vV(𝒯)|Rχ(v)|)=O(|K|), we can enumerate the results in J(𝒯,χ) with constant delay. The last thing we need to address is that the same output tuple may be produced via more than one tree decomposition in 𝕋. To deal with this, we can apply Cheater’s lemma [5] that tells us that we can enumerate the union of a constant number of constant-delay enumeration algorithms with constant delay. Alternatively, we can also apply the method of Durand and Strozecki [8] for enumeration of unions of sets, as done in [4].

The main message of the above theorem is that considering more tree decompositions does not hurt the constant-delay enumeration property of covers if we consider data complexity.

 Remark 9.

Although the delay is constant w.r.t. data complexity, it depends linearly on the number of tree decompositions |𝕋|; more precisely, the delay will be O(|𝕋|||Q|). If we choose to include all non-redunant tree decompositions, i.e., 𝕋=𝖳𝖣(), then the delay can become exponentially large in the size of the query, since 𝖳𝖣() can be as large as n!. However, in practice one may not need to consider all tree decompositions to obtain a concise representation of the output.

5 Finding Small Covers

In this section, we will seek to find the smallest possible cover we can obtain for Q(D) w.r.t. any finite set of tree decompositions 𝕋. We will be interested in providing worst-case guarantees on the size, and not an instance-optimal construction of the minimum-sized cover. In particular, we will show the following theorem.

Theorem 10.

Given a full CQ Q with hypergraph and a database instance D that satisfies the degree constraints 𝖣𝖢, there exists a cover of Q(D) w.r.t. a finite set of tree decompositions 𝕋 of size O(2da-entw𝕋(,𝖧𝖣𝖢)). Moreover, given the output of the query Q(D), we can construct this cover in time O(|Q(D)|).

When 𝕋=𝖳𝖣(), then the size bound simply becomes O(2da-entw(,𝖧𝖣𝖢)). To show the above theorem, we will give in the rest of the section a constructive proof, based on a simple greedy algorithm (algorithm 1). This algorithm is inspired by the construction used in [15] (Proof of Lemma 4.1) to construct small models for disjunctive Datalog rules. Here, we adapt this idea to a CQ and do a direct analysis.

The algorithm starts with an empty set K¯, and iterates over the tuples tQ(D) (in any order). For every tuple tQ(D), if there exists a tree decomposition (𝒯,χ)𝕋 such that tvV(𝒯)πχ(v)(K¯) we do nothing; otherwise, we add t to K¯. The algorithm terminates when we have visited all output tuples. To bound the running time of algorithm 1, we can implement the existence check in line 3 in (amortized) constant time per tuple by maintaining a hash table on the projection on each bag πB(K¯), and then simply checking that for every vV(𝒯), we have t[χ(v)]πχ(v)(K¯) (for example, cuckoo hashing [17] gives us constant-time access and amortized constant-time insertions.) Hence, we have the following bound on the running time.

Lemma 11.

algorithm 1 runs in time O(|Q(D)|).

Algorithm 1 Greedy Cover.
Input : output Q(D) of a full CQ Q, set of tree decompositions 𝕋
Output : cover of Q(D) w.r.t. 𝕋
1 K¯
2 foreach tQ(D) do
3 if  (𝒯,χ)𝕋:tvV(𝒯)πχ(v)(K¯) then
4    K¯K¯{t}
5    
return K¯

By construction of K¯, we also obtain:

Lemma 12.

The output K¯ of algorithm 1 is a cover of Q(D) w.r.t. 𝕋.

Note that if we run algorithm 1 with two different sets of tree decompositions 𝕋𝕋, the cover produced for 𝕋 will be at least as small as the one for 𝕋. However, the delay in the enumeration of the results from the cover will increase. Hence, we essentially have a tradeoff between the compression size (the size of the cover) and time to decompress (the delay).

Let 𝕋 be the set of all maps β:𝕋2[n] such that β(𝒯,χ)=χ(v) for some vV(𝒯). Such a map β is called a bag selector that chooses a bag from each tree decomposition in 𝕋.

Lemma 13.

The set K¯ can be partitioned as {K¯β}β𝕋 such that for any two tuples t,tK¯β it holds that t[B]t[B] for every B𝗂𝗆𝖺𝗀𝖾(β).

Proof.

We will construct the sets K¯β following the construction of K¯ in algorithm 1. Initially, all K¯β are empty. Consider the point where a new tuple t is added in K¯ in line 4. Then, for every decomposition (𝒯,χ)𝕋 there must exist some bag vV(𝒯) such that t[χ(v)]πχ(v)(K¯) (if there is more than one such bag, we choose one arbitrarily). Consider the bag selector β such that its image is exactly this set of bags, and add t to K¯β. It is easy to see that, for every tuple tK¯β and B𝗂𝗆𝖺𝗀𝖾(β), we have t[B]πB(K¯) and thus it must be that t[B]t[B].

Example 14.

We show how algorithm 1 and the above lemma work using Example 5. Recall that we have two tree decompositions: T1={1,2,3},{1,3,4} and T2={1,2,4},{2,3,4}. There are four bag selectors: β1={123,124}, β2={123,234}, β3={134,124}, and β4={134,234}. To make the exposition simpler, suppose the instance obtains only four output tuples:

t1=(a11,a2,a31,a4),t2=(a11,a2,a32,a4),t3=(a12,a2,a31,a4),t4=(a12,a2,a32,a4)

K¯ and all K¯β are initially empty. After reading the first tuple, we can assign it to any selector, say Kβ1={t1}. The second tuple is also added to K¯; however, we will not add it to K¯β1 because t1,t2 agree on the bag 124. Thus, K¯β2={t2}. For t3, we will add it to K¯β1, so K¯β1={t1,t3}. Finally, t4 will not be added, since it can be generated via decomposition T2.

Equipped with the above two lemmas, we can now prove an upper bound on the size of the cover constructed via algorithm 1.

Theorem 15.

If K¯ is the output of algorithm 1, then |K¯|=O(2da-entw𝕋(,𝖧𝖣𝖢)

Proof.

From Lemma 13, we have that β𝕋|K¯β|=|K¯|. Hence, there exists a set K¯β with size |K¯β||K¯|/|𝕋| such that for any two tuples t,tK¯β it holds that t[B]t[B] for every B𝗂𝗆𝖺𝗀𝖾(β). The set K¯βQ(D) satisfies all degree constraints, since it is a subset of Q(D) which also satisfies all degree constraints. Now, take a probability distribution over [n] that is uniformly random for K¯β, that is, each tuple in K¯β is chosen independently with the same probability. Let h¯ denote the entropy of this distribution. Note that by construction, h¯Γ¯n𝖧𝖣𝖢. Moreover, because the projections of the tuples on the bags of B𝗂𝗆𝖺𝗀𝖾(β) are disjoint, log|K¯β|=h¯(B).

Now, consider any tree decomposition (𝒯,χ)𝕋; then, there exists a bag vV(𝒯) such that B=χ(v)𝗂𝗆𝖺𝗀𝖾(β):

log|K¯β|=h¯(B)maxvV(𝒯)h¯(χ(v)).

Taking the min over all tree decompositions in 𝕋, we can write:

log|K¯β| min(𝒯,χ)𝕋maxvV(𝒯)h¯(χ(v))
maxhΓ¯n𝖧𝖣𝖢min(𝒯,χ)𝕋maxvV(𝒯)h(χ(v))
=da-entw𝕋(,𝖧𝖣𝖢).

The bound in the statement follows from the fact that |K¯β||K¯|/|𝕋|.

Corollary 16.

Given a full CQ Q with hypergraph and a database instance D, there exists a cover of Q(D) w.r.t. 𝖳𝖣() of size O(|D|entw()).

Example 17.

Continuing our running example for the 4-cycle query, we know that for this query entw=subw=3/2. Hence, algorithm 1 constructs a cover that has size at most O(|D|3/2); for this, it suffices to use as 𝕋 the only two non-redundant tree decompositions.

Combining the above corollary with Theorem 8, we obtain that we can construct a data structure of size only O(|D|entw) such that we can provide a constant-delay enumeration guarantee of the output Q(D). Note that the best-known such data structure has size O(|D|subw), where subw is the submodular width of the query. However, it holds that entwsubw [14].333It is not known whether there is some query where there is a gap between the two width measures. Of course, we do not know whether we can construct this data structure in time O(|D|entw), since our algorithm needs time linear w.r.t. the output size, which can be asymptotically larger. However, as the next proposition shows, we can construct a (potentially larger) cover in time and size O(|D|subw) using the PANDA algorithm as a blackbox.

Proposition 18.

Given a full CQ Q with hypergraph and a database instance D, there exists a cover of Q(D) w.r.t. 𝖳𝖣() of size O(|D|subw()) that can be constructed in time O~(|D|subw()).444The notation O~ allows for a polylogarithmic factor in the input size |D|.

6 Lower Bounds for Covers

To prove the lower bound, we will relate the size of a cover to the size of a model of a disjunctive Datalog rule, following a similar argument as used in [9] to connect the size of a semiring circuit to disjunctive Datalog. Following the definition in [15], a disjunctive Datalog rule over a hypergraph =([n],) is an expression of the form:

P:BTB(𝐱B)eRe(𝐱e)

where B[n] and 2[n]. Each output relation TB on the head of the rule is called a target. Given an instance D, a model of a disjunctive Datalog rule is a tuple 𝐓=(TB)B of relation instances such that for any tuple t over schema [n], if t[e]Re for all e, then there exists a target TB such that t[B]TB. The size of a model is defined to be maxB|TB| and the output size of a disjunctive Datalog rule over an instance D is the minimum size over all models of the rule.

Lemma 19.

Let Q be a full CQ, D be an instance, and 𝕋 be a finite set of tree decompositions of Q. Suppose K is a cover of Q(D) w.r.t. 𝕋, and let β be any bag selector for 𝕋. Consider the disjunctive rule

Pβ:B𝗂𝗆𝖺𝗀𝖾(β)TB(𝐱B)eRe(𝐱e).

Then, |K||Pβ(D)|.

Proof.

We will construct a model of Pβ from the cover K by taking TB:=πB(K) for every B𝗂𝗆𝖺𝗀𝖾(β). Clearly, |Pβ|=maxB|TB||K|. To show that {TB}B𝗂𝗆𝖺𝗀𝖾(β) is a model for Pβ, consider any tuple tQ(D). Then, since K is a cover, there exists a tree decomposition (𝒯,χ) such that tvV(𝒯)πχ(v)(K). Since β is a bag selector, there exists B𝗂𝗆𝖺𝗀𝖾(β) for which β(𝒯,χ)=B, and for this bag, t[B]πB(K)=TB.

Example 20.

We continue our running example for the 4-cycle query with 𝕋={T1,T2}. Consider the bag selector β where we choose bag {1,2,3} from decomposition T1 and bag {1,2,4} from decomposition T2. Then, the following disjunctive Datalog rule is formed:

Pβ:T123(x1,x2,x3)T124(x1,x2,x4)R1(x1,x2),R2(x2,x3),R3(x3,x4),R4(x4,x1).

Lemma 19 tells us than any lower bound on the size of the model for Pβ is also a lower bound for the size of any cover w.r.t. 𝕋. Note that the rule that has only target T123 (or T124) does not transfer its lower bound, since the disjunction on the left-hand side needs to include all tree decompositions in 𝕋.

Theorem 21.

Let Q be a full CQ, and 𝕋 be a finite set of tree decompositions of Q. Consider a set of constraints 𝖧𝖣𝖢. Then, for any ϵ>0 there exists a scale factor k and an instance D that satisfies the constraints 𝖧𝖣𝖢×k for which for every cover K of Q(D) w.r.t. 𝕋 we have: log|K|(1ϵ)da-entw𝕋(,𝖧𝖣𝖢×k).

Proof.

We will use the lower bound construction for an output of a disjunctive rule [15] and follow the proof of Theorem 3.7 in [9]. Let 𝐁𝕋 be the collection of images of all β𝕋. Then, we can bound da-entw𝕋(,𝖧𝖣𝖢×k) as follows:

da-entw𝕋(,𝖧𝖣𝖢×k)max𝐁𝕋maxhΓ¯n𝖧𝖣𝖢×kminBh(B)

For a bag selector β, consider the disjunctive rule Pβ as constructed in 19. We know from Lemma 4.4 in [15] that for any ϵ>0 there exists an integer kβ>0 and an instance Iβ such that

log|P(Iβ)|(1ϵ)maxhΓ¯n𝖧𝖣𝖢×kβminBh(B).

Consider now the k0 that maximizes kβ over all bag selectors in 𝕋, and let β0 the bag selector that maximizes the right-hand side of the first inequality. Then,

log|P(Iβ0)|(1ϵ)da-entw𝕋(,𝖧𝖣𝖢×k0).

To conclude the proof, observe that from Lemma 19 we have that for a cover K, it holds that log|K|log|P(Iβ0)|.

This lower bound matches (asymptotically) the upper bound in the previous section. When the degree constraints correspond to a uniform cardinality constraint N on each relation and 𝕋 is the set of all non-redundant tree decompositions, then we obtain from Theorem 21 that there exists an instance where any cover has size Ω(Nentw).

7 From Covers to Semiring Circuits

In this section, we will show how we can go from a cover of a query result to a semiring circuit. Before we present the main result, we need to introduce some further terminology.

A semiring 𝕊=(𝐃,,,𝟎,𝟏) is a structure where is the addition and is the multiplication such that (i) (𝐃,,𝟎) and (𝐃,,𝟏) are commutative monoids, (ii) multiplication is distributive over addition, and (iii) x𝟎=𝟎 for every element x𝐃. The semiring is also idempotent if xx=x for every element x𝐃. An example of an idempotent semiring is the tropical semiring (,min,+,+,0).

Given a full CQ with hypergraph , an instance D and a semiring 𝕊, we will add to each input tuple t from relation Re an annotation xte: this annotation can be thought as a variable that takes values from the domain 𝐃 of the semiring 𝕊. For example, if 𝕊 is the Boolean semiring ({0,1},,,0,1), the annotation captures the presence or absence of the tuple t. As another example, if 𝕊 is the counting semiring (,+,×,0,1), the annotation captures the multiplicity of the tuple t. We can now define the sum-product polynomial as:

pD:=tQ(D)ext[e]e.

The sum-product polynomial captures different types of provenance of the query, depending on the semiring we use to interpret it [10]. When 𝕊 is the Boolean semiring, the polynomial captures Boolean provenance, while if it is the counting semiring, it encodes how-provenance. Why-provenance can also be captured in this framework.

We can concisely represent the sum-product polynomial using circuits. For our purposes, we will define a circuit F over 𝕊 as a directed acyclic graph with input nodes the variables xt[e]e and constants 𝟎,𝟏. Every other node is labelled by a semiring operation ( or ) and has fan-in 2. An input gate of F is any gate with fan-in 0 and the output gate of F is the unique gate with fan-out 0.555In general, we could define a circuit to have multiple output gates, but for the purposes of this paper it suffices to consider circuits with a unique output gate. The size of the circuit, denoted as |F|, is the number of gates in the circuit. If we have a circuit F that computes the polynomial pD, then we can evaluate pD for any input values in time only O(|F|) by evaluating the circuit in a bottom-up fashion.

Our main result in this section shows a tight connection between covers and circuits. In particular, we show that we can use a cover of size K to construct a semiring circuit of the same size O(K). Hence, small covers will lead to small circuits.

Theorem 22.

Let Q be a full CQ with hypergraph , D be an instance, and 𝕋 be a finite set of tree decompositions of Q. Let K be a cover of Q(D) w.r.t. 𝕋. Then, in time O(|K|) we can construct a semiring circuit of size O(|K|) for the polynomial pD under any idempotent semiring.

Proof.

For a decomposition T=(𝒯,χ)𝕋, we will first augment it with a mapping μT:V(𝒯); this mapping assigns each hyperedge of to a node v in the tree decomposition. For a node vV(𝒯) and a tuple t, we can now define a new variable

ytv:=e:μT(e)=vxt[e]e

In other words, the new variable corresponds to taking the semiring product of the input tuples for the hyperedges we have assigned to the bag v. If there is no hyperedge assigned to v, then ytv:=𝟏. Any variable ytv can be always encoded via a constant-size subcircuit that computes a constant-size semiring product.

Using the definition of a cover, we can now write:

pD =tQ(D)ext[e]e
=t(𝒯,χ)𝕋vV(𝒯)(πχ(v)K)(ext[e]e) (by definition of a cover)
=(𝒯,χ)𝕋(tvV(𝒯)(πχ(v)K)(ext[e]e)) (by the idempotence of )
=(𝒯,χ)𝕋(tvV(𝒯)(πχ(v)K)(vV(𝒯)e:μ(e)=vxt[e]e)) (by commutativity of )
=(𝒯,χ)𝕋(tvV(𝒯)(πχ(v)K)(vV(𝒯)ytv)) (by definition of ytv)

At this point, we can see that it suffices to build a subcircuit for each tree decomposition (𝒯,χ)𝕋 and then sum their output gates to obtain the final output. Since |𝕋| is data-independent, the number of these sum-gates is a constant. Hence, our problem reduces to constructing a circuit for the polynomial

pT:=tvV(𝒯)(πχ(v)K)(vV(𝒯)ytv).

This corresponds to constructing a circuit for the polynomial of an acyclic query with body vV(𝒯)Sχ(v)(𝐱χ(v)), where each input relation Sχ(v) is constructed by computing the projection on the cover πχ(v)K. Any such projection can be computed in linear time O(|K|). Additionally, from [9] (Theorem 4.1), we know how to construct this circuit in time linear w.r.t. the size of the input, which is O(|K|). Moreover, the size of the circuit is bounded by the input size, and thus is also O(|K|).

 Remark 23.

The idempotence of the semiring is a critical requirement for the above theorem, since otherwise the proof breaks. Indeed, if the same output tuple is produced by multiple tree decompositions, then it is not possible to correctly split to decompositions with . Thus, we cannot use a cover to construct a semiring circuit for, say, the counting semiring (,+,×,0,1) which is not idempotent.

8 Conclusion

In this paper, we have shown that by generalizing covers to depend on multiple tree decompositions, we can construct covers of asymptotically smaller size. We have also provided matching worst-case lower bounds for general degree constraints.

Several research questions on covers remain open. A first question is how efficiently we can construct a small cover of size O(|D|𝖾𝗇𝗍𝗐). Our current construction takes time O(|Q(D)|), which can be as large as Ω(|D|ρ). It would be interesting to explore whether it is possible to reduce this time down to O(|D|𝗌𝗎𝖻𝗐) using the PANDA algorithm as a blackbox.

A second direction is to consider algorithms that compute the instance-optimal cover (instead of the worst-case optimal). It is likely that this becomes a computationally hard problem – especially when we consider multiple tree decompositions – however, there might be cases where the problem of finding the smallest cover is tractable.

Finally, it is possible to consider a variant of a generalized cover where the output of each decomposition is disjoint from each other. In other words, we want Q(D) to be partitioned into {Ii}i𝕋 such that each Ii has a cover Ki w.r.t. the decomposition i. Such a “disjoint cover” {Ki}i𝕋 would allow us to construct a semiring circuit for any semiring, even one that is non-idempotent. The current greedy algorithm, however, does not have the disjointness guarantee and thus it is not clear how we can construct disjoint covers of small size.

References

  • [1] Antoine Amarilli and Florent Capelli. Tractable circuits in database theory. SIGMOD Rec., 53(2):6–20, 2024. doi:10.1145/3685980.3685982.
  • [2] Albert Atserias, Martin Grohe, and Dániel Marx. Size bounds and query plans for relational joins. In FOCS, pages 739–748. IEEE Computer Society, 2008. doi:10.1109/FOCS.2008.43.
  • [3] Guillaume Bagan, Arnaud Durand, and Etienne Grandjean. On acyclic conjunctive queries and constant delay enumeration. In CSL, volume 4646 of Lecture Notes in Computer Science, pages 208–222. Springer, 2007. doi:10.1007/978-3-540-74915-8_18.
  • [4] Christoph Berkholz and Nicole Schweikardt. Constant delay enumeration with fpt-preprocessing for conjunctive queries of bounded submodular width. In MFCS, volume 138 of LIPIcs, pages 58:1–58:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2019. doi:10.4230/LIPICS.MFCS.2019.58.
  • [5] Nofar Carmeli and Markus Kröll. On the enumeration complexity of unions of conjunctive queries. ACM Trans. Database Syst., 46(2):5:1–5:41, 2021. doi:10.1145/3450263.
  • [6] Shaleen Deep, Xiao Hu, and Paraschos Koutris. General space-time tradeoffs via relational queries. In WADS, volume 14079 of Lecture Notes in Computer Science, pages 309–325. Springer, 2023. doi:10.1007/978-3-031-38906-1_21.
  • [7] Shaleen Deep and Paraschos Koutris. Compressed representations of conjunctive query results. In PODS, pages 307–322. ACM, 2018. doi:10.1145/3196959.3196979.
  • [8] Arnaud Durand and Yann Strozecki. Enumeration complexity of logical query problems with second-order variables. In CSL, volume 12 of LIPIcs, pages 189–202. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2011. doi:10.4230/LIPICS.CSL.2011.189.
  • [9] Austen Z. Fan, Paraschos Koutris, and Hangdong Zhao. Tight bounds of circuits for sum-product queries. Proc. ACM Manag. Data, 2(2):87, 2024. doi:10.1145/3651588.
  • [10] Todd J. Green, Gregory Karvounarakis, and Val Tannen. Provenance semirings. In PODS, pages 31–40. ACM, 2007. doi:10.1145/1265530.1265535.
  • [11] Ahmet Kara, Milos Nikolic, Dan Olteanu, and Haozhe Zhang. Trade-offs in static and dynamic evaluation of hierarchical queries. In PODS, pages 375–392. ACM, 2020. doi:10.1145/3375395.3387646.
  • [12] Ahmet Kara, Milos Nikolic, Dan Olteanu, and Haozhe Zhang. Evaluation trade-offs for acyclic conjunctive queries. In CSL, volume 252 of LIPIcs, pages 29:1–29:20. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.CSL.2023.29.
  • [13] Ahmet Kara and Dan Olteanu. Covers of query results. In ICDT, volume 98 of LIPIcs, pages 16:1–16:22. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2018. doi:10.4230/LIPICS.ICDT.2018.16.
  • [14] Mahmoud Abo Khamis, Hung Q. Ngo, and Dan Suciu. What do Shannon-type inequalities, submodular width, and disjunctive datalog have to do with one another? In PODS, pages 429–444. ACM, 2017. doi:10.1145/3034786.3056105.
  • [15] Mahmoud Abo Khamis, Hung Q. Ngo, and Dan Suciu. What do shannon-type inequalities, submodular width, and disjunctive datalog have to do with one another?, 2023. arXiv:1612.02503.
  • [16] Dan Olteanu and Jakub Závodný. Size bounds for factorised representations of query results. ACM Trans. Database Syst., 40(1):2:1–2:44, 2015. doi:10.1145/2656335.
  • [17] Rasmus Pagh and Flemming Friche Rodler. Cuckoo hashing. In ESA, volume 2161 of Lecture Notes in Computer Science, pages 121–133. Springer, 2001. doi:10.1007/3-540-44676-1_10.
  • [18] Maximilian Schleich, Dan Olteanu, and Radu Ciucanu. Learning linear regression models over factorized joins. In SIGMOD Conference, pages 3–18. ACM, 2016. doi:10.1145/2882903.2882939.
  • [19] Hangdong Zhao, Shaleen Deep, and Paraschos Koutris. Space-time tradeoffs for conjunctive queries with access patterns. In PODS, pages 59–68. ACM, 2023. doi:10.1145/3584372.3588675.

Appendix A Appendix

We show here the proof of Proposition 18.

Proof.

Consider 𝐁𝖳𝖣 and for each 𝐁𝖳𝖣, the disjunctive Datalog rule

P:BTB(𝐱B)eRe(𝐱e)

From the proof of Proposition 7.13 in [15], we can construct in time O~(|D|subw()) a model (TB)B of size O(|D|subw()) using the PANDA algorithm.

Next, at each bag χ(v) in every tree decomposition (𝒯,χ)𝖳𝖣, we construct a table Sχ(v)(𝐱χ(v)) (𝐱χ(v) is its schema) as follows: (1) take the union over all output tables Tχ(v)(𝐱χ(v)) from every disjunctive Datalog rule defined by =𝗂𝗆𝖺𝗀𝖾(β)𝐁𝖳𝖣, if the bag selector β selects this bag χ(v) from (𝒯,χ), and (2) semijoin-reduce the union by every input relation Re(𝐱e), to prune off tuples that do not contribute to the output. It is easy to verify that Sχ(v)(𝐱χ(v)) is of size O(|D|subw()).

By Corollary 7.13 in [15], the query results can be written as the following union of CQs, where we have one CQ per tree decomposition:

Q(D)=(𝒯,χ)𝖳𝖣vV(𝒯)Sχ(v)(𝐱χ(v))

Let Q(𝒯,χ):=vV(𝒯)Sχ(v)(𝐱χ(v)). Note that this corresponds to computing an acyclic join over relations with sizes bounded by O(|D|subw()). We can now use the standard construction of a cover (Lemma 23 in [13]) to construct in time O~(|D|subw()) a cover K(𝒯,χ) of Q(𝒯,χ) of size O(|D|subw()). Finally, to obtain the cover K we simply take the union of all covers, i.e., K:=(𝒯,χ)K(𝒯,χ).