Generalized Covers for Conjunctive Queries
Abstract
Covers of query results were introduced as succinct lossless representations of join query outputs. A cover is a subset of the query result from which we can efficiently enumerate the output with constant delay and linear preprocessing time. However, covers are dependent on a single tree decomposition of the query. In this work, we generalize the notion of a cover to a set of multiple tree decompositions. We show that this generalization can potentially produce asymptotically smaller covers while maintaining the properties of constant-delay enumeration and linear preprocessing time. In particular, given a set of tree decompositions, we can determine exactly the asymptotic size of a minimum cover, which is tied to the notion of entropic width of the query. We also provide a simple greedy algorithm that computes this cover efficiently. Finally, we relate covers to semiring circuits when the semiring is idempotent.
Keywords and phrases:
Conjunctive Query, tree decomposition, cover2012 ACM Subject Classification:
Theory of computation Database theoryEditors:
Sudeepa Roy and Ahmet KaraSeries and Publisher:

1 Introduction
Join queries are one of the fundamental operations in relational databases. However, the output of a join query can be potentially huge when executed over a large database instance. This is particularly relevant when a join output needs to be sent over a network link in a distributed setting, or if the output result needs to be stored before being processed in a downstream task. In such scenarios, it is often desirable to construct concise representations of the query output such that we can reconstruct the query result as efficiently as possible when and where it is needed. Several succinct data structures have been proposed in the literature, including factorized databases [16] and compressed representations [7].
A cover is one such succinct and lossless representation of a join output that was introduced by Kara and Olteanu [13]. A cover is simply a subset of the query result that, together with a given tree decomposition of the query, can efficiently reconstruct the full query output. Here, efficient means that after a preprocessing step that is linear to the size of the representation, we can enumerate the output with a constant delay guarantee. One of the key results in [13] is that, given a tree decomposition of the query and a database instance of size , we can always produce a cover of size , where is the fractional hypertree width of . Thus, choosing a tree decomposition of minimum width gives a cover of size and that can be shown to be asymptotically optimal. This bound is a large improvement over storing the full output, which can be as large as the AGM bound [2], i.e., where is the fractional edge cover of the query.
The main insight of this work is that it is possible to construct covers of even smaller asymptotic size if we define the cover to depend not on a single decomposition, but on multiple tree decompositions. This is analogous to the PANDA algorithm [14], which improves the runtime of query evaluation from to by considering simultaneously multiple tree decompositions. Here, is the submodular width of the query. Importantly, even though we allow a dependence on multiple tree decompositions, we can still show that the guarantee of linear-time preprocessing with constant delay enumeration remains unchanged. Thus, this generalized notion of a cover produces an even more succinct and efficient representation of the query output.
In fact, we will show in this paper that by considering the (finite) set of all possible tree decompositions, we can construct covers of size , where is the entropic width of the query, a width measure that is at least as small as the submodular width. As a consequence, we show the existence of a data structure of size from which we can enumerate the query output with constant delay. We should note here that, unfortunately, the time to construct this data structure can be much larger than , and so the cover construction does not produce a faster query evaluation algorithm.
From a theoretical point of a view, this makes progress to the following fundamental question: what is the smallest possible compression of a query result that still guarantees constant-delay enumeration (i.e., fast decompression)? From a practical point of view, the enumeration guarantee is dependent on the number of tree decompositions, which can be exponentially large in the query. However, we show that as we add more decompositions the size of the cover can only improve; hence, even a small set of decompositions can lead to a large improvement in the cover size compared to considering only a single one. The algorithm we present that constructs a small cover – even though it needs access to the full query result – requires only a linear scan over the query result.
Our Contribution.
We summarize our contributions as follows:
-
We generalize the notion of a cover of a query result to consider multiple tree decompositions (Section 4). We prove several interesting properties for covers, and show that given a cover of size we can spend preprocessing time to enumerate the query output with constant delay.
-
We present a simple greedy algorithm (Section 5) that, given the query output, produces an asymptotically optimal cover in time linear w.r.t. the query output. Moreover, we show an upper bound that uses not only the input size as a parameter, but any combination of degree constraints, including different cardinalities, functional dependencies, and degree bounds.
-
We provide a lower bound (Section 6) that asymptotically matches our upper bound. The lower bound uses a technically interesting connection to disjunctive Datalog rules to find a worst-case instance for a cover.
-
Finally, we show in Section 7 how we can use a cover to produce a semiring circuit of size linear to the cover size for the provenance polynomial of the query for any idempotent semiring. A semiring circuit can be thought as a factorization of the query output (more precisely, it is a factorization of the polynomial associated with the query output), with the additional flexibility that it can consider different tree decompositions in its internal representation.
2 Related Work
In this section, we present work that relates to covers and in general efficient representations of query results.
Factorized Databases and Circuits.
As shown in [13], covers over a single tree decomposition are related to -representations over the same tree decomposition. -representations (and -representations) are also lossless representations of query outputs, which allow for efficient enumeration, aggregation, and even ML model computation [18]. Query outputs can also be concisely stored using circuits [9, 1] – these circuits represent the polynomial corresponding to the join query and can be evaluated under different semiring semantics. As we will show later in the paper, analogous to the connection between covers and -representations, there is also a direct connection between generalized covers and semiring circuits.
Compressed Representations of Outputs.
Recent work has also looked at how we can further decrease the size of a representation by trading off the enumeration delay guarantee [7, 19, 6]. Such a tradeoff allows for asymptotically smaller succinct data structures, but with an increased possible delay when outputting two consecutive outputs. Even though it would be theoretically possible to consider covers with weaker enumeration guarantees, in this work we focus only on constant delay.
Preprocessing Time and Enumeration.
Several algorithms on join computation use a two-phase framework where a preprocessing phase is followed by an output enumeration phase [3]. For instance, for any join query with fractional hypertree width , we need preprocessing time to achieve constant delay enumeration. The intermediate data structure constructed by the preprocessing phase can be also viewed as a succinct representation of the result; however, it is critical in this case to also have an efficient algorithm to construct the data structure. Recent work has looked at preprocessing time/size and enumeration tradeoffs in this setting [11, 12]. In our paper, even though we construct a very small succinct representation, the runtime can exceed the size of the representation and thus does not lead to a better query evaluation algorithm.
3 Preliminaries
In this section, we present useful notation and terminology.
Conjunctive Queries.
A Conjunctive Query (CQ) is an expression associated to a hypergraph where and a set :
where each is a relation of arity , the variables take values in some discrete domain, and . We say that is Boolean if and full if .
Given a tuple over and a subset , we will use to denote the projection of the tuple on the attributes in .
Tree Decompositions.
We recall the notion of a tree decomposition.
Definition 1 (Tree Decomposition).
A tree decomposition of a hypergraph is a pair , where is a tree and maps each node of the tree to a subset of , called a bag, such that:
-
1.
every hyperedge is a subset of for some ; and
-
2.
for every vertex , the set is a non-empty connected subtree of .
We say that a tree decomposition is non-redundant if no bag is a subset of another; otherwise it is redundant. We let be the set of all non-redundant tree decompositions of . This set is known to be finite; it can be shown that , where is the number of vertices of the hypergraph (Proposition 2.9 [15]).
Entropic Functions.
Let be a positive integer. A function is called a set function on . Given a discrete random variable with support in and probability distribution , its entropy is defined as . When the probability distribution is uniform in the support set, then . Given a set of random variables , the joint entropy is defined as the entropy of the random variable that represents the tuple . A set function is an entropic function of order if there exist random variables such that for any . We denote by the set of all entropic functions of order , and the topological closure of .111The topological closure of a set is defined as the smallest closed set that contains . The important property of is that it is a convex cone and hence characterized by the (infinitely many) linear inequalities it satisfies.
Entropic Width.
Let be a set of triples for some and that encodes a set of degree constraints. A database instance over the relational schema satisfies the constraints if for every relation in with and every tuple defined over schema , we have . A constraint of the form is simply a cardinality constraint that says that relation has size at most . The degree constraints on an instance can be translated as constraints on entropic functions as follows:
where . For a given set of degree constraints , we define the “scaled-up” degree constraints as the same set of constraints but with all the degree bounds multiplied by . It will be also helpful to define the entropic constraints in the case we are only given a uniform cardinality constraint (input size):
We now define the entropic width (defined in [9]) and degree-aware entropic width (defined in [15] as eda-subw) w.r.t. a set of tree decompositions respectively:
When , we obtain the standard notions of (degree-aware) entropic width.
Computational Model.
We use the uniform-cost RAM model where data values as well as pointers to databases are of constant size. Our runtime analysis will consider data complexity (where the query is considered fixed) unless otherwise stated.
4 Generalized Covers
We start by generalizing the notion of a cover of a query result from one to multiple tree decompositions.
Definition 2 (Cover).
Let be a full CQ with hypergraph , be an instance, and be a finite set of tree decompositions of . A relation over schema is a cover of w.r.t. if
In other words, if for every decomposition we project the cover to its bags and then join the bags together, the union of these outputs must form exactly . When the set consists of a single tree decomposition, then we recover the notion of a cover as in [13]. In our more general definition, the reconstruction of the output is based not on a single tree decomposition, but a set of tree decompositions.
Proposition 3.
Let be a full CQ, be an instance, and be a set of tree decompositions of . If is a cover of w.r.t. , then .
Proof.
This follows immediately from the fact that for every tree decomposition and relational instance , we have .
Remark 4.
It is useful to contrast with the original definition of a cover used in [13] for a single tree decomposition . In this definition, a cover was equivalently defined as a set of tuples for which for every . However, such an equivalent characterization does not hold anymore in the case of multiple tree decompositions, since each decomposition may be responsible to produce a (strict) subset of the query output.
We say that a cover is minimal w.r.t. if there is no strict subset that is also a cover w.r.t. 222We should note here that in [13] a cover always refers to a minimal cover. In this paper, however, we will also consider covers that may not be minimal.. A cover is minimum w.r.t. if it has the smallest possible size across all covers w.r.t. .
Example 5.
We will use as an example the 4-cycle query:
This CQ has only two non-redundant tree decompositions: and . We consider the following relational instance , where each relation has tuples and thus the input size is . Note also that , hence the output size is quadratic w.r.t. the input size.
Suppose now that we want to find a cover w.r.t. only. Using a result from [13], we know that the size of any cover w.r.t. a single decomposition is at least the size of the largest bag in the decomposition. For , both bags and have size , hence the cover has size . A similar reasoning provides a lower bound of for the size of any cover w.r.t. . Hence, no matter which tree decomposition we use, a cover w.r.t. a single tree decomposition has size .
Now, suppose we want to find a cover w.r.t. , i.e., we will include both non-redundant tree decompositions. We claim that the following set of size only is a cover w.r.t. :
Indeed, partition , where contain the tuples with -values respectively, and similarly partition into . Then, one can verify that and . In other words, we use decomposition to guide the covering of the -tuples, and to guide the covering of the -tuples. Note that .
As we will show later in the paper, for the 4-cycle query we can always produce a cover of size using both tree decompositions. In this example, we are able to do even better and produce a cover of only linear size.
4.1 Basic Properties of Covers
The following propositions establish some basic facts about covers.
Proposition 6.
Let be a full CQ, be an instance, and be two sets of tree decompositions of such that . If is a cover of w.r.t. , then is a cover of w.r.t. as well.
Proof.
Indeed, we can write the following:
In the last equation, the first inequality follows from Lemma 3 that implies , while the last equality follows from the fact that for every tree decomposition , .
The above proposition tells us that if we add more tree decompositions in a set , the minimum cover can never increase in size. But how large does need to be to achieve the smallest possible cover (among all possible tree decomposition sets)?
Proposition 7.
Let be a full CQ, be an instance, and be a finite set of tree decompositions of . If covers w.r.t. , then it covers w.r.t. the (finite) set of all non-redundant tree decompositions .
Proof.
From 6, we have that covers w.r.t. . Hence:
To complete the proof, we will show that
It is straightforward that the left-hand-side is contained in the right-hand-side. For the other direction, let be a redundant tree decomposition in . Then, there exists a non-redundant tree decomposition in such that for every bag , there exists a bag such that (this is constructed by eliminating the bags that are contained in some other bag). We will show that for any over schema , it holds that . Indeed, take any tuple . Consider some bag . Then, there exists with . Hence, . This implies that as well.
The above proposition says that we can achieve the best possible cover in terms of size if we take our set of tree decompositions to be , which is finite.
4.2 From Covers to Constant-Delay Enumeration
In this section, we show an important property of covers. Given a cover of w.r.t. some set of tree decompositions, we show that with preprocessing time only linear w.r.t. , it is possible to enumerate the output of a query with constant delay. A constant-delay enumeration algorithm means that the time between outputting two consecutive tuples in (as well as the time to produce the first tuple, and the time to finish after producing the last tuple) is w.r.t. data complexity.
Theorem 8.
Let be a full CQ, be an instance, and be a finite set of tree decompositions of . Suppose we are given a cover for w.r.t. . Then, with preprocessing time we can enumerate with constant delay.
Proof.
We first describe the preprocessing phase. We will use to construct a materialized instance for every bag in every decomposition of . In particular, we scan , and for every tuple and every decomposition in and every bag , we add the projection to the bag . At this point, since is a cover for , we know that .
Once we have materialized each bag, computing the join corresponds to computing an acyclic CQ. Indeed, by the definition of a tree decomposition, the query formed by treating each bag as a relation must be acyclic. Thus, we can use the standard result [3] that, with linear preprocessing time , we can enumerate the results in with constant delay. The last thing we need to address is that the same output tuple may be produced via more than one tree decomposition in . To deal with this, we can apply Cheater’s lemma [5] that tells us that we can enumerate the union of a constant number of constant-delay enumeration algorithms with constant delay. Alternatively, we can also apply the method of Durand and Strozecki [8] for enumeration of unions of sets, as done in [4].
The main message of the above theorem is that considering more tree decompositions does not hurt the constant-delay enumeration property of covers if we consider data complexity.
Remark 9.
Although the delay is constant w.r.t. data complexity, it depends linearly on the number of tree decompositions ; more precisely, the delay will be . If we choose to include all non-redunant tree decompositions, i.e., , then the delay can become exponentially large in the size of the query, since can be as large as . However, in practice one may not need to consider all tree decompositions to obtain a concise representation of the output.
5 Finding Small Covers
In this section, we will seek to find the smallest possible cover we can obtain for w.r.t. any finite set of tree decompositions . We will be interested in providing worst-case guarantees on the size, and not an instance-optimal construction of the minimum-sized cover. In particular, we will show the following theorem.
Theorem 10.
Given a full CQ with hypergraph and a database instance that satisfies the degree constraints , there exists a cover of w.r.t. a finite set of tree decompositions of size . Moreover, given the output of the query , we can construct this cover in time .
When , then the size bound simply becomes . To show the above theorem, we will give in the rest of the section a constructive proof, based on a simple greedy algorithm (algorithm 1). This algorithm is inspired by the construction used in [15] (Proof of Lemma 4.1) to construct small models for disjunctive Datalog rules. Here, we adapt this idea to a CQ and do a direct analysis.
The algorithm starts with an empty set , and iterates over the tuples (in any order). For every tuple , if there exists a tree decomposition such that we do nothing; otherwise, we add to . The algorithm terminates when we have visited all output tuples. To bound the running time of algorithm 1, we can implement the existence check in line 3 in (amortized) constant time per tuple by maintaining a hash table on the projection on each bag , and then simply checking that for every , we have (for example, cuckoo hashing [17] gives us constant-time access and amortized constant-time insertions.) Hence, we have the following bound on the running time.
Lemma 11.
algorithm 1 runs in time .
By construction of , we also obtain:
Lemma 12.
The output of algorithm 1 is a cover of w.r.t. .
Note that if we run algorithm 1 with two different sets of tree decompositions , the cover produced for will be at least as small as the one for . However, the delay in the enumeration of the results from the cover will increase. Hence, we essentially have a tradeoff between the compression size (the size of the cover) and time to decompress (the delay).
Let be the set of all maps such that for some . Such a map is called a bag selector that chooses a bag from each tree decomposition in .
Lemma 13.
The set can be partitioned as such that for any two tuples it holds that for every .
Proof.
We will construct the sets following the construction of in algorithm 1. Initially, all are empty. Consider the point where a new tuple is added in in line 4. Then, for every decomposition there must exist some bag such that (if there is more than one such bag, we choose one arbitrarily). Consider the bag selector such that its image is exactly this set of bags, and add to . It is easy to see that, for every tuple and , we have and thus it must be that .
Example 14.
We show how algorithm 1 and the above lemma work using Example 5. Recall that we have two tree decompositions: and . There are four bag selectors: , , , and . To make the exposition simpler, suppose the instance obtains only four output tuples:
and all are initially empty. After reading the first tuple, we can assign it to any selector, say . The second tuple is also added to ; however, we will not add it to because agree on the bag . Thus, . For , we will add it to , so . Finally, will not be added, since it can be generated via decomposition .
Equipped with the above two lemmas, we can now prove an upper bound on the size of the cover constructed via algorithm 1.
Theorem 15.
If is the output of algorithm 1, then
Proof.
From Lemma 13, we have that . Hence, there exists a set with size such that for any two tuples it holds that for every . The set satisfies all degree constraints, since it is a subset of which also satisfies all degree constraints. Now, take a probability distribution over that is uniformly random for , that is, each tuple in is chosen independently with the same probability. Let denote the entropy of this distribution. Note that by construction, . Moreover, because the projections of the tuples on the bags of are disjoint, .
Now, consider any tree decomposition ; then, there exists a bag such that :
Taking the over all tree decompositions in , we can write:
The bound in the statement follows from the fact that .
Corollary 16.
Given a full CQ with hypergraph and a database instance , there exists a cover of w.r.t. of size .
Example 17.
Continuing our running example for the 4-cycle query, we know that for this query . Hence, algorithm 1 constructs a cover that has size at most ; for this, it suffices to use as the only two non-redundant tree decompositions.
Combining the above corollary with Theorem 8, we obtain that we can construct a data structure of size only such that we can provide a constant-delay enumeration guarantee of the output . Note that the best-known such data structure has size , where subw is the submodular width of the query. However, it holds that [14].333It is not known whether there is some query where there is a gap between the two width measures. Of course, we do not know whether we can construct this data structure in time , since our algorithm needs time linear w.r.t. the output size, which can be asymptotically larger. However, as the next proposition shows, we can construct a (potentially larger) cover in time and size using the PANDA algorithm as a blackbox.
Proposition 18.
Given a full CQ with hypergraph and a database instance , there exists a cover of w.r.t. of size that can be constructed in time .444The notation allows for a polylogarithmic factor in the input size .
6 Lower Bounds for Covers
To prove the lower bound, we will relate the size of a cover to the size of a model of a disjunctive Datalog rule, following a similar argument as used in [9] to connect the size of a semiring circuit to disjunctive Datalog. Following the definition in [15], a disjunctive Datalog rule over a hypergraph is an expression of the form:
where and . Each output relation on the head of the rule is called a target. Given an instance , a model of a disjunctive Datalog rule is a tuple of relation instances such that for any tuple over schema , if for all , then there exists a target such that . The size of a model is defined to be and the output size of a disjunctive Datalog rule over an instance is the minimum size over all models of the rule.
Lemma 19.
Let be a full CQ, be an instance, and be a finite set of tree decompositions of . Suppose is a cover of w.r.t. , and let be any bag selector for . Consider the disjunctive rule
Then, .
Proof.
We will construct a model of from the cover by taking for every . Clearly, . To show that is a model for , consider any tuple . Then, since is a cover, there exists a tree decomposition such that . Since is a bag selector, there exists for which , and for this bag, .
Example 20.
We continue our running example for the 4-cycle query with . Consider the bag selector where we choose bag from decomposition and bag from decomposition . Then, the following disjunctive Datalog rule is formed:
Lemma 19 tells us than any lower bound on the size of the model for is also a lower bound for the size of any cover w.r.t. . Note that the rule that has only target (or ) does not transfer its lower bound, since the disjunction on the left-hand side needs to include all tree decompositions in .
Theorem 21.
Let be a full CQ, and be a finite set of tree decompositions of . Consider a set of constraints . Then, for any there exists a scale factor and an instance that satisfies the constraints for which for every cover of w.r.t. we have: .
Proof.
We will use the lower bound construction for an output of a disjunctive rule [15] and follow the proof of Theorem 3.7 in [9]. Let be the collection of images of all . Then, we can bound as follows:
For a bag selector , consider the disjunctive rule as constructed in 19. We know from Lemma 4.4 in [15] that for any there exists an integer and an instance such that
Consider now the that maximizes over all bag selectors in , and let the bag selector that maximizes the right-hand side of the first inequality. Then,
To conclude the proof, observe that from Lemma 19 we have that for a cover , it holds that .
This lower bound matches (asymptotically) the upper bound in the previous section. When the degree constraints correspond to a uniform cardinality constraint on each relation and is the set of all non-redundant tree decompositions, then we obtain from Theorem 21 that there exists an instance where any cover has size .
7 From Covers to Semiring Circuits
In this section, we will show how we can go from a cover of a query result to a semiring circuit. Before we present the main result, we need to introduce some further terminology.
A semiring is a structure where is the addition and is the multiplication such that and are commutative monoids, multiplication is distributive over addition, and for every element . The semiring is also idempotent if for every element . An example of an idempotent semiring is the tropical semiring .
Given a full CQ with hypergraph , an instance and a semiring , we will add to each input tuple from relation an annotation : this annotation can be thought as a variable that takes values from the domain of the semiring . For example, if is the Boolean semiring , the annotation captures the presence or absence of the tuple . As another example, if is the counting semiring , the annotation captures the multiplicity of the tuple . We can now define the sum-product polynomial as:
The sum-product polynomial captures different types of provenance of the query, depending on the semiring we use to interpret it [10]. When is the Boolean semiring, the polynomial captures Boolean provenance, while if it is the counting semiring, it encodes how-provenance. Why-provenance can also be captured in this framework.
We can concisely represent the sum-product polynomial using circuits. For our purposes, we will define a circuit over as a directed acyclic graph with input nodes the variables and constants . Every other node is labelled by a semiring operation ( or ) and has fan-in 2. An input gate of is any gate with fan-in 0 and the output gate of is the unique gate with fan-out 0.555In general, we could define a circuit to have multiple output gates, but for the purposes of this paper it suffices to consider circuits with a unique output gate. The size of the circuit, denoted as , is the number of gates in the circuit. If we have a circuit that computes the polynomial , then we can evaluate for any input values in time only by evaluating the circuit in a bottom-up fashion.
Our main result in this section shows a tight connection between covers and circuits. In particular, we show that we can use a cover of size to construct a semiring circuit of the same size . Hence, small covers will lead to small circuits.
Theorem 22.
Let be a full CQ with hypergraph , be an instance, and be a finite set of tree decompositions of . Let be a cover of w.r.t. . Then, in time we can construct a semiring circuit of size for the polynomial under any idempotent semiring.
Proof.
For a decomposition , we will first augment it with a mapping ; this mapping assigns each hyperedge of to a node in the tree decomposition. For a node and a tuple , we can now define a new variable
In other words, the new variable corresponds to taking the semiring product of the input tuples for the hyperedges we have assigned to the bag . If there is no hyperedge assigned to , then . Any variable can be always encoded via a constant-size subcircuit that computes a constant-size semiring product.
Using the definition of a cover, we can now write:
At this point, we can see that it suffices to build a subcircuit for each tree decomposition and then sum their output gates to obtain the final output. Since is data-independent, the number of these sum-gates is a constant. Hence, our problem reduces to constructing a circuit for the polynomial
This corresponds to constructing a circuit for the polynomial of an acyclic query with body , where each input relation is constructed by computing the projection on the cover . Any such projection can be computed in linear time . Additionally, from [9] (Theorem 4.1), we know how to construct this circuit in time linear w.r.t. the size of the input, which is . Moreover, the size of the circuit is bounded by the input size, and thus is also .
Remark 23.
The idempotence of the semiring is a critical requirement for the above theorem, since otherwise the proof breaks. Indeed, if the same output tuple is produced by multiple tree decompositions, then it is not possible to correctly split to decompositions with . Thus, we cannot use a cover to construct a semiring circuit for, say, the counting semiring which is not idempotent.
8 Conclusion
In this paper, we have shown that by generalizing covers to depend on multiple tree decompositions, we can construct covers of asymptotically smaller size. We have also provided matching worst-case lower bounds for general degree constraints.
Several research questions on covers remain open. A first question is how efficiently we can construct a small cover of size . Our current construction takes time , which can be as large as . It would be interesting to explore whether it is possible to reduce this time down to using the PANDA algorithm as a blackbox.
A second direction is to consider algorithms that compute the instance-optimal cover (instead of the worst-case optimal). It is likely that this becomes a computationally hard problem – especially when we consider multiple tree decompositions – however, there might be cases where the problem of finding the smallest cover is tractable.
Finally, it is possible to consider a variant of a generalized cover where the output of each decomposition is disjoint from each other. In other words, we want to be partitioned into such that each has a cover w.r.t. the decomposition . Such a “disjoint cover” would allow us to construct a semiring circuit for any semiring, even one that is non-idempotent. The current greedy algorithm, however, does not have the disjointness guarantee and thus it is not clear how we can construct disjoint covers of small size.
References
- [1] Antoine Amarilli and Florent Capelli. Tractable circuits in database theory. SIGMOD Rec., 53(2):6–20, 2024. doi:10.1145/3685980.3685982.
- [2] Albert Atserias, Martin Grohe, and Dániel Marx. Size bounds and query plans for relational joins. In FOCS, pages 739–748. IEEE Computer Society, 2008. doi:10.1109/FOCS.2008.43.
- [3] Guillaume Bagan, Arnaud Durand, and Etienne Grandjean. On acyclic conjunctive queries and constant delay enumeration. In CSL, volume 4646 of Lecture Notes in Computer Science, pages 208–222. Springer, 2007. doi:10.1007/978-3-540-74915-8_18.
- [4] Christoph Berkholz and Nicole Schweikardt. Constant delay enumeration with fpt-preprocessing for conjunctive queries of bounded submodular width. In MFCS, volume 138 of LIPIcs, pages 58:1–58:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2019. doi:10.4230/LIPICS.MFCS.2019.58.
- [5] Nofar Carmeli and Markus Kröll. On the enumeration complexity of unions of conjunctive queries. ACM Trans. Database Syst., 46(2):5:1–5:41, 2021. doi:10.1145/3450263.
- [6] Shaleen Deep, Xiao Hu, and Paraschos Koutris. General space-time tradeoffs via relational queries. In WADS, volume 14079 of Lecture Notes in Computer Science, pages 309–325. Springer, 2023. doi:10.1007/978-3-031-38906-1_21.
- [7] Shaleen Deep and Paraschos Koutris. Compressed representations of conjunctive query results. In PODS, pages 307–322. ACM, 2018. doi:10.1145/3196959.3196979.
- [8] Arnaud Durand and Yann Strozecki. Enumeration complexity of logical query problems with second-order variables. In CSL, volume 12 of LIPIcs, pages 189–202. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2011. doi:10.4230/LIPICS.CSL.2011.189.
- [9] Austen Z. Fan, Paraschos Koutris, and Hangdong Zhao. Tight bounds of circuits for sum-product queries. Proc. ACM Manag. Data, 2(2):87, 2024. doi:10.1145/3651588.
- [10] Todd J. Green, Gregory Karvounarakis, and Val Tannen. Provenance semirings. In PODS, pages 31–40. ACM, 2007. doi:10.1145/1265530.1265535.
- [11] Ahmet Kara, Milos Nikolic, Dan Olteanu, and Haozhe Zhang. Trade-offs in static and dynamic evaluation of hierarchical queries. In PODS, pages 375–392. ACM, 2020. doi:10.1145/3375395.3387646.
- [12] Ahmet Kara, Milos Nikolic, Dan Olteanu, and Haozhe Zhang. Evaluation trade-offs for acyclic conjunctive queries. In CSL, volume 252 of LIPIcs, pages 29:1–29:20. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.CSL.2023.29.
- [13] Ahmet Kara and Dan Olteanu. Covers of query results. In ICDT, volume 98 of LIPIcs, pages 16:1–16:22. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2018. doi:10.4230/LIPICS.ICDT.2018.16.
- [14] Mahmoud Abo Khamis, Hung Q. Ngo, and Dan Suciu. What do Shannon-type inequalities, submodular width, and disjunctive datalog have to do with one another? In PODS, pages 429–444. ACM, 2017. doi:10.1145/3034786.3056105.
- [15] Mahmoud Abo Khamis, Hung Q. Ngo, and Dan Suciu. What do shannon-type inequalities, submodular width, and disjunctive datalog have to do with one another?, 2023. arXiv:1612.02503.
- [16] Dan Olteanu and Jakub Závodný. Size bounds for factorised representations of query results. ACM Trans. Database Syst., 40(1):2:1–2:44, 2015. doi:10.1145/2656335.
- [17] Rasmus Pagh and Flemming Friche Rodler. Cuckoo hashing. In ESA, volume 2161 of Lecture Notes in Computer Science, pages 121–133. Springer, 2001. doi:10.1007/3-540-44676-1_10.
- [18] Maximilian Schleich, Dan Olteanu, and Radu Ciucanu. Learning linear regression models over factorized joins. In SIGMOD Conference, pages 3–18. ACM, 2016. doi:10.1145/2882903.2882939.
- [19] Hangdong Zhao, Shaleen Deep, and Paraschos Koutris. Space-time tradeoffs for conjunctive queries with access patterns. In PODS, pages 59–68. ACM, 2023. doi:10.1145/3584372.3588675.
Appendix A Appendix
We show here the proof of Proposition 18.
Proof.
Consider and for each , the disjunctive Datalog rule
From the proof of Proposition 7.13 in [15], we can construct in time a model of size using the PANDA algorithm.
Next, at each bag in every tree decomposition , we construct a table ( is its schema) as follows: (1) take the union over all output tables from every disjunctive Datalog rule defined by , if the bag selector selects this bag from , and (2) semijoin-reduce the union by every input relation , to prune off tuples that do not contribute to the output. It is easy to verify that is of size .
By Corollary 7.13 in [15], the query results can be written as the following union of CQs, where we have one CQ per tree decomposition:
Let . Note that this corresponds to computing an acyclic join over relations with sizes bounded by . We can now use the standard construction of a cover (Lemma 23 in [13]) to construct in time a cover of of size . Finally, to obtain the cover we simply take the union of all covers, i.e., .