Query Repairs
Abstract
We formalize and study the problem of repairing database queries based on user feedback in the form of a collection of labeled examples. We propose a framework based on the notion of a proximity pre-order, and we investigate and compare query repairs for conjunctive queries (CQs) using different such pre-orders. The proximity pre-orders we consider are based on query containment and on distance metrics for CQs.
Keywords and phrases:
Query Repairs, Databases, Conjunctive Queries, Data Examples, FittingFunding:
Balder ten Cate: Supported by EU Horizon 2020 Grant MSCA-101031081.Copyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Information systems Query languagesEditors:
Sudeepa Roy and Ahmet KaraSeries and Publisher:

1 Introduction
When querying a database, it may happen that the query result includes some undesired tuples and/or that some desired tuples are missing. In such cases, it is often necessary to adjust the query to ensure that the result aligns with expectations, i.e., it includes the desired tuples and omits the undesired ones.
Release | ||
---|---|---|
Babygirl | 2025 | DE |
Babygirl | 2025 | FR |
Nosferatu | 2025 | DE |
Nosferatu | 2024 | FR |
… |
Example 1.1.
Consider a database instance in Figure 1. A user, wanting to retrieve movies released in both Germany and France, issues the query . The query results include Babygirl but not Nosferatu. The user spots the latter as a missing answer, and wants to revise the query. A solution is to change the query to . A more radically different query such as would also account for the missing answer, but clearly fails to capture the user’s intention.
We propose a formalization of the above problem through the notion of query repairs, as follows. We assume that we are given a query and a set of labeled examples, by which we mean pairs with a database instance and a a tuple of values from the active domain of , labeled as positive or negative to indicate whether a is desired or undesired as an answer on input . In the above example, for instance, the input query is and there is a positively labeled example that fails to fit. A query repair, then, is a query that fits the given labeled examples and “differs from in a minimal way”. Different notions of query repair arise by using different means to formalize what it means for two queries to differ in a minimal way. Besides requiring that fits the labeled examples and differs minimally from , depending on the context, it may be natural to additionally require that or that . This leads to further refinements of the notion of a query repair, namely query generalization and query specialization, respectively, which we also investigate.
We propose a broad framework for defining what it means for two queries to differ in a minimal way, based on a proximity pre-order , i.e., a family of pre-orders (one for each query ), where asserts that query is at least as close to as . A query is then a -repair of a query if fits the given labeled examples and there is no query with the same property such that . We instantiate this framework for conjunctive queries (CQs), focussing mainly on two kinds of proximity pre-orders: the containment-of-difference proximity pre-order , based on query containment, and the edit-distance proximity pre-order , based on a distance metric between queries defined in terms of a suitably adapted version of edit distance. To be more precise, if for every instance , the symmetric difference of and is contained in the symmetric difference of and . Moreover, if the edit distance between the homomorphism core of and the homomorphism core of is no larger than the edit distance between the homomorphism core of and the homomorphism core of , modulo variable renaming.
Example 1.2 (Generalization).
Consider the CQ which returns all values that lie on a directed -cycle of length 4, and the instance that consists of the facts , i.e., is the directed -cycle of length 3. Clearly, . Let be the singleton set of examples consisting of labeled as a positive example. Which CQs qualify as repairs for or as generalizations for ? Note that since we only given positive examples, specializations to not seem to be a natural choice here.
It will turn out that there are two -repairs for : the CQ which expresses that lies on a directed -cycle of length 12 and ghe CQ which expresses that lies on a directed -cycle of length 3. Both of these are reasonable options. If we ask for -generalizations for , then only the first repair remains.
In contrast, there are precisely three - repairs of , each obtained from by dropping a different atom from the body. Also these are reasonable options. The same CQs are also the -generalizations for .
Example 1.3 (Specialization).
Consider the CQ , which returns all values that have an outgoing -path of length . Let be the set consisting of
-
a negative example with , and
-
a positive example with .
Which CQs qualify as specializations for ? In the same way in which generalizations are linked closely to the positive examples, specializations are linked closely to the negative examples. Note, however, that by itself the negative example in does not provide much guidance as to what would be a “good repair” as there are many possible options. The positive example gives (in this case, quite specific) additional guidance regarding the “direction” towards which we should look to find the repair.
It will turn out that there is precisely one -specialization, namely the very natural CQ . However, does not qualify as a -specialization, since the CQ also fits and is “closer” to in terms of query containment. In fact, as we will see, no -specialization for exists.
The problem of constructing a query that fits a given set of labeled data examples has been studied extensively and is known under different names such as reverse engineering, query learning, or fitting; see for instance [25] which offers a comparison of several fitting algorithms for CQs. Recently, in [9], extremal variants of the fitting problem for CQs were studied, including (weakly/strongly) most-general fitting and most-specific fitting. There, the input consists of a set of positive and negative examples and the task is to find a most specific CQ, or a most general CQ, that fits them. We can think of such extremal fitting problems as constrained versions of the fitting problem for CQs where an additional requirement is put on the output query. In the same spirit, the query repair problem can also be viewed as a constrained version of fitting where the input now includes, in addition, a CQ , and the output is required to be a fitting CQ that differs minimally from .111For the trivial proximity pre-order relating every CQ to every CQ, the query repair problem coincides with the fitting problem. As a part of our contributions, we will establish close relationships between query repair problems and extremal fitting problems.
Overview of contributions.
In Sect. 3, we formally define -query repairs, as well as -generalizations and -specializations, based on a given proximity pre-order . We also propose, for each of these, three algorithmic problems: verification, existence and construction. The remaining sections focus specifically on CQs.
In Sect. 4, we study the containment-based proximity pre-order . Besides examples of the resulting notions of generalization, specialization, and repair, our results, here, include:
-
(a)
structural characterizations that relate -generalizations and -specializations to most-specific fittings and most-general fittings, respectively (Theorems 4.4, 4.11, 4.12). These characterizations imply that there is always a unique -generalization (unless no suitable fitting CQ exists) while -specializations do not always exist.
-
(b)
based on this, results that identify the computational complexity of the verification, existence, and construction of -generalizations and -specializations.
-
(c)
results that relate -repairs to -specializations and -generalizations, allowing us to apply some of the above algorithmic results to the more general case of -repairs. However, we also illustrate that the behaviour of -repairs is often counterintuitive. For instance, -repairs need not exist and also there can be infinitely many -repairs. In contrast to -generalizations and -specializations, -repairs thus do not seem to be very natural.
In Sect. 5, we study proximity pre-orders based on distance metrics. In particular, we propose a distance metric for CQs based on edit distance that gives rise to a proximity pre-order . We show that there is always a non-empty and finite set of -repairs (respectively, -generalizations, and -specializations), unless the given examples do not admit a fitting CQ. Moreover, we shed light on the complexity of the construction and verification problems (Thm. 5.15). We also show that other, seemingly natural distance metric lead to repair notions that behave worse.
Outline.
Sect. 2 contains technical preliminaries. In Sect. 3, we define query repairs. In Sect. 4, we explore containment-based query repairs. In Sect. 5, we explore query repairs based on distance metrics. We conclude in Sect. 6 with a discussion of future directions.
Due to lack of space, most proofs are omitted. They can be found in the full version [27].
Related work.
Our notion of query repairs is in part inspired by the literature on database repairs introduced in [2]. There, one is given a database that is inconsistent in the sense that it violates one or more integrity constraints and the aim is to answer a given query over all possible repairs of , that, is, all databases consistent with the integrity constraints that “differ from in a minimal way”. Different notions of repairs, including set-based repairs and cardinality repairs, arise by formalizing in different ways what it means for two databases to “differ in a minimal way”. Research in this area has been rather active and fruitful [5].
There is extensive literature on approximating a query by some other query such that is contained in or is contained in . In the former case is often called an upper approximation or an upper envelope of , while in the latter case it is called a lower approximation, a lower envelope, or a relaxation of . For instance, [23] proposes an algorithm for relaxing the where clause of an over-constrained database query that returns an empty result. Naturally, one is interested in optimal (with respect to containment) such approximations, which are known as tight upper or lower envelopes. Lipski [17] studied upper and lower approximations in the context of databases with incomplete information, while Libkin [22] carried out a study of formal models of approximation in databases. A related body of work focused on the problem of using approximation to achieve more efficient query evaluation. In particular, approximations of Datalog queries by CQs or unions of CQs were investigated in [11, 12]. More recently, approximations of CQs by CQs of tractable combined complexity (such as acyclic CQs or CQs of bounded treewidth) were studied in [3, 4]. In a different, yet related direction, tight lower envelopes were used in the area of answering queries using views [14, 18], where such envelopes approximate a perfect rewriting. Upper and lower envelopes were also used as tractable approximations of the answers to ontology-mediated queries, both over consistent databases [16] and over inconsistent ones [6].
The literature on approximations summarized above is based on the notion of containment of one query to another. Notions of “closeness” or “similarity” of queries that are not based on containment have also been investigated. For example, a notion of closeness based on suitable combinations of precision and recall was used to study the problem of translating a query over some schema to a semantically similar query over a different schema [10]. Furthermore, a notion of semantic similarity of queries based on available query logs was explored in [7].
In the area of belief revision, a number of proposals have been made for model-based revision and update operators, in which a knowledge base is viewed semantically as a set of possible worlds (where a world is a propositional truth assignment), and update/revision is performed on sets of possible worlds. Various concrete update and revision operators have been proposed based on different notions of relative proximity for possible worlds, including using Hamming distance [13, 15] and containment-of-difference [24, 29].
In software engineering, automated program repair techniques seek to aid developers by suggesting likely correct patches for software bugs. They take as input a program and a specification of correctness criteria that the fixed program should meet. Most techniques assume that the correctness criteria are given by means of a test suite: one or more failing tests indicate a bug to be fixed, while passing tests indicate behavior that should not change. The desired output is a set of program changes that leads all tests to pass. See [21] for an overview.
2 Preliminaries
As usual, a schema is a set of relation symbols, each with associated arity. A database instance over is a finite set of facts of the form where is a relation symbol of arity and are values. We use to denote the set of all values used in . We can then view a query over a schema , semantically, as a function that maps each database instance over to a set of -tuples , where is the arity of the query. A query of arity zero is called a Boolean query. We write and say that is contained in if for all database instances . Two queries and are equivalent, written , if and .
A data example for a -ary query consists of a database instance together with a -tuple of values. We denote by the set of all data examples for which it holds that . A labeled example is a data example that is labeled as positive or as negative. By a collection of labeled examples we mean a pair , where and are sets of examples. Here, the data examples in are considered as positive examples, and the data examples in are considered as negative examples. A query fits if for each and for each . In other words, fits if and . Here, we assume that has the same arity as the data examples in and . We will often abuse notation and write that fits (or that fits ), meaning that fits (respectively, fits ).
We will be focusing specifically on conjunctive queries. By a -ary conjunctive query (CQ) over a schema , we mean an expression of the form where is a sequence of variables and each is a relational atom that uses a relation symbol from and no constants. Note: the restriction to queries without constants is not essential for our results (cf. [25, Remark 2.3]) but simplifies the presentation.
The variables in x are called answer variables and the other variables used in the atoms are the existential variables. Each answer variable is required to occur in at least one atom , a requirement known as the safety condition. For CQs of the same arity, we denote their conjunction by (where, for instance, the conjunction of and is ). With the size of a CQ, denoted , we mean the number of atoms in it. The query output is defined as usual, cf. any standard database textbook.
Every CQ has a canonical example , namely the data example , where is the database instance (over the same schema as ) whose active domain consists of the variables in and whose facts are the atomic formulas in .
Given data examples and over the same schema and with the same number of distinguished elements, a homomorphism is a map from to that preserves all facts and such that . When such a homomorphism exists, we say that “homomorphically maps to” and write . We say that and are homomorphically equivalent if and . It then holds that iff . Furthermore, the well-known Chandra-Merlin theorem states that holds iff .
A data example is said to be a core if every homomorphism is surjective. It is well known that for every data example there is a subinstance such that is a core and such that and are homomorphically equivalent. Moreover, such is unique up to isomorphism, and may be referred to as the core of , denoted . We say that a CQ is a core if its canonical example is a core.
The direct product of two database instances, denoted , is the database instance containing all facts over the domain such that is a fact of and is a fact of . This naturally extends to data examples: .
3 Query Repairs
Fix a query language . A proximity pre-order for is a family of pre-orders , one for every , satisfying the following conditions:
- Conservativeness
-
For all , .
- Syntax independence
-
Whenever , and , then iff .
Let , and let a collection of labeled examples (of the same arity as ). We call the pair an annotated -query. The following are the 3 main notions studied in this paper.
-
A -repair for is a query such that (i) fits , and (ii) there is no with that satisfies (i).
-
A -generalization for is a query such that (i) fits and and (ii) there is no with that satisfies (i).
-
A -specialization for is a query such that (i) fits and , and (ii) there is no with that satisfies (i).
(where is short for and ).
Note: under this definition, -specializations and -generalizations need not be -repairs.
Conservativeness and syntax-independence are minimal conditions on needed to yield intuitive behavior for query repairs: conservativeness ensures that if the input query already fits the given examples, then is its own -repair, while syntax independence ensures that equivalent queries have equivalent -repairs.
In the rest of this paper, we will restrict attention to CQs. That is, is the class of CQs.
Remark 3.1.
We will restrict our attention to repairs, generalizations, and specializations that use only the relation symbols which occur in . Note that when consists only of negative examples, a query repair or specialization could in principle contain relation symbols that do not occur in . We will disregard such repairs. It is not difficult, however, to adapt our results to the case where such additional symbols would be admitted.
Several algorithmic problems arise, all parameterized with a proximity pre-order .
-Repair-Verification Input: An annotated CQ and a CQ Output: Yes if is a -repair for , No otherwise
-Repair-Existence Input: an annotated CQ Output: Yes if there is a -repair for , No otherwise
-Repair-Construction Input: an annotated CQ for which a -repair exists Output: a -repair for
We will also study the analogous algorithmic problems for -generalization and -specialization, defined in the expected way.
In all of the above problems, the input queries and examples are assumed to be compatible in terms of their arity. Moreover, in our complexity analyses, we will assume a fixed (constant) query arity . This is in fact only necessary for some of the upper bounds in Sect. 4.2.
4 Containment-Based Approach
In this section, we study notions of query generalization, query specialization and query repair defined based on query containment. For generalization and specialization, it seems intuitively clear what the definition should be. Let be an annotated CQ. Then
-
a containment-based generalization for is a CQ that fits and such that and there is no CQ that fits with .
-
a containment-based specialization for is a CQ that fits and such that and there is no CQ that fits with .
It is less immediately clear what the right query containment-based definition of query repairs should be. As it turns out, the above notions of query generalization and query specialization can be viewed as query generalizations and query specializations with respect to the following natural proximity pre-order (cf. also [24, 29, 4]).
Definition 4.1 (Containment of Difference).
For queries , we write if , where denotes symmetric difference.
Recall that denotes the set of all positive examples of a query . Therefore, denotes the set of all examples on which disagrees with . Thus, means that the set of examples on which and disagree is a subset of the set of examples on which and disagree (cod stands for “containment of difference”). It is easy to see that is indeed a proximity pre-order. Moreover, it gives rise to the intended containment-based notions of query generalization and specialization:
Proposition 4.2.
For all annotated CQs and CQs ,
-
1.
is a -generalization for iff is a containment-based generalization for
-
2.
is a -specialization for iff is a containment-based specialization for .
It also follows that -generalizations and -specializations are -repairs. This furthermore suggests -repairs as a (seemingly) natural containment-based notion of query repair. Next, we will study -generalizations, -specializations, and -repairs. Our main findings can be summarized as follows: -generalizations and -specializations are well-behaved notions, although the latter do not always exist (Example 1.3) and can be too plentiful (Example 4.14). The associated existence, verification, and construction problems admit effective algorithms, although often of super-polynomial complexity. The more general -repairs exhibits counter-intuitive behavior.
4.1 Containment-Based Query Generalizations
The following example illustrates -generalizations.
Example 4.3.
Consider the schema consisting of unary relations . Let and let be the instance that consists of the facts . There is exactly one -generalization for , where is the set of positive examples , namely . This is, in fact, also the only -repair.
The next result show that there is a precise, two-way correspondence between -generalizations and most-specific fitting CQs. A most-specific fitting CQ for a collection of labeled examples is a fitting CQ such that for every fitting CQ , [9].
Theorem 4.4.
For all CQs and collections of labeled examples ,
-
1.
is a -generalization for iff is a most-specific fitting CQ for .
-
2.
is a most-specific fitting CQ for iff is a -generalization for ,
where denotes the maximally-constrained CQ over the relevant schema and of the relevant arity, i.e., the CQ .
As a consequence of this, we can leverage known results about most-specific fitting CQs. For instance, it is known that, for every collection of labeled examples , there is at most one most-specific fitting CQ up to equivalence, namely the CQ whose canonical example is the direct product of the positive examples in . This implies:
Corollary 4.5.
Let be an annotated CQ for which a fitting CQ with exists. Then there is, up to equivalence, exactly one -generalization for .
Example 4.6 (Example 1.2 revisited).
Using Thm. 4.4, we can verify the claim, in Example 1.2, that the CQ expressing “ lies on a directed -cycle of length 12” is the unique (up to equivalence) -generalization for . This is true because the canonical example of is homomorphically equivalent to the direct product of the positive example and the canonical example of the input CQ.
The complexity of various algorithmic problems pertaining to most-specific fitting CQs, as well as size bounds, were studied in [9]. From these, we obtain complexity results and size bounds for -generalizations. We include here also an analysis for the case where the input consists of positive examples only, which is particularly natural in the case of query-generalizations: it follows from Thm. 4.4 and what is said below it that negative examples have no effect on generalizations except that they may render them non-existent.
Corollary 4.7.
-
1.
-generalization-existence is coNExpTime-complete. For inputs consisting of a bounded number of examples, it is coNP-complete. For inputs consisting of positive examples only, it is in PTime.
-
2.
-generalization-verification is NExpTime-complete, even for inputs consisting of positive examples only. It is DP-complete for a bounded number of input examples.
-
3.
-generalization-construction is in ExpTime (and in PTime if the number of examples is bounded).
-
4.
Let . There is a sequence of examples of size polynomial in , such that (i) there is a -generalization for , where , and (ii) every -generalization for has size at least .
Remark 4.8.
Corollary 4.7(4) implies that the size of -repairs is in general exponential in the number of positive examples, and also that the size of -repairs for cannot be bounded by any function in the size of and the size of the smallest fitting CQ for .
4.2 Containment-Based Query Specializations
The following example illustrate -specializations.
Example 4.9.
Consider the CQ and let consist of
-
a negative example where consists only of the fact , and
-
a positive example where extends with the additional facts and .
There are two -repairs for , namely and . It can be shown with the help of Thm. 4.11 below that these are the only two -repairs.
An annotated CQ may lack a -specialization even when a fitting CQ exists:
Example 4.10.
Consider the Boolean CQ , and let consist of
-
a negative example consisting of facts , and
-
a positive example consisting of the fact .
The positive example, here, is strictly speaking redundant: every CQ over the relevant schema fits it. It is included only for intuition. There are CQs with that fit (for instance, is such a query), but there does not exist a -specialization for . This can be seen as follows: for every CQ that fits , by construction, the canonical example is a non-2-colorable graph. By a well-known result in graph theory, must then contain a cycle of odd length. By blowing up the length of this cycle (e.g. using the sparse incomparability lemma [20]), one can construct a fitting CQ such that .
In the previous subsection, we saw that -generalizations are closely related to most-specific fitting CQs. Similarly, -specializations are closely related to weakly most-general fitting CQs, where a weakly most-general fitting CQ for a collection of labeled examples is a fitting such that for every fitting CQ , implies [9].
Theorem 4.11.
Let be any annotated CQ with , such that has no repeated answer variables. Then, for all CQs , the following are equivalent:
-
1.
is a -specialization for ,
-
2.
fits and is equivalent to for some that is a weakly most-general fitting CQ for , where .
Moreover, in the direction from 1 to 2, can be chosen so that .
Again, there is also a converse reduction, but it is more cumbersome to state because for , there does not exist a -ary “most-general” CQ that is contained in all -ary CQs (due to the safety condition in the definition of CQs). Instead, we need to consider all CQs whose body only contains, for each one atom of the form , where is any relation symbol and are tuples of distinct fresh existential variables. We call any such CQ a minimally-constrained CQ. Note that, for any schema and arity, there are finitely many minimally-constrained CQs (up to renaming of variables).
Theorem 4.12.
For every CQ and collection of labeled examples, the following are equivalent:
-
1.
is a weakly most-general fitting CQ for ,
-
2.
is a -specialization for for all minimally-constrained CQs with .
For a given CQ , the set of all minimally-constrained CQs with can easily be constructed in in polynomial time (for fixed query arity). This implies that the above proposition can be viewed as a polynomial-time (Turing) reduction.
Example 4.13 (Example 1.3 revisited).
Let us revisit Example 1.3 from the introduction. There, we claimed that there is no -specialization for . In light of Thm. 4.11, it suffices to argue that there is no weakly most-general fitting CQ for . We will not give a formal proof here, it suffices to consider CQs that describe an oriented -paths of the form for increasing values of . The resulting infinite sequence of CQs (each of which fits ) can be used to disprove the existence of a weakly most-general fitting CQ, and hence of a -specialization for .
Weakly most-general fitting CQs were studied in depth in [9]. Based on results from [9] and the above reductions, we obtain a number of results. In particular, the following example shows that an annotated CQ may have infinitely many non-equivalent -specializations.
Example 4.14.
Consider the CQ and let consist of
-
a positive example consisting of the facts , and
-
a negative example consisting of the facts .
Note that the positive example is strictly speaking redundant as it belongs to for all CQs . It is added only for intuition. It follows from results in [9] that there are infinitely many non-equivalent weakly most-general fitting CQs for . Indeed, for every , the CQ is a weakly most-general fitting CQ for . It follows by Thm. 4.11 that there are infinitely many -specializations for .
We also obtain a number of complexity results.
Corollary 4.15.
-
1.
-specialization-existence is ExpTime-complete.
-
2.
-specialization-verification is in and is DP-hard already for a bounded number of input examples.
-
3.
-specialization-construction is in 2ExpTime.
-
4.
There is a sequence , where and are of total size polynomial in , such that (i) there is a -specialization for , and (ii) every -specialization for is of size at least .
We do not know what happens in case 1 and 3 if the number of input examples is bounded.
4.3 Containment-Based Query Repairs
We now move to the general setting of -repairs for an annotated CQ where we no longer require that or . Our first result on -repairs provides a reduction to the case with only positive examples or only negative examples.
Proposition 4.16.
Let be an annotated CQ with . Then a CQ is a -repair for if and only if one of following holds:
-
(a)
is a -repair for and fits , or
-
(b)
is a -repair for and fits .
where if and otherwise.
This is promising, as one might hope that case (a) above could be reduced to a statement about -generalizations, since we are repairing w.r.t. a set of positive examples, and likewise for case (b) and -specializations. For case (b) this approach indeed works:
Proposition 4.17.
For all annotated CQs , if consists of negative examples only, then every -repair for is a -specialization for .
The same, however, does not hold for case (a), as -repairs w.r.t. positive examples are not necessarily -generalizations:
Example 4.18.
Consider a schema consisting of unary relations . Let , and let be the instance consisting of the facts . There are, up to equivalence, two -repairs for where , namely the queries and . Of these, only the first is a -generalization. It seems counter-intuitive that is a -repair, since is, intuitively, closer to . However, there are instances on which agrees with but does not. An example is the instance consisting only of the fact .
The following result characterizes the -repairs for a CQ and set of positive examples.
Theorem 4.19.
Let be an annotated CQ where consists only of positive examples. Then a CQ is a -repair for if and only if one of the following holds:
-
1.
fits and is equivalent to , or
-
2.
does not fit , fits , and .
In case of Boolean CQs, item 2 can be replaced by
-
2’.
does not fit , and (cf. Figure 2).
Example 4.20 (Example 1.2 revisited).
Using Thm. 4.19, one can easily verify the claim that the CQ expressing “ lies on a directed -cycle of length 3” is a -repair for . The same holds (it can be shown, with some more work) for the CQ expressing “ lies on a directed -cycle of length 6”, whose canonical example lies homomorphically in-between (the cycle of length 3) and (the cycle of length 12).
Using Thm. 4.19, we can show that there can be infinitely many -repairs for a CQ and set of positive examples.
Example 4.21.
Over a schema consisting of a unary relation and a binary relation , consider the Boolean CQ and the set of positive examples , where consists of the fact . It follows from Thm. 4.19 that every CQ that fits is a -repair. There are infinitely many pairwise non-equivalent such CQs.
The next result shows another connection between -repairs and -generalizations:
Proposition 4.22.
The -generalizations for an annotated CQ are precisely the -repairs for , where extends with the positive example .
Example 4.18 and Example 4.21 show that -repairs can behave counterintuitively. Various algorithmic results regarding the verification, existence and construction of -repairs can be derived from the above characterizations and reductions, but we refrain from stating them here as they seem of little value given the problematic behaviour of -repairs.
5 Query Repairs Based on Distance Metrics
In this section, we study proximity pre-orders based on distance metrics. In particular, we propose and study a variant of edit distance for CQs. We also study proximity pre-orders based on several other natural distance metrics. Our main findings can be summarized as follows: edit distance, suitably defined, yields a proximity pre-order that avoids some of the problems of . We also show that other natural distance metrics induce proximity pre-orders that are less well behaved.
Definition 5.1 (Semantic distance metric).
A semantic distance metric for CQs is a function from pairs of CQs (of the same arity) to non-negative real numbers, satisfying:
-
1.
,
-
2.
iff and are equivalent,
-
3.
(the triangle inequality).
If all the conditions are met except for the only-if direction of 2, we say that is a weak semantic distance metric for CQs.
One can think of a semantic distance metric for CQs as a distance metric (in the standard sense) on the equivalence classes of CQs. Every semantic distance metric, and in fact every weak semantic distance metric, induces a pre-order.
Definition 5.2 (Pre-order induced by a semantic distance metric).
Let be a weak semantic distance metric for CQs. We define as follows: iff .
Proposition 5.3.
Let be a weak semantic distance metric for CQs. Then is a proximity pre-order.
We study -repairs for several distance metrics. Besides the algorithmic problems of repair existence, verification and construction, we also consider the following natural fitting problem that is closely related to query repairs based on distance metrics:
-bounded fitting existence Input: an annotated CQ and a distance bound . Output: Yes if there is a CQ that fits such that , No otherwise
5.1 Edit Distance
A naive definition of the edit distance of two CQs would be the number of atoms that appear in one of the two CQs but not in the other, that is, . It is easy to see that this is not a semantic distance metric: it is not invariant under logical equivalence, because simple syntactic changes such as renaming a quantified variable, which do not affect the semantics of the query, affect the edit distance. This can be fixed, however, by (i) taking homomorphism cores (i.e., minimizing the CQs), and (ii) working modulo bijective variable renamings.
This leads to the following definition. For simplicity, in this section we restrict attention to CQs whose sequence of answer variables is repetition-free (a restriction that could be lifted at the expense of a more intricate definition of edit distance, cf. Remark 5.16).
Definition 5.4 (Edit distance for CQs).
Given CQs and ,
where denotes the set of facts occurring in example and not in or vice versa.
Example 5.5.
Consider the Boolean CQs
Note that is not a core, but is. The core of is obtained by dropping the second and fourth atom. Thus . Thus, perhaps surprisingly, can be larger than the naive edit distance of and (which, in this case, is 2).
Proposition 5.6.
edit-dist is a semantic distance metric.
We next take a look at the complexity of computing edit distance.
Proposition 5.7.
Testing whether (on input ) is DP-hard and in . When restricted to input queries that are cores, it is NP-complete.
We now move on to studying the proximity pre-order . This pre-order has some useful structural properties. Let us say that a proximity pre-order is well-founded if for each CQ , every non-empty set of CQs has a -minimal element (i.e., there are no infinite descending chains ); and has the finite-basis property if for each CQ , every set of CQs has only finitely many -minimal elements, up to equivalence.
Proposition 5.8.
is well-founded and has the finite-basis property.
As an immediate consequence, we obtain:
Theorem 5.9.
Let be an annotated CQ.
-
1.
If there is any CQ that fits , then there is a -repair for .
-
2.
If there is any CQ that fits with , there is a -generalization for .
-
3.
If there is any CQ that fits with , there is a -specialization for .
Moreover, there are at most finitely many -repairs, -generalizations, and -specializations for , up to equivalence.
We now compare -repairs with -repairs.
Example 5.10.
This example serves to compare -generalizations with -generalizations. Consider the following Boolean CQ and positive example:
Let . There are four -generalizations for , namely
In contrast, there is a unique (up to equivalence) -generalization for , namely
A variation of this example shows that (i) there can be exponentially more -repairs than -repairs, and (ii) -repairs can be exponentially longer than -repairs.
Example 5.11.
Consider again Example 4.10 which shows that -specializations are not guaranteed to exist. There is a unique -specialization, namely .
Example 5.12.
In Example 4.21, where -repairs showed degenerative behavior, there exists a unique -repair, namely the (intuitively expected) query .
A -repair w.r.t. positive examples is not necessarily a -generalization:
Example 5.13.
Consider the following Boolean CQs and example:
Let with and . Then is the unique -repair for (having edit distance 2), but it is not a -generalization. On the other hand, is a -generalization for but not a -repair as it has edit distance 3 (due to the fact that it is not a core). Similarly, it can be shown that a -repair with respect to negative examples is not necessarily a -specialization (cf. the full version of this paper).
The following upper bound on the size of -repairs is implied by the definitions.
Proposition 5.14.
Let be an annotated CQ and a core CQ. If is an -repair for , then , where is the size of the smallest fitting CQ for . The same holds for -generalizations and for -specializations, where is then the size of the smallest fitting CQ that satisfies , respectively .
Prop. 5.14 stands in stark contrast with Remark 4.8 for -repairs. We remark that, while the smallest fitting CQ may be exponential in the size of the input examples [28], one may expect it to be typically much smaller in practice.
We now consider algorithmic problems for -repairs. By Thm. 5.9, the -repair-existence problem coincides with the fitting existence problem. In particular, the existence of a -repair for does not depend on . The verification and construction problems for -repairs are more interesting and do depend on . Of course, the existence of -generalizations and -specializations depends on as well.
Theorem 5.15.
-
1.
edit-dist-bounded-fitting-existence is -complete (provided the distance bound is given in unary). It is in NP if the input CQ is core and only positive examples are given.
-
2.
-repair-existence is coNExpTime-complete. It is coNP-complete for inputs that consist of a bounded number of examples. If the input contains only positive examples, or only negative examples, it is in PTime.
-
3.
-repair-verification is -hard and in .
Items 2 and 3 also hold if “repair” are replaced by “generalization” or “specialization”, except for the case of -generalization-existence with a bounded number of examples, where we only have a DP upper bound.
Finally, let us discuss -repair-construction. It follows from known results [8] that whenever a fitting CQ exists, there is one of size at most . A brute-force algorithm for -repair-construction simply enumerates CQs in the order of increasing size and, for each, checks if it is a -repair (cf. Thm. 5.15(2)). Since we are promised that an -repair exists, this process terminates and, by Prop. 5.14, yields a CQ of size at most . We do not know of an algorithm for constructing -repairs with asymptotically better running time, see Sect. 6 for further discussion.
Remark 5.16.
One can further refine edit distance by requiring that all equalities between variables in the body of the query are represented explicitly by means of equality atoms, and by counting these when computing the symmetric difference. For example, consider the CQs
Under Def. 5.4, , while under the more refined definition of edit distance (where we treat as shorthand for ), . We omit the details, but we believe that the above results continue to hold under such a more refined notion of edit distance.
5.2 Other Distance Metrics
Distance as size of the smallest distinguishing instance.
The next distance metric we consider is based on the smallest instance on which the two queries produce different answers.
Definition 5.17.
For CQs and , , where is the size of the smallest instance (measured by the number of facts) such that , or 0 if no such instance exists.
Proposition 5.18.
sdi-dist is a semantic distance metric, and in fact an ultrametric (i.e., it satisfies ).
Example 5.19.
Consider the CQs and . An instance that satisfies either isomorphically embeds a clique of size or else contains a “reflexive” fact of the form . Therefore, every example distinguishing from must contain at least facts. It follows that .
Two further relevant basic facts about sdi-dist are the following:
Proposition 5.20.
For all CQs , or .
Proposition 5.21.
Computing sdi-dist is NP-hard. More precisely, testing (on input CQs and natural number in unary) is NP-hard and is in .
We will now move on to study the pre-order . The next example shows that is not a good pre-order for query repairs, since it is not sufficiently discriminative.
Example 5.22.
Let , and let consist of
-
the negative example where , and
-
the positive example where .
The positive example is strictly speaking redundant: every CQ over the relevant schema fits it. It is added only for intuition. There are infinitely many pairwise non-equivalent CQs that fit with . Furthermore, by construction, every fitting CQ disagrees with on , and hence has . It follows that all infinitely-many fitting CQs are -repairs for , and there are infinitely many -specializations for as well.
A similar situation arises for -generalizations: let and let consist of the positive example where . There are infinitely many CQs (up to equivalence) that fit and they all disagree with on the single-fact instance .
This shows that there are annotated CQs with infinitely many -repairs (as well as -specializations and -generalizations). On the flip side, we have:
Proposition 5.23.
Let be an annotated CQ.
-
1.
If there is any CQ that fits , then there is a -repair for .
-
2.
If there is any CQ that fits with , there is a -generalization for .
-
3.
If there is any CQ that fits with , there is a -specialization for .
We omit a complexity-theoretic analysis of -repairs in light of Example 5.22.
Remark 5.24.
Dually to sdi-dist one can also consider the distance metric sdq-dist defined as the size of the smallest distinguishing query. More precisely, is the size of the smallest CQ (as measured by the number of atoms) that maps homomorphically to precisely one of . Unfortunately, sdq-dist fares no better than sdi-dist. For instance, for the annotated CQ from Example 5.22, all CQs that fit again have .
Distance as probability of disagreement.
Let be a discrete probability distribution over the space of all examples (an example distribution, for short). For instance, may be a uniform distribution over values in some pre-existing (unlabeled) database instance. We define . That is, is the probability of drawing an example on which and disagree. The same works for non-discrete probability distributions, as long as is measurable for each CQ . We restrict to discrete distributions for simplicity. It is easy to see that, for all example distributions , is a weak semantic distance metric.
Proposition 5.25.
can be computed in (for example distributions with finite support, specified as part of the input). Testing is -complete.
For the sake of readability, we will denote the proximity pre-order by .
Proposition 5.26.
Let be any example distribution. If has finite support, is well-founded. The same does not necessarily hold when has infinite support.
As a consequence, we obtain:
Proposition 5.27.
Let be an annotated CQ and an example distribution with finite support.
-
1.
If there is any CQ that fits , then there is a -repair for .
-
2.
If there is any CQ that fits with , there is a -generalization for .
-
3.
If there is any CQ that fits with , there is a -specialization for .
On the other hand, there can be infinitely many -repairs. Indeed, it is not difficult to show the following using a pigeon-hole argument:
Proposition 5.28.
Let be any example distribution with finite support. There are annotated CQs for which there are infinitely many -repairs, up to equivalence.
In light of this, we omit a complexity-theoretic analysis for -repairs.
6 Discussion
We proposed and studied notions of -generalizations, -specializations, and -repairs, parameterized by a proximity pre-order , providing a principled framework for example-driven query debugging and refinement (an idea partially inspired by the interactive schema-mapping design tool EIRINI [1]). We explored two ways to obtain a proximity pre-order for CQs: containment-based and edit distance-based. In each case, we assessed the behavior of the obtained repair notions through examples, and we studied the existence, verification and construction problems, as well as the size of repairs. Other algorithmic problems may be considered in followup work, such as repair enumeration (cf. [19]) and computing “possible” or “certain” answers across all repairs of a given CQ (as in [4] for query approximations).
Based on our findings, -generalizations and -refinements are reasonbly well-behaved (although the latter do not always exist) while unconstrained -repairs are not; -repairs behave favorably: they exist whenever a fitting CQ exists; there are always only finitely many repairs up to equivalence; and the main algorithmic tasks are decidable with typically lower complexity than for -repairs; in addition, -repairs tend to be of smaller size (Prop. 5.14). Although -repairs, too, in some cases display surprising behavior (Example 5.13), this may be an unavoidable consequence of our syntax-independence requirement taken together with the inherently syntactic nature of edit distance.
An important outstanding issue with -repairs is to design practical algorithms for constructing them. In [26], a SAT-based approach for computing minimal-size fitting -concepts (i.e., Berge-acyclic connected unary CQs) was proposed and evaluated, showing promising performance. While a SAT-solver may not be applicable here due to the higher complexity of the problem, we believe that inspiration can be taken from this approach. It may also be worthwhile to study approximation algorithms for distance-based query repairs, i.e, algorithms that produce fitting CQs that have near-minimal distance to the input CQ.
Naturally, these result are specific to the particular class of queries we considered: CQs. One may also study query repairs for other query classes (e.g., self-join free CQs, unions of CQs, or nested queries). Note, in particular, that self-join free CQs are cores and our examples of counterintuitive behavior of -repairs are based on CQs that are not cores. Moreover, the computational problems associated with -repairs are often of lower complexity for CQs that are cores.
It is also natural to let the space of candidate repairs depend on the input query in a stronger way. For instance, we may require that the join structure of the query remains fixed, and that repairs differ only in their WHERE-clause (cf. [23]).
Among different avenues for further research, let us mention one: query repair operations could be studied from a more structural perspective. For instance, how does repairing a query w.r.t. a collection of labeled examples compare to repairing it w.r.t. followed by repairing it w.r.t. ? And, assuming a “true” CQ , if we repair a given CQ repeatedly based on labeled examples for , does this process converge towards in a formal sense?
References
- [1] Bogdan Alexe, Balder ten Cate, Phokion G. Kolaitis, and Wang-Chiew Tan. Designing and refining schema mappings via data examples. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD ’11, pages 133–144, New York, NY, USA, 2011. Association for Computing Machinery. doi:10.1145/1989323.1989338.
- [2] Marcelo Arenas, Leopoldo E. Bertossi, and Jan Chomicki. Consistent query answers in inconsistent databases. In Victor Vianu and Christos H. Papadimitriou, editors, Proceedings of the Eighteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 31 - June 2, 1999, Philadelphia, Pennsylvania, USA, pages 68–79. ACM Press, 1999. doi:10.1145/303976.303983.
- [3] Pablo Barceló, Leonid Libkin, and Miguel Romero. Efficient approximations of conjunctive queries. SIAM J. Comput., 43(3):1085–1130, 2014. doi:10.1137/130911731.
- [4] Pablo Barceló, Mikaël Romero, and Thomas Zeume. A more general theory of static approximations for conjunctive queries. Theory of Computing Systems, 64(4):916–964, 2020. doi:10.1007/s00224-019-09924-0.
- [5] Leopoldo E. Bertossi. Database Repairing and Consistent Query Answering. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2011. doi:10.2200/S00379ED1V01Y201108DTM020.
- [6] Meghyn Bienvenu and Riccardo Rosati. Tractable approximations of consistent query answering for robust ontology-based data access. In Francesca Rossi, editor, IJCAI 2013, Proceedings of the 23rd International Joint Conference on Artificial Intelligence, Beijing, China, August 3-9, 2013, pages 775–781. IJCAI/AAAI, 2013. URL: http://www.aaai.org/ocs/index.php/IJCAI/IJCAI13/paper/view/6904.
- [7] Ilaria Bordino, Carlos Castillo, Debora Donato, and Aristides Gionis. Query similarity by projecting the query-flow graph. In Fabio Crestani, Stéphane Marchand-Maillet, Hsin-Hsi Chen, Efthimis N. Efthimiadis, and Jacques Savoy, editors, Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, Geneva, Switzerland, July 19-23, 2010, pages 515–522. ACM, 2010. doi:10.1145/1835449.1835536.
- [8] Balder ten Cate and Víctor Dalmau. The product homomorphism problem and applications. In Marcelo Arenas and Martín Ugarte, editors, 18th International Conference on Database Theory, ICDT 2015, March 23-27, 2015, Brussels, Belgium, volume 31 of LIPIcs, pages 161–176. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2015. doi:10.4230/LIPIcs.ICDT.2015.161.
- [9] Balder ten Cate, Victor Dalmau, Maurice Funk, and Carsten Lutz. Extremal fitting problems for conjunctive queries. In Proceedings of PODS 2023, 2023.
- [10] Kevin Chen-Chuan Chang and Hector Garcia-Molina. Approximate query mapping: Accounting for translation closeness. VLDB J., 10(2-3):155–181, 2001. doi:10.1007/S007780100042.
- [11] Surajit Chaudhuri. Finding nonrecursive envelopes for datalog predicates. In Catriel Beeri, editor, Proceedings of the Twelfth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 25-28, 1993, Washington, DC, USA, pages 135–146. ACM Press, 1993. doi:10.1145/153850.153862.
- [12] Surajit Chaudhuri and Phokion G. Kolaitis. Can datalog be approximated? J. Comput. Syst. Sci., 55(2):355–369, 1997. doi:10.1006/JCSS.1997.1528.
- [13] Mukesh Dalal. Investigations into a theory of knowledge base revision: preliminary report. In Proceedings of the Seventh AAAI National Conference on Artificial Intelligence, AAAI’88, pages 475–479. AAAI Press, 1988.
- [14] Oliver M. Duschka and Michael R. Genesereth. Answering recursive queries using views. In Alberto O. Mendelzon and Z. Meral Özsoyoglu, editors, Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 12-14, 1997, Tucson, Arizona, USA, pages 109–116. ACM Press, 1997. doi:10.1145/263661.263674.
- [15] Kenneth D. Forbus. Introducing actions into qualitative simulation. In Proceedings of the 11th International Joint Conference on Artificial Intelligence (IJCAI-89), pages 1273–1278, Detroit, MI, August 1989. Morgan Kaufmann. URL: http://ijcai.org/Proceedings/89-2/Papers/068.pdf.
- [16] Anneke Haga, Carsten Lutz, Leif Sabellek, and Frank Wolter. How to approximate ontology-mediated queries. In Proc. of KR, pages 323–333, 2021. doi:10.24963/KR.2021/31.
- [17] Witold Lipski Jr. On semantic issues connected with incomplete information databases. ACM Trans. Database Syst., 4(3):262–296, 1979. doi:10.1145/320083.320088.
- [18] Verena Kantere, George Orfanoudakis, Anastasios Kementsietsidis, and Timos K. Sellis. Query relaxation across heterogeneous data sources. In James Bailey, Alistair Moffat, Charu C. Aggarwal, Maarten de Rijke, Ravi Kumar, Vanessa Murdock, Timos K. Sellis, and Jeffrey Xu Yu, editors, Proceedings of the 24th ACM International Conference on Information and Knowledge Management, CIKM 2015, Melbourne, VIC, Australia, October 19 - 23, 2015, pages 473–482. ACM, 2015. doi:10.1145/2806416.2806529.
- [19] Benny Kimelfeld, Ester Livshits, and Liat Peterfreund. Counting and enumerating preferred database repairs. Theoretical Computer Science, 837:115–157, 2020. doi:10.1016/j.tcs.2020.05.016.
- [20] Gábor Kun. Constraints, MMSNP and expander relational structures. Combinatorica, 33(3):335–347, 2013. doi:10.1007/S00493-013-2405-4.
- [21] Claire Le Goues, Michael Pradel, and Abhik Roychoudhury. Automated program repair. Commun. ACM, 62(12):56–65, November 2019. doi:10.1145/3318162.
- [22] Leonid Libkin. Models of approximation in databases. Theor. Comput. Sci., 190(2):167–210, 1998. doi:10.1016/S0304-3975(97)00090-X.
- [23] Ion Muslea and Thomas J. Lee. Online query relaxation via bayesian causal structures discovery. In Manuela M. Veloso and Subbarao Kambhampati, editors, Proceedings, The Twentieth National Conference on Artificial Intelligence and the Seventeenth Innovative Applications of Artificial Intelligence Conference, July 9-13, 2005, Pittsburgh, Pennsylvania, USA, pages 831–836. AAAI Press / The MIT Press, 2005. URL: http://www.aaai.org/Library/AAAI/2005/aaai05-131.php.
- [24] K. Satoh. Nonmonotonic reasoning by minimal belief revision. In Proceedings of the International Conference on Fifth Generation Computer Systems (FGCS’88), pages 455–462, 1988.
- [25] Balder ten Cate, Maurice Funk, Jean Christoph Jung, and Carsten Lutz. Fitting algorithms for conjunctive queries. SIGMOD Record, 52, 2023. doi:10.1145/3641832.3641834.
- [26] Balder ten Cate, Maurice Funk, Jean Christoph Jung, and Carsten Lutz. Sat-based pac learning of description logic concepts. In Edith Elkind, editor, Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, pages 3347–3355. International Joint Conferences on Artificial Intelligence Organization, August 2023. Main Track. doi:10.24963/ijcai.2023/373.
- [27] Balder ten Cate, Phokion Kolaitis, and Carsten Lutz. Query repairs, 2025. arXiv:2501.11162.
- [28] Ross Willard. Testing expressibility is hard. In David Cohen, editor, Principles and Practice of Constraint Programming - CP 2010 - 16th International Conference, CP 2010, St. Andrews, Scotland, UK, September 6-10, 2010. Proceedings, volume 6308 of Lecture Notes in Computer Science, pages 9–23. Springer, 2010. doi:10.1007/978-3-642-15396-9_4.
- [29] Marianne Winslett. Updating Logical Databases. Cambridge Tracts in Theoretical Computer Science. Cambridge University Press, 1990.