Repairing Databases over Metric Spaces with Coincidence Constraints
Abstract
Datasets often contain values that naturally reside in a metric space: numbers, strings, geographical locations, machine-learned embeddings in a vector space, and so on. We study the computational complexity of repairing inconsistent databases that violate integrity constraints, where the database values belong to an underlying metric space. The goal is to update the database values to retain consistency while minimizing the total distance between the original values and the repaired ones. We consider what we refer to as coincidence constraints, which include unary key constraints, inclusion constraints, foreign keys, and generally any restriction on the relationship between the numbers of cells of different labels (attributes) coinciding in a single value, for a fixed attribute set.
We begin by showing that the problem is APX-hard for general metric spaces. We then present an algorithm solving the problem optimally for tree metrics, which generalize both the line metric (i.e., where repaired values are numbers) and the discrete metric (i.e., where we simply count the number of changed values). Combining our algorithm for tree metrics and a classic result on probabilistic tree embeddings, we design a (high probability) logarithmic-ratio approximation for general metrics. We also study the variant of the problem where we limit the allowed change of each individual value. In this variant, it is already NP-complete to decide the existence of any legal repair for a general metric, and we present a polynomial-time repairing algorithm for the case of a line metric.
Keywords and phrases:
Database repairs, metric spaces, coincidence constraints, inclusion constraints, foreign-key constraintsFunding:
Benny Kimelfeld: Israel Science Foundation grant 768/19, German Research Foundation grant KI 2348/1-1.Copyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Information systems Data management systemsEditors:
Sudeepa Roy and Ahmet KaraSeries and Publisher:

1 Introduction
A pervasive problem encountered in databases is inconsistency, which means that the database violates some integrity constraints that are expected to hold in reality. Such an anomaly can arise due to mistaken data sources (e.g., Web resources), erroneous data recording (e.g., manual form filling), noisy data generation (e.g., machine learning), imprecise data integration (e.g., faulty entity resolution), and so on. Commercial tools for data cleaning and transformation (e.g., in data warehouse scenarios) are often based on human-specified constraints to guide the detection and repair of errors. Consequently, a well-studied computational problem that arises with inconsistent data is database repairing – suggesting a minimal intervention needed to correct the database so that all integrity constraints are satisfied. Studies of this general problem differ in their considered (i) types of integrity constraints, (ii) intervention models, and (iii) intervention cost measure.
Common types of integrity constraints include functional dependencies [26, 23, 28, 18, 8] (including key constraints), the more general denial constraints [5, 12, 13], and inclusion dependencies [8, 27, 12]. The intervention model refers to the operations applied to repair the database, and these are typically tuple deletion, tuple insertion, and value updates [4, 17]. In this work, we focus on the third and seek a so-called update repair. For this intervention model, finding an optimal repair is almost always shown to be computationally hard and algorithms developed are therefore mostly heuristic or focus on highly specialized cases [26, 23, 18, 30, 13, 8]. This state of affairs holds for various intervention cost measures studied, including the number of changed cells [26, 23], the sum of changes of numerical values [5], or the sum of user-provided costs of value updates [8, 13], often cast as probabilities of (independent) changes [18, 30].
To obtain provable guarantees for repair costs in a wide variety of update cost measures, in this work we exploit
the fact that database values commonly belong to a structured domain, namely a metric space, where update costs of cells abide by the laws of a metric space (positivity, symmetry, and the triangle inequality).
The obvious example of such a metric space is the space of numerical values [5]. Another simple example is the discrete metric, where any two different values are a unit distance apart (hence, the cost of a repair is the number of changed cells [26, 23]).
A more sophisticated example is the distance or travel time between map locations. Yet another example is textual values and distance given by edit distance, or the distance between textual embeddings in a Euclidean space [29].
In fact, several studies have proposed ways of embedding general database cells (including opaque keys) in Euclidean spaces, with the objective that semantically close cells should be mapped to geometrically close vectors, and vice versa [33, 10, 9].
Hence, there is a wealth of value types corresponding to metrics, so a unifying approach for repairing databases with metric values can apply to a wide range of applications.
Motivated by this observation, we introduce the following general computational problem.
Metric Database Repair Problem (informal; see Section 2)
A metric database consists of a set of cells, each with a label (attribute) and a value from a metric space.
A coincidence constraint lists, for each value, the allowed combinations of numbers of coincident cells of different labels.
A repair of an inconsistent database moves cells between points in the metric space (i.e., updates the cells’ values).
The goal is to compute a repair, if one exists, minimizing the total distance moved over all cells. The attribute set is fixed, but all else is given as input, including the (finite) metric space.
Coincidence constraints generalize (unary) inclusion constraints, key constraints, foreign-key constraints, as well as other constraints on cardinalities (e.g., the number of people associated with a company is at most the number of employees of the company according to some external trusted registry). Moreover, this class of constraints is closed under conjunction, disjunction, and negation. For example, we can phrase a constraint stating that every location of a team includes one driver (Driver.location cell), and either engineers (Eng.location cells) or salespeople (SP.location cells), but not both.
Contributions.
Our first contribution is a formalization of the problem introduced informally above, together with illustrative examples (Section 2). We then begin the complexity investigation of this problem by showing that, in the generality presented here, it is NP-hard and even APX-hard (Section 2.5).
Our main technical contribution is a polynomial-time algorithm for finding an optimal repair when the metric is a tree metric (Section 3.1). This algorithm implies the tractability of the problem for well-studied special cases of the tree metric, namely the line metric (i.e., the usual metric on numeric values) and the discrete metric where, as mentioned above, the cost of a repair is the number of changed cells. Moreover, combining our algorithm for tree metrics with classic results on randomly embedding a general metric into a tree metric [16, 3], we establish a (high-probability) logarithmic-factor approximation for general metrics (Section 3.2).
We also investigate two extensions of our work (Section 4). We first show how the model and results can be generalized to an infinite metric, such as the full Euclidean space (e.g., with the metric ). We then discuss the implication of imposing a bound on the amount of change allowed for each individual cell; that is, each point can move by at most some given distance . We show that the bound restriction makes the problem fundamentally harder for a general (finite) metric, since testing whether any repair exists is already NP-complete (even for a single label or two labels with a simple inclusion constraint). On the other hand, we devise a polynomial-time algorithm for finding an optimal repair for the line metric.
Related Work.
For repairs, we restrict the discussion to update repairs. For a broader view of database repair, see the excellent books [4, 17]. Upper and lower bounds have been established for the number of changes (which corresponds to the discrete metric) under functional dependencies [26, 23], which are incomparable with our coincidence constraints. Gilad, Imber and Kimelfeld [18] studied a relational model where each cell has a distribution over possible values (and cells are probabilistically independent), and the goal is to find a most probable instantiation that satisfies integrity constraints; one can translate an optimal repair in our model to a most probable world in their model, yet the metric properties are ignored in their work and, again, their study is restricted to functional dependencies.
Several studies investigated the computation of an optimal repair with a general distance function (that does not necessarily obey the distance/metric axioms); yet, their results are restricted to lower bounds (hardness) and heuristic algorithms without quality guarantees [13, 8]. While Chu et al. [13] focused on denial constraints (which are again incomparable to our coincidence constraints), the work of Bohannon et al. [8] is closer to our work, as they considered inclusion dependencies (in addition to functional dependencies). Bertossi et al. [5] studied the complexity of optimal repairs (“Database Fix Problem”) for numerical values under the Euclidean space (square of differences), and focused on denial constraints and aggregation constraints; their results for this problem are lower bounds (NP-completeness).
There is a weaker relevance of this work to frameworks where the repairs themselves (rather than the database values) are points of the metric space [1, 2], and work on consistent query answering (i.e., determining whether all repairs agree on a query answer) under keys and foreign keys [20].
From an opposite angle, there has been considerable work on approximate variations of database dependencies, where a database satisfies such a constraint if the database values are close to satisfying it. For example, if we assume that values belong to a metric space (as we do in this work), then such a constraint, also known as a metric constraint [24, 32] or a differential dependency [31, 25], may state that if two tuples are close on their attributes, they should also be close on their attributes, yielding a soft variation of the functional dependency [11]. This line of work is in the vain of approximate query answering (or “similarity joins”) where value equality (e.g., for equi-joins) is replaced with metric proximity [19, 34, 14, 15]. The work of Kaminsky, Pena, and Naumann [22] combines inclusion constraints and similarity between values within the problem of constraint discovery (with equality replaced by similarity); their work, as well as other work on the discovery of metric based approximate constraints [31, 25], is relevant here in the sense that it can be combined with ours in a larger flow that improves data quality by the discovery of (violated) integrity constraints, which are then handled by a repairing phase.
2 Formal Framework
We first describe our formal framework, from the basic concepts to the problem definition.
2.1 Metric Spaces
Recall that a metric space is a pair where is a set of points and is a function satisfying the following: positivity: for all and if and only if ; symmetry: for all ; and the triangle inequality: for all . When is finite, the metric can also be seen as having its distances correspond to the length of the shortest paths in an undirected graph, such as a complete graph on where each edge has weight .
Patient (P) | |
---|---|
pid | name |
437 | Anna |
487 | Bill |
719 | Carl |
799 | Darcy |
Registration (R) | |
---|---|
pid | time |
779 | 10:00 |
437 | 13:00 |
199 | 14:00 |
Vaccine (V) | |
---|---|
pid | nurse |
719 | 018 |
481 | 017 |
987 | 078 |
UsedShots (U) | |
nurse | #shots |
078 | 5 |
017 | 1 |
We will discuss three special cases of metric spaces , and we will distinguish between them via a subscript of the distance function .
-
A line metric , where and .
-
A discrete metric , where whenever .
-
A tree metric , where is a weighted tree with the vertex set , and is the weighted distance (i.e., sum of weights along the unique path) in between and .
2.2 Databases
We assume three countably infinite sets: is the set of all attributes, is a set of all cells, and is a set of values. Each cell is labeled with an attribute . If , then we call an -cell. By a domain we refer to a set of values.
A database is a mapping from a finite set of cells, denoted , to values. Hence, we view a database as a function . We denote by the set of values that occur in , that is, . We say that is a database over a domain if . If is a database and , then we denote by the restriction of to its -cells; hence, , and for all . If is a database and , then we denote by the set of cells with . We denote by the set of attributes that occur in , that is, .
A signature is a sequence of attributes. We denote by the attribute set . An instance of is a database such that . In the remainder of this paper, we always assume the presence of a signature , and a database that we mention is implicitly assumed to be an instance of .
Example 1.
The relational database of Figure 1 consists of a registry of vaccinations in an improvised medical facility. (We later discuss the errors in the relations.) From the “pid” columns we derive the database (in our model) of Figure 2(b) over the signature . Each cell is depicted by a rounded rectangle with its label written inside the box and its value connected to the rectangle by a dotted line. The number attached to each cell denotes the row number of the cell in Figure 1. (It is not part of the formal model and is given here just to help connecting between the figures.) For example, the pid attribute of row 1 of Patient gives rise to the left-top cell with and . We can similarly obtain the database over the signature from the nurse columns of the relations. This database is depicted on the left side of Figure 3. We use both and as running examples for this section.
2.3 Coincidence Constraints
Let be a signature. If is database and , then the sequence lists the number of -cells with the value for . We call the coincidence profile of (stating how many cells of each attribute coincide at ).
Let be a domain. A coincidence constraint over (w.r.t. ) is a function that maps every value to a subset , stating the allowed coincidence profiles for every value in . A database over satisfies , denoted , if for all ; that is, the coincidence profile of every value in is allowed by . Conversely, violates if for at least one value , and then we denote it by .
Note that the class of coincidence constraints is closed under conjunction (intersection), disjunction (union), and negation (complement). That is, if and are coincidence constraints over , then (that maps every to ) is a coincidence constraint over , and so are and . Also note that a may be infinite, but for a given database, only a finite subset is relevant (see Remark 6 later in the section).
For illustration, let , and let be a set of values. The key constraint states that no two -cells can have the same value. Hence, says that every value can have at most one -cell. The constraint can be expressed as the following function (which is the same on every point ):
The inclusion constraint states that every value in an -cell must also occur in an -cell. Hence, states that every value that has one or more -cells must also have one or more cells, and can be expressed by the following coincidence constraint:
Finally, the foreign-key constraint is the conjunction of and , hence expressed by the coincidence constraint .
A coincidence constraint may, in principle, pose a different restriction on every value of . We also consider the uniform case where is the same everywhere. Formally, we say that is uniform if for all pairs and of values in . In this case, we may write just instead of , and treat simply as a set of coincidence profiles . For example, each of , and is uniform since its definition does not depend on , hence we can write, for instance, .
Example 2.
We continue with our running example. The scenario we consider is that Figure 1 describes a relational database with a combination of clean data in the Patient and UsedShots relations (where administrators insert records), and noisy data in the Registration and Vaccine relations, where records are entered (manually or via OCR) from handwritten forms that may include mistakes or ambiguous letters.
Consider again the databases in Figure 2(b). derived from Figure 1. The constraints that we associate to the domain of is , stating that every vaccine is given to a registered patient who, in turn, is in the patient registry, and no two registrations, or two patient records, coincide in their value. For the relations of Figure 1 it means that pid is a foreign key from Vaccine to Registration and from Registration to Patient. In our formalism, this translates to the uniform constraint , where for very value . Note that the constraint is violated by since, for example, 719 is a value of a V.pid-cell but not an R.pid-cell, and 779 is a value of an R.pid-cell but not a P.pid-cell.
Example 3.
Still within our running example, for the domain of (Figure 3), recall that our signature is . Our constraint for this domain defined by
where is the number in the row of in the (clean) relation UsedShots of Figure 1, that is, and . Hence, states that every vaccinating nurse should occur in the registry of the used shots (i.e., ) and vice versa, and the number of times a nurse is recorded as giving a vaccine (that is, the number of V.nurse cells per nurse identifier) is at most the number of used shots recorded for that nurse. Note that is non-uniform since it differs for the two values 078 and 017. It is violated since the value 018 has a V.pid-cell but not a U.pid-cell.
2.4 Repairs
Let be a signature, be a metric space, and be a coincidence constraint over . An inconsistent database is a database over such that . A repair of is a database such that and . The cost of a repair is the cumulated distance that the values of undergo in the transformation to . We allow to differentiate the cost between different attributes, so we assume a global weight function .111For example, it may be the case that we trust -cells more than we trust -cells, so the same movement by distance can contribute differently to the cost of ; this will be reflected in . Hence, we define the cost of a repair , denoted , by
Example 4.
Continuing Example 2, assume that (or some large number), since P.pid-cells are assumed to be clean, and that and (since V-pid cells are deemed slightly more reliable than R.pid-cells). Figures 2(a) and 2(c) show optimal repairs of of Figure 2(b) with respect to and , respectively, where
-
(i.e., );
-
is the Hamming distance between strings (e.g., and );
-
is the discrete distance over (e.g., ).
Note that , since there are two changes of R.pid and three of V.pid, each of distance . (Changes are marked by red lines.) Also note that as there are two changes of R.pid and two of V.pid, but since:
(1) | ||||
(2) |
In particular, note that is superior to for the metric .
Example 5.
Continuing Example 3, recall that violates because the value 018 is associated with a V.nurse-cell but no U.pid-cells. Assume that since the U.nurse-cells are assumed to be clean, but . Under both the discrete metric and the Hamming distance, an optimal repair is obtained by changing the cell of 018 to 078, as illustrated at the right of Figure 3. Note that changing the value of to 017 would not be legal, since it would violate the constraint , as would be two (corresponding to two V.nurse-cells with the value 017 while .
2.5 Computational Problem
We study the complexity of computing a low-cost repair. Formally, we assume a fixed signature . The input consists of a finite metric space , a finite coincidence constraint over , and an inconsistent database over . The goal is to compute an optimal repair, that is, a repair such that for all repairs , or declare that no repair exists. We will also study the approximate version of finding an -optimal repair, where is a number (or a numeric function of the input), which is a repair such that for all repairs .
Remark 6.
The assumption that is fixed (i.e., concerning the data complexity the problem) is essential in this work, as it means that we can traverse in polynomial time all the profiles that are relevant to a repair: for a given database , these are the profiles where every number is at most . In particular, our polynomial-time algorithms do not deal with the manner in which the constraints are represented. In fact, for this reason, the constraint could even be infinite, and examples of such constraints are shown in Section 2.3; hence, the assumption that is given as part of the input is not a limitation.
It may be unclear upfront whether we can even test in polynomial time whether any repair exists (i.e., our version of the existence of repair problem [4]). It follows immediately from our later results on optimal repairs (e.g., Theorem 8) that this problem is, indeed, solvable in polynomial time. Yet, our first result in the next section states hardness of approximation, even when a repair is guaranteed to exist.
2.6 Hardness
The following theorem states that for general input metrics, it is NP-hard to approximate the optimal repair beyond some fixed ratio. Recall that the metric space is given explicitly as part of the input (e.g., as a graph). Be reminded that APX-hardness means that there is some constant such that there is no polynomial-time algorithm for computing an -approximation unless . The proof (in the full version [21]) is via a PTAS reduction from the problem of finding a minimum cover by 3-sets.
Theorem 7.
Let be a signature with . Minimizing the cost of a repair is APX-hard, even if the weight is uniform, is uniformly the inclusion constraint , and a repair is guaranteed to exist.
The remainder of this paper is therefore dedicated to obtaining optimal algorithms for metrics of interest (notably, the line metric and discrete metric), and providing provable approximation algorithms for general metrics.
3 Algorithms
In this section, we present our repair algorithms. We begin with an exact algorithm for tree metrics. This algorithm immediately implies the tractability of the line and discrete metrics (as we will show in Corollary 9). The exact algorithm for trees will also play a central role in the approximation algorithm for general metrics in the second half of this section.
3.1 Algorithm for Tree Metrics
Many problems in tree metrics are amenable to dynamic programming approaches, allowing for polynomial-time algorithms for problems that might be intractable on general metrics. This is also the case here. In this section, we devise a polynomial-time algorithm for finding an optimal repair in the case of a tree metric. Formally, we will prove that:
Theorem 8.
An optimal repair can be found in polynomial time (if exists), given a tree metric space , a coincidence constraint over , and an inconsistent database .
Theorem 8 implies, in particular, that we can test in polynomial time whether a repair exists (regardless of the metric): apply the theorem to an arbitrary tree metric over the points. Moreover, the tractability of tree metrics implies the same for other basic metrics:
Corollary 9.
An optimal repair can be found in polynomial time (if exists) in the case of a line metric , and in the case of the discrete metric over .
Proof.
Let . The corollary follows from Theorem 8 by casting the two metrics as tree metrics, as illustrated in Figure 4. The case of a line metric is straightforward; if (w.l.o.g.) , then is the same as the tree metric where is the path where the weight of each edge is . In the case of the discrete metric over , the tree is a star with the leaves and a new vertex as the center. The weight of every edge is . To make sure that (optimal) repairs do not place any cell in , we define .
Next, we prove Theorem 8 by showing the repairing algorithm. Throughout this section, we assume the signature and a given input . We further assume that the tree itself is given (or, otherwise, we can reconstruct it from ).
General idea.
Before we delve into the details of the algorithm, we give the general intuition behind it. The initial direction is to process the tree bottom-up, starting from the leaves, and compute the best repair for each subtree we process. However, an optimal repair for the subtree rooted at a vertex (if any exists) is not necessarily useful when processing the ancestors of , since an optimal solution for may need to transfer cells from to vertices outside of , or cells from outside of into (as illustrated in Figure 5). Nevertheless, due to the uniformity of the cost of moving cells of each type , it would suffice to find the best repair knowing just the number of cells moved inside/outside without knowing the identity of these cells. So, we compute the best repair under the assumption that we need to handle a specific number of -cells (which can be lower or higher than the number of -cells in ), for every attribute . Moreover, we consider all possible (yet polynomially many) combinations for the attributes in the signature .
Translation into a binary tree.
We first transform the tree into a binary tree where every internal vertex has precisely two children. We do so by introducing new vertices with distance zero to their parents. While this construction violates the property of non-zeroness of the metric, this does not matter for the algorithm, which is optimal even for a pseudometric (where distinct points can be of distance ). The end result is that every point in is a vertex of , and some vertices of are not in (as we introduced them for the construction). Let be the set of new vertices (hence, ). We extend to by defining for every and for every . This ensures that our solution places no cells in the new vertices, and so, the result is a legal repair.
Hereon, we will assume that is binary to begin with, and that and are, to begin with, the above constructed and .
Placement.
The algorithm deploys dynamic programming that processes bottom-up, leaf to root. For a vertex of , we denote by the subtree of rooted at , and by the subset of with cells having points inside .
Let . For , let be the number of -cells of , that is, . Let be integers, where each is in . A -placement in is a consistent database that is obtained by positioning cells of in , where the number of -cells is . Some of these cells may belong to and others may be outside of . Cells of the former kind are called resident cells and those of the latter kind are visitor cells. Note that the consistency of a -placement considers only and the resident and visitor cells, and ignores the rest of and the remaining cells of ; in particular, it may be the case that a -placement exists, but it cannot be extended to a full repair (e.g., since the other metric points cannot contain the remaining cells).
The cost of a -placement is the sum where:
In words, we consider the transformation of from ; if is a cell that moves from one vertex of to another, then is the distance between the two vertices. If is a resident cell that disappears, then is the cost of moving to the root. If is a visitor cell that occurs in , then is the cost of moving from the root to its vertex.
Given , we will compute, for each vertex , the minimum-cost -placement (among all sets of cells) in . This value will be stored as where, as a special case, the cost of the best repair of is , where is the root vertex of . If no -placement exists, then is . (As often done in dynamic programming, the actual repair is obtained by restoring the optimal placements that produce the least cost .)
Handling leaves.
When is a leaf, we define if (hence, the coincidence profile of is legal); otherwise, .
Handling internal vertices.
We now consider the case where is an internal vertex. For a vertex of and , we use to denote the number of -cells in the tree (as positioned in ), that is, the sum of over all vertices in the subtree .
To find an optimal -placement for setting , we compute the minimum cost for every coincidence profile of the vertex , and then take the least-cost entry across all profiles. In the remainder of this part, we fix .
Recall that is an internal vertex, and so, has precisely two children. Let us denote them by and . To obtain our optimal -placement, we can pull from (where ) a set of -cells of size for , or push to a set -cells of size for . (Note that there is no gain in pulling -cells and pushing -cells at the same time since the placement can use the pulled cell instead of the pushed cell with no additional cost, or even a lower cost.) We say uniformly that we pull from a set of -cells of size for , where the meaning of pulling cells, for , is pushing cells. Once we pull -cells from (or push -cells to) , we place the cells optimally in . To find the cost of that, we use a previously computed where (i.e., the number of -cells that remain in ). Hence, the total cost is given by:
(3) |
We will then iterate over all legal combinations of and take the minimal cost according to Equation 3. The combination is legal if, for all , the number of cells that we position in is indeed . This number consists of the number of -cells that remain in each , namely , plus the number of -cells that remain in ; hence, .
This concludes the description of the dynamic program. Clearly, the execution of the program terminates in polynomial time. (Recall that the signature is fixed and, in particular, is treated as a constant.) The dynamic program is correct in the sense that it constructs an optimal repair if any repair exists. This is proved by showing that the program is indeed solving the generalized optimization problem:
Lemma 10.
Let be a vertex of . When we process with , we compute the least cost of a -placement in , or if none exists.
The proof is straightforward from the description of the algorithm in this section.
We remark here that the algorithm is exponential in the size of the (fixed) signature , and the exponent is manifested in two parts of the algorithm: first, we assume that we can explicitly enumerate all sequences of each ; second, the dynamic program handles every combination for its placement optimization.
3.2 General (Finite) Metrics
Leveraging our algorithm for tree metrics and classic results for probabilistic tree embeddings, in this section we devise a logarithmic-ratio approximation for general (finite) metrics. Formally, we will prove that:
Theorem 11.
There is a polynomial-time randomized algorithm that, given a metric , a coincidence constraint , an inconsistent database , and an error probability , finds an -optimal repair with probability at least , if any repair exists.
In the remainder of this section, we prove Theorem 11. We fix the input for the rest of this section. The proof is based on the following classic tree embedding lemma [16] that allows for the translation of exact algorithms for tree metrics to -approximation algorithms for arbitrary metrics.
Lemma 12 ([16]).
Given a metric , there exists a polynomial-time samplable distribution over weighted trees defining tree metrics such that for every :
-
1.
for every in the support of .
-
2.
.
Blelloch, Gu, and Sun [6] show a near-linear-time counterpart of Lemma 12. We note that randomization and expectation are crucial for this theorem; for example, if we embed an -point cycle with unit-distance edges in any tree, then at least one pair of vertices will suffer an distortion [3, Theorem 7].
By combining Theorem 8 and Lemma 12, we will show how we obtain an approximation, with high probability, in polynomial time. We do so by repeatedly finding an optimal repair for multiple random trees , and taking the best outcome. More precisely, to obtain an -approximation with probability at least , we take the best outcome out of the repetitions of the following procedure:
-
1.
Select a random tree according to Lemma 12.
-
2.
Find an optimal repair for , and .
We show that the process gives an -approximation with probability at least . Let us denote by an optimal repair for the original metric . We use the following two lemmas.
Lemma 13.
.
Proof (Sketch).
We prove the lemma (in the archive version of this paper [21]) by analyzing , applying the linearity of expectation and Lemma 12.
Lemma 14.
for some constant . In words, is -optimal with probability at least .
Proof.
It follows immediately from the inequality of Lemma 13 and Markov’s inequality.
The two lemmas suffice to prove Theorem 11, since the randomized procedure can be seen as a Bernoulli trial that succeeds (i.e., produces a good approximation) with a probability of at least ; hence, we see at least one success with probability after independent trials.
4 Extensions
In this section, we study two extensions of our study: the case of an infinite metric, and bound restriction on the movement of each individual cell.
4.1 Infinite Metrics
Up to now, we have considered databases over a finite metric that is given explicitly as part of the input. In particular, a repair could use only values from the given point set . There are, however, natural situations where the metric is a known infinite metric that can provide additional points for repairs. We wish to be able to repair an inconsistent database by using arbitrary values from the infinite metric. An example is the -metric where is the Euclidian space and is the norm over , that is, .
To model the computational problem in the case of infinite metrics, we consider the case where the metric is fixed and infinite. Moreover, the coincidence constraint is fixed, and we restrict the discussion to a uniform (that maps every point in to the same, possibly infinite, set of coincidence profiles). We will further assume that contains the profile since, otherwise, it is impossible to satisfy using any database, as our databases are finite. Computationally, we only require polynomial-time computation of distances , given and , and membership testing in , given .
The following theorem states that we need not use points outside of if we are satisfied with a 2-approximation and the coincidence constraint is closed under addition. Note that a uniform coincidence constraint over is closed under addition if for every pair and of profiles in , the profile is also in . For illustration, referring to Section 2.3, the inclusion constraint is closed under addition, but the key constraint and the foreign-key constraint are not closed under addition.
Proposition 15.
Let be an infinite metric space, a uniform coincidence constraint, and an inconsistent database. If is closed under addition, then there exists a 2-optimal repair such that .
Proof (Sketch).
To establish from an optimal repair , we take every point in and move all cells in that point to the nearest cell in . The distance to each moved cell can then grow at most twice. See the archive version of the paper [21].
Note that it is necessary to make an assumption on the coincidence constraint in Proposition 15. If we remove the assumption, the statement is false simply because there may be no repair at all over the domain of . For example, if is the key constraint , then it may be necessary to introduce new metric points if has fewer points than -cells.
Proposition 15 implies that we can reduce the case of an infinite metric to the case of a finite one (assuming that the coincidence constraint is closed under addition). In particular, we can apply Theorem 11 to conclude a logarithmic approximation in the infinite case.
Theorem 16.
Let be an infinite metric space, and let be a uniform coincidence constraint that is closed under addition. There is a polynomial-time randomized algorithm that, given an inconsistent database and an error probability , finds an -optimal repair for , with probability at least .
The assumption on is crucial for the correctness of Theorem 16. For example, suppose that is the -metric and is a key constraint. In that case, we can construct a repair with an arbitrarily small cost, by generating enough points around those of . Hence, any -optimal solution must be an optimal solution, since our approximation ratio is multiplicative. In particular, we cannot get any approximation guarantee by using the tree embedding of Lemma 12.
4.2 Bound Restriction
Note that a low cost of a repair does not necessarily imply that an individual cell is moved to a point that is close to its origin. This is due to our choice to measure the cost of a repair as the sum of cell movements. It is clearly of interest to consider the variant of the problem where we limit the movement of individual cells by a threshold.
In this section, we consider the extension of our repairing problem where a bound is posed on the maximal movement of a cell. Formally, consider a metric space , a coincidence constraint , and an inconsistent database . For a threshold , a -repair is a repair such that for every . Our goal is now to find a -repair with a minimal .
However, the bound restriction is nontrivial to deal with. Our algorithms inherently fail to deal with this restriction. The dynamic-programming algorithm of Theorem 8 is based on the fact that we can remember the total movement of cells from/to the root, but not any information about individual cells. Moreover, the approximation using the random tree embedding of Lemma 12 cannot lead to the support of a bound restriction, since every random tree may (unavoidably) have a high distortion, meaning that two individual points and can be way farther in than in . In fact, even the existence of a repair (regardless of its cost) becomes an intractable problem in the presence of a bound constraint.
Proposition 17.
It is NP-complete to determine whether any -repair exists, given a (finite) metric , a coincidence constraint , an inconsistent database , and threshold . The problem remains NP-hard in each of the following cases:
-
1.
The signature consists of a single attribute and is uniform.
-
2.
The signature is and is the inclusion constraint .
Proof (Sketch).
For the first case we devise a reduction from exact cover by 3-sets, and for the second we use CNF satisfiability. See the archive version of the paper [21].
Algorithm for the Line Domain
In contrast to the hardness shown in Proposition 17, we can efficiently force a bound restriction in the case of a line metric.
Theorem 18.
An optimal -repair can be found in polynomial time, given , , and a line metric . The same holds true if is fixed and is fixed, uniform, and closed under addition.
In the remainder of this section, we prove Theorem 18 by presenting an algorithm for computing an optimal -repair. Throughout the section, we assume that is a set of numbers and is . We consider separately the cases where is finite and given as part of the input, and where is itself.
Given (finite) metric.
We first introduce some notation. Let be a database.
A subset of is a database is such that and for all . We write to state that is a subset of , and to denote the complement of , that is, the subset of with . If and is a repair of , then denotes the subset of with .
A prefix of is a database that comprises a prefix of ’s cells for each attribute; that is, it is a database such that if for some attribute , then for all with , it holds that . We write to denote that is a prefix of . We say that is a strict prefix of if and . If is a prefix of , then the complement is a suffix of .
We say that is contracted if consists of a single point, that is, for all and in . For a set , we say that satisfies (denoted ) if for a contracted database with . (Note that either all such contracted or none of them satisfy , since the common cell value has no impact on satisfying .)
The following lemma implies that, without loss of generality, we can assume that an optimal repair consists of two parts, as illustrated in Figure 6: one is an optimal repair of a strict prefix, and the other is a contraction of the suffix.
Lemma 19.
Let be an inconsistent database over a line metric . If any -repair exists, then there is an optimal -repair and a strict prefix of with the following properties.
-
1.
is an optimal -repair of .
-
2.
is contracted.
Proof (Sketch).
The proof (in the archive version [21]) is based on showing that two -cells cannot cross each other in their movement from to .
Note that has a polynomial number of prefixes. Also note that is a partial order over the prefixes of . Hence, due to Lemma 19, we can apply dynamic programming to compute an optimal -repair of every prefix of , where we traverse the prefixes in a topological order according to , starting with the empty set.
We compute as follows the cost of an optimal -repair of . (By recording the decisions throughout the procedure, we can derive the repair itself.) Let be the values of , with for all . For a prefix of , we denote by the cost of a -repair of such that for all , and is minimal among all -repairs of satisfying this property. Our final goal is to compute the value . We show how to compute using dynamic programming.
-
1.
If , then .
-
2.
If and , then .
-
3.
Otherwise, let be the set of all prefixes of such that and for all . Then:
The first case refers to the situation where is empty; hence, the cost is zero. The second case refers to the situation where we have cells but no points to locate cells in (hence, there is no -repair), so the cost is infinite. In the third case, we go over all possible prefixes of with . In this case, the repair is such that for all , while for every cell . In this case, is the minimal cost for , and to that we add the cost of changing the cells of to . Since is contracted, by checking that , we ensure that we do not violate consistency by placing all cells of together. Then, is responsible for checking that . This guarantees that we obtain a repair. Furthermore, we only consider the prefixes such that for all , and is responsible for checking that for all , which guarantees that we obtain a -repair.
The correctness of the above algorithm is a direct consequence of Lemma 19, and it is rather straightforward to show that its running time is polynomial in the size of the input.
The full line.
When , we can use the following lemma, which gives a reduction to the finite case that we discussed in the previous part.
Lemma 20.
Suppose that and is closed under addition. Let be an inconsistent database. If any -repair exists, then there is an optimal -repair such that .
Proof (Sketch).
We show (in the archive [21]) that if the repair uses a point outside the stated domain, then the entire set of cells in that point can be moved to a point in the domain without increasing the cost; this move is possible since is closed under addition.
5 Conclusions
We studied the problem of finding an optimal repair of an inconsistent database (i.e., a set of labeled cells) with respect to a coincidence constraint. We established that incorporating the metric space underlying the domain of values can lead to algorithms with efficiency and quality guarantees. In summary: the problem is APX-hard for general metrics but logarithmically approximable in polynomial time, and moreover the problem is solvable optimally in polynomial time for a tree metric (hence, for the common line and discrete metrics). We also discussed the case of an infinite metric and the addition of bound restrictions. The addition of the bound restrictions makes it NP-hard to test whether any legal repair exists, but an optimal repair can be found in polynomial time for the line metric.
Many directions are left for future work. First, for general metrics our lower bound is APX-hardness (i.e., some constant ratio) while the upper bound is logarithmic; hence, a gap remains. Next, as mentioned in Section 4, we left open the case of a tree metric with bound restrictions. Note, however, that even if this case is solved optimally in polynomial time, it is not at all clear that this tractability has implications on other metrics (e.g., via embedding) as it has in the absence of bound restrictions. Still, it is an important challenge to find natural metrics, beyond the line metric, where nontrivial upper bounds can be established.
Another direction for future work is to extend our work to constraints besides coincidence constraints, including others commonly studied for data quality management: functional dependencies (and their conditional enrichment [7]), denial constraints, non-unary inclusion constraints, and so on. Another direction is the extension to coincidence constraints with a non-fixed set of labels given as part of the input (i.e., the “combined complexity” variant of this problem), which requires a formalism for compactly expressing coincidence constraints, such as a set of inclusion (or key or foreign-key) constraints.
Finally, it is important to investigate the practical aspects of our work. How efficiently do the algorithms of this paper perform on common datasets? How practical is the dynamic program for the tree metric? How can we optimize it? What is the actual approximation ratio that takes place in a general metric? How does it change from one metric to another? How close is an (approximately) optimal repair to the correct database instance? These questions call for a careful implementation and experimental investigation as the next steps.
References
- [1] Ofer Arieli, Marc Denecker, and Maurice Bruynooghe. Distance semantics for database repair. Ann. Math. Artif. Intell., 50(3-4):389–415, 2007. doi:10.1007/S10472-007-9074-1.
- [2] Ofer Arieli and Anna Zamansky. A graded approach to database repair by context-aware distance semantics. Fuzzy Sets Syst., 298:4–21, 2016. doi:10.1016/J.FSS.2015.06.007.
- [3] Yair Bartal. Probabilistic approximations of metric spaces and its algorithmic applications. In FOCS, pages 184–193. IEEE Computer Society, 1996. doi:10.1109/SFCS.1996.548477.
- [4] Leopoldo E. Bertossi. Database Repairing and Consistent Query Answering. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2011. doi:10.2200/S00379ED1V01Y201108DTM020.
- [5] Leopoldo E. Bertossi, Loreto Bravo, Enrico Franconi, and Andrei Lopatenko. The complexity and approximation of fixing numerical attributes in databases under integrity constraints. Inf. Syst., 33(4-5):407–434, 2008. doi:10.1016/J.IS.2008.01.005.
- [6] Guy E. Blelloch, Yan Gu, and Yihan Sun. Efficient construction of probabilistic tree embeddings. In ICALP, volume 80 of LIPIcs, pages 26:1–26:14. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2017. doi:10.4230/LIPICS.ICALP.2017.26.
- [7] Philip Bohannon, Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746–755. IEEE Computer Society, 2007. doi:10.1109/ICDE.2007.367920.
- [8] Philip Bohannon, Michael Flaster, Wenfei Fan, and Rajeev Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD Conference, pages 143–154. ACM, 2005. doi:10.1145/1066157.1066175.
- [9] Rajesh Bordawekar and Oded Shmueli. Using word embedding to enable semantic queries in relational databases. In DEEM@SIGMOD, pages 5:1–5:4. ACM, 2017. doi:10.1145/3076246.3076251.
- [10] Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. Creating embeddings of heterogeneous relational datasets for data integration tasks. In SIGMOD Conference, pages 1335–1349. ACM, 2020. doi:10.1145/3318464.3389742.
- [11] Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. Relaxed functional dependencies - a survey of approaches. IEEE Trans. Knowl. Data Eng., 28(1):147–165, 2016. doi:10.1109/TKDE.2015.2472010.
- [12] Jan Chomicki and Jerzy Marcinkowski. Minimal-change integrity maintenance using tuple deletions. Inf. Comput., 197(1-2):90–121, 2005. doi:10.1016/J.IC.2004.04.007.
- [13] Xu Chu, Ihab F. Ilyas, and Paolo Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458–469. IEEE Computer Society, 2013. doi:10.1109/ICDE.2013.6544847.
- [14] Vlastislav Dohnal, Claudio Gennaro, and Pavel Zezula. A metric index for approximate text management. In ISDB, pages 37–42. Acta Press, 2002.
- [15] Vlastislav Dohnal, Claudio Gennaro, and Pavel Zezula. Similarity join in metric spaces using ed-index. In DEXA, volume 2736 of Lecture Notes in Computer Science, pages 484–493. Springer, 2003. doi:10.1007/978-3-540-45227-0_48.
- [16] Jittat Fakcharoenphol, Satish Rao, and Kunal Talwar. A tight bound on approximating arbitrary metrics by tree metrics. J. Comput. Syst. Sci., 69(3):485–497, 2004. doi:10.1016/J.JCSS.2004.04.011.
- [17] Wenfei Fan and Floris Geerts. Foundations of Data Quality Management. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2012. doi:10.2200/S00439ED1V01Y201207DTM030.
- [18] Amir Gilad, Aviram Imber, and Benny Kimelfeld. The consistency of probabilistic databases with independent cells. In ICDT, volume 255 of LIPIcs, pages 22:1–22:19. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.ICDT.2023.22.
- [19] Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, and Divesh Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491–500. Morgan Kaufmann, 2001. URL: http://www.vldb.org/conf/2001/P491.pdf.
- [20] Miika Hannula and Jef Wijsen. A dichotomy in consistent query answering for primary keys and unary foreign keys. In PODS, pages 437–449. ACM, 2022. doi:10.1145/3517804.3524157.
- [21] Youri Kaminsky, Benny Kimelfeld, Ester Livshits, Felix Naumann, and David Wajc. Repairing databases over metric spaces with coincidence constraints. CoRR, abs/2409.16713, 2024. doi:10.48550/arXiv.2409.16713.
- [22] Youri Kaminsky, Eduardo H. M. Pena, and Felix Naumann. Discovering similarity inclusion dependencies. Proc. ACM Manag. Data, 1(1):75:1–75:24, 2023. doi:10.1145/3588929.
- [23] Solmaz Kolahi and Laks V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, volume 361 of ACM International Conference Proceeding Series, pages 53–62. ACM, 2009. doi:10.1145/1514894.1514901.
- [24] Nick Koudas, Avishek Saha, Divesh Srivastava, and Suresh Venkatasubramanian. Metric functional dependencies. In ICDE, pages 1275–1278. IEEE Computer Society, 2009. doi:10.1109/ICDE.2009.219.
- [25] Selasi Kwashie, Jixue Liu, Jiuyong Li, and Feiyue Ye. Efficient discovery of differential dependencies through association rules mining. In ADC, volume 9093 of Lecture Notes in Computer Science, pages 3–15. Springer, 2015. doi:10.1007/978-3-319-19548-3_1.
- [26] Ester Livshits, Benny Kimelfeld, and Sudeepa Roy. Computing optimal repairs for functional dependencies. ACM Trans. Database Syst., 45(1):4:1–4:46, 2020. doi:10.1145/3360904.
- [27] Yasir Mahmood, Jonni Virtema, Timon Barlag, and Axel-Cyrille Ngonga Ngomo. Computing repairs under functional and inclusion dependencies via argumentation. In FoIKS, volume 14589 of Lecture Notes in Computer Science, pages 23–42. Springer, 2024. doi:10.1007/978-3-031-56940-1_2.
- [28] Dongjing Miao, Pengfei Zhang, Jianzhong Li, Ye Wang, and Zhipeng Cai. Approximation and inapproximability results on computing optimal repairs. VLDB J., 32(1):173–197, 2023. doi:10.1007/S00778-022-00738-0.
- [29] Rajvardhan Patil, Sorio Boit, Venkat N. Gudivada, and Jagadeesh Nandigam. A survey of text representation and embedding techniques in NLP. IEEE Access, 11:36120–36146, 2023. doi:10.1109/ACCESS.2023.3266377.
- [30] Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld, Christopher Ré, and Theodoros Rekatsinas. A formal framework for probabilistic unclean databases. In ICDT, volume 127 of LIPIcs, pages 6:1–6:18, 2019. doi:10.4230/LIPICS.ICDT.2019.6.
- [31] Shaoxu Song and Lei Chen. Differential dependencies: Reasoning and discovery. ACM Trans. Database Syst., 36(3):16:1–16:41, 2011. doi:10.1145/2000824.2000826.
- [32] Shaoxu Song, Lei Chen, and Hong Cheng. Parameter-free determination of distance thresholds for metric distance constraints. In ICDE, pages 846–857. IEEE Computer Society, 2012. doi:10.1109/ICDE.2012.46.
- [33] Jan Tönshoff, Neta Friedman, Martin Grohe, and Benny Kimelfeld. Stable tuple embeddings for dynamic databases. In ICDE, pages 1286–1299. IEEE, 2023. doi:10.1109/ICDE55515.2023.00103.
- [34] Minghe Yu, Guoliang Li, Dong Deng, and Jianhua Feng. String similarity search and join: a survey. Frontiers Comput. Sci., 10(3):399–417, 2016. doi:10.1007/S11704-015-5900-5.