Targeted Least Cardinality Candidate Key for Relational Databases

Nakos, Vasileios; Ngo, Hung Q.; Tsourakakis, Charalampos E.

doi:10.4230/LIPIcs.ICDT.2025.21

Targeted Least Cardinality Candidate Key for Relational Databases

Vasileios Nakos

National and Kapodistrian University of Athens, Greece
Archimedes Athena RC, Marousi, Greece Hung Q. Ngo

Relational AI, Berkeley, CA, USA Charalampos E. Tsourakakis

Relational AI, Berkeley, CA, USA

Abstract

Functional dependencies (FDs) are a central theme in databases, playing a major role in the design of database schemas and the optimization of queries [37]. In this work, we introduce the targeted least cardinality candidate key problem (TCAND). This problem is defined over a set of functional dependencies $\mathcal{F}$ and a target variable set $T\subseteq V$ , and it aims to find the smallest set $X\subseteq V$ such that the FD $X\to T$ can be derived from $\mathcal{F}$ . The TCAND problem generalizes the well-known NP-hard problem of finding the least cardinality candidate key [32], which has been previously demonstrated to be at least as difficult as the set cover problem.

We present an integer programming (IP) formulation for the TCAND problem, analogous to a layered set cover problem. We analyze its linear programming (LP) relaxation from two perspectives: we propose two approximation algorithms and investigate the integrality gap. Our findings indicate that the approximation upper bounds for our algorithms are not significantly improvable through LP rounding, a notable distinction from the standard Set Cover problem. Additionally, we discover that a generalization of the TCAND problem is equivalent to a variant of the Set Cover problem, named Red Blue Set Cover [8], which cannot be approximated within a sub-polynomial factor in polynomial time under plausible conjectures [11]. Despite the extensive history surrounding the issue of identifying the least cardinality candidate key, our research contributes new theoretical insights, novel algorithms, and demonstrates that the general TCAND problem poses complexities beyond those encountered in the Set Cover problem.

Keywords and phrases:

functional dependencies, candidate key, approximation algorithms, hardness

Copyright and License:

2012 ACM Subject Classification:

Information systems

\rightarrow

Database design and models

DOI:

10.4230/LIPIcs.ICDT.2025.21

Event:

28th International Conference on Database Theory (ICDT 2025)

Editors:

Sudeepa Roy and Ahmet Kara

Series and Publisher:

Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

Relational databases are a fundamental component of modern information systems, used to store and manage vast amounts of data across a wide range of industries and applications [37]. However, designing an effective database schema can be a complex task, requiring careful consideration of factors such as data integrity, performance, and scalability. One key tool in this process is the use of functional dependencies (FDs), which provide a way to describe the relationships between attributes in a database relation [16]. A functional dependency (FD) is a statement that indicates that the value of one or more attributes uniquely determines the value of another attribute. For example, consider a relation with attributes Student ID, Student Name, and Student Email. In this relation, each student id is associated with a unique student name and email. This means that if we know a student’s ID, we can determine their name and email. This relationship can be represented by the following FD: Student ID $\to$ Student Name, Student Email. Alternatively, we express that the Student ID variable functionally determines the name and email, meaning that each id uniquely corresponds to one student name and email in the relation. A key is a set of one or more attributes that uniquely identifies a tuple (or record) within a relation. A key ensures that there are no duplicate records in the table and establishes a way to reference records for relational operations. A candidate key is any key that can serve as the primary key, i.e., a set of attributes that uniquely identifies a tuple within a relation. Finding a candidate key in a schema requires finding a set of attributes that functionally determine all the rest [17]. It is important to note that it is often customary to select a candidate key with the least number of attributes as the primary key. This approach simplifies the database design and can enhance storage efficiency and query speed, particularly when the primary key plays a role in indexing, joins, and various database tasks [37]. For example, in query optimization, fewer attributes in a key mean fewer columns to index and process during queries, which enhances the speed and efficiency of database operations. Furthermore, using the smallest possible key simplifies the design of the database schema, making it easier to understand and maintain. It also reduces the likelihood of errors in data management and makes integrity checks more efficient. This problem is known as the minimum candidate key problem [30, 32]. Furthermore, finding the Boyce–Codd normal form (BCNF) to eliminate redundancy is based on functional dependencies [37]. Understanding functional dependencies is essential for database designers and administrators to create efficient and effective database schemas that accurately store and manage data. Functional dependency analysis is a major topic with a variety of applications beyond schema design query optimization, that additionally include cost estimation, order optimization, selectivity estimation, estimation of (intermediate) result sizes and data cleaning among others [5, 6, 12, 35, 23]. The DISTINCT clause appears frequently in SQL queries and frequently requires an expensive sort to remove duplicates. Functional dependency analysis can identify redundant DISTINCT clauses [34], thus lowering significantly the execution cost of a query.

In this work we revisit functional dependency analysis and make several key contributions to this longstanding line of work.

$\blacksquare$

Novel formulation: We introduce a well-motivated, novel formulation that generalizes the classic problem of finding the least cardinality key of a schema [30, 32]. We refer to this problem as the targeted least cardinality candidate key generation problem, or TCAND for short.

Problem 1 (Targeted Least Cardinality Candidate Key (TCAND)).

Given a set of functional dependencies $\mathcal{F}$ and a set of target variables $T\subseteq V$ , the aim is to determine a set of variables $X\subseteq V$ with the smallest cardinality, such that the closure $X^{+}$ satisfies the condition $T\subseteq X^{+}$ . In other terms, the functional dependency $X\to T$ is logically implied from the set $\mathcal{F}$ .
$\blacksquare$

Hardness: We study in depth the hardness of approximation of the TCAND formulation. Despite the long history and importance of the problem in its general form when $T=V$ [32], surprisingly not much is known on the approximability of the problem. As we show the known approximation lower bound [1] is tight only for some special cases, while the general case is significantly harder. We achieve this result by establishing a novel connection with the Red Blue Set Cover that allows us to leverage recent progress to establish the a powerful hardness result assuming the Dense-vs-Random conjecture [9].
$\blacksquare$

Exact IP formulation: We present an exact integer programming (IP) formulation that represents the TCAND problem as a layered set cover problem. Intuitively, each layer corresponding to a round of ${\it FD}$ inference. With recent advancements in integer programming solver software, our formulation can be practical for real-world use for instances of moderate size. From a theory perspective, it serves as the basis for designing approximation algorithms.
$\blacksquare$

Approximation algorithms: We design two approximation algorithms based on the linear programming relaxation of our integer programming (IP) formulation. Both algorithms rely on solving a variant of the TCAND problem we introduce, parameterized by the number of rounds of inference $D$ .

Problem 2 ( $D$ -round-TCAND).

Given a set of functional dependencies $\mathcal{F}$ , and a set of target variables $T\subseteq V$ we want to find a least cardinality set of variables $X\subseteq V$ whose closure $X^{+}$ includes $T$ , i.e., $X^{+}\supseteq T$ , by performing at most $D$ rounds of FD inference.

Our approach is based on approximating Problem 2 for $D=1$ ; this gives a natural approximation algorithm for the $D$ -round TCAND problem with an exponential dependence on $D$ . It is worth noting that Problem 1 is a special case of Problem 2 by setting $D=n$ as we explain later in detail. We also show that our approximation guarantee is asymptotically tight by studying the integrality gap of the LP relaxation.
$\blacksquare$

Equivalence with the Red Blue Set Cover problem: We discover an equivalence between the TCAND problem and the Red Blue Set Cover problem [8], a variant of the Set Cover problem. This discovery holds dual significance for the TCAND problem. Firstly, it introduces an additional approximation algorithm developed by Chlamtáč et al. [11] for the Red Blue Set Cover problem, and secondly, it establishes an inapproximability result.

2 Preliminaries

Functional dependencies.

A relational database schema $R$ , represented as $R(A_{1},\ldots,A_{n})$ , consists of a set of attributes or variables. An instance of $R$ , denoted by $r(R)$ , is a set of tuples, where each tuple is an element of the Cartesian product of the attributes’ domains, i.e., $r(R)\subseteq\text{dom}(A_{1})\times\ldots\times\text{dom}(A_{n})$ . We also use $V=\{A_{1},\ldots,A_{n}\}$ , or simply $V=[n]$ , to denote the set of attributes/variables. Each element $t\in r(R)$ is referred to as a tuple. Functional dependencies (FDs for short) are properties of the semantics of the attributes in $R$ [17]. Specifically an FD is a constraint between two sets of attributes $X,Y\subseteq V$ . We say that $X$ functionally determines $Y$ and this means that if two tuples have the same value for all the attributes in $X$ , they must also have the same values for all the attributes in $Y$ . This is denoted as $X\to Y$ . We refer to an FD $X\to Y$ as regular when $|Y|=1$ . Any irregular FD $X\to y_{1}y_{2}\ldots y_{k}$ is equivalent to the set of regular FDs $X\to y_{i}$ for $i=1,\ldots,k$ . An input set $\mathcal{F}$ of FDs, may logically imply more FDs. For instance, the set $\mathcal{F}=\{a\to b,b\to c\}$ logically implies $a\to c$ . The set of all possible valid FDs for $\mathcal{F}$ is its closure $\mathcal{F}^{+}$ . The inference of valid FDs is performed using Armstrong’s axioms which are sound and complete [16]. In practice $|\mathcal{F}|\ll|\mathcal{F}^{+}|$ . Finding a smaller set of FDs $\mathcal{F}^{\prime}$ than $\mathcal{F}$ such that $\mathcal{F}^{\prime+}=\mathcal{F}^{+}$ is known as the canonical cover problem and can be solved efficiently [37]. Other compression schemes are also available, see [3]. The attribute closure $X^{+}$ is the set of attributes that are functionally determined by $X$ . In contrast to the FD closure, computing the closure $X^{+}$ of a subset of attributes $X\subseteq V$ is solvable in linear time [37].

A key of a relation $r$ is a subset $X\subseteq V$ that satisfies two conditions: (i) uniqueness, meaning no two distinct tuples in $r$ have identical values for the attributes in $X$ , and (ii) minimality, meaning no proper subset of $X$ satisfies the uniqueness property. When $X$ is a key, the functional dependency (FD) $X\to V$ is valid, or equivalently, the closure of $X$ , denoted $X^{+}$ , is equal to $V$ . A subset of attributes $X$ is considered a super-key if it satisfies the uniqueness property but not the minimality property. If a relation has more than one minimal key, each of these keys is referred to as a candidate key of $R$ . Finding the least cardinality key is NP-hard. Specifically, deciding if there exists a key of cardinality $k$ is NP-complete using a straight-forward reduction from vertex cover [32]. Furthermore, in terms of approximation algorithm very little is known. Specifically, Akutsu and Bao [1] proved that the problem is at least as hard as the set cover problem [18] but they do not discuss algorithmic upper bounds.

Set Cover and Red Blue Set Cover.

The Set Cover problem is a quintessential problem in computer science, renowned for its wide applicability and fundamental role in computational complexity theory. It was one of Karp’s 21 NP-complete problems [24], serving as a cornerstone for the study of approximation algorithms and computational intractability. The problem is defined as follows: given a universe $U=\{u_{1},u_{2},\ldots,u_{n}\}$ and a collection of subsets $S=\{S_{1},S_{2},\ldots,S_{m}\}$ where each $S_{i}\subseteq U$ , the Set Cover problem seeks to find a minimum subset $C\subseteq[m]$ of set indices such that $\bigcup_{i\in C}S_{i}=U$ . Feige showed that the Set Cover problem cannot be approximated in polynomial time to within a factor of $(1-o(1))\cdot\ln n$ unless NP has quasi-polynomial time algorithms. This inapproximability result was further improved by Dinur and Steuer who showed optimal inapproximability by proving that it cannot be approximated to $(1-o(1))\cdot\ln n$ unless $P=\text{NP}$ [14]. We use the latter result in Theorem 6. The Set Cover problem admits an approximation within a factor of $O(\log n)$ utilizing either a straightforward greedy strategy or a randomized algorithm based on linear programming (LP) rounding techniques. Additionally, there is a deterministic LP-based algorithm that guarantees an $f$ -factor approximation, with $f$ denoting the maximum frequency an element of the universe is represented in the set collection $S$ [42]. The Set Cover problem is not only fundamental in computer science but also has a wide range of applications, as discussed in [13].

As we show, a variant of the Set Cover problem that plays an important role for the TCAND problem is the Red Blue Set Cover problem introduced by Carr et al. [8]: given a universe $U=R\cup B$ where $R$ and $B$ are disjoint sets representing red and blue elements respectively, and a collection of subsets $S=\{S_{1},S_{2},\ldots,S_{m}\}$ where each $S_{i}\subseteq U$ , the Red Blue Set Cover problem seeks to find a subset $C\subseteq S$ such that $\bigcup_{S_{i}\in C}S_{i}\cap B=B$ and $\bigcup_{S_{i}\in C}S_{i}\cap R$ is minimized. Recently, Chlamtáč et al. [11] proved the following state-of-the-art approximation result for the Red Blue Set Cover problem.

Theorem 1 (Chlamtáč et al. [11]).

There exists an $O(m^{1/3}\log^{4/3}n\log k)$ -approximation algorithm for the Red Blue Set Cover problem where $m$ is the number of sets, $n$ is the number of red elements, and $k$ is the number of blue elements.

To discuss the inapproximability of the Red Blue Set Cover we need to introduce the Dense-vs-Random Conjecture [7]. Given a graph $G(V,E)$ , the density of a subgraph induced by $S\subseteq V$ is defined as $\rho(S)=\frac{e(S)}{|S|}$ , which is the ratio of the number of edges to the number of nodes in the subgraph [20]. The densest $k$ -subgraph problem (D $k$ S) problem asks for the densest subgraph with exactly $k$ nodes; it is NP-hard with the best-known approximation ratio being $\Omega(1/n^{1/4+\epsilon})$ for any $\epsilon>0$ [7]. This approximability result is far off from the best-known hardness result that assumes the Exponential Time Hypothesis (ETH). If ETH holds, then D $k$ S cannot be approximated within a ratio of $n^{1/(\log\log n)^{c}}$ for some $c>0$ [33]. Define the log-density of a graph with $n$ nodes as $\log_{n}(D_{avg})$ , where $D_{avg}$ represents the average degree. The Dense-vs-Random Conjecture on graphs [9] conjectures that it is hard to distinguish between the following two cases: 1) $G=G(n,p)$ where $p=n^{\alpha-1}$ (and thus the graph has log-density concentrated around $\alpha$ ), and 2) $G$ is adversarially chosen so that the densest $k$ -subgraph has log-density $\beta$ where $k\beta\gg pk$ (and thus the average degree inside this subgraph is approximately $k^{\beta}$ ). In this context, $G(n,p)$ represents the random Erdős and Rényi graph model [15, 19].

Conjecture 2 ([7, 10, 9]).

For all $0<\alpha<1$ , for all sufficiently small $\varepsilon>0$ , and for all $k\leq\sqrt{n}$ , we cannot solve Dense vs Random with log-density $\alpha$ and planted log-density $\beta$ in polynomial time (w.h.p.) when $\beta<\alpha-\varepsilon$ .

Using an extension of the above conjecture on hypergraphs, Chlamtáč et al. [11] proved the following inapproximability result.

Theorem 3 (Chlamtáč et al. [11]).

Assuming the Hypergraph Dense-vs-Random Conjecture, for every $\varepsilon>0$ , no polynomial-time algorithm achieves better than $O(m^{1/4-\varepsilon}\log^{2}k)$ approximation for the Red Blue Set Cover problem where $m$ is the number of sets and $k$ is the number of blue elements.

Integrality Gap.

The integrality gap of an integer program is the worst-case ratio over all instances of the problem of value of an optimal solution to the integer programming formulation to value of an optimal solution to its linear programming relaxation [42, 41]. Notice that in the case of a minimization problem, the integrality gap satisfies

\max_{\text{instances\leavevmode\nobreak\ }Q}\frac{\textrm{OPT}_{IP}}{\textrm{% OPT}_{LP}}\geq 1.

This gap provides insights into how closely the LP relaxation approximates the IP and is a measure of the performance of approximation algorithms. A small integrality gap (i.e., close to 1) indicates that the LP relaxation is a good approximation of the IP, while a large gap suggests the LP relaxation might not yield a good approximation and that other approximation strategies may be needed. It is often used as a benchmark to understand the effectiveness of approximation algorithms and to determine the best possible approximation ratio that can be achieved by any polynomial-time algorithm for NP-hard problems. Lovász proved that the integrality gap for the straight-forward LP formulation of the Set Cover is $O(\log n)$ [31].

Equitable coloring.

In Section 3, we make use of the following theorem, known as the equitable coloring theorem, which was proved by Hajnal and Szemerédi [22].

Lemma 4 (Hajnal-Szemerédi [22]).

Every graph with $n$ vertices and maximum vertex degree at most $k$ is $k+1$ colorable with all color classes of size $\lfloor\frac{n}{k+1}\rfloor$ or $\lceil\frac{n}{k+1}\rceil$ .

There exist efficient algorithms for finding such a coloring, e.g., [27, 28]. The best known algorithm is due to Kierstead et al. [28] and runs in $O(kn^{2})$ time and returns such an equitable coloring. We apply the Hajnal-Szemerédi theorem to establish a Chernoff-type concentration result under conditions of limited dependencies, as demonstrated in works like [29, 36].

3 Proposed Methods

3.1 Integer Programming Formulation

Figure 1: Visual representation of IP (1).

We formulate the TCAND problem as an integer program (IP) that provides an optimal solution. This formulation serves as the basis of our LP-based approximation algorithms. Notationally, we define $[k,n]\mathrel{\stackrel{{\scriptstyle\makebox[0.0pt]{\mbox{\tiny def}}}}{{=}}}% \{k,k+1,\ldots,n\}$ and $[n]=[1,n]$ . We assume, without loss of generality, that each FD’s right-hand side has a size exactly equal to $1$ . Figure 1 visualizes the variables introduced by our IP. Specifically, we introduce a set $n^{2}+n$ variables $x^{d}_{i}$ for $d\in[0,n],i\in[n]$ . The bottom level corresponds to the set of Boolean variables $x_{i}^{n}$ . We constraint the set of variables in this layer corresponding to the target variables (blue dotted circles) to be equal to 1. In general, the set of variables $\{x_{i}^{n-D}\}$ will indicate what variables we should include in our output if we allow for $D$ rounds of inference, for $D\in[n]$ . For any layer $k=n,\ldots,1$ , a variable $x_{i}^{k}$ will be set to 1 if and only if either $x_{i}^{k-1}=1$ or if there exists an FD of the form $i_{1},\ldots,i_{r}\to i$ where all the variables on the left side in the previous layer $k-1$ are equal to 1, i.e., $x_{i_{1}}^{k-1}=\ldots=x_{i_{r}}^{k-1}=1$ . In this case, we shall say that the FD is activated. We can express this logic as

x_{i}^{k}=\textbf{OR}\left(x_{i}^{k-1},\textbf{AND}\left(x_{i_{1}}^{k-1},% \ldots,x_{i_{r}}^{k-1}\right),\ldots\right),

where we have a AND term for each FD of the form $i_{1}\ldots i_{r}\to i$ as linear constraints. The logical operators OR, AND can be easily expressed as linear constraints [43]. Towards this formulation, we introduce a new set of variables $z^{(d)}_{LS}$ which corresponds to whether FDs with left hand side $L S$ can be activated in round $d$ , i.e., all the variables participating in $L S$ from the previous layer are already set to 1. Putting those constraints together along with the objective function, we arrive at the following IP for $D$ -round TCAND where $1\leq D\leq n$ .

\begin{array}[]{ll@{}ll}\text{minimize}&\displaystyle\sum\limits_{j=1}^{n}x_{j% }^{n-D}&\\ \text{subject to}&\displaystyle x_{i}^{d}\leq x_{i}^{d-1}+\sum_{LS:LS\to i}z_{% LS}^{d}\text{\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ }% \forall d\in[n-D+1,n],i\in[n]\\ &z_{LS}^{d}\leq x_{j}^{d-1}\text{\leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ }\forall LS\to i,\forall j\in LS,d\in[n-D+1,n]\\ &x_{i}^{n}=1\text{\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode% \nobreak\ }\forall i\in T\subseteq[n]&\vskip 6.0pt plus 2.0pt minus 2.0pt% \vskip 3.0pt plus 1.0pt minus 1.0pt\\ \end{array}

(1)

By setting $D=n$ , the IP is guaranteed to return an optimal solution. This is because applying the FD rules a maximum of $n$ times is sufficient; if an iteration fails to fix any additional variables, subsequent iterations will not alter the outcome and a variable, once set, does not change.

3.2 Simple FDs

Before addressing the general case of the problem, we focus on an important special case in this section. Notice that without any loss of generality, the right side of any ${\it FD}$ consists of a single attribute. We denote the set of all left-sides of FDs as $\mathcal{LS}=\{LS\subseteq V:(LS\to i)\in\mathcal{F},i\in V\}$ . If an FD has a single variable on the left side, we refer to this FD as simple. Otherwise, we refer to it as non-simple. We call a set $\mathcal{F}$ of FDs simple if all FDs in $\mathcal{F}$ are simple. If there exists at least one non-simple FD in $\mathcal{F}$ , we refer to $\mathcal{F}$ as regular or non-simple. For example, the set of FDs $\{a\to b,b\to c,c\to d\}$ is simple because all FDs are simple. The set of FDs $\{a\to b,bc\to d\}$ is regular/non-simple since the FD $bc\to d$ is non-simple. We also define the following natural graph for any set of FDs. It is worth mentioning that similar notions exist in the literature, see [4] and [38, Definition 3].

Definition 5.

Let $\mathcal{F}$ be a set of FDs. We define the corresponding FD-graph $G_{\mathcal{F}}$ (or simply $G$ ) as follows: for each variable $i$ appearing in $\mathcal{F}$ we create a vertex $v_{i}$ . There exists a directed edge $(v_{i},v_{j})$ when there exists an FD of the form $X\to y$ with $i\in X,y=j$ .

Designing approximation algorithms is easier for simple FDs. Intuitively, when dealing with a simple set of FDs, the FD-graph precisely mirrors the input data. For example, if the FDs are simple and the FD-graph is strongly connected, any single variable $i$ is an optimal solution for any target set $T$ , as the closure of $i$ is the set of all variables, i.e., $\{i\}^{+}=V$ . However, in general the FD-graph may not be strongly connected. However, it is a well-known fact that any directed graph can be decomposed as a directed acyclic graph (DAG) of strongly connected components (SCCs) [40] and this decomposition requires linear time in the size of the graph. We use this fact to prove a refined approximation result and establish that that the lower bound established by Akutsu and Bao [1] is tight, see also [18]. We present this result in the subsequent theorem.

Theorem 6.

Let $\mathcal{F}$ be a set of simple FDs and let $g$ be the number of the strongly connected components (SCCs) of $G$ . Then there exists a polynomial time $\ln g$ -approximation algorithm for the TCAND problem. Furthermore, it cannot be approximated within $(1-o(1))\ln g$ unless $P=NP$ .

Proof.

We define for each node $v_{i}$ in the FD-graph $G$ , the set $\mathrm{Reach}(v_{i})$ of nodes that are reachable from $v_{i}$ . Then, $S_{i}:=\mathrm{Reach}(v_{i})\cap T$ is the set of target variables reachable from $v_{i}$ . Notice that for any $v_{i},v_{j}$ that belong to the same SCC the sets $S_{i},S_{j}$ are the same, i.e., $S_{i}=S_{j}$ . Thus, we define $\hat{S}_{i},i=1,\ldots,g$ to be the set of target variables reachable from each SCC. It is straight-forward to observe that solving the Set Cover problem for the instance defined by $\{\hat{S}_{1}\cap T,\ldots,\hat{S}_{g}\cap T\}$ and universe $T$ yields the optimal solution. This allows us to use the well known $\ln g-\ln\ln g+\Theta(1)$ -approximation greedy algorithm for the Set Cover [39]. By Dinur and Steuer, the TCAND for simple FDs cannot be approximated to $(1-o(1))\cdot\ln g$ unless $P=\text{NP}$ . $\hfill\blacktriangleleft$ Notice that the number of SCCs $g$ can be significantly less than $n$ . Furthermore, our proof directly yields that the above bound can be further tightened to the logarithm of the number of sources in the DAG of the FD-graph, which is trivially upper bounded by the number of SCCs. This is the case for it suffices to pick at most one node from each strongly connected component of $G$ that is a source; this can be any node. We state this as the following corollary.

Corollary 7.

Let $\mathcal{F}$ be a set of simple FDs and let $s$ be the number of the source nodes in the SCC DAG of the fd-graph $G$ . Then there exists a polynomial time $\ln s$ -approximation algorithm for the TCAND problem. Furthermore, it cannot be approximated within $(1-o(1))\ln s$ unless $P=NP$ .

3.3 LP Relaxation and Approximation Algorithms

Our building block for the $D$ -round TCAND problem is solving the 1-round TCAND problem and then iterating this algorithm for $D$ layers. Given its importance and for the reader’s convenience, we introduce it as a special case with simplified notation compared to the IP (1).

Single layer/1-round TCAND.

Our goal in the single-layer problem is to choose a set of variables $S\subseteq[n]$ such that their one step closure includes the target variables. By the term one step closure, we mean that an attribute $i$ is active if it is either included in $S$ or if there exist attributes $i_{1},\ldots,i_{r}$ such that they are all active (i.e., in $S$ ) and there exists an FD in $\mathcal{F}$ of the form $i_{1}i_{2}\ldots i_{r}\to i$ . The bottom row with the squares encodes the target attributes; for each attribute we have a Boolean variable $x_{i}$ and this is set to $1$ for each target variable (blue filled squares); the rest may be 0 or 1 depending on which set of variables we will activate on the top row. To encode the output set $S$ we define Boolean variables $y_{i}$ for $i=1,\ldots,n$ . Our goal is to minimize the number of variables we include in $S$ or in terms of the $y$ variables the sum $\sum_{i=1}^{n}y_{i}$ . The constraint of covering the target variables is expressed as $x_{i}=1$ for each target variable $i\in T$ . The connection between the $y$ and $x$ variables –as explained also earleir– is expressed as follows $x_{i}=\textbf{OR}(y_{i},\textbf{AND}(y_{i_{1}},\ldots,y_{i_{r}}),\ldots)$ where we include in the OR all FDs of the form $i_{1}\ldots i_{r}\to i$ as the AND of the corresponding left-side variables $i_{1},\ldots,i_{r}$ . Figure 2 illustrates a single-layer version of the TCAND problem. Each column corresponds to an attribute $i$ , $i=1,\ldots,n$ .

Figure 2: The Boolean variables

x_{1},\ldots,x_{n}

denote whether an attribute

i

is active, i.e., meaning it is included in the closure of the selected set of variables. The target variables are all set to 1.

LP Relaxation.

We explicitly express the linear programming relaxation of the integer program $\eqref{ipexact}$ , incorporating a single layer or round of FD inference. In Sections 3.3.1 and 3.3.2 we present a deterministic and randomized based on the following LP relaxation and we show how we apply it for the $D$ -round TCAND problem. For simplicity, we assume, without any loss of generality, that the set of FDs includes the valid FDs $i\to i$ for each $i\in[n]$ .

\begin{array}[]{ll@{}ll}\text{minimize}&\displaystyle\sum\limits_{j=1}^{n}&y_{% j}&\\ \text{subject to}&\displaystyle y_{i}+\sum\limits_{LS\to i}&z_{LS}\geq 1,&% \forall i\in T\\ &\displaystyle&z_{LS}\leq y_{j},&\forall j\in LS\text{\leavevmode\nobreak\ % \leavevmode\nobreak\ where\leavevmode\nobreak\ \leavevmode\nobreak\ }LS\to i,i% \in T\\ &&y_{j},z_{LS}\in[0,1],&\forall j\in V,\forall LS\text{\leavevmode\nobreak\ % \leavevmode\nobreak\ \leavevmode\nobreak\ of the form\leavevmode\nobreak\ % \leavevmode\nobreak\ }LS\to i,i\in T\vskip 6.0pt plus 2.0pt minus 2.0pt\vskip 3% .0pt plus 1.0pt minus 1.0pt\\ \end{array}

(2)

3.3.1 Deterministic rounding

Our deterministic rounding approximation algorithm for the single-round TCAND problem solves the LP relaxation $\eqref{lp:oneround}$ and outputs all attributes $i$ for which the corresponding variable $y_{i}$ is at least a certain threshold. The threshold value is determined by the input. Define $f_{i}$ as the number of input FDs of the form $X\to i$ , represented by $f_{i}=|{X\subseteq V:X\to i\in\mathcal{F}}|$ . Let $f$ denote the maximum value of $f_{i}$ across all elements in $V$ , i.e., $f=\max_{i\in V}f_{i}$ . In simple terms, $f$ represents the maximum number of variables on the left side of any FD in our collection of FDs, $\mathcal{F}$ . The threshold value is set equal to $\frac{1}{f+1}$ . This simple process is outlined for completeness as Algorithm 1 and the approximation guarantee states as Theorem 8.

Algorithm 1

(f+1)

approximation algorithm for the 1-round TCAND (Problem 2 with

D=1

).

Solve the LP relaxation (2)

I\leftarrow\{i:y_{i}\geq\frac{1}{f+1}\}

Output

I

Theorem 8.

Algorithm 1 is an $(f+1)$ approximation for the 1-round TCAND problem.

Proof.

Feasibility.

First we prove that $T\subseteq I^{+}$ . We note that for all $i\in T$ the inequality

y_{i}+\sum\limits_{LS\to i}z_{LS}\geq 1

implies that at least one of the $f+1$ summands will be at least $\frac{1}{f+1}$ . If that summand is $y_{i}$ , then we add $i$ to $S$ , and thus $i$ is trivially in the closure of $S^{+}$ . Otherwise, $z_{LS}\geq\frac{1}{f+1}$ for some $L S$ and due to the linear constraints $y_{j}\geq\frac{1}{f+1},\forall j\in LS$ . This means that all such $j$ will be added to $I$ and therefore $i$ will be in the closure $I^{+}$ .

Approximation guarantees.

The cost of the solution is upper bounded as follows:

|I|\leq(f+1)\cdot\sum_{i\in I}y_{i}=(f+1)\cdot OPT_{LP}\leq(f+1)\cdot OPT_{IP}.

This completes our proof. $\hfill\blacktriangleleft$ We now state our main result for the $D$ -round TCAND problem as Theorem 9. Our proof is constructive. To solve the $D$ -round TCAND problem, we iteratively apply Algorithm 1 $D$ times, allowing for $D$ rounds of FD inference. Our algorithm is described as Algorithm 2. We return to the indexing notation previously used in Figure 1. Observe that in this case the objective becomes the minimization of $\sum\limits_{j=1}^{n}x_{j}^{n-D}$ where $D\leq n$ . Next, we present Theorem 9, which demonstrates how the approximation algorithm we developed for the 1-round TCAND can be applied to the D-round TCAND.

Algorithm 2

(f+1)^{D}

approximation algorithm for the

D

-round TCAND (Problem 2).

Solve the LP relaxation (1) to obtain fractional values for all variables

x, z

I\leftarrow\{i:x_{i}^{n-D}\geq\frac{1}{(f+1)^{D}}\}

Output

I

Theorem 9 (Approximating TCAND).

There exists a polynomial time $(f+1)^{D}$ -approximation algorithm for the $D$ -round TCAND problem.

Proof.

Similar to the proof of Theorem 8, we first establish feasibility, ensuring that the closure of the output set $I$ contains all target variables $T$ , and then we prove the approximation guarantee.

Feasibility.

For each target variable $i\in T$ , $x_{i}^{n}=1$ . By the LP inequality,

1=x_{i}^{n}\leq x_{i}^{n-1}+\sum_{LS:LS\to i}z_{LS}^{(n-1)}

(3)

with the same reasoning as in Theorem 8, either $x_{i}^{n-1}\geq\frac{1}{f+1}$ or $z_{LS}^{n-1}\geq\frac{1}{f+1}$ for some $LS\to i$ . The latter inequality yields that

x_{j}^{n-1}\geq\frac{1}{f+1},\forall j\in LS.

Thus, either of the sets $\{x_{i}^{n-1}\},\{x_{j}^{n-1}\}_{j\in LS}$ for some FD $LS\to i$ must have all its elements larger than $\frac{1}{f+1}$ . We apply the same reasoning for the previous layer $n-2$ , with the difference that the left-hand-side of Inequality 3 is $\frac{1}{f+1}$ . This implies that for a variable $x_{j}^{n-1}$ that is at least $\frac{1}{f+1}$ either $x_{j}^{n-2}\geq\frac{1}{(f+1)}\times\frac{1}{(f+1)}=\frac{1}{(f+1)^{2}}$ or $x_{j^{\prime}}^{n-2}\geq\frac{1}{(f+1)^{2}}$ for all $j^{\prime}\in LS^{\prime}$ for some $LS^{\prime}\to j$ with $z_{LS^{\prime}}\geq\frac{1}{(f+1)^{2}}$ . Using backwards induction and the same averaging argument, we obtain that $I$ is a feasible solution.

Approximation guarantees.

We simply observe that the output size $|I|$ satisfies

|I|\leq(f+1)^{D}\cdot\sum_{i\in I}x_{i}\leq(f+1)^{D}OPT_{LP}\leq(f+1)^{D}OPT_{% IP},

which yields the desired bound. $\hfill\blacktriangleleft$

3.3.2 Randomized rounding

Our randomized algorithm relies again on solving the LP relaxation (2) to obtain $\{y_{i}\}_{i\in[n]}$ values and on the KKMS algorithm [28] as a subroutine for finding an equitable Hajnal-Szemerédi partition as described in Lemma 4. The algorithm is outlined in pseudocode as Algorithm 3. Let $\mathcal{LS}$ be the set of all left-side sets of variables that appear in $\mathcal{F}$ , i.e., $\mathcal{LS}=\{LS:LS\to i\in\mathcal{F}\}$ . Define $\Delta=\max_{LS\in\mathcal{LS}}|\{LS^{\prime}\in\mathcal{LS}:LS\cap LS^{\prime% }\neq\emptyset\}|$ denote the maximum number of FDs that share at least one common attribute with any FD in $\mathcal{F}$ . The algorithm initiates a set $O U T$ that will contain a set of variables whose closure will contain the target set $T$ with high probability. The algorithm considers each target element separately. To determine which variables we will include in the set $O U T$ we use the constructive polynomial time algorithm for Hajnal-Szemerédi [22] lemma 4. This allows us to find partition the FDs into sets whose left sides share no attribute. This ensures that the joint distribution of those FDs factor over the individual left sides due to independence; we elaborate more on this in the following paragraph. Among those sets we choose one set with the property that the sum of $z_{LS}$ values is at least $\frac{1}{\Delta+1}$ ; such a set is guaranteed to exist by a simple averaging argument.

Specifically, let us consider the solution to the 1-round TCAND LP relaxation, yielding fractional values $y_{1},\ldots,y_{n}$ . Focusing on a specific target element $t\in T$ , we define $\mathcal{F}_{t}=\{LS\in\mathcal{LS}:LS\to t\}$ as the set of FDs with variable $t$ on their right-hand side. Viewing the left sides of these FDs as a collection of hyperedges, our objective is to randomly select attributes such that at least one hyperedge “survives” after sampling, namely all the variables are included in $O U T$ . This ensures that $t$ is included in the closure of the randomized output. As mentioned earlier, analyzing this randomized procedure involves dependencies; the joint distribution of the survival of two hyperedges does not factor over the variables in them, since they may overlap. e.g., $X\to t$ and $Y\to t$ , share attributes, i.e., $X\cap Y\neq\emptyset$ . This interdependence adds complexity to the analysis of a randomized rounding approach akin to the Set Cover algorithm [42, Section 1.7] and we address it using Lemma 4. We state our main result as the next theorem.

Algorithm 3

2\log n(\Delta+1)

-approximation algorithm for the

1

-round-TCAND (Problem 2 with

D=1

).

Solve the LP relaxation (2) to obtain

\{y_{i}\}_{i\in[n]}

values

Compute

\Delta=\max_{LS\in\mathcal{LS}}|\{LS^{\prime}\in\mathcal{LS}:LS\cap LS^{\prime% }\neq\emptyset\}|

OUT\leftarrow\emptyset

for each target element

t\in T

do

Find a set

S_{j^{\star}}

of FDs

\{LS\to t\}

with the properties that (i) no two FDs share an attribute and (ii) whose sum of

z_{LS}\geq\frac{1}{\Delta+1}

using the KKMS algorithm [28].

For each variable

k\in S_{j^{\star}}

, toss

2(\Delta+1)\log n

coins each with success probability

y_{k}

.

if success at least once for variable

k

then

OUT\leftarrow OUT\cup\{k\}

end if

end for

Return

O U T

Theorem 10.

Then, there exists a polynomial-time $c(\Delta+1)\log n$ -approximation algorithm that solves the 1-round TCAND problem with high probability, where $c$ is a constant.

Proof.

Define $\mathcal{B}_{i}$ to be the bad event that target element $i$ is not covered by Algorithm 3. Fix any target element $i$ and consider a meta-graph $G$ where each node represents the left-side of some FD $LS\to i$ and two nodes $LS_{j},LS_{k}$ are connected iff $LS_{j}\cap LS_{k}\neq\emptyset$ . Recall, $\Delta=\max_{LS}|\{LS^{\prime}:LS\cap LS^{\prime}\neq\emptyset\}|$ and thus the maximum degree in $G$ is upper-bounded by $\Delta$ . We invoke the equitable coloring theorem 4 on $G$ to obtain color classes $S_{1},\ldots,S_{\Delta+1}$ of size (essentially) $\frac{n}{\Delta+1}$ . By grouping the terms $z_{LS}$ according to color classes we obtain $\sum_{j=1}^{\Delta+1}\sum\limits_{LS\in S_{j}}z_{LS}\geq 1.$ For at least one of the color classes $j$ the summation term $\sum\limits_{LS\in S_{j}}z_{LS}\geq\frac{1}{\Delta+1}$ . Let $j^{\star}$ be such an index. Observe that all the FDs within the $S_{j^{\star}}$ share no attributes since by the equitable coloring theorem they form an independent set in the meta-graph $G$ . Thus, we obtain from the independence of the coin tossing

	$\displaystyle{\bf{Pr}}\left[{i\text{\leavevmode\nobreak\ not activated}}\right]$	$\displaystyle=\prod_{LS\in S_{j^{\star}}}\Bigg{(}1-\prod_{k\in LS}y_{k}\Bigg{)% }\leq\prod_{LS\in S_{j^{\star}}}e^{-\prod_{k\in LS}y_{k}}=$
		$\displaystyle=\exp\Big{(}-\sum_{LS\in S_{j^{\star}}}\prod_{k\in LS}y_{k}\Big{)% }\leq e^{-\frac{1}{\Delta+1}}.$

For each attribute $j$ we toss a biased coin with probability $y_{j}$ of success $c(\Delta+1)\log n$ times independently where $c>1$ is a constant. If success occurs at least once, we include $j$ in our output. The probability $p_{j}$ that an element $j$ is included in the output satisfies $p_{j}=1-(1-y_{j})^{c(\Delta+1)\log n}\leq c(\Delta+1)\log ny_{j}$ . It is straight-forward to show that using this procedure

{\bf{Pr}}\left[{\mathcal{B}_{i}}\right]\leq e^{-\frac{c\log n(\Delta+1)}{% \Delta+1}}=\frac{1}{n^{c}}.

Furthermore, the expected cost of Algorithm’s 3 is upper bounded as follows:

	$\displaystyle{\mathbb{E}}\left[{\text{OUTPUT COST}}\right]$	$\displaystyle\leq\sum_{i=1}^{n}p_{i}\leq\sum_{i=1}^{n}(c(\Delta+1)\log n)y_{i}% =c(\Delta+1)\cdot\log n\cdot OPT_{LP}\leq$
		$\displaystyle\leq(c(\Delta+1)\log n)\cdot OPT_{IP}.$

Since ${\bf{Pr}}\left[{\exists i\in T\text{\leavevmode\nobreak\ not activated% \leavevmode\nobreak\ }}\right]={\bf{Pr}}\left[{\cup_{i}\mathcal{B}_{i}}\right]$ , by a union bound we conclude that all target variables are activated with high probability for any constant $c>1$ :

\displaystyle{\bf{Pr}}\left[{\cup_{i}\mathcal{B}_{i}}\right]

\displaystyle\leq|T|\max_{i\in T}{\bf{Pr}}\left[{i\text{\leavevmode\nobreak\ % not activated}}\right]\leq n\frac{1}{n^{c}}=o(1).

Using the rule of conditional probability and the fact that the good event $\cap\bar{\mathcal{B}_{i}}$ (i.e., all target variables are covered) holds whp, we obtain the desired result

{\mathbb{E}}\left[{\text{OUTPUT COST}|\bar{\mathcal{B}}}\right]\leq O((\Delta+% 1)\log n)OPT_{IP}.\

$\hfill\blacktriangleleft$

We apply our randomized procedure for $D$ layers in the same manner as the deterministic algorithm, achieving an approximation guarantee of $\big{(}c(\Delta+1)\log n\big{)}^{D}$ . The only distinction from the deterministic approximation algorithm is ensuring a success probability of $1-o(1)$ . This requirement is easily met, as the failure probability for a single application of the FD rules is $\frac{1}{n^{c}}$ for some sufficiently large constant $c$ . By applying a union bound over $D$ rounds, since $D\leq n$ , we achieve the desired result.

Remarks.

Solving the TCAND problem using an integer program (IP) can be a practical approach for small to medium-scale instances. In a query optimizer, where speed is crucial, this method can be effectively applied to smaller instances. Between the two approximation algorithms we presented, the deterministic algorithm is more straight-forward to implement, as the randomized algorithm relies on the complex algorithm due to Kierstad et al.[28] for finding an equitable coloring on a meta-graph of the left-sides of FDs. From an approximation guarantee perspective, the two approximation algorithms cannot be directly compared based due to different parameterizations. While the parameters clearly cannot take arbitrary values (e.g., the number of FDs $|\mathcal{F}|$ is upper bounded by $n\cdot f$ given that each variable $i\in[n]$ participates in at most $f$ FDs and $\Delta\leq|\mathcal{F}|-1$ ), the values $f+1,(\Delta+1)\log n$ can be either larger or smaller depending on the specific instance.We present two extreme scenarios to illustrate this claim, though such situations are unlikely to occur in practice. For example, when $f=\Theta(|\mathcal{F}|)=O(n)$ (e.g., a constant number of variables are on the RS of $f=O(n)$ FDs per each), and $\Delta=O(1)$ (e.g. the left-sides are singleton sets), the value $(\Delta+1)\log n$ is less than $f+1$ asymptotically. On other other extreme, there exist instances where $\Delta=\Theta(|F|)$ (e.g., there exists one common variable to all the left-sides of FDs) and $f$ is a low value making the value $f+1$ smaller than $(\Delta+1)\log n$ . In relatively large practical instances, these extreme scenarios do not occur. However, the parameters generally lean towards the deterministic algorithm side, as $\Delta$ tends to be large due to the presence of a few common variables on the left side of most of the FDs, while $f\ll|\mathcal{F}|$ . For these reasons, we find the deterministic algorithm to be more practical. However, since it requires solving an LP, there is still an open area for developing well-performing heuristics for systems.

3.3.3 $1$ -round TCAND is equivalent toRed Blue Set Cover

We discover that the general case (regular FDs) for 1-round TCAND problem the problem is equivalent to a variant of the Set Cover problem, referred to as Red Blue Set Cover [21, 2, 8]. Our discovery is important for two reasons: (i) from the equivalence reduction we obtain a new algorithm from [11] and (ii) an inapproximability result, showing that polynomial time is very likely to restrict the approximation factor of the problem to $|\mathcal{F}|^{\frac{1}{4}}$ at best. It is important to note that the number of input FDs, denoted as $|\mathcal{F}|$ , can range from a constant to an exponential function of $n$ . Therefore, our bound does not directly provide a limit based solely on $n$ . Instead, it offers a new form of approximation guarantee. Our findings are summarized in the following theorem.

Theorem 11 (Algorithm and Inapproximability of $1$ -round TCAND).

The $1$ -round TCAND problem admits a polynomial time $\widetilde{O}(|\mathcal{F}|^{\frac{1}{3}})$ -approximation algorithm and, additionally, does not admit a polynomial time algorithm with approximation ratio better than $\widetilde{O}(|\mathcal{F}|^{\frac{1}{4}-\epsilon})$ unless the Dense-vs-Random conjecture [9] fails. The hardness results carries over to $D$ -round TCAND for any $D\in[n]$ .

We now prove Theorem 11 by proving an equivalence between the $1$ -round TCAND problem andRed Blue Set Cover and then utilizing the recent progress in [11]. We adopt the same convention regarding the use of red/blue colors as described in the Red Blue Set Cover problem in Section 2.

Proof.

Given a collection of FDs $\mathcal{F}$ and a set $T$ we create an instance of Red Blue Set Cover as follows. Firstly, for each element $i\in T$ , we introduce a new variable $i^{(\text{new})}$ . Our universe $U$ will be composed of these new variables in addition to the original ones. The new variables will be colored blue whereas the old variables will be colored red. Now, for every FD $LS\to i$ we create a set $S_{LS}:=LS\cup\{i^{(new)}\}$ (note that due to the $1$ -round scenario, we may discard each FD of the form $LS\to i$ with $i\notin T$ so $i^{(new)}$ is well defined). We claim that any solution to Red Blue Set Cover of size $k$ yields a solution to $1$ -round TCAND of size $k$ and vice versa. Indeed, if we have a collection of sets $\mathcal{S}$ that cover all blue points and $k$ red points, we may as well pick the variables corresponding to those red points as our solution. By definition, for every $S_{LS}\in\mathcal{S}$ we have covered LS and so this means that all blue points are covered as there exists an FD of the form $LS\to i$ $\forall i\in T$ and some totally coverd LS, which correspond exactly to $T$ being inferred. Additionally, every solution to $1$ -round TCAND of size $k$ can be mapped to a Red Blue Set Cover solution by noting that every chosen variable corresponds to a red point and thus we may pick all the sets for which all their red variables are chosen in the solution. This covers all the blue points by the fact that all variables in $T$ can be inferred from the original instance and every blue point corresponds to a variable in $T$ . For the converse, consider an instance of Red Blue Set Cover where $U\mathrel{\stackrel{{\scriptstyle\makebox[0.0pt]{\mbox{\tiny def}}}}{{=}}}R\cup B$ is the union of a red set $R$ and a blue set $B$ . We can construct an instance of $1$ -round TCAND as follows: For each set $S\mathrel{\stackrel{{\scriptstyle\makebox[0.0pt]{\mbox{\tiny def}}}}{{=}}}\{e_% {1},\ldots,e_{t}\}$ , we create a functional dependency (FD) $S\setminus B\rightarrow S\cap B$ . We then set $T\mathrel{\stackrel{{\scriptstyle\makebox[0.0pt]{\mbox{\tiny def}}}}{{=}}}B$ and solve the $1$ -round TCAND problem for this instance. Note that this process may generate FDs of the form $\emptyset\rightarrow V^{\prime}$ for some non-empty set $V^{\prime}$ .

A solution to the constructed $1$ -round TCAND instance of size $k$ means that we pick only variables corresponding to red points as the left hand side is of the form $S\setminus B$ and in particular at most $k$ of them. Let $R_{\mathrm{sol}}$ be those points and pick the sets $S$ for which $S\subseteq\cup R_{\mathrm{sol}}$ . This means that the chosen sets do not cover points outside of $R_{\mathrm{sol}}$ and hence cover at most $k$ red points. Additionally, every blue point $b\in B$ is covered by the definition of the target $T$ : the activated FD $S\setminus B\rightarrow b$ means that we have picked the set S and $b\in B$ participates in an activated FD. The other direction is analogous. Finally, note that the number of sets created in the above reductions equals the number of FDs, i.e. $|\mathcal{F}|$ . Invoking Theorems 1, 3 by Chlamtavc et al. [11] yields the desired results. $\hfill\blacktriangleleft$

4 Integrality Gaps

We complement Theorem 9 by showing that the integrality gap of the problem of the LP relaxation is at least $\left(\frac{f-1}{2}\right)^{D-1}$ which is close to what we achieve. Specifically, we prove the following results.

Theorem 12.

(Integrality gap for D-round-TCAND) For $f\geq 3$ , the integrality gap of the Linear Programming Formulation for the $D$ -round-TCAND problem is at least

\left(\frac{n/(D+1)-1}{2}\right)^{D}=\Omega\left(\frac{n}{D}\right)^{D}.

To be more precise, it is at least $\left\lfloor\frac{f}{2}-1\right\rfloor^{D-1}$ , where $f:=\mathrm{max}_{i\in V}|{X\subseteq V:X\rightarrow i\in\mathcal{F}}|.$ whenever $f\geq 3$ . The result holds even when each FD in $\mathcal{F}$ has a left hand side with cardinality at most $2$ . For $f=2$ the problem cannot be approximated within a factor of $2-\epsilon$ for any $\epsilon>0$ assuming the Unique Games Conjecture [26].

It is worth pointing out that the integrality gap is not an artifact of our modelling of the problem, since as we showed in Theorem 11, there exists no polynomial time algorithm that can achieve an approximation ratio better than $|\mathcal{F}|^{\frac{1}{4}-\delta}$ with $\delta>0$ under the Dense-vs-Random Conjecture [9] even for $D=1$ . Before presenting the proof of Theorem 12, we first establish Theorem 13, which is a simpler version of the theorem. This preliminary result helps to elucidate the fundamental concept driving the overall proof. In essence, our proof is analogous to the integrality gap of the standard LP relaxation for the vertex cover, which assigns each variable a value of $\frac{1}{2}$ in a clique of $n$ nodes. This approach results in an integrality gap of $\frac{n-1}{\frac{n}{2}}\to 2$ as $n\to\infty$ [42]. The adaptation to obtain the more general Theorem 12 is straightforward afterwards.

Theorem 13.

The integrality gap of our LP formulation for the $D$ -round TCAND problem is at least $2^{D}$ .

Proof.

We create an instance of the $D$ -round TCAND problem as follows. Consider the variable set $V:=V^{(0)}\cup V^{(1)}\cup\ldots\cup V^{(D)}$ where $V^{(r)}:=\left\{(i,r)\right\}_{i\in[5]}$ . Note that we use tuples as variable names. We insert FDs in $\mathcal{F}$ from some pairs of variables in $V^{(r)}$ to variables in $V^{(r+1)}$ as follows for $r=0,\ldots,D-1$ . For each variable $(k,r+1)$ in $V^{(r+1)}$ , we create two FDs of the form $(i,r),(j,r)\to(k,r+1)$ . Note that $|V^{(r)}\times V^{(r)}|={5\choose 2}=10$ pairs of variables in layer $r$ . These 10 pairs are then partitioned into 5 sets, each containing 2 pairs. Each set corresponds to the left-hand side of an FD for a specific variable in the next layer, $V^{(r+1)}$ . In other words, for each variable in $V^{(r+1)}$ , there are two pairs of variables from $V^{(r)}$ that determine it through FDs. Finally, we set $T=V^{(D)}$ as the target set for this instance. Notice that $T\subseteq(V^{(0)})^{+}$ .

A feasible fractional solution to our LP formulation is to assign the fractional weight $c:=2^{-D+1}$ to every variable in $V^{(0)}$ , i.e. set $x_{i,0}^{(0)}=2^{-D+1},i\in[5]$ . One can verify that $z_{LS}^{(1)}:=1$ for every $LS\rightarrow(i,1)$ and hence we have that $x_{i,1}^{(1)}$ can be as large as $\sum_{z_{LS}^{(1)}\rightarrow(i,1)}c=2c$ for all $i$ by the fact that every $(i,r)$ variable can be inferred from exactly two FDs. Inductively, we have that $x_{i,d}^{(d)}$ can be as large as $2^{d}c$ and hence in the last layer we have $x_{i,D}^{(D)}=2^{D}\cdot c=1$ , meaning that every element in the last layer, which is exactly our target set, satisfies the target constraint of our LP. This shows that the total fractional cost is $5\cdot 2^{-D}$ whereas the integral cost is $5$ , yielding the $2^{D}$ integrality gap. $\hfill\blacktriangleleft$

It is important to note that our proof employs an inductive approach. However, the underlying intuition becomes more apparent when we consider the process starting from the final layer $D$ and moving backwards. In the final layer, each variable is a target variable, so its corresponding LP variable is set to 1 because of the target constraints. Given that each variable can be inferred by two FDs, an unfavorable yet feasible scenario for the LP would involve equally distributing the weight between these two FDs, and this pattern continues in preceding layers. The complete proof of Theorem 12 essentially extends this concept by adjusting the number of FDs associated with each variable.

Proof of Theorem 12.

To obtain the general hardness on the integrality gap, we note that it suffices to set $V^{(r)}:=\left\{(i,r)\right\}_{i\in[g]}$ for some parameter $g$ . Then we can force each variable in $V^{(0)},V^{(1)},\ldots,V^{(D)}$ to participate in ${g\choose 2}/g\geq\left\lfloor\frac{g-1}{2}\right\rfloor$ FDs. Thus, the base of exponent in the fractional magnification per level is $g-1$ in contrast to $2$ and thus we may obtain both results by setting $g:=n/(D+1)$ (for a total of $g\cdot D=n$ nodes) or $g=f=max_{i\in V}|{X\subseteq V:X\rightarrow i\in\mathcal{F}}|$ for any $f\geq 3$ .

For $f=2$ , we observe that the problem is as hard as the Vertex Cover problem, so it cannot be approximated within a factor $2-\epsilon$ for any $\epsilon>0$ assuming the Unique Games Conjecture [25]. To prove the equivalence $G(V,E)$ we create a non-target variable $x_{v}$ for each vertex $v\in V$ and a target variable $x_{e}$ for each edge $e\in E$ adding the FD $x_{u}x_{v}\rightarrow x_{e}$ if $e:=(u,v)$ . We first observe that any vertex cover in $G$ corresponds trivially to a set of variables which cover all target variables $\{x_{e}\}_{e\in E}$ . For the other direction, every set $S$ of variables of size $\leq k$ from which $\{x_{e}\}_{e\in E}$ can be inferred we can create a set $\widetilde{S}$ of size $|\widetilde{S}|\leq|S|\leq k$ by substituting each $x_{e}\in S,e:=(u,v)$ with $x_{u}$ (if $x_{u}$ already in the set) and noting that $\widetilde{S}$ is a vertex cover for $G$ . $\hfill\blacktriangleleft$ The above instances rule out any hope for designing efficient approximation algorithms for TCAND using linear programming based approaches.

5 Conclusion

In this work, we introduce the TCAND problem, a generalization of the well-known minimum candidate key problem [30, 32]. The TCAND problem plays an important role in semantic query optimization. We demonstrate that TCAND, in its general form, is a layered set-cover problem, with each layer representing a stage of FD inference using the given FDs. We formulate the TCAND as an integer program and explore its LP relaxation, both from the perspective of algorithm design and the analysis of integrality gaps. We also examine specific cases of the TCAND problem, such as scenarios where all FDs have at most one attribute on their left-hand side. In cases with one round of inference, the problem aligns with the Red Blue Set Cover, a variant of Set Cover known for its inapproximability. Our study opens a gateway to a host of compelling challenges, with the development of practical heuristics as a promising direction for future research.

References

[1] Tatsuya Akutsu and Feng Bao. Approximating minimum keys and optimal substructure screens. In International Computing and Combinatorics Conference, pages 290–299. Springer, 1996. doi:10.1007/3-540-61332-3_163.
[2] Michael Alekhnovich, Sam Buss, Shlomo Moran, and Toniann Pitassi. Minimum propositional proof length is NP-hard to linearly approximate. The Journal of Symbolic Logic, 66(1):171–191, 2001. doi:10.2307/2694916.
[3] Gabriele Ausiello, Alessandro D’Atri, and Domenico Saccà. Minimal representations of directed hypergraphs and their application to database design. In Algorithm design for computer system design, pages 125–157. Springer, 1984.
[4] Giorgio Ausiello, Alessandro D’Atri, and Domenico Saccà. Graph algorithms for functional dependency manipulation. Journal of the ACM (JACM), 30(4):752–766, 1983. doi:10.1145/2157.322404.
[5] Gautam Bhargava, Piyush Goel, and Bala Iyer. Efficient processing of outer joins and aggregate junctions. In Proceedings of the Twelfth International Conference on Data Engineering, pages 441–449. IEEE, 1996.
[6] Gautam Bhargava, Piyush Goel, and Balakrishna R Iyer. Simplification of outer joins. In Proceedings of the 1995 conference of the Centre for Advanced Studies on Collaborative research, page 7, 1995. URL: https://dl.acm.org/citation.cfm?id=781922.
[7] Aditya Bhaskara, Moses Charikar, Eden Chlamtáč, Uriel Feige, and Aravindan Vijayaraghavan. Detecting high log-densities: an O $(n^{\frac{1}{4}})$ approximation for densest $k$ -subgraph. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 201–210, 2010. doi:10.48550/arXiv.1001.2891.
[8] Robert D. Carr, Srinivas Doddi, Goran Konjevod, and Madhav Marathe. On the red-blue set cover problem. In Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’00, pages 345–353, USA, 2000. Society for Industrial and Applied Mathematics. URL: http://dl.acm.org/citation.cfm?id=338219.338271.
[9] Eden Chlamtáč, Michael Dinitz, and Yury Makarychev. Minimizing the union: Tight approximations for small set bipartite vertex expansion. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 881–899. SIAM, 2017. doi:10.1137/1.9781611974782.56.
[10] Eden Chlamtáč, Michael Dinitz, and Robert Krauthgamer. Everywhere-sparse spanners via dense subgraphs. In 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science, pages 758–767. IEEE, 2012. doi:10.48550/arXiv.1205.0144.
[11] Eden Chlamtáč, Yury Makarychev, and Ali Vakilian. Approximating red-blue set cover and minimum monotone satisfying assignment. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2023). Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023.
[12] Xu Chu, Ihab F Ilyas, Paolo Papotti, and Yin Ye. Ruleminer: Data quality rules discovery. In 2014 IEEE 30th International Conference on Data Engineering, pages 1222–1225. IEEE, 2014. doi:10.1109/ICDE.2014.6816746.
[13] Graham Cormode, Howard Karloff, and Anthony Wirth. Set cover algorithms for very large datasets. In Proceedings of the 19th ACM international conference on Information and knowledge management, pages 479–488, 2010. doi:10.1145/1871437.1871501.
[14] Irit Dinur and David Steurer. Analytical approach to parallel repetition. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 624–633, 2014. doi:10.1145/2591796.2591884.
[15] Paul Erdős, Alfréd Rényi, et al. On the evolution of random graphs. Publ. math. inst. hung. acad. sci, 5(1):17–60, 1960.
[16] Ronald Fagin. Functional dependencies in a relational database and propositional logic. IBM Journal of research and development, 21(6):534–544, 1977.
[17] Ronald Fagin and Moshe Vardi. The theory of data dependencies—an overview. In International Colloquium on Automata, Languages, and Programming, pages 1–22. Springer, 1984.
[18] Uriel Feige. A threshold of $\ln n$ for approximating set cover. Journal of the ACM (JACM), 45(4):634–652, 1998. doi:10.1145/285055.285059.
[19] Alan Frieze and Michał Karoński. Random graphs and networks: a first course. Cambridge University Press, 2023.
[20] Aristides Gionis and Charalampos E Tsourakakis. Dense subgraph discovery: KDD 2015 tutorial. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2313–2314, 2015. doi:10.1145/2783258.2789987.
[21] Michael Goldwasser and Rajeev Motwani. Intractability of assembly sequencing: Unit disks in the plane. In Algorithms and Data Structures: 5th International Workshop, WADS’97 Halifax, Nova Scotia, Canada August 6–8, 1997 Proceedings 5, pages 307–320. Springer, 1997. doi:10.1007/3-540-63307-3_70.
[22] András Hajnal and Endŕe Szemerédi. Proof of a conjecture of P. Erdős. Combinatorial theory and its applications, 2:601–623, 1970.
[23] Ihab F Ilyas, Xu Chu, et al. Trends in cleaning relational data: Consistency and deduplication. Foundations and Trends® in Databases, 5(4):281–393, 2015. doi:10.1561/1900000045.
[24] Richard M Karp. Reducibility among combinatorial problems. Springer, 2010. doi:10.1007/978-3-540-68279-0_8.
[25] Subhash Khot and Oded Regev. Vertex cover might be hard to approximate to within 2- $\varepsilon$ . Journal of Computer and System Sciences, 74(3):335–349, 2008. doi:10.1109/CCC.2003.1214437.
[26] Subhash Khot and Nisheeth K Vishnoi. On the unique games conjecture. In FOCS, volume 5, page 3, 2005. doi:10.1109/SFCS.2005.61.
[27] Hal A Kierstead and Alexandr V Kostochka. A short proof of the Hajnal–Szemerédi theorem on equitable colouring. Combinatorics, Probability and Computing, 17(2):265–270, 2008. doi:10.1017/S0963548307008619.
[28] Henry A Kierstead, Alexandr V Kostochka, Marcelo Mydlarz, and Endre Szemerédi. A fast algorithm for equitable coloring. Combinatorica, 30(2):217–224, 2010. doi:10.1007/s00493-010-2483-5.
[29] Mihail N Kolountzakis, Gary L Miller, Richard Peng, and Charalampos E Tsourakakis. Efficient triangle counting in large graphs via degree-based vertex partitioning. Internet Mathematics, 8(1-2):161–185, 2012. doi:10.1080/15427951.2012.625260.
[30] Witold Lipski. Two np-complete problems related to information retrieval. In International Conference on Fundamentals of Computation Theory, pages 452–458. Springer, 1977. doi:10.1007/978-3-662-40153-8_52.
[31] László Lovász. On the ratio of optimal integral and fractional covers. Discrete mathematics, 13(4):383–390, 1975. doi:10.1016/0012-365X(75)90058-8.
[32] Claudio L Lucchesi and Sylvia L Osborn. Candidate keys for relations. Journal of Computer and System Sciences, 17(2):270–279, 1978. doi:10.1016/0022-0000(78)90009-0.
[33] Pasin Manurangsi. Almost-polynomial ratio eth-hardness of approximating densest k-subgraph. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 954–961, 2017. doi:10.1145/3055399.3055412.
[34] Glenn N Paulley and Per-Åke Larson. Exploiting uniqueness in query optimization. In CASCON First Decade High Impact Papers, CASCON ’10, pages 127–145, USA, 2010. IBM Corp. doi:10.1145/1925805.1925812.
[35] Glenn Norman Paulley. Exploiting Functional Dependence in Query Optimization. PhD thesis, University of Waterloo, 2001.
[36] Sriram Pemmaraju and Aravind Srinivasan. The randomized coloring procedure with symmetry-breaking. In Automata, Languages and Programming: 35th International Colloquium, ICALP 2008, Reykjavik, Iceland, July 7-11, 2008, Proceedings, Part I 35, pages 306–319. Springer, 2008. doi:0.1007/978-3-540-70575-8_26.
[37] Raghu Ramakrishnan and Johannes Gehrke. Database management systems, volume 3. McGraw-Hill New York, 2003.
[38] Hossein Saiedian and Thomas Spencer. An efficient algorithm to compute the candidate keys of a relational database schema. The Computer Journal, 39(2):124–132, 1996. doi:10.1093/COMJNL/39.2.124.
[39] Petr Slavík. A tight analysis of the greedy algorithm for set cover. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, pages 435–441, 1996. doi:10.1145/237814.237991.
[40] Robert Tarjan. Depth-first search and linear graph algorithms. SIAM journal on computing, 1(2):146–160, 1972. doi:10.1137/0201010.
[41] Vijay V Vazirani. Approximation algorithms, volume 1. Springer, 2001.
[42] David P Williamson and David B Shmoys. The design of approximation algorithms. Cambridge university press, 2011.
[43] Laurence A Wolsey. Integer programming. John Wiley & Sons, 2020.

[bib.bib1] [1] Tatsuya Akutsu and Feng Bao. Approximating minimum keys and optimal substructure screens. In International Computing and Combinatorics Conference, pages 290–299. Springer, 1996. doi:10.1007/3-540-61332-3_163.

[bib.bib2] [2] Michael Alekhnovich, Sam Buss, Shlomo Moran, and Toniann Pitassi. Minimum propositional proof length is NP-hard to linearly approximate. The Journal of Symbolic Logic, 66(1):171–191, 2001. doi:10.2307/2694916.

[bib.bib3] [3] Gabriele Ausiello, Alessandro D’Atri, and Domenico Saccà. Minimal representations of directed hypergraphs and their application to database design. In Algorithm design for computer system design, pages 125–157. Springer, 1984.

[bib.bib4] [4] Giorgio Ausiello, Alessandro D’Atri, and Domenico Saccà. Graph algorithms for functional dependency manipulation. Journal of the ACM (JACM), 30(4):752–766, 1983. doi:10.1145/2157.322404.

[bib.bib5] [5] Gautam Bhargava, Piyush Goel, and Bala Iyer. Efficient processing of outer joins and aggregate junctions. In Proceedings of the Twelfth International Conference on Data Engineering, pages 441–449. IEEE, 1996.

[bib.bib6] [6] Gautam Bhargava, Piyush Goel, and Balakrishna R Iyer. Simplification of outer joins. In Proceedings of the 1995 conference of the Centre for Advanced Studies on Collaborative research, page 7, 1995. URL: https://dl.acm.org/citation.cfm?id=781922.

[bib.bib7] [7] Aditya Bhaskara, Moses Charikar, Eden Chlamtáč, Uriel Feige, and Aravindan Vijayaraghavan. Detecting high log-densities: an O $(n^{\frac{1}{4}})$ approximation for densest $k$ -subgraph. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 201–210, 2010. doi:10.48550/arXiv.1001.2891.

[bib.bib8] [8] Robert D. Carr, Srinivas Doddi, Goran Konjevod, and Madhav Marathe. On the red-blue set cover problem. In Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’00, pages 345–353, USA, 2000. Society for Industrial and Applied Mathematics. URL: http://dl.acm.org/citation.cfm?id=338219.338271.

[bib.bib9] [9] Eden Chlamtáč, Michael Dinitz, and Yury Makarychev. Minimizing the union: Tight approximations for small set bipartite vertex expansion. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 881–899. SIAM, 2017. doi:10.1137/1.9781611974782.56.

[bib.bib10] [10] Eden Chlamtáč, Michael Dinitz, and Robert Krauthgamer. Everywhere-sparse spanners via dense subgraphs. In 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science, pages 758–767. IEEE, 2012. doi:10.48550/arXiv.1205.0144.

[bib.bib11] [11] Eden Chlamtáč, Yury Makarychev, and Ali Vakilian. Approximating red-blue set cover and minimum monotone satisfying assignment. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2023). Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023.

[bib.bib12] [12] Xu Chu, Ihab F Ilyas, Paolo Papotti, and Yin Ye. Ruleminer: Data quality rules discovery. In 2014 IEEE 30th International Conference on Data Engineering, pages 1222–1225. IEEE, 2014. doi:10.1109/ICDE.2014.6816746.

[bib.bib13] [13] Graham Cormode, Howard Karloff, and Anthony Wirth. Set cover algorithms for very large datasets. In Proceedings of the 19th ACM international conference on Information and knowledge management, pages 479–488, 2010. doi:10.1145/1871437.1871501.

[bib.bib14] [14] Irit Dinur and David Steurer. Analytical approach to parallel repetition. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 624–633, 2014. doi:10.1145/2591796.2591884.

[bib.bib15] [15] Paul Erdős, Alfréd Rényi, et al. On the evolution of random graphs. Publ. math. inst. hung. acad. sci, 5(1):17–60, 1960.

[bib.bib16] [16] Ronald Fagin. Functional dependencies in a relational database and propositional logic. IBM Journal of research and development, 21(6):534–544, 1977.

[bib.bib17] [17] Ronald Fagin and Moshe Vardi. The theory of data dependencies—an overview. In International Colloquium on Automata, Languages, and Programming, pages 1–22. Springer, 1984.

[bib.bib18] [18] Uriel Feige. A threshold of $\ln n$ for approximating set cover. Journal of the ACM (JACM), 45(4):634–652, 1998. doi:10.1145/285055.285059.

[bib.bib19] [19] Alan Frieze and Michał Karoński. Random graphs and networks: a first course. Cambridge University Press, 2023.

[bib.bib20] [20] Aristides Gionis and Charalampos E Tsourakakis. Dense subgraph discovery: KDD 2015 tutorial. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2313–2314, 2015. doi:10.1145/2783258.2789987.

[bib.bib21] [21] Michael Goldwasser and Rajeev Motwani. Intractability of assembly sequencing: Unit disks in the plane. In Algorithms and Data Structures: 5th International Workshop, WADS’97 Halifax, Nova Scotia, Canada August 6–8, 1997 Proceedings 5, pages 307–320. Springer, 1997. doi:10.1007/3-540-63307-3_70.

[bib.bib22] [22] András Hajnal and Endŕe Szemerédi. Proof of a conjecture of P. Erdős. Combinatorial theory and its applications, 2:601–623, 1970.

[bib.bib23] [23] Ihab F Ilyas, Xu Chu, et al. Trends in cleaning relational data: Consistency and deduplication. Foundations and Trends® in Databases, 5(4):281–393, 2015. doi:10.1561/1900000045.

[bib.bib24] [24] Richard M Karp. Reducibility among combinatorial problems. Springer, 2010. doi:10.1007/978-3-540-68279-0_8.

[bib.bib25] [25] Subhash Khot and Oded Regev. Vertex cover might be hard to approximate to within 2- $\varepsilon$ . Journal of Computer and System Sciences, 74(3):335–349, 2008. doi:10.1109/CCC.2003.1214437.

[bib.bib26] [26] Subhash Khot and Nisheeth K Vishnoi. On the unique games conjecture. In FOCS, volume 5, page 3, 2005. doi:10.1109/SFCS.2005.61.

[bib.bib27] [27] Hal A Kierstead and Alexandr V Kostochka. A short proof of the Hajnal–Szemerédi theorem on equitable colouring. Combinatorics, Probability and Computing, 17(2):265–270, 2008. doi:10.1017/S0963548307008619.

[bib.bib28] [28] Henry A Kierstead, Alexandr V Kostochka, Marcelo Mydlarz, and Endre Szemerédi. A fast algorithm for equitable coloring. Combinatorica, 30(2):217–224, 2010. doi:10.1007/s00493-010-2483-5.

[bib.bib29] [29] Mihail N Kolountzakis, Gary L Miller, Richard Peng, and Charalampos E Tsourakakis. Efficient triangle counting in large graphs via degree-based vertex partitioning. Internet Mathematics, 8(1-2):161–185, 2012. doi:10.1080/15427951.2012.625260.

[bib.bib30] [30] Witold Lipski. Two np-complete problems related to information retrieval. In International Conference on Fundamentals of Computation Theory, pages 452–458. Springer, 1977. doi:10.1007/978-3-662-40153-8_52.

[bib.bib31] [31] László Lovász. On the ratio of optimal integral and fractional covers. Discrete mathematics, 13(4):383–390, 1975. doi:10.1016/0012-365X(75)90058-8.

[bib.bib32] [32] Claudio L Lucchesi and Sylvia L Osborn. Candidate keys for relations. Journal of Computer and System Sciences, 17(2):270–279, 1978. doi:10.1016/0022-0000(78)90009-0.

[bib.bib33] [33] Pasin Manurangsi. Almost-polynomial ratio eth-hardness of approximating densest k-subgraph. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 954–961, 2017. doi:10.1145/3055399.3055412.

[bib.bib34] [34] Glenn N Paulley and Per-Åke Larson. Exploiting uniqueness in query optimization. In CASCON First Decade High Impact Papers, CASCON ’10, pages 127–145, USA, 2010. IBM Corp. doi:10.1145/1925805.1925812.

[bib.bib35] [35] Glenn Norman Paulley. Exploiting Functional Dependence in Query Optimization. PhD thesis, University of Waterloo, 2001.

[bib.bib36] [36] Sriram Pemmaraju and Aravind Srinivasan. The randomized coloring procedure with symmetry-breaking. In Automata, Languages and Programming: 35th International Colloquium, ICALP 2008, Reykjavik, Iceland, July 7-11, 2008, Proceedings, Part I 35, pages 306–319. Springer, 2008. doi:0.1007/978-3-540-70575-8_26.

[bib.bib37] [37] Raghu Ramakrishnan and Johannes Gehrke. Database management systems, volume 3. McGraw-Hill New York, 2003.

[bib.bib38] [38] Hossein Saiedian and Thomas Spencer. An efficient algorithm to compute the candidate keys of a relational database schema. The Computer Journal, 39(2):124–132, 1996. doi:10.1093/COMJNL/39.2.124.

[bib.bib39] [39] Petr Slavík. A tight analysis of the greedy algorithm for set cover. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, pages 435–441, 1996. doi:10.1145/237814.237991.

[bib.bib40] [40] Robert Tarjan. Depth-first search and linear graph algorithms. SIAM journal on computing, 1(2):146–160, 1972. doi:10.1137/0201010.

[bib.bib41] [41] Vijay V Vazirani. Approximation algorithms, volume 1. Springer, 2001.

[bib.bib42] [42] David P Williamson and David B Shmoys. The design of approximation algorithms. Cambridge university press, 2011.

[bib.bib43] [43] Laurence A Wolsey. Integer programming. John Wiley & Sons, 2020.

Targeted Least Cardinality Candidate Key for Relational Databases

Abstract

Keywords and phrases:

Copyright and License:

2012 ACM Subject Classification:

DOI:

Event:

Editors:

Series and Publisher:

1 Introduction

Problem 1 (Targeted Least Cardinality Candidate Key (TCAND)).

Problem 2 (D-round-TCAND).

2 Preliminaries

Functional dependencies.

Set Cover and Red Blue Set Cover.

Theorem 1 (Chlamtáč et al. [11]).

Conjecture 2 ([7, 10, 9]).

Theorem 3 (Chlamtáč et al. [11]).

Integrality Gap.

Equitable coloring.

Lemma 4 (Hajnal-Szemerédi [22]).

3 Proposed Methods

3.1 Integer Programming Formulation

3.2 Simple FDs

Definition 5.

Theorem 6.

Proof.

Corollary 7.

3.3 LP Relaxation and Approximation Algorithms

Single layer/1-round TCAND.

LP Relaxation.

3.3.1 Deterministic rounding

Theorem 8.

Proof.

Feasibility.

Approximation guarantees.

Theorem 9 (Approximating TCAND).

Proof.

Feasibility.

Approximation guarantees.

3.3.2 Randomized rounding

Theorem 10.

Proof.

Remarks.

3.3.3 𝟏-round TCAND is equivalent toRed Blue Set Cover

Theorem 11 (Algorithm and Inapproximability of 1-round TCAND).

Proof.

4 Integrality Gaps

Theorem 12.

Theorem 13.

Proof.

Proof of Theorem 12.

5 Conclusion

References

Problem 2 ( $D$ -round-TCAND).

3.3.3 $1$ -round TCAND is equivalent toRed Blue Set Cover

Theorem 11 (Algorithm and Inapproximability of $1$ -round TCAND).