Document

**Published in:** LIPIcs, Volume 290, 27th International Conference on Database Theory (ICDT 2024)

We study the fine-grained complexity of conjunctive queries with grouping and aggregation. For some common aggregate functions (e.g., min, max, count, sum), such a query can be phrased as an ordinary conjunctive query over a database annotated with a suitable commutative semiring. Specifically, we investigate the ability to evaluate such queries by constructing in log-linear time a data structure that provides logarithmic-time direct access to the answers ordered by a given lexicographic order. This task is nontrivial since the number of answers might be larger than log-linear in the size of the input, and so, the data structure needs to provide a compact representation of the space of answers.
In the absence of aggregation and annotation, past research provides a sufficient tractability condition on queries and orders. For queries without self-joins, this condition is not just sufficient, but also necessary (under conventional lower-bound assumptions in fine-grained complexity). We show that all past results continue to hold for annotated databases, assuming that the annotation itself is not part of the lexicographic order. On the other hand, we show infeasibility for the case of count-distinct that does not have any efficient representation as a commutative semiring. We then investigate the ability to include the aggregate and annotation outcome in the lexicographic order. Among the hardness results, standing out as tractable is the case of a semiring with an idempotent addition, such as those of min and max. Notably, this case captures also count-distinct over a logarithmic-size domain.

Idan Eldar, Nofar Carmeli, and Benny Kimelfeld. Direct Access for Answers to Conjunctive Queries with Aggregation. In 27th International Conference on Database Theory (ICDT 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 290, pp. 4:1-4:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{eldar_et_al:LIPIcs.ICDT.2024.4, author = {Eldar, Idan and Carmeli, Nofar and Kimelfeld, Benny}, title = {{Direct Access for Answers to Conjunctive Queries with Aggregation}}, booktitle = {27th International Conference on Database Theory (ICDT 2024)}, pages = {4:1--4:20}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-312-6}, ISSN = {1868-8969}, year = {2024}, volume = {290}, editor = {Cormode, Graham and Shekelyan, Michael}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2024.4}, URN = {urn:nbn:de:0030-drops-197861}, doi = {10.4230/LIPIcs.ICDT.2024.4}, annote = {Keywords: aggregate queries, conjunctive queries, provenance semirings, commutative semirings, annotated databases, direct access, ranking function, answer orderings, query classification} }

Document

**Published in:** LIPIcs, Volume 290, 27th International Conference on Database Theory (ICDT 2024)

When extracting a relation of spans (intervals) from a text document, a common practice is to filter out tuples of the relation that are deemed dominated by others. The domination rule is defined as a partial order that varies along different systems and tasks. For example, we may state that a tuple is dominated by tuples that extend it by assigning additional attributes, or assigning larger intervals. The result of filtering the relation would then be the skyline according to this partial order. As this filtering may remove most of the extracted tuples, we study whether we can improve the performance of the extraction by compiling the domination rule into the extractor.
To this aim, we introduce the skyline operator for declarative information extraction tasks expressed as document spanners. We show that this operator can be expressed via regular operations when the domination partial order can itself be expressed as a regular spanner, which covers several natural domination rules. Yet, we show that the skyline operator incurs a computational cost (under combined complexity). First, there are cases where the operator requires an exponential blowup on the number of states needed to represent the spanner as a sequential variable-set automaton. Second, the evaluation may become computationally hard. Our analysis more precisely identifies classes of domination rules for which the combined complexity is tractable or intractable.

Antoine Amarilli, Benny Kimelfeld, Sébastien Labbé, and Stefan Mengel. Skyline Operators for Document Spanners. In 27th International Conference on Database Theory (ICDT 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 290, pp. 7:1-7:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{amarilli_et_al:LIPIcs.ICDT.2024.7, author = {Amarilli, Antoine and Kimelfeld, Benny and Labb\'{e}, S\'{e}bastien and Mengel, Stefan}, title = {{Skyline Operators for Document Spanners}}, booktitle = {27th International Conference on Database Theory (ICDT 2024)}, pages = {7:1--7:18}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-312-6}, ISSN = {1868-8969}, year = {2024}, volume = {290}, editor = {Cormode, Graham and Shekelyan, Michael}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2024.7}, URN = {urn:nbn:de:0030-drops-197898}, doi = {10.4230/LIPIcs.ICDT.2024.7}, annote = {Keywords: Information Extraction, Document Spanners, Query Evaluation} }

Document

**Published in:** LIPIcs, Volume 290, 27th International Conference on Database Theory (ICDT 2024)

We propose and study a framework for quantifying the importance of the choices of parameter values to the result of a query over a database. These parameters occur as constants in logical queries, such as conjunctive queries. In our framework, the importance of a parameter is its SHAP score. This score is a popular instantiation of the game-theoretic Shapley value to measuring the importance of feature values in machine learning models. We make the case for the rationale of using this score by explaining the intuition behind SHAP, and by showing that we arrive at this score in two different, apparently opposing, approaches to quantifying the contribution of a parameter.
The application of the SHAP score requires two components in addition to the query and the database: (a) a probability distribution over the combinations of parameter values, and (b) a utility function that measures the similarity between the result for the original parameters and the result for hypothetical parameters. The main question addressed in the paper is the complexity of calculating the SHAP score for different distributions and similarity measures. We first address the case of probabilistically independent parameters. The problem is hard if we consider a fragment of queries that is hard to evaluate (as one would expect), and even for the fragment of acyclic conjunctive queries. In some cases, though, one can efficiently list all relevant parameter combinations, and then the SHAP score can be computed in polynomial time under reasonable general conditions. Also tractable is the case of full acyclic conjunctive queries for certain (natural) similarity functions. We extend our results to conjunctive queries with inequalities between variables and parameters. Finally, we discuss a simple approximation technique for the case of correlated parameters.

Martin Grohe, Benny Kimelfeld, Peter Lindner, and Christoph Standke. The Importance of Parameters in Database Queries. In 27th International Conference on Database Theory (ICDT 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 290, pp. 14:1-14:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{grohe_et_al:LIPIcs.ICDT.2024.14, author = {Grohe, Martin and Kimelfeld, Benny and Lindner, Peter and Standke, Christoph}, title = {{The Importance of Parameters in Database Queries}}, booktitle = {27th International Conference on Database Theory (ICDT 2024)}, pages = {14:1--14:17}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-312-6}, ISSN = {1868-8969}, year = {2024}, volume = {290}, editor = {Cormode, Graham and Shekelyan, Michael}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2024.14}, URN = {urn:nbn:de:0030-drops-197966}, doi = {10.4230/LIPIcs.ICDT.2024.14}, annote = {Keywords: SHAP score, query parameters, Shapley value} }

Document

**Published in:** LIPIcs, Volume 255, 26th International Conference on Database Theory (ICDT 2023)

A path query extracts vertex tuples from a labeled graph, based on the words that are formed by the paths connecting the vertices. We study the computational complexity of measuring the contribution of edges and vertices to an answer to a path query, focusing on the class of conjunctive regular path queries. To measure this contribution, we adopt the traditional Shapley value from cooperative game theory. This value has been recently proposed and studied in the context of relational database queries and has uses in a plethora of other domains.
We first study the contribution of edges and show that the exact Shapley value is almost always hard to compute. Specifically, it is #P-hard to calculate the contribution of an edge whenever at least one (non-redundant) conjunct allows for a word of length three or more. In the case of regular path queries (i.e., no conjunction), the problem is tractable if the query has only words of length at most two; hence, this property fully characterizes the tractability of the problem. On the other hand, if we allow for an approximation error, then it is straightforward to obtain an efficient scheme (FPRAS) for an additive approximation. Yet, a multiplicative approximation is harder to obtain. We establish that in the case of conjunctive regular path queries, a multiplicative approximation of the Shapley value of an edge can be computed in polynomial time if and only if all query atoms are finite languages (assuming non-redundancy and conventional complexity limitations). We also study the analogous situation where we wish to determine the contribution of a vertex, rather than an edge, and establish complexity results of similar nature.

Majd Khalil and Benny Kimelfeld. The Complexity of the Shapley Value for Regular Path Queries. In 26th International Conference on Database Theory (ICDT 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 255, pp. 11:1-11:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{khalil_et_al:LIPIcs.ICDT.2023.11, author = {Khalil, Majd and Kimelfeld, Benny}, title = {{The Complexity of the Shapley Value for Regular Path Queries}}, booktitle = {26th International Conference on Database Theory (ICDT 2023)}, pages = {11:1--11:19}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-270-9}, ISSN = {1868-8969}, year = {2023}, volume = {255}, editor = {Geerts, Floris and Vandevoort, Brecht}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2023.11}, URN = {urn:nbn:de:0030-drops-177535}, doi = {10.4230/LIPIcs.ICDT.2023.11}, annote = {Keywords: Path queries, regular path queries, graph databases, Shapley value} }

Document

**Published in:** LIPIcs, Volume 255, 26th International Conference on Database Theory (ICDT 2023)

A probabilistic database with attribute-level uncertainty consists of relations where cells of some attributes may hold probability distributions rather than deterministic content. Such databases arise, implicitly or explicitly, in the context of noisy operations such as missing data imputation, where we automatically fill in missing values, column prediction, where we predict unknown attributes, and database cleaning (and repairing), where we replace the original values due to detected errors or violation of integrity constraints. We study the computational complexity of problems that regard the selection of cell values in the presence of integrity constraints. More precisely, we focus on functional dependencies and study three problems: (1) deciding whether the constraints can be satisfied by any choice of values, (2) finding a most probable such choice, and (3) calculating the probability of satisfying the constraints. The data complexity of these problems is determined by the combination of the set of functional dependencies and the collection of uncertain attributes. We give full classifications into tractable and intractable complexities for several classes of constraints, including a single dependency, matching constraints, and unary functional dependencies.

Amir Gilad, Aviram Imber, and Benny Kimelfeld. The Consistency of Probabilistic Databases with Independent Cells. In 26th International Conference on Database Theory (ICDT 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 255, pp. 22:1-22:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{gilad_et_al:LIPIcs.ICDT.2023.22, author = {Gilad, Amir and Imber, Aviram and Kimelfeld, Benny}, title = {{The Consistency of Probabilistic Databases with Independent Cells}}, booktitle = {26th International Conference on Database Theory (ICDT 2023)}, pages = {22:1--22:19}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-270-9}, ISSN = {1868-8969}, year = {2023}, volume = {255}, editor = {Geerts, Floris and Vandevoort, Brecht}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2023.22}, URN = {urn:nbn:de:0030-drops-177644}, doi = {10.4230/LIPIcs.ICDT.2023.22}, annote = {Keywords: Probabilistic databases, attribute-level uncertainty, functional dependencies, most probable database} }

Document

**Published in:** LIPIcs, Volume 186, 24th International Conference on Database Theory (ICDT 2021)

Regular expressions with capture variables, also known as "regex-formulas", extract relations of spans (intervals identified by their start and end indices) from text. In turn, the class of regular document spanners is the closure of the regex formulas under the Relational Algebra. We investigate the computational complexity of querying text by aggregate functions, such as sum, average, and quantile, on top of regular document spanners. To this end, we formally define aggregate functions over regular document spanners and analyze the computational complexity of exact and approximate computation. More precisely, we show that in a restricted case, all studied aggregate functions can be computed in polynomial time. In general, however, even though exact computation is intractable, some aggregates can still be approximated with fully polynomial-time randomized approximation schemes (FPRAS).

Johannes Doleschal, Noa Bratman, Benny Kimelfeld, and Wim Martens. The Complexity of Aggregates over Extractions by Regular Expressions. In 24th International Conference on Database Theory (ICDT 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 186, pp. 10:1-10:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{doleschal_et_al:LIPIcs.ICDT.2021.10, author = {Doleschal, Johannes and Bratman, Noa and Kimelfeld, Benny and Martens, Wim}, title = {{The Complexity of Aggregates over Extractions by Regular Expressions}}, booktitle = {24th International Conference on Database Theory (ICDT 2021)}, pages = {10:1--10:20}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-179-5}, ISSN = {1868-8969}, year = {2021}, volume = {186}, editor = {Yi, Ke and Wei, Zhewei}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2021.10}, URN = {urn:nbn:de:0030-drops-137181}, doi = {10.4230/LIPIcs.ICDT.2021.10}, annote = {Keywords: Information extraction, document spanners, regular expressions, aggregation functions} }

Document

**Published in:** LIPIcs, Volume 186, 24th International Conference on Database Theory (ICDT 2021)

Quantifying the inconsistency of a database is motivated by various goals including reliability estimation for new datasets and progress indication in data cleaning. Another goal is to attribute to individual tuples a level of responsibility to the overall inconsistency, and thereby prioritize tuples in the explanation or inspection of dirt. Therefore, inconsistency quantification and attribution have been a subject of much research in Knowledge Representation and, more recently, in Databases. As in many other fields, a conventional responsibility sharing mechanism is the Shapley value from cooperative game theory. In this paper, we carry out a systematic investigation of the complexity of the Shapley value in common inconsistency measures for functional-dependency (FD) violations. For several measures we establish a full classification of the FD sets into tractable and intractable classes with respect to Shapley-value computation. We also study the complexity of approximation in intractable cases.

Ester Livshits and Benny Kimelfeld. The Shapley Value of Inconsistency Measures for Functional Dependencies. In 24th International Conference on Database Theory (ICDT 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 186, pp. 15:1-15:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{livshits_et_al:LIPIcs.ICDT.2021.15, author = {Livshits, Ester and Kimelfeld, Benny}, title = {{The Shapley Value of Inconsistency Measures for Functional Dependencies}}, booktitle = {24th International Conference on Database Theory (ICDT 2021)}, pages = {15:1--15:19}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-179-5}, ISSN = {1868-8969}, year = {2021}, volume = {186}, editor = {Yi, Ke and Wei, Zhewei}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2021.15}, URN = {urn:nbn:de:0030-drops-137230}, doi = {10.4230/LIPIcs.ICDT.2021.15}, annote = {Keywords: Shapley value, inconsistent databases, functional dependencies, database repairs} }

Document

**Published in:** LIPIcs, Volume 186, 24th International Conference on Database Theory (ICDT 2021)

A common interpretation of soft constraints penalizes the database for every violation of every constraint, where the penalty is the cost (weight) of the constraint. A computational challenge is that of finding an optimal subset: a collection of database tuples that minimizes the total penalty when each tuple has a cost of being excluded. When the constraints are strict (i.e., have an infinite cost), this subset is a "cardinality repair" of an inconsistent database; in soft interpretations, this subset corresponds to a "most probable world" of a probabilistic database, a "most likely intention" of a probabilistic unclean database, and so on. Within the class of functional dependencies, the complexity of finding a cardinality repair is thoroughly understood. Yet, very little is known about the complexity of finding an optimal subset for the more general soft semantics. This paper makes a significant progress in this direction. In addition to general insights about the hardness and approximability of the problem, we present algorithms for two special cases: a single functional dependency, and a bipartite matching. The latter is the problem of finding an optimal "almost matching" of a bipartite graph where a penalty is paid for every lost edge and every violation of monogamy.

Nofar Carmeli, Martin Grohe, Benny Kimelfeld, Ester Livshits, and Muhammad Tibi. Database Repairing with Soft Functional Dependencies. In 24th International Conference on Database Theory (ICDT 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 186, pp. 16:1-16:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{carmeli_et_al:LIPIcs.ICDT.2021.16, author = {Carmeli, Nofar and Grohe, Martin and Kimelfeld, Benny and Livshits, Ester and Tibi, Muhammad}, title = {{Database Repairing with Soft Functional Dependencies}}, booktitle = {24th International Conference on Database Theory (ICDT 2021)}, pages = {16:1--16:17}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-179-5}, ISSN = {1868-8969}, year = {2021}, volume = {186}, editor = {Yi, Ke and Wei, Zhewei}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2021.16}, URN = {urn:nbn:de:0030-drops-137245}, doi = {10.4230/LIPIcs.ICDT.2021.16}, annote = {Keywords: Database inconsistency, database repairs, integrity constraints, soft constraints, functional dependencies} }

Document

**Published in:** LIPIcs, Volume 186, 24th International Conference on Database Theory (ICDT 2021)

The reliability of a Boolean Conjunctive Query (CQ) over a tuple-independent probabilistic database is the probability that the CQ is satisfied when the tuples of the database are sampled one by one, independently, with their associated probability. For queries without self-joins (repeated relation symbols), the data complexity of this problem is fully characterized in a known dichotomy: reliability can be computed in polynomial time for hierarchical queries, and is #P-hard for non-hierarchical queries. Hierarchical queries also characterize the tractability of queries for other tasks: having read-once lineage formulas, supporting insertion/deletion updates to the database in constant time, and having a tractable computation of tuples' Shapley and Banzhaf values.
In this work, we investigate a fundamental counting problem for CQs without self-joins: how many sets of facts from the input database satisfy the query? This is equivalent to the uniform case of the query reliability problem, where the probability of every tuple is required to be 1/2. Of course, for hierarchical queries, uniform reliability is in polynomial time, like the reliability problem. However, it is an open question whether being hierarchical is necessary for the uniform reliability problem to be in polynomial time. In fact, the complexity of the problem has been unknown even for the simplest non-hierarchical CQs without self-joins.
We solve this open question by showing that uniform reliability is #P-complete for every non-hierarchical CQ without self-joins. Hence, we establish that being hierarchical also characterizes the tractability of unweighted counting of the satisfying tuple subsets. We also consider the generalization to query reliability where all tuples of the same relation have the same probability, and give preliminary results on the complexity of this problem.

Antoine Amarilli and Benny Kimelfeld. Uniform Reliability of Self-Join-Free Conjunctive Queries. In 24th International Conference on Database Theory (ICDT 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 186, pp. 17:1-17:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{amarilli_et_al:LIPIcs.ICDT.2021.17, author = {Amarilli, Antoine and Kimelfeld, Benny}, title = {{Uniform Reliability of Self-Join-Free Conjunctive Queries}}, booktitle = {24th International Conference on Database Theory (ICDT 2021)}, pages = {17:1--17:17}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-179-5}, ISSN = {1868-8969}, year = {2021}, volume = {186}, editor = {Yi, Ke and Wei, Zhewei}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2021.17}, URN = {urn:nbn:de:0030-drops-137252}, doi = {10.4230/LIPIcs.ICDT.2021.17}, annote = {Keywords: Hierarchical conjunctive queries, query reliability, tuple-independent database, counting problems, #P-hardness} }

Document

Invited Talk

**Published in:** LIPIcs, Volume 155, 23rd International Conference on Database Theory (ICDT 2020)

Probabilistic databases are commonly known in the form of the tuple-independent model, where the validity of every tuple is an independent random event. Conceptually, the notion is more general, as a probabilistic database refers to any probability distribution over ordinary databases. A central computational problem is that of marginal inference for database queries: what is the probability that a given tuple is a query answer? In this talk, I will discuss recent developments in several research directions that, collectively, position probabilistic databases as the common and natural foundation of various challenges at the core of data analytics. Examples include reasoning about uncertain preferences from conventional distributions such as the Mallows model, data cleaning and repairing in probabilistic paradigms such as the HoloClean system, and the explanation of query answers through concepts from cooperative game theory such as the Shapley value and the Banzhaf Power Index. While these challenges manifest different facets of probabilistic databases, I will show how they interrelate and, moreover, how they relate to the basic theory of inference over tuple-independent databases.

Benny Kimelfeld. Facets of Probabilistic Databases (Invited Talk). In 23rd International Conference on Database Theory (ICDT 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 155, p. 1:1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{kimelfeld:LIPIcs.ICDT.2020.1, author = {Kimelfeld, Benny}, title = {{Facets of Probabilistic Databases}}, booktitle = {23rd International Conference on Database Theory (ICDT 2020)}, pages = {1:1--1:1}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-139-9}, ISSN = {1868-8969}, year = {2020}, volume = {155}, editor = {Lutz, Carsten and Jung, Jean Christoph}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2020.1}, URN = {urn:nbn:de:0030-drops-119258}, doi = {10.4230/LIPIcs.ICDT.2020.1}, annote = {Keywords: Probabilistic databases, data cleaning, preference models, Shapley value} }

Document

**Published in:** LIPIcs, Volume 155, 23rd International Conference on Database Theory (ICDT 2020)

The framework of document spanners abstracts the task of information extraction from text as a function that maps every document (a string) into a relation over the document’s spans (intervals identified by their start and end indices). For instance, the regular spanners are the closure under the Relational Algebra (RA) of the regular expressions with capture variables, and the expressive power of the regular spanners is precisely captured by the class of vset-automata - a restricted class of transducers that mark the endpoints of selected spans.
In this work, we embark on the investigation of document spanners that can annotate extractions with auxiliary information such as confidence, support, and confidentiality measures. To this end, we adopt the abstraction of provenance semirings by Green et al., where tuples of a relation are annotated with the elements of a commutative semiring, and where the annotation propagates through the (positive) RA operators via the semiring operators. Hence, the proposed spanner extension, referred to as an annotator, maps every string into an annotated relation over the spans. As a specific instantiation, we explore weighted vset-automata that, similarly to weighted automata and transducers, attach semiring elements to transitions. We investigate key aspects of expressiveness, such as the closure under the positive RA, and key aspects of computational complexity, such as the enumeration of annotated answers and their ranked enumeration in the case of numeric semirings. For a number of these problems, fundamental properties of the underlying semiring, such as positivity, are crucial for establishing tractability.

Johannes Doleschal, Benny Kimelfeld, Wim Martens, and Liat Peterfreund. Weight Annotation in Information Extraction. In 23rd International Conference on Database Theory (ICDT 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 155, pp. 8:1-8:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{doleschal_et_al:LIPIcs.ICDT.2020.8, author = {Doleschal, Johannes and Kimelfeld, Benny and Martens, Wim and Peterfreund, Liat}, title = {{Weight Annotation in Information Extraction}}, booktitle = {23rd International Conference on Database Theory (ICDT 2020)}, pages = {8:1--8:18}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-139-9}, ISSN = {1868-8969}, year = {2020}, volume = {155}, editor = {Lutz, Carsten and Jung, Jean Christoph}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2020.8}, URN = {urn:nbn:de:0030-drops-119325}, doi = {10.4230/LIPIcs.ICDT.2020.8}, annote = {Keywords: Information extraction, regular document spanners, weighted automata, provenance semirings, K-relations} }

Document

**Published in:** LIPIcs, Volume 155, 23rd International Conference on Database Theory (ICDT 2020)

We investigate the application of the Shapley value to quantifying the contribution of a tuple to a query answer. The Shapley value is a widely known numerical measure in cooperative game theory and in many applications of game theory for assessing the contribution of a player to a coalition game. It has been established already in the 1950s, and is theoretically justified by being the very single wealth-distribution measure that satisfies some natural axioms. While this value has been investigated in several areas, it received little attention in data management. We study this measure in the context of conjunctive and aggregate queries by defining corresponding coalition games. We provide algorithmic and complexity-theoretic results on the computation of Shapley-based contributions to query answers; and for the hard cases we present approximation algorithms.

Ester Livshits, Leopoldo Bertossi, Benny Kimelfeld, and Moshe Sebag. The Shapley Value of Tuples in Query Answering. In 23rd International Conference on Database Theory (ICDT 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 155, pp. 20:1-20:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{livshits_et_al:LIPIcs.ICDT.2020.20, author = {Livshits, Ester and Bertossi, Leopoldo and Kimelfeld, Benny and Sebag, Moshe}, title = {{The Shapley Value of Tuples in Query Answering}}, booktitle = {23rd International Conference on Database Theory (ICDT 2020)}, pages = {20:1--20:19}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-139-9}, ISSN = {1868-8969}, year = {2020}, volume = {155}, editor = {Lutz, Carsten and Jung, Jean Christoph}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2020.20}, URN = {urn:nbn:de:0030-drops-119442}, doi = {10.4230/LIPIcs.ICDT.2020.20}, annote = {Keywords: Shapley value, query answering, conjunctive queries, aggregate queries} }

Document

**Published in:** Dagstuhl Reports, Volume 9, Issue 5 (2019)

This report documents the program and the outcomes of Dagstuhl Seminar 19211 "Enumeration in Data Management". The goal of the seminar was to bring together researchers from various fields of computer science, including the Databases, Computational Logic, and Algorithms communities, and establish the means of collaboration towards considerable progress on the topic. Specifically, we aimed at understanding the recent developments, identifying the important open problems, and initiating collaborative efforts towards solutions thereof. In addition, we aimed to build and disseminate a toolkit for data-centric enumeration problems, including algorithmic techniques, proof techniques, and important indicator problems. Towards the objectives, the seminar included tutorials on the topic, invited talks, presentations of open problems, working groups on the open problems, discussions on platforms to compile the community knowledge, and the construction of various skeletons of such compilations.

Endre Boros, Benny Kimelfeld, Reinhard Pichler, and Nicole Schweikardt. Enumeration in Data Management (Dagstuhl Seminar 19211). In Dagstuhl Reports, Volume 9, Issue 5, pp. 89-109, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@Article{boros_et_al:DagRep.9.5.89, author = {Boros, Endre and Kimelfeld, Benny and Pichler, Reinhard and Schweikardt, Nicole}, title = {{Enumeration in Data Management (Dagstuhl Seminar 19211)}}, pages = {89--109}, journal = {Dagstuhl Reports}, ISSN = {2192-5283}, year = {2019}, volume = {9}, number = {5}, editor = {Boros, Endre and Kimelfeld, Benny and Pichler, Reinhard and Schweikardt, Nicole}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/DagRep.9.5.89}, URN = {urn:nbn:de:0030-drops-113822}, doi = {10.4230/DagRep.9.5.89}, annote = {Keywords: constant delay, databases, dynamic complexity, enumeration, polynomial delay, query evaluation} }

Document

**Published in:** LIPIcs, Volume 127, 22nd International Conference on Database Theory (ICDT 2019)

Most theoretical frameworks that focus on data errors and inconsistencies follow logic-based reasoning. Yet, practical data cleaning tools need to incorporate statistical reasoning to be effective in real-world data cleaning tasks. Motivated by empirical successes, we propose a formal framework for unclean databases, where two types of statistical knowledge are incorporated: The first represents a belief of how intended (clean) data is generated, and the second represents a belief of how noise is introduced in the actual observed database. To capture this noisy channel model, we introduce the concept of a Probabilistic Unclean Database (PUD), a triple that consists of a probabilistic database that we call the intention, a probabilistic data transformator that we call the realization and captures how noise is introduced, and an observed unclean database that we call the observation. We define three computational problems in the PUD framework: cleaning (infer the most probable intended database, given a PUD), probabilistic query answering (compute the probability of an answer tuple over the unclean observed database), and learning (estimate the most likely intention and realization models of a PUD, given examples as training data). We illustrate the PUD framework on concrete representations of the intention and realization, show that they generalize traditional concepts of repairs such as cardinality and value repairs, draw connections to consistent query answering, and prove tractability results. We further show that parameters can be learned in some practical instantiations, and in fact, prove that under certain conditions we can learn a PUD directly from a single dirty database without any need for clean examples.

Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld, Christopher Ré, and Theodoros Rekatsinas. A Formal Framework for Probabilistic Unclean Databases. In 22nd International Conference on Database Theory (ICDT 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 127, pp. 6:1-6:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{desa_et_al:LIPIcs.ICDT.2019.6, author = {De Sa, Christopher and Ilyas, Ihab F. and Kimelfeld, Benny and R\'{e}, Christopher and Rekatsinas, Theodoros}, title = {{A Formal Framework for Probabilistic Unclean Databases}}, booktitle = {22nd International Conference on Database Theory (ICDT 2019)}, pages = {6:1--6:18}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-101-6}, ISSN = {1868-8969}, year = {2019}, volume = {127}, editor = {Barcelo, Pablo and Calautti, Marco}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2019.6}, URN = {urn:nbn:de:0030-drops-103083}, doi = {10.4230/LIPIcs.ICDT.2019.6}, annote = {Keywords: Unclean databases, data cleaning, probabilistic databases, noisy channel} }

Document

**Published in:** LIPIcs, Volume 127, 22nd International Conference on Database Theory (ICDT 2019)

A document spanner models a program for Information Extraction (IE) as a function that takes as input a text document (string over a finite alphabet) and produces a relation of spans (intervals in the document) over a predefined schema. A well-studied language for expressing spanners is that of the regular spanners: relational algebra over regex formulas, which are regular expressions with capture variables. Equivalently, the regular spanners are the ones expressible in non-recursive Datalog over regex formulas (which extract relations that constitute the extensional database). This paper explores the expressive power of recursive Datalog over regex formulas. We show that such programs can express precisely the document spanners computable in polynomial time. We compare this expressiveness to known formalisms such as the closure of regex formulas under the relational algebra and string equality. Finally, we extend our study to a recently proposed framework that generalizes both the relational model and the document spanners.

Liat Peterfreund, Balder ten Cate, Ronald Fagin, and Benny Kimelfeld. Recursive Programs for Document Spanners. In 22nd International Conference on Database Theory (ICDT 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 127, pp. 13:1-13:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{peterfreund_et_al:LIPIcs.ICDT.2019.13, author = {Peterfreund, Liat and Cate, Balder ten and Fagin, Ronald and Kimelfeld, Benny}, title = {{Recursive Programs for Document Spanners}}, booktitle = {22nd International Conference on Database Theory (ICDT 2019)}, pages = {13:1--13:18}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-101-6}, ISSN = {1868-8969}, year = {2019}, volume = {127}, editor = {Barcelo, Pablo and Calautti, Marco}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2019.13}, URN = {urn:nbn:de:0030-drops-103155}, doi = {10.4230/LIPIcs.ICDT.2019.13}, annote = {Keywords: Information Extraction, Document Spanners, Polynomial Time, Recursion, Regular Expressions, Datalog} }

Document

Complete Volume

**Published in:** LIPIcs, Volume 98, 21st International Conference on Database Theory (ICDT 2018)

LIPIcs, Volume 98, ICDT'18, Complete Volume

21st International Conference on Database Theory (ICDT 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 98, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@Proceedings{kimelfeld_et_al:LIPIcs.ICDT.2018, title = {{LIPIcs, Volume 98, ICDT'18, Complete Volume}}, booktitle = {21st International Conference on Database Theory (ICDT 2018)}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-063-7}, ISSN = {1868-8969}, year = {2018}, volume = {98}, editor = {Kimelfeld, Benny and Amsterdamer, Yael}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2018}, URN = {urn:nbn:de:0030-drops-86795}, doi = {10.4230/LIPIcs.ICDT.2018}, annote = {Keywords: Information systems, Data management systems, Information systems, Database design and models, Information systems, Database query processing} }

Document

**Published in:** Dagstuhl Manifestos, Volume 7, Issue 1 (2018)

The area of Principles of Data Management (PDM) has made crucial contributions to the development of formal frameworks for understanding and managing
data and knowledge. This work has involved a rich cross-fertilization between
PDM and other disciplines in mathematics and computer science, including logic, complexity theory, and knowledge representation. We anticipate on-going expansion of PDM research as the technology and applications involving data management continue to grow and evolve. In particular, the lifecycle of Big Data Analytics raises a wealth of challenge areas that PDM can help with.
In this report we identify some of the most important research directions where the PDM community has the potential to make significant contributions. This is done from three perspectives: potential practical relevance, results already obtained, and research questions that appear surmountable in the short and medium term.

Serge Abiteboul, Marcelo Arenas, Pablo Barceló, Meghyn Bienvenu, Diego Calvanese, Claire David, Richard Hull, Eyke Hüllermeier, Benny Kimelfeld, Leonid Libkin, Wim Martens, Tova Milo, Filip Murlak, Frank Neven, Magdalena Ortiz, Thomas Schwentick, Julia Stoyanovich, Jianwen Su, Dan Suciu, Victor Vianu, and Ke Yi. Research Directions for Principles of Data Management (Dagstuhl Perspectives Workshop 16151). In Dagstuhl Manifestos, Volume 7, Issue 1, pp. 1-29, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@Article{abiteboul_et_al:DagMan.7.1.1, author = {Abiteboul, Serge and Arenas, Marcelo and Barcel\'{o}, Pablo and Bienvenu, Meghyn and Calvanese, Diego and David, Claire and Hull, Richard and H\"{u}llermeier, Eyke and Kimelfeld, Benny and Libkin, Leonid and Martens, Wim and Milo, Tova and Murlak, Filip and Neven, Frank and Ortiz, Magdalena and Schwentick, Thomas and Stoyanovich, Julia and Su, Jianwen and Suciu, Dan and Vianu, Victor and Yi, Ke}, title = {{Research Directions for Principles of Data Management (Dagstuhl Perspectives Workshop 16151)}}, pages = {1--29}, journal = {Dagstuhl Manifestos}, ISSN = {2193-2433}, year = {2018}, volume = {7}, number = {1}, editor = {Abiteboul, Serge and Arenas, Marcelo and Barcel\'{o}, Pablo and Bienvenu, Meghyn and Calvanese, Diego and David, Claire and Hull, Richard and H\"{u}llermeier, Eyke and Kimelfeld, Benny and Libkin, Leonid and Martens, Wim and Milo, Tova and Murlak, Filip and Neven, Frank and Ortiz, Magdalena and Schwentick, Thomas and Stoyanovich, Julia and Su, Jianwen and Suciu, Dan and Vianu, Victor and Yi, Ke}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/DagMan.7.1.1}, URN = {urn:nbn:de:0030-drops-86772}, doi = {10.4230/DagMan.7.1.1}, annote = {Keywords: database theory, principles of data management, query languages, efficient query processing, query optimization, heterogeneous data, uncertainty, knowledge-enriched data management, machine learning, workflows, human-related data, ethics} }

Document

Front Matter

**Published in:** LIPIcs, Volume 98, 21st International Conference on Database Theory (ICDT 2018)

Front Matter, Table of Contents, Preface, Conference Organization

21st International Conference on Database Theory (ICDT 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 98, pp. 0:i-0:xvi, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{kimelfeld_et_al:LIPIcs.ICDT.2018.0, author = {Kimelfeld, Benny and Amsterdamer, Yael}, title = {{Front Matter, Table of Contents, Preface, Conference Organization}}, booktitle = {21st International Conference on Database Theory (ICDT 2018)}, pages = {0:i--0:xvi}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-063-7}, ISSN = {1868-8969}, year = {2018}, volume = {98}, editor = {Kimelfeld, Benny and Amsterdamer, Yael}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2018.0}, URN = {urn:nbn:de:0030-drops-85938}, doi = {10.4230/LIPIcs.ICDT.2018.0}, annote = {Keywords: Front Matter, Table of Contents, Preface, Conference Organization} }

Document

**Published in:** LIPIcs, Volume 68, 20th International Conference on Database Theory (ICDT 2017)

In its traditional definition, a repair of an inconsistent database is a consistent database that differs from the inconsistent one in a "minimal way." Often, repairs are not equally legitimate, as it is desired to prefer one over another; for example, one fact is regarded more reliable than another, or a more recent fact should be preferred to an earlier one.
Motivated by these considerations, researchers have introduced and investigated the framework of preferred repairs, in the context of denial constraints and subset repairs. There, a priority relation between facts is lifted towards a priority relation between consistent databases, and repairs are restricted to the ones that are optimal in the lifted sense.
Three notions of lifting (and optimal repairs) have been proposed: Pareto, global, and completion.
In this paper we investigate the complexity of deciding whether the priority relation suffices to clean the database unambiguously, or in other words, whether there is exactly one optimal repair. We show that the different lifting semantics entail highly different complexities. Under Pareto optimality, the problem is coNP-complete, in data complexity, for every set of functional dependencies (FDs), except for the tractable case of (equivalence to) one FD per relation. Under global optimality, one FD per relation is still tractable, but we establish Pi-2-p-completeness for a relation with two FDs. In contrast, under completion optimality the problem is solvable in polynomial time for every set of FDs. In fact, we present a polynomial-time algorithm for arbitrary conflict hypergraphs. We further show that under a general assumption of transitivity, this algorithm solves the problem even for global optimality. The algorithm is extremely simple, but its proof of correctness is quite intricate.

Benny Kimelfeld, Ester Livshits, and Liat Peterfreund. Detecting Ambiguity in Prioritized Database Repairing. In 20th International Conference on Database Theory (ICDT 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 68, pp. 17:1-17:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{kimelfeld_et_al:LIPIcs.ICDT.2017.17, author = {Kimelfeld, Benny and Livshits, Ester and Peterfreund, Liat}, title = {{Detecting Ambiguity in Prioritized Database Repairing}}, booktitle = {20th International Conference on Database Theory (ICDT 2017)}, pages = {17:1--17:20}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-024-8}, ISSN = {1868-8969}, year = {2017}, volume = {68}, editor = {Benedikt, Michael and Orsi, Giorgio}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2017.17}, URN = {urn:nbn:de:0030-drops-70489}, doi = {10.4230/LIPIcs.ICDT.2017.17}, annote = {Keywords: inconsistent databases, preferred repairs, data cleaning, functional dependencies, conflict hypergraph} }

Document

**Published in:** LIPIcs, Volume 48, 19th International Conference on Database Theory (ICDT 2016)

Probabilistic programming languages are used for developing statistical models, and they typically consist of two components: a specification of a stochastic process (the prior), and a specification of observations that restrict the probability space to a conditional subspace (the posterior). Use cases of such formalisms include the development of algorithms in machine learning and artificial intelligence. We propose and investigate an extension of Datalog for specifying statistical models, and establish a declarative probabilistic-programming paradigm over databases. Our proposed extension provides convenient mechanisms to include common numerical probability functions; in particular, conclusions of rules may contain values drawn from such functions. The semantics of a program is a probability distribution over the possible outcomes of the input database with respect to the program. Observations are naturally incorporated by means of integrity constraints over the extensional and intensional relations. The resulting semantics is robust under different chases and invariant to rewritings that preserve logical equivalence.

Vince Barany, Balder ten Cate, Benny Kimelfeld, Dan Olteanu, and Zografoula Vagena. Declarative Probabilistic Programming with Datalog. In 19th International Conference on Database Theory (ICDT 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 48, pp. 7:1-7:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016)

Copy BibTex To Clipboard

@InProceedings{barany_et_al:LIPIcs.ICDT.2016.7, author = {Barany, Vince and ten Cate, Balder and Kimelfeld, Benny and Olteanu, Dan and Vagena, Zografoula}, title = {{Declarative Probabilistic Programming with Datalog}}, booktitle = {19th International Conference on Database Theory (ICDT 2016)}, pages = {7:1--7:19}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-002-6}, ISSN = {1868-8969}, year = {2016}, volume = {48}, editor = {Martens, Wim and Zeume, Thomas}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2016.7}, URN = {urn:nbn:de:0030-drops-57761}, doi = {10.4230/LIPIcs.ICDT.2016.7}, annote = {Keywords: Chase, Datalog, probability measure space, probabilistic programming} }