eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-03-19
127
0
0
10.4230/LIPIcs.ICDT.2019
article
LIPIcs, Volume 127, ICDT'19, Complete Volume
Barcelo, Pablo
1
Calautti, Marco
2
Department of Computer Science, Universidad de Chile, CL
School of Informatics, University of Edinburgh, UK
LIPIcs, Volume 127, ICDT'19, Complete Volume
https://drops.dagstuhl.de/storage/00lipics/lipics-vol127-icdt2019/LIPIcs.ICDT.2019/LIPIcs.ICDT.2019.pdf
Computing Methodologies, Knowledge Representation and Reasoning, Theory of computation, Data modeling, Incomplete, inconsistent and uncertain database Information systems, Data management systems, Data streams, Database query processing, Incomplete data, Inconsistent data, Relational database model
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-03-19
127
0:i
0:xvi
10.4230/LIPIcs.ICDT.2019.0
article
Front Matter, Table of Contents, Preface, Conference Organization
Barcelo, Pablo
1
Calautti, Marco
2
Department of Computer Science, Universidad de Chile, CL
School of Informatics, University of Edinburgh, UK
Front Matter, Table of Contents, Preface, Conference Organization
https://drops.dagstuhl.de/storage/00lipics/lipics-vol127-icdt2019/LIPIcs.ICDT.2019.0/LIPIcs.ICDT.2019.0.pdf
Front Matter
Table of Contents
Preface
Conference Organization
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-03-19
127
1:1
1:1
10.4230/LIPIcs.ICDT.2019.1
article
Learning Models over Relational Databases (Invited Talk)
Olteanu, Dan
1
Department of Computer Science, University of Oxford, Oxford, UK
In this talk, I will make the case for a first-principles approach to machine learning over relational databases that exploits recent development in database systems and theory.
The input to learning classification and regression models is defined by feature extraction queries over relational databases. The mainstream approach to learning over relational data is to materialize the training dataset, export it out of the database, and then learn over it using statistical software packages. These three steps are expensive and unnecessary. Instead, one can cast the machine learning problem as a database problem by decomposing the learning task into a batch of aggregates over the feature extraction query and by computing this batch over the input database.
The performance of this database-centric approach benefits tremendously from structural properties of the relational data and of the feature extraction query; such properties may be algebraic (semi-ring), combinatorial (hypertree width), or statistical (sampling). It also benefits from database systems techniques such as factorized query evaluation and query compilation. For a variety of models, including factorization machines, decision trees, and support vector machines, this approach may come with lower computational complexity than the materialization of the training dataset used by the mainstream approach. Recent results show that this translates to several orders-of-magnitude speed-up over state-of-the-art systems such as TensorFlow, R, Scikit-learn, and mlpack.
While these initial results are promising, there is much more awaiting to be discovered.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol127-icdt2019/LIPIcs.ICDT.2019.1/LIPIcs.ICDT.2019.1.pdf
In-database analytics
Data complexity
Feature extraction queries
Database dependencies
Model reparameterization
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-03-19
127
2:1
2:1
10.4230/LIPIcs.ICDT.2019.2
article
The Power of Relational Learning (Invited Talk)
Getoor, Lise
1
Computer Science Department, University of California, Santa Cruz, US
We live in a richly interconnected world and, not surprisingly, we generate richly interconnected data. From smart cities to social media to financial networks to biological networks, data is relational. While database theory is built on strong relational foundations, the same is not true for machine learning. The majority of machine learning methods flatten data into a single table before performing any processing. Further, database theory is also built on a bedrock of declarative representations. The same is not true for machine learning, in particular deep learning, which results in black-box, uninterpretable and unexplainable models. In this talk, I will introduce the field of statistical relational learning, an alternative machine learning approach based on declarative relational representations paired with probabilistic models. I’ll describe our work on probabilistic soft logic, a probabilistic programming language that is ideally suited to richly connected, noisy data. Our recent results show that by building on state-of-the-art optimization methods in a distributed implementation, we can solve very large relational learning problems orders of magnitude faster than existing approaches.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol127-icdt2019/LIPIcs.ICDT.2019.2/LIPIcs.ICDT.2019.2.pdf
Machine learning
Probabilistic soft logic
Relational model
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-03-19
127
3:1
3:17
10.4230/LIPIcs.ICDT.2019.3
article
The Power of the Terminating Chase (Invited Talk)
Krötzsch, Markus
1
https://orcid.org/0000-0002-9172-2601
Marx, Maximilian
1
https://orcid.org/0000-0003-1479-0341
Rudolph, Sebastian
1
https://orcid.org/0000-0002-1609-2080
TU Dresden, Germany
The chase has become a staple of modern database theory with applications in data integration, query optimisation, data exchange, ontology-based query answering, and many other areas. Most application scenarios and implementations require the chase to terminate and produce a finite universal model, and a large arsenal of sufficient termination criteria is available to guarantee this (generally undecidable) condition. In this invited tutorial, we therefore ask about the expressive power of logical theories for which the chase terminates. Specifically, which database properties can be recognised by such theories, i.e., which Boolean queries can they realise? For the skolem (semi-oblivious) chase, and almost any known termination criterion, this expressivity is just that of plain Datalog. Surprisingly, this limitation of most prior research does not apply to the chase in general. Indeed, we show that standard - chase terminating theories can realise queries with data complexities ranging from PTime to non-elementary that are out of reach for the terminating skolem chase. A "Datalog-first" standard chase that prioritises applications of rules without existential quantifiers makes modelling simpler - and we conjecture: computationally more efficient. This is one of the many open questions raised by our insights, and we conclude with an outlook on the research opportunities in this area.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol127-icdt2019/LIPIcs.ICDT.2019.3/LIPIcs.ICDT.2019.3.pdf
Existential rules
Tuple-generating dependencies
all-instances chase termination
expressive power
data complexity
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-03-19
127
4:1
4:18
10.4230/LIPIcs.ICDT.2019.4
article
Counting Triangles under Updates in Worst-Case Optimal Time
Kara, Ahmet
1
Ngo, Hung Q.
2
Nikolic, Milos
3
Olteanu, Dan
1
Zhang, Haozhe
1
Department of Computer Science, University of Oxford, Oxford, UK
RelationalAI, Inc., Berkeley, CA, USA
School of Informatics, University of Edinburgh, Edinburgh, UK
We consider the problem of incrementally maintaining the triangle count query under single-tuple updates to the input relations. We introduce an approach that exhibits a space-time tradeoff such that the space-time product is quadratic in the size of the input database and the update time can be as low as the square root of this size. This lowest update time is worst-case optimal conditioned on the Online Matrix-Vector Multiplication conjecture.
The classical and factorized incremental view maintenance approaches are recovered as special cases of our approach within the space-time tradeoff. In particular, they require linear-time maintenance under updates, which is suboptimal. Our approach can also count all triangles in a static database in the worst-case optimal time needed for enumerating them.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol127-icdt2019/LIPIcs.ICDT.2019.4/LIPIcs.ICDT.2019.4.pdf
incremental view maintenance
amortized analysis
data skew
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-03-19
127
5:1
5:18
10.4230/LIPIcs.ICDT.2019.5
article
A Formal Framework for Complex Event Processing
Grez, Alejandro
1
2
Riveros, Cristian
1
2
Ugarte, Martín
2
Pontificia Universidad Católica de Chile, Santiago, Chile
Millennium Institute for Foundational Research on Data, Santiago, Chile
Complex Event Processing (CEP) has emerged as the unifying field for technologies that require processing and correlating distributed data sources in real-time. CEP finds applications in diverse domains, which has resulted in a large number of proposals for expressing and processing complex events. However, existing CEP languages lack from a clear semantics, making them hard to understand and generalize. Moreover, there are no general techniques for evaluating CEP query languages with clear performance guarantees.
In this paper we embark on the task of giving a rigorous and efficient framework to CEP. We propose a formal language for specifying complex events, called CEL, that contains the main features used in the literature and has a denotational and compositional semantics. We also formalize the so-called selection strategies, which had only been presented as by-design extensions to existing frameworks. With a well-defined semantics at hand, we discuss how to efficiently process complex events by evaluating CEL formulas with unary filters. We start by studying the syntactical properties of CEL and propose rewriting optimization techniques for simplifying the evaluation of formulas. Then, we introduce a formal computational model for CEP, called complex event automata (CEA), and study how to compile CEL formulas with unary filters into CEA. Furthermore, we provide efficient algorithms for evaluating CEA over event streams using constant time per event followed by constant-delay enumeration of the results. Finally, we gather the main results of this work to present an efficient and declarative framework for CEP.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol127-icdt2019/LIPIcs.ICDT.2019.5/LIPIcs.ICDT.2019.5.pdf
Complex event processing
streaming evaluation
constant delay enumeration
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-03-19
127
6:1
6:18
10.4230/LIPIcs.ICDT.2019.6
article
A Formal Framework for Probabilistic Unclean Databases
De Sa, Christopher
1
Ilyas, Ihab F.
2
Kimelfeld, Benny
3
Ré, Christopher
4
Rekatsinas, Theodoros
5
Cornell University, Ithacan, NY, USA
University of Waterloo, Waterloo, ON, Canada
Technion - Israel Institute of Technology, Haifa, Israel
Stanford University, Stanford, CA, USA
University of Wisconsin - Madison, Madison, WI, USA
Most theoretical frameworks that focus on data errors and inconsistencies follow logic-based reasoning. Yet, practical data cleaning tools need to incorporate statistical reasoning to be effective in real-world data cleaning tasks. Motivated by empirical successes, we propose a formal framework for unclean databases, where two types of statistical knowledge are incorporated: The first represents a belief of how intended (clean) data is generated, and the second represents a belief of how noise is introduced in the actual observed database. To capture this noisy channel model, we introduce the concept of a Probabilistic Unclean Database (PUD), a triple that consists of a probabilistic database that we call the intention, a probabilistic data transformator that we call the realization and captures how noise is introduced, and an observed unclean database that we call the observation. We define three computational problems in the PUD framework: cleaning (infer the most probable intended database, given a PUD), probabilistic query answering (compute the probability of an answer tuple over the unclean observed database), and learning (estimate the most likely intention and realization models of a PUD, given examples as training data). We illustrate the PUD framework on concrete representations of the intention and realization, show that they generalize traditional concepts of repairs such as cardinality and value repairs, draw connections to consistent query answering, and prove tractability results. We further show that parameters can be learned in some practical instantiations, and in fact, prove that under certain conditions we can learn a PUD directly from a single dirty database without any need for clean examples.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol127-icdt2019/LIPIcs.ICDT.2019.6/LIPIcs.ICDT.2019.6.pdf
Unclean databases
data cleaning
probabilistic databases
noisy channel
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-03-19
127
7:1
7:19
10.4230/LIPIcs.ICDT.2019.7
article
On the Expressive Power of Linear Algebra on Graphs
Geerts, Floris
1
University of Antwerp, Antwerp, Belgium
Most graph query languages are rooted in logic. By contrast, in this paper we consider graph query languages rooted in linear algebra. More specifically, we consider MATLANG, a matrix query language recently introduced, in which some basic linear algebra functionality is supported. We investigate the problem of characterising equivalence of graphs, represented by their adjacency matrices, for various fragments of MATLANG. A complete picture is painted of the impact of the linear algebra operations in MATLANG on their ability to distinguish graphs.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol127-icdt2019/LIPIcs.ICDT.2019.7/LIPIcs.ICDT.2019.7.pdf
matrix query languages
graph queries
graph theory
logics
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-03-19
127
8:1
8:16
10.4230/LIPIcs.ICDT.2019.8
article
Fragments of Bag Relational Algebra: Expressiveness and Certain Answers
Console, Marco
1
Guagliardo, Paolo
1
https://orcid.org/0000-0003-0756-5787
Libkin, Leonid
1
School of Informatics, University of Edinburgh, United Kingdom
While all relational database systems are based on the bag data model, much of theoretical research still views relations as sets. Recent attempts to provide theoretical foundations for modern data management problems under the bag semantics concentrated on applications that need to deal with incomplete relations, i.e., relations populated by constants and nulls. Our goal is to provide a complete characterization of the complexity of query answering over such relations in fragments of bag relational algebra.
The main challenges that we face are twofold. First, bag relational algebra has more operations than its set analog (e.g., additive union, max-union, min-intersection, duplicate elimination) and the relationship between various fragments is not fully known. Thus we first fill this gap. Second, we look at query answering over incomplete data, which again is more complex than in the set case: rather than certainty and possibility of answers, we now have numerical information about occurrences of tuples. We then fully classify the complexity of finding this information in all the fragments of bag relational algebra.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol127-icdt2019/LIPIcs.ICDT.2019.8/LIPIcs.ICDT.2019.8.pdf
bag semantics
relational algebra
expressivity
certain answers
complexity
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-03-19
127
9:1
9:19
10.4230/LIPIcs.ICDT.2019.9
article
Categorical Range Reporting with Frequencies
Ganguly, Arnab
1
Munro, J. Ian
2
Nekrich, Yakov
2
Shah, Rahul
3
Thankachan, Sharma V.
4
Dept. of Computer Science, University of Wisconsin, Whitewater, USA
Cheriton School of Computer Science, University of Waterloo, Canada
Dept. of Computer Science, Baton Rouge, USA
Dept. of Computer Science, University of Central Florida
In this paper, we consider a variant of the color range reporting problem called color reporting with frequencies. Our goal is to pre-process a set of colored points into a data structure, so that given a query range Q, we can report all colors that appear in Q, along with their respective frequencies. In other words, for each reported color, we also output the number of times it occurs in Q. We describe an external-memory data structure that uses O(N(1+log^2D/log N)) words and answers one-dimensional queries in O(1 +K/B) I/Os, where N is the total number of points in the data structure, D is the total number of colors in the data structure, K is the number of reported colors, and B is the block size.
Next we turn to an approximate version of this problem: report all colors sigma that appear in the query range; for every reported color, we provide a constant-factor approximation on its frequency. We consider color reporting with approximate frequencies in two dimensions. Our data structure uses O(N) space and answers two-dimensional queries in O(log_B N +log^*B + K/B) I/Os in the special case when the query range is bounded on two sides. As a corollary, we can also answer one-dimensional approximate queries within the same time and space bounds.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol127-icdt2019/LIPIcs.ICDT.2019.9/LIPIcs.ICDT.2019.9.pdf
Data Structures
Range Reporting
Range Counting
Categorical Range Reporting
Orthogonal Range Query
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-03-19
127
10:1
10:20
10.4230/LIPIcs.ICDT.2019.10
article
Approximating Distance Measures for the Skyline
Kumar, Nirman
1
Raichel, Benjamin
2
Sintos, Stavros
3
Van Buskirk, Gregory
2
Department of Computer Science, University of Memphis, TN, USA
Department of Computer Science, University of Texas at Dallas, TX, USA
Department of Computer Science, Duke University, Durham, NC, USA
In multi-parameter decision making, data is usually modeled as a set of points whose dimension is the number of parameters, and the skyline or Pareto points represent the possible optimal solutions for various optimization problems. The structure and computation of such points have been well studied, particularly in the database community. As the skyline can be quite large in high dimensions, one often seeks a compact summary. In particular, for a given integer parameter k, a subset of k points is desired which best approximates the skyline under some measure. Various measures have been proposed, but they mostly treat the skyline as a discrete object. By viewing the skyline as a continuous geometric hull, we propose a new measure that evaluates the quality of a subset by the Hausdorff distance of its hull to the full hull. We argue that in many ways our measure more naturally captures what it means to approximate the skyline.
For our new geometric skyline approximation measure, we provide a plethora of results. Specifically, we provide (1) a near linear time exact algorithm in two dimensions, (2) APX-hardness results for dimensions three and higher, (3) approximation algorithms for related variants of our problem, and (4) a practical and efficient heuristic which uses our geometric insights into the problem, as well as various experimental results to show the efficacy of our approach.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol127-icdt2019/LIPIcs.ICDT.2019.10/LIPIcs.ICDT.2019.10.pdf
Skyline
Pareto optimal
Approximation
Hardness
Multi-criteria decision making
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-03-19
127
11:1
11:20
10.4230/LIPIcs.ICDT.2019.11
article
Index-Based, High-Dimensional, Cosine Threshold Querying with Optimality Guarantees
Li, Yuliang
1
2
Wang, Jianguo
2
Pullman, Benjamin
2
Bandeira, Nuno
2
Papakonstantinou, Yannis
2
Megagon Labs, Mountain View, California, USA
UC San Diego, San Diego, California, USA
Given a database of vectors, a cosine threshold query returns all vectors in the database having cosine similarity to a query vector above a given threshold. These queries arise naturally in many applications, such as document retrieval, image search, and mass spectrometry. The present paper considers the efficient evaluation of such queries, providing novel optimality guarantees and exhibiting good performance on real datasets. We take as a starting point Fagin’s well-known Threshold Algorithm (TA), which can be used to answer cosine threshold queries as follows: an inverted index is first built from the database vectors during pre-processing; at query time, the algorithm traverses the index partially to gather a set of candidate vectors to be later verified against the similarity threshold. However, directly applying TA in its raw form misses significant optimization opportunities. Indeed, we first show that one can take advantage of the fact that the vectors can be assumed to be normalized, to obtain an improved, tight stopping condition for index traversal and to efficiently compute it incrementally. Then we show that one can take advantage of data skewness to obtain better traversal strategies. In particular, we show a novel traversal strategy that exploits a common data skewness condition which holds in multiple domains including mass spectrometry, documents, and image databases. We show that under the skewness assumption, the new traversal strategy has a strong, near-optimal performance guarantee. The techniques developed in the paper are quite general since they can be applied to a large class of similarity functions beyond cosine.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol127-icdt2019/LIPIcs.ICDT.2019.11/LIPIcs.ICDT.2019.11.pdf
Vector databases
Similarity search
Cosine
Threshold Algorithm
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-03-19
127
12:1
12:18
10.4230/LIPIcs.ICDT.2019.12
article
An Experimental Study of the Treewidth of Real-World Graph Data
Maniu, Silviu
1
Senellart, Pierre
2
3
4
Jog, Suraj
5
LRI, CNRS, Université Paris-Sud, Université Paris-Saclay, Orsay, France
DI ENS, ENS, CNRS, PSL University, Paris, France
Inria Paris, France
LTCI, Télécom ParisTech, Paris, France
University of Illinois at Urbana–Champaign, Urbana-Champaign, USA
Treewidth is a parameter that measures how tree-like a relational instance is, and whether it can reasonably be decomposed into a tree. Many computation tasks are known to be tractable on databases of small treewidth, but computing the treewidth of a given instance is intractable. This article is the first large-scale experimental study of treewidth and tree decompositions of real-world database instances (25 datasets from 8 different domains, with sizes ranging from a few thousand to a few million vertices). The goal is to determine which data, if any, can benefit of the wealth of algorithms for databases of small treewidth. For each dataset, we obtain upper and lower bound estimations of their treewidth, and study the properties of their tree decompositions. We show in particular that, even when treewidth is high, using partial tree decompositions can result in data structures that can assist algorithms.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol127-icdt2019/LIPIcs.ICDT.2019.12/LIPIcs.ICDT.2019.12.pdf
Treewidth
Graph decompositions
Experiments
Query processing
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-03-19
127
13:1
13:18
10.4230/LIPIcs.ICDT.2019.13
article
Recursive Programs for Document Spanners
Peterfreund, Liat
1
Cate, Balder ten
2
Fagin, Ronald
3
Kimelfeld, Benny
1
Technion, Haifa 32000, Israel
Google, Inc., Mountain View, CA 94043, USA
IBM Research - Almaden, San Jose, CA 95120, USA
A document spanner models a program for Information Extraction (IE) as a function that takes as input a text document (string over a finite alphabet) and produces a relation of spans (intervals in the document) over a predefined schema. A well-studied language for expressing spanners is that of the regular spanners: relational algebra over regex formulas, which are regular expressions with capture variables. Equivalently, the regular spanners are the ones expressible in non-recursive Datalog over regex formulas (which extract relations that constitute the extensional database). This paper explores the expressive power of recursive Datalog over regex formulas. We show that such programs can express precisely the document spanners computable in polynomial time. We compare this expressiveness to known formalisms such as the closure of regex formulas under the relational algebra and string equality. Finally, we extend our study to a recently proposed framework that generalizes both the relational model and the document spanners.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol127-icdt2019/LIPIcs.ICDT.2019.13/LIPIcs.ICDT.2019.13.pdf
Information Extraction
Document Spanners
Polynomial Time
Recursion
Regular Expressions
Datalog
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-03-19
127
14:1
14:19
10.4230/LIPIcs.ICDT.2019.14
article
Parallel-Correctness and Parallel-Boundedness for Datalog Programs
Neven, Frank
1
Schwentick, Thomas
2
Spinrath, Christopher
2
Vandevoort, Brecht
1
Hasselt University and transnational University of Limburg, The Netherlands
Dortmund University, Germany
Recently, Ketsman et al. started the investigation of the parallel evaluation of recursive queries in the Massively Parallel Communication (MPC) model. Among other things, it was shown that parallel-correctness and parallel-boundedness for general Datalog programs is undecidable, by a reduction from the undecidable containment problem for Datalog. Furthermore, economic policies were introduced as a means to specify data distribution in a recursive setting. In this paper, we extend the latter framework to account for more general distributed evaluation strategies in terms of communication policies. We then show that the undecidability of parallel-correctness runs deeper: it already holds for fragments of Datalog, e.g., monadic and frontier-guarded Datalog, with a decidable containment problem, under relatively simple evaluation strategies. These simple evaluation strategies are defined w.r.t. data-moving distribution constraints. We then investigate restrictions of economic policies that yield decidability. In particular, we show that parallel-correctness is 2EXPTIME-complete for monadic and frontier-guarded Datalog under hash-based economic policies. Next, we consider restrictions of data-moving constraints and show that parallel-correctness and parallel-boundedness are 2EXPTIME-complete for frontier-guarded Datalog. Interestingly, distributed evaluation no longer preserves the usual containment relationships between fragments of Datalog. Indeed, not every monadic Datalog program is equivalent to a frontier-guarded one in the distributed setting. We illustrate the latter by considering two alternative settings where in one of these parallel-correctness is decidable for frontier-guarded Datalog but undecidable for monadic Datalog.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol127-icdt2019/LIPIcs.ICDT.2019.14/LIPIcs.ICDT.2019.14.pdf
Datalog
distributed databases
distributed evaluation
decision problems
complexity
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-03-19
127
15:1
15:18
10.4230/LIPIcs.ICDT.2019.15
article
The First Order Truth Behind Undecidability of Regular Path Queries Determinacy
Głuch, Grzegorz
1
Marcinkowski, Jerzy
1
Ostropolski-Nalewaja, Piotr
1
Institute of Computer Science, University of Wrocław, Poland
In our paper [Głuch, Marcinkowski, Ostropolski-Nalewaja, LICS ACM, 2018] we have solved an old problem stated in [Calvanese, De Giacomo, Lenzerini, Vardi, SPDS ACM, 2000] showing that query determinacy is undecidable for Regular Path Queries. Here a strong generalisation of this result is shown, and - we think - a very unexpected one. We prove that no regularity is needed: determinacy remains undecidable even for finite unions of conjunctive path queries.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol127-icdt2019/LIPIcs.ICDT.2019.15/LIPIcs.ICDT.2019.15.pdf
database theory
query
view
determinacy
recursive path queries
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-03-19
127
16:1
16:19
10.4230/LIPIcs.ICDT.2019.16
article
Datalog: Bag Semantics via Set Semantics
Bertossi, Leopoldo
1
2
3
Gottlob, Georg
4
5
Pichler, Reinhard
5
RelationalAI Inc., USA
Carleton University, Ottawa, Canada
Member of the "Millenium Institute for Foundational Research on Data" (IMFD, Chile)
University of Oxford, UK
TU Wien, Austria
Duplicates in data management are common and problematic. In this work, we present a translation of Datalog under bag semantics into a well-behaved extension of Datalog, the so-called warded Datalog^+/-, under set semantics. From a theoretical point of view, this allows us to reason on bag semantics by making use of the well-established theoretical foundations of set semantics. From a practical point of view, this allows us to handle the bag semantics of Datalog by powerful, existing query engines for the required extension of Datalog. This use of Datalog^+/- is extended to give a set semantics to duplicates in Datalog^+/- itself. We investigate the properties of the resulting Datalog^+/- programs, the problem of deciding multiplicities, and expressibility of some bag operations. Moreover, the proposed translation has the potential for interesting applications such as to Multiset Relational Algebra and the semantic web query language SPARQL with bag semantics.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol127-icdt2019/LIPIcs.ICDT.2019.16/LIPIcs.ICDT.2019.16.pdf
Datalog
duplicates
multisets
query answering
chase
Datalog+/-
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-03-19
127
17:1
17:18
10.4230/LIPIcs.ICDT.2019.17
article
Oblivious Chase Termination: The Sticky Case
Calautti, Marco
1
Pieris, Andreas
1
School of Informatics, University of Edinburgh, UK
The chase procedure is one of the most fundamental algorithmic tools in database theory. A key algorithmic task is uniform chase termination, i.e., given a set of tuple-generating dependencies (tgds), is it the case that the chase under this set of tgds terminates, for every input database? In view of the fact that this problem is undecidable, no matter which version of the chase we consider, it is natural to ask whether well-behaved classes of tgds, introduced in different contexts such as ontological reasoning, make our problem decidable. In this work, we consider a prominent decidability paradigm for tgds, called stickiness. We show that for sticky sets of tgds, uniform chase termination is decidable if we focus on the (semi-)oblivious chase, and we pinpoint its exact complexity: PSpace-complete in general, and NLogSpace-complete for predicates of bounded arity. These complexity results are obtained via graph-based syntactic characterizations of chase termination that are of independent interest.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol127-icdt2019/LIPIcs.ICDT.2019.17/LIPIcs.ICDT.2019.17.pdf
Chase procedure
tuple-generating dependencies
stickiness
termination
computational complexity
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-03-19
127
18:1
18:19
10.4230/LIPIcs.ICDT.2019.18
article
A Single Approach to Decide Chase Termination on Linear Existential Rules
Leclère, Michel
1
Mugnier, Marie-Laure
1
Thomazo, Michaël
2
Ulliana, Federico
1
University of Montpellier, CNRS, Inria, LIRMM, France
Inria, DI ENS, ENS, CNRS, PSL University, France
Existential rules, long known as tuple-generating dependencies in database theory, have been intensively studied in the last decade as a powerful formalism to represent ontological knowledge in the context of ontology-based query answering. A knowledge base is then composed of an instance that contains incomplete data and a set of existential rules, and answers to queries are logically entailed from the knowledge base. This brought again to light the fundamental chase tool, and its different variants that have been proposed in the literature. It is well-known that the problem of determining, given a chase variant and a set of existential rules, whether the chase will halt on any instance, is undecidable. Hence, a crucial issue is whether it becomes decidable for known subclasses of existential rules. In this work, we consider linear existential rules with atomic head, a simple yet important subclass of existential rules that generalizes inclusion dependencies. We show the decidability of the all-instance chase termination problem on these rules for three main chase variants, namely semi-oblivious, restricted and core chase. To obtain these results, we introduce a novel approach based on so-called derivation trees and a single notion of forbidden pattern. Besides the theoretical interest of a unified approach and new proofs for the semi-oblivious and core chase variants, we provide the first positive decidability results concerning the termination of the restricted chase, proving that chase termination on linear existential rules with atomic head is decidable for both versions of the problem: Does every chase sequence terminate? Does some chase sequence terminate?
https://drops.dagstuhl.de/storage/00lipics/lipics-vol127-icdt2019/LIPIcs.ICDT.2019.18/LIPIcs.ICDT.2019.18.pdf
Chase
Tuple Generating Dependencies
Existential rules
Decidability
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-03-19
127
19:1
19:14
10.4230/LIPIcs.ICDT.2019.19
article
Additive First-Order Queries
Berger, Gerald
1
Otto, Martin
2
Pieris, Andreas
3
Surinx, Dimitri
4
Van den Bussche, Jan
4
TU Wien, Austria
TU Darmstadt, Germany
University of Edinburgh, Scotland
Hasselt University, Belgium
A database query q is called additive if q(A U B) = q(A) U q(B) for domain-disjoint input databases A and B. Additivity allows the computation of the query result to be parallelised over the connected components of the input database. We define the "connected formulas" as a syntactic fragment of first-order logic, and show that a first-order query is additive if and only if it expressible by a connected formula. This characterisation specializes to the guarded fragment of first-order logic. We also show that additivity is decidable for formulas of the guarded fragment, establish the computational complexity, and do the same for positive-existential formulas. Our results hold when restricting attention to finite structures, as is common in database theory, but also hold in the unrestricted setting.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol127-icdt2019/LIPIcs.ICDT.2019.19/LIPIcs.ICDT.2019.19.pdf
Expressive power
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-03-19
127
20:1
20:18
10.4230/LIPIcs.ICDT.2019.20
article
Characterizing Tractability of Simple Well-Designed Pattern Trees with Projection
Mengel, Stefan
1
Skritek, Sebastian
2
CNRS, CRIL UMR 8188, Lens, France
Faculty of Informatics, TU Wien, Vienna, Austria
We study the complexity of evaluating well-designed pattern trees, a query language extending conjunctive queries with the possibility to define parts of the query to be optional. This possibility of optional parts is important for obtaining meaningful results over incomplete data sources as it is common in semantic web settings.
Recently, a structural characterization of the classes of well-designed pattern trees that can be evaluated in polynomial time was shown. However, projection - a central feature of many query languages - was not considered in this study. We work towards closing this gap by giving a characterization of all tractable classes of simple well-designed pattern trees with projection (under some common complexity theoretic assumptions). Since well-designed pattern trees correspond to the fragment of well-designed {AND, OPTIONAL}-SPARQL queries this gives a complete description of the tractable classes of queries with projections in this fragment that can be characterized by the underlying graph structures of the queries.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol127-icdt2019/LIPIcs.ICDT.2019.20/LIPIcs.ICDT.2019.20.pdf
SPARQL
well-designed pattern trees
query evaluation
FPT
characterizing tractable classes
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-03-19
127
21:1
21:19
10.4230/LIPIcs.ICDT.2019.21
article
Boolean Tensor Decomposition for Conjunctive Queries with Negation
Abo Khamis, Mahmoud
1
Ngo, Hung Q.
1
Olteanu, Dan
2
Suciu, Dan
3
RelationalAI, Berkeley, USA
Department of Computer Science, University of Oxford, UK
Department of Computer Science and Engineering, University of Washington, USA
We propose an approach for answering conjunctive queries with negation, where the negated relations have bounded degree. Its data complexity matches that of the InsideOut and PANDA algorithms for the positive subquery of the input query and is expressed in terms of the fractional hypertree width and the submodular width respectively. Its query complexity depends on the structure of the conjunction of negated relations; in general it is exponential in the number of join variables occurring in negated relations yet it becomes polynomial for several classes of queries.
This approach relies on several contributions. We show how to rewrite queries with negation on bounded-degree relations into equivalent conjunctive queries with not-all-equal (NAE) predicates, which are a multi-dimensional analog of disequality (!=). We then generalize the known color-coding technique to conjunctions of NAE predicates and explain it via a Boolean tensor decomposition of conjunctions of NAE predicates. This decomposition can be achieved via a probabilistic construction that can be derandomized efficiently.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol127-icdt2019/LIPIcs.ICDT.2019.21/LIPIcs.ICDT.2019.21.pdf
color-coding
combined complexity
negation
query evaluation
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-03-19
127
22:1
22:19
10.4230/LIPIcs.ICDT.2019.22
article
Constant-Delay Enumeration for Nondeterministic Document Spanners
Amarilli, Antoine
1
2
3
https://orcid.org/0000-0002-7977-4441
Bourhis, Pierre
4
5
https://orcid.org/0000-0001-5699-0320
Mengel, Stefan
6
7
https://orcid.org/0000-0003-1386-8784
Niewerth, Matthias
8
https://orcid.org/0000-0003-2032-5374
LTCI, France
Télécom ParisTech, France
Université Paris-Saclay, France
CNRS, CRIStAL UMR 9189, France
Inria Lille, France
CNRS, France
CRIL UMR 8188, Lens, France
University of Bayreuth, Germany
We consider the information extraction framework known as document spanners, and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential variable-set automaton (VA). We pose this problem in the setting of enumeration algorithms, where we can first run a preprocessing phase and must then produce the results with a small delay between any two consecutive results. Our goal is to have an algorithm which is tractable in combined complexity, i.e., in the sizes of the input document and the VA; while ensuring the best possible data complexity bounds in the input document size, i.e., constant delay in the document size. Several recent works at PODS'18 proposed such algorithms but with linear delay in the document size or with an exponential dependency in size of the (generally nondeterministic) input VA. In particular, Florenzano et al. suggest that our desired runtime guarantees cannot be met for general sequential VAs. We refute this and show that, given a nondeterministic sequential VA and an input document, we can enumerate the mappings of the VA on the document with the following bounds: the preprocessing is linear in the document size and polynomial in the size of the VA, and the delay is independent of the document and polynomial in the size of the VA. The resulting algorithm thus achieves tractability in combined complexity and the best possible data complexity bounds. Moreover, it is rather easy to describe, in particular for the restricted case of so-called extended VAs.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol127-icdt2019/LIPIcs.ICDT.2019.22/LIPIcs.ICDT.2019.22.pdf
enumeration
spanners
automata
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-03-19
127
23:1
23:19
10.4230/LIPIcs.ICDT.2019.23
article
Consistent Query Answering for Primary Keys in Logspace
Koutris, Paraschos
1
Wijsen, Jef
2
University of Wisconsin-Madison, WI, USA
University of Mons, Belgium
We study the complexity of consistent query answering on databases that may violate primary key constraints. A repair of such a database is any consistent database that can be obtained by deleting a minimal set of tuples. For every Boolean query q, CERTAINTY(q) is the problem that takes a database as input and asks whether q evaluates to true on every repair. In [Koutris and Wijsen, ACM TODS, 2017], the authors show that for every self-join-free Boolean conjunctive query q, the problem CERTAINTY(q) is either in P or coNP-complete, and it is decidable which of the two cases applies. In this paper, we sharpen this result by showing that for every self-join-free Boolean conjunctive query q, the problem CERTAINTY(q) is either expressible in symmetric stratified Datalog (with some aggregation operator) or coNP-complete. Since symmetric stratified Datalog is in L, we thus obtain a complexity-theoretic dichotomy between L and coNP-complete. Another new finding of practical importance is that CERTAINTY(q) is on the logspace side of the dichotomy for queries q where all join conditions express foreign-to-primary key matches, which is undoubtedly the most common type of join condition.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol127-icdt2019/LIPIcs.ICDT.2019.23/LIPIcs.ICDT.2019.23.pdf
conjunctive queries
consistent query answering
Datalog
primary keys
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-03-19
127
24:1
24:18
10.4230/LIPIcs.ICDT.2019.24
article
Learning Definable Hypotheses on Trees
Grienenberger, Emilie
1
Ritzert, Martin
2
ENS Paris-Saclay, 61 Avenue du Président Wilson, 94230 Cachan, France
RWTH Aachen University, Templergraben 55, 52062 Aachen, Germany
We study the problem of learning properties of nodes in tree structures. Those properties are specified by logical formulas, such as formulas from first-order or monadic second-order logic. We think of the tree as a database encoding a large dataset and therefore aim for learning algorithms which depend at most sublinearly on the size of the tree. We present a learning algorithm for quantifier-free formulas where the running time only depends polynomially on the number of training examples, but not on the size of the background structure. By a previous result on strings we know that for general first-order or monadic second-order (MSO) formulas a sublinear running time cannot be achieved. However, we show that by building an index on the tree in a linear time preprocessing phase, we can achieve a learning algorithm for MSO formulas with a logarithmic learning phase.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol127-icdt2019/LIPIcs.ICDT.2019.24/LIPIcs.ICDT.2019.24.pdf
monadic second-order logic
trees
query learning