eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2016-03-14
48
0
0
10.4230/LIPIcs.ICDT.2016
article
LIPIcs, Volume 48, ICDT'16, Complete Volume
Martens, Wim
Zeume, Thomas
LIPIcs, Volume 48, ICDT'16, Complete Volume
https://drops.dagstuhl.de/storage/00lipics/lipics-vol048-icdt2016/LIPIcs.ICDT.2016/LIPIcs.ICDT.2016.pdf
Database Management, Normal forms, Schema and subschema, Query languages, Query processing, Relational databases, Distributed databases, Heterogeneous Databases, Online Information Services,Miscellaneous – Privacy, Office Automation: Workflow management
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2016-03-14
48
0:i
0:xvi
10.4230/LIPIcs.ICDT.2016.0
article
Front Matter, Table of Contents, Preface, Conference Organization, External Reviewers, List of Authors
Martens, Wim
Zeume, Thomas
Front Matter, Table of Contents, Preface, Conference Organization, External Reviewers, List of Authors
https://drops.dagstuhl.de/storage/00lipics/lipics-vol048-icdt2016/LIPIcs.ICDT.2016.0/LIPIcs.ICDT.2016.0.pdf
Front Matter
Table of Contents
Preface
Conference Organization
External Reviewers
List of Authors
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2016-03-14
48
1:1
1:2
10.4230/LIPIcs.ICDT.2016.1
article
The ICDT 2016 Test of Time Award Announcement
Afrati, Foto N.
David, Claire
Gottlob, Georg
We describe the 2016 ICDT Test of Time Award which is awarded to Chandra Chekuri and Anand Rajaraman for their 1997 ICDT paper on "Conjunctive Query Containment Revisited".
https://drops.dagstuhl.de/storage/00lipics/lipics-vol048-icdt2016/LIPIcs.ICDT.2016.1/LIPIcs.ICDT.2016.1.pdf
conjunctive query
treewidth
NP-hardness
rewriting
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2016-03-14
48
2:1
2:2
10.4230/LIPIcs.ICDT.2016.2
article
Scale Independence: Using Small Data to Answer Queries on Big Data (Invited Talk)
Geerts, Floris
Large datasets introduce challenges to the scalability of query answering. Given a query Q and a dataset D, it is often prohibitively costly to compute the query answers Q(D) when D is big. To this end, one may want to use heuristics, "quick and dirty" algorithms which return approximate answers. However, in many applications it is a must to find exact query answers. So, how can we efficiently compute Q(D) when D is big or when we only have limited resources?
One idea is to find a small subset D_Q of D such that Q(D_Q)=Q(D) where the size of D_Q is independent of the size of the underlying dataset D. Intuitively, when such a D_Q can be found for a query Q, the query is said to be scale independent (Armbrust et al. 2011, Armbrust et al. 2013, Fan et al. 2014). Indeed, for answering such queries the size of the underlying database does not matter, i.e., query processing is independent of the scale of the database.
In this talk, I will survey various formalisms that enable large classes of queries to be scale independent. These formalisms primarily rely on the availability of access constraints, a combination of indexes and cardinality constraints, on the data (Fan et al. 15, Fan et al. 14). We will take a closer look at how, in the presence of such constraints, queries can often be compiled into efficient query plans that access a bounded amount data (Cao et al. 2014, Fan et al. 2015), and how these techniques relate to query processing in the presence of access patterns (Benedikt et al. 2015, Benedikt et al. 2014, Deutsch et al. 2007). Finally, we illustrate that scale independent queries are quite common in practice and that they indeed can be efficiently answered on big datasets when access constraints are present (Cao et al. 2015, Cao et al. 2014).
https://drops.dagstuhl.de/storage/00lipics/lipics-vol048-icdt2016/LIPIcs.ICDT.2016.2/LIPIcs.ICDT.2016.2.pdf
Scale independence
Access constraints
Query processing
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2016-03-14
48
3:1
3:1
10.4230/LIPIcs.ICDT.2016.3
article
Top-k Indexes Made Small and Sweet (Invited Talk)
Tao, Yufei
Top-k queries have become extremely popular in the database community. Such a query, which is issued on a set of elements each carrying a real-valued weight, returns the k elements with the highest weights among all the elements that satisfy a predicate. As usual, an index structure is necessary to answer a query substantially faster than accessing the whole input set.
The existing research on top-k queries can be classified in two categories. The first one, which is system-oriented, aims to devise indexes that are simple to understand and easy to implement. These indexes, typically designed with heuristics, are reasonably fast in practical applications, but do not necessarily offer strong performance guarantees - in other words, they are small but not sweet. The other category, which is theory-oriented, aims to develop indexes that promise attractive bounds on the space consumption and query overhead (sometimes also update cost). These indexes, unfortunately, are often excessively sophisticated in the adopted techniques, and are rarely applied in practice - they are sweet but not small.
This talk will discuss the progress of an on-going project that strives to take down the barrier between the two categories, by crafting a framework for acquiring simple top-k indexes with excellent performance guarantees - namely, small and sweet. This is achieved with reductions that produce top-k indexes automatically from the existing data structures for conventional reporting queries on unweighted elements (i.e., finding all elements satisfying a predicate), and/or the existing data structures on top-1 queries. Our reductions promise nearly no performance deterioration with respect to those existing structures, are general enough to be applicable to a huge variety of top-k problems, and work in both the external memory model and the RAM model.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol048-icdt2016/LIPIcs.ICDT.2016.3/LIPIcs.ICDT.2016.3.pdf
Data Structures
Top-k
External Memory
RAM
Reductions
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2016-03-14
48
4:1
4:12
10.4230/LIPIcs.ICDT.2016.4
article
New Algorithms for Heavy Hitters in Data Streams (Invited Talk)
Woodruff, David P.
An old and fundamental problem in databases and data streams is that of finding the heavy hitters, also known as the top-k, most popular items, frequent items, elephants, or iceberg queries. There are several variants of this problem, which quantify what it means for an item to be frequent, including what are known as the l_1-heavy hitters and l_2-heavy hitters. There are a number of algorithmic solutions for these problems, starting with the work of Misra and Gries, as well as the CountMin and CountSketch data structures, among others.
In this paper (accompanying an invited talk) we cover several recent results developed in this area, which improve upon the classical solutions to these problems. In particular, we develop new algorithms for finding l_1-heavy hitters and l_2-heavy hitters, with significantly less memory required than what was known, and which are optimal in a number of parameter regimes.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol048-icdt2016/LIPIcs.ICDT.2016.4/LIPIcs.ICDT.2016.4.pdf
data streams
heavy hitters
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2016-03-14
48
5:1
5:18
10.4230/LIPIcs.ICDT.2016.5
article
Beyond Well-designed SPARQL
Kaminski, Mark
Kostylev, Egor V.
SPARQL is the standard query language for RDF data. The distinctive feature of SPARQL is the OPTIONAL operator, which allows for partial answers when complete answers are not available due to lack of information. However, optional matching is computationally expensive - query answering is PSPACE-complete. The well-designed fragment of SPARQL achieves much better computational properties by restricting the use of optional matching - query answering becomes coNP-complete. However, well-designed SPARQL captures far from all real-life queries - in fact, only about half of the queries over DBpedia that use OPTIONAL are well-designed.
In the present paper, we study queries outside of well-designed SPARQL. We introduce the class of weakly well-designed queries that subsumes well-designed queries and includes most common meaningful non-well-designed queries: our analysis shows that the new fragment captures about 99% of DBpedia queries with OPTIONAL. At the same time, query answering for weakly well-designed SPARQL remains coNP-complete, and our fragment is in a certain sense maximal for this complexity. We show that the fragment's expressive power is strictly in-between well-designed and full SPARQL. Finally, we provide an intuitive normal form for weakly well-designed queries and study the complexity of containment and equivalence.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol048-icdt2016/LIPIcs.ICDT.2016.5/LIPIcs.ICDT.2016.5.pdf
RDF
Query languages
SPARQL
Optional matching
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2016-03-14
48
6:1
6:17
10.4230/LIPIcs.ICDT.2016.6
article
A Framework for Estimating Stream Expression Cardinalities
Dasgupta, Anirban
Lang, Kevin J.
Rhodes, Lee
Thaler, Justin
Given m distributed data streams A_1,..., A_m, we consider the problem of estimating the number of unique identifiers in streams defined by set expressions over A_1,..., A_m. We identify a broad class of algorithms for solving this problem, and show that the estimators output by any algorithm in this class are perfectly unbiased and satisfy strong variance bounds. Our analysis unifies and generalizes a variety of earlier results in the literature. To demonstrate its generality, we describe several novel sampling algorithms in our class, and show that they achieve a novel tradeoff between accuracy, space usage, update speed, and applicability.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol048-icdt2016/LIPIcs.ICDT.2016.6/LIPIcs.ICDT.2016.6.pdf
sketching
data stream algorithms
mergeability
distinct elements
set operations
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2016-03-14
48
7:1
7:19
10.4230/LIPIcs.ICDT.2016.7
article
Declarative Probabilistic Programming with Datalog
Barany, Vince
ten Cate, Balder
Kimelfeld, Benny
Olteanu, Dan
Vagena, Zografoula
Probabilistic programming languages are used for developing statistical models, and they typically consist of two components: a specification of a stochastic process (the prior), and a specification of observations that restrict the probability space to a conditional subspace (the posterior). Use cases of such formalisms include the development of algorithms in machine learning and artificial intelligence. We propose and investigate an extension of Datalog for specifying statistical models, and establish a declarative probabilistic-programming paradigm over databases. Our proposed extension provides convenient mechanisms to include common numerical probability functions; in particular, conclusions of rules may contain values drawn from such functions. The semantics of a program is a probability distribution over the possible outcomes of the input database with respect to the program. Observations are naturally incorporated by means of integrity constraints over the extensional and intensional relations. The resulting semantics is robust under different chases and invariant to rewritings that preserve logical equivalence.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol048-icdt2016/LIPIcs.ICDT.2016.7/LIPIcs.ICDT.2016.7.pdf
Chase
Datalog
probability measure space
probabilistic programming
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2016-03-14
48
8:1
8:18
10.4230/LIPIcs.ICDT.2016.8
article
Worst-Case Optimal Algorithms for Parallel Query Processing
Koutris, Paraschos
Beame, Paul
Suciu, Dan
In this paper, we study the communication complexity for the problem of computing a conjunctive query on a large database in a parallel setting with p servers. In contrast to previous work, where upper and lower bounds on the communication were specified for particular structures of data (either data without skew, or data with specific types of skew), in this work we focus on worst-case analysis of the communication cost. The goal is to find worst-case optimal parallel algorithms, similar to the work of (Ngo et al. 2012) for sequential algorithms.
We first show that for a single round we can obtain an optimal worst-case algorithm. The optimal load for a conjunctive query q when all relations have size equal to M is O(M/p^{1/psi^*}), where psi^* is a new query-related quantity called the edge quasi-packing number, which is different from both the edge packing number and edge cover number of the query hypergraph. For multiple rounds, we present algorithms that are optimal for several classes of queries. Finally, we show a surprising connection to the external memory model, which allows us to translate parallel algorithms to external memory algorithms. This technique allows us to recover (within a polylogarithmic factor) several recent results on the I/O complexity for computing join queries, and also obtain optimal algorithms for other classes of queries.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol048-icdt2016/LIPIcs.ICDT.2016.8/LIPIcs.ICDT.2016.8.pdf
conjunctive query
parallel computation
worst-case bounds
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2016-03-14
48
9:1
9:17
10.4230/LIPIcs.ICDT.2016.9
article
Parallel-Correctness and Containment for Conjunctive Queries with Union and Negation
Geck, Gaetano
Ketsman, Bas
Neven, Frank
Schwentick, Thomas
Single-round multiway join algorithms first reshuffle data over many servers and then evaluate the query at hand in a parallel and communication-free way. A key question is whether a given distribution policy for the reshuffle is adequate for computing a given query, also referred to as parallel-correctness. This paper extends the study of the complexity of parallel-correctness and its constituents, parallel-soundness and parallel-completeness, to unions of conjunctive queries with and without negation. As a by-product it is shown that the containment problem for conjunctive queries with negation is coNEXPTIME-complete.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol048-icdt2016/LIPIcs.ICDT.2016.9/LIPIcs.ICDT.2016.9.pdf
Conjunctive queries
distributed evaluation
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2016-03-14
48
10:1
10:17
10.4230/LIPIcs.ICDT.2016.10
article
A Formal Study of Collaborative Access Control in Distributed Datalog
Abiteboul, Serge
Bourhis, Pierre
Vianu, Victor
We formalize and study a declaratively specified collaborative access control mechanism for data dissemination in a distributed environment. Data dissemination is specified using distributed datalog. Access control is also defined by datalog-style rules, at the relation level for extensional relations, and at the tuple level for intensional ones, based on the derivation of tuples. The model also includes a mechanism for "declassifying" data, that allows circumventing overly restrictive access control. We consider the complexity of determining whether a peer is allowed to access a given fact, and address the problem of achieving the goal of disseminating certain information under some access control policy. We also investigate the problem of information leakage, which occurs when a peer is able to infer facts to which the peer is not allowed access by the policy. Finally, we consider access control extended to facts equipped with provenance information, motivated by the many applications where such information is required. We provide semantics for access control with provenance, and establish the complexity of determining whether a peer may access a given fact together with its provenance. This work is motivated by the access control of the Webdamlog system, whose core features it formalizes.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol048-icdt2016/LIPIcs.ICDT.2016.10/LIPIcs.ICDT.2016.10.pdf
Distributed datalog
access control
provenance
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2016-03-14
48
11:1
11:17
10.4230/LIPIcs.ICDT.2016.11
article
It's All a Matter of Degree: Using Degree Information to Optimize Multiway Joins
Joglekar, Manas R.
Ré, Christopher M.
We optimize multiway equijoins on relational tables using degree information. We give a new bound that uses degree information to more tightly bound the maximum output size of a query. On real data, our bound on the number of triangles in a social network can be up to 95 times tighter than existing worst case bounds. We show that using only a constant amount of degree information, we are able to obtain join algorithms with a running time that has a smaller exponent than existing algorithms - for any database instance. We also show that this degree information can be obtained in nearly linear time, which yields asymptotically faster algorithms in the serial setting and lower communication algorithms in the MapReduce setting.
In the serial setting, the data complexity of join processing can be expressed as a function O(IN^x + OUT) in terms of input size IN and output size OUT in which x depends on the query. An upper bound for x is given by fractional hypertreewidth. We are interested in situations in which we can get algorithms for which x is strictly smaller than the fractional hypertreewidth. We say that a join can be processed in subquadratic time if x < 2. Building on the AYZ algorithm for processing cycle joins in quadratic time, for a restricted class of joins which we call 1-series-parallel graphs, we obtain a complete decision procedure for identifying subquadratic solvability (subject to the 3-SUM problem requiring quadratic time). Our 3-SUM based quadratic lower bound is tight, making it the only known tight bound for joins that does not require any assumption about the matrix multiplication exponent omega. We also give a MapReduce algorithm that meets our improved communication bound and handles essentially optimal parallelism.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol048-icdt2016/LIPIcs.ICDT.2016.11/LIPIcs.ICDT.2016.11.pdf
Joins
Degree
MapReduce
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2016-03-14
48
12:1
12:18
10.4230/LIPIcs.ICDT.2016.12
article
Filtering With the Crowd: CrowdScreen Revisited
Groz, Benoit
Levin, Ezra
Meilijson, Isaac
Milo, Tova
Filtering a set of items, based on a set of properties that can be verified by humans, is a common application of CrowdSourcing. When the workers are error-prone, each item is presented to multiple users, to limit the probability of misclassification. Since the Crowd is a relatively expensive resource, minimizing the number of questions per item may naturally result in big savings. Several algorithms to address this minimization problem have been presented in the CrowdScreen framework by Parameswaran et al. However, those algorithms do not scale well and therefore cannot be used in scenarios where high accuracy is required in spite of high user error rates. The goal of this paper is thus to devise algorithms that can cope with such situations. To achieve this, we provide new theoretical insights to the problem, then use them to develop a new efficient algorithm. We also propose novel optimizations for the algorithms of CrowdScreen that improve their scalability. We complement our theoretical study by an experimental evaluation of the algorithms on a large set of synthetic parameters as well as real-life crowdsourcing scenarios, demonstrating the advantages of our solution.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol048-icdt2016/LIPIcs.ICDT.2016.12/LIPIcs.ICDT.2016.12.pdf
CrowdSourcing
filtering
algorithms
sprt
hypothesis testing
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2016-03-14
48
13:1
13:18
10.4230/LIPIcs.ICDT.2016.13
article
Streaming Partitioning of Sequences and Trees
Konrad, Christian
We study streaming algorithms for partitioning integer sequences and trees. In the case of trees, we suppose that the input tree is provided by a stream consisting of a depth-first-traversal of the input tree. This captures the problem of partitioning XML streams, among other problems.
We show that both problems admit deterministic (1+epsilon)-approximation streaming algorithms, where a single pass is sufficient for integer sequences and two passes are required for trees. The space complexity for partitioning integer sequences is O((1/epsilon) * p * log(nm)) and for partitioning trees is O((1/epsilon) * p^2 * log(nm)), where n is the length of the input stream, m is the maximal weight of an element in the stream, and p is the number of partitions to be created.
Furthermore, for the problem of partitioning integer sequences, we show that computing an optimal solution in one pass requires Omega(n) space, and computing a (1+epsilon)-approximation in one pass requires Omega((1/epsilon) * log(n)) space, rendering our algorithm tight for instances with p,m in O(1).
https://drops.dagstuhl.de/storage/00lipics/lipics-vol048-icdt2016/LIPIcs.ICDT.2016.13/LIPIcs.ICDT.2016.13.pdf
Streaming Algorithms
XML Documents
Data Partitioning
Communication Complexity
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2016-03-14
48
14:1
14:18
10.4230/LIPIcs.ICDT.2016.14
article
Dynamic Graph Queries
Muñoz, Pablo
Vortmeier, Nils
Zeume, Thomas
Graph databases in many applications - semantic web, transport or biological networks among others - are not only large, but also frequently modified. Evaluating graph queries in this dynamic context is a challenging task, as those queries often combine first-order and navigational features.
Motivated by recent results on maintaining dynamic reachability, we study the dynamic evaluation of traditional query languages for graphs in the descriptive complexity framework. Our focus is on maintaining regular path queries, and extensions thereof, by first-order formulas. In particular we are interested in path queries defined by non-regular languages and in extended conjunctive regular path queries (which allow to compare labels of paths based on word relations). Further we study the closely related problems of maintaining distances in graphs and reachability in product graphs.
In this preliminary study we obtain upper bounds for those problems in restricted settings, such as undirected and acyclic graphs, or under insertions only, and negative results regarding quantifier-free update formulas. In addition we point out interesting directions for further research.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol048-icdt2016/LIPIcs.ICDT.2016.14/LIPIcs.ICDT.2016.14.pdf
Dynamic descriptive complexity
graph databases
graph products
reachability
path queries
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2016-03-14
48
15:1
15:19
10.4230/LIPIcs.ICDT.2016.15
article
Verification of Evolving Graph-structured Data under Expressive Path Constraints
Calvanese, Diego
Ortiz, Magdalena
Šimkus, Mantas
Integrity constraints play a central role in databases and, among other applications, are fundamental for preserving data integrity when databases evolve as a result of operations manipulating the data. In this context, an important task is that of static verification, which consists in deciding whether a given set of constraints is preserved after the execution of a given sequence of operations, for every possible database satisfying the initial constraints. In this paper, we consider constraints over graph-structured data formulated in an expressive Description Logic (DL) that allows for regular expressions over binary relations and their inverses, generalizing many of the well-known path constraint languages proposed for semi-structured data in the last two decades. In this setting, we study the problem of static verification, for operations expressed in a simple yet flexible language built from additions and deletions of complex DL expressions. We establish undecidability of the general setting, and identify suitable restricted fragments for which we obtain tight complexity results, building on techniques developed in our previous work for simpler DLs. As a by-product, we obtain new (un)decidability results for the implication problem of path constraints, and improve previous upper bounds on the complexity of the problem.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol048-icdt2016/LIPIcs.ICDT.2016.15/LIPIcs.ICDT.2016.15.pdf
Path constraints
Description Logics
Graph databases
Static verification
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2016-03-14
48
16:1
16:18
10.4230/LIPIcs.ICDT.2016.16
article
Query Stability in Monotonic Data-Aware Business Processes
Savkovic, Ognjen
Marengo, Elisa
Nutt, Werner
Organizations continuously accumulate data, often according to some business processes. If one poses a query over such data for decision support, it is important to know whether the query is stable, that is, whether the answers will stay the same or may change in the future because business processes may add further data. We investigate query stability for conjunctive queries. To this end, we define a formalism that combines an explicit representation of the control flow of a process with a specification of how data is read and inserted into the database. We consider different restrictions of the process model and the state of the system, such as negation in conditions, cyclic executions, read access to written data, presence of pending process instances, and the possibility to start fresh process instances. We identify for which restriction combinations stability of conjunctive queries is decidable and provide encodings into variants of Datalog that are optimal with respect to the worst-case complexity of the problem.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol048-icdt2016/LIPIcs.ICDT.2016.16/LIPIcs.ICDT.2016.16.pdf
Business Processes
Query Stability
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2016-03-14
48
17:1
17:17
10.4230/LIPIcs.ICDT.2016.17
article
Document Spanners: From Expressive Power to Decision Problems
Freydenberger, Dominik D.
Holldack, Mario
We examine document spanners, a formal framework for information extraction that was introduced by Fagin et al. (PODS 2013). A document spanner is a function that maps an input string to a relation over spans (intervals of positions of the string). We focus on document spanners that are defined by regex formulas, which are basically regular expressions that map matched subexpressions to corresponding spans, and on core spanners, which extend the former by standard algebraic operators and string equality selection.
First, we compare the expressive power of core spanners to three models - namely, patterns, word equations, and a rich and natural subclass of extended regular expressions (regular expressions with a repetition operator). These results are then used to analyze the complexity of query evaluation and various aspects of static analysis of core spanners. Finally, we examine the relative succinctness of different kinds of representations of core spanners and relate this to the simplification of core spanners that are extended with difference operators.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol048-icdt2016/LIPIcs.ICDT.2016.17/LIPIcs.ICDT.2016.17.pdf
Information extraction
document spanners
regular expressions
regex
patterns
word equations
decision problems
descriptional complexity
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2016-03-14
48
18:1
18:18
10.4230/LIPIcs.ICDT.2016.18
article
Algorithms for Provisioning Queries and Analytics
Assadi, Sepehr
Khanna, Sanjeev
Li, Yang
Tannen, Val
Provisioning is a technique for avoiding repeated expensive computations in what-if analysis. Given a query, an analyst formulates k hypotheticals, each retaining some of the tuples of a database instance, possibly overlapping, and she wishes to answer the query under scenarios, where a scenario is defined by a subset of the hypotheticals that are "turned on". We say that a query admits compact provisioning if given any database instance and any k hypotheticals, one can create a poly-size (in k) sketch that can then be used to answer the query under any of the 2^k possible scenarios without accessing the original instance.
In this paper, we focus on provisioning complex queries that combine relational algebra (the logical component), grouping, and statistics/analytics (the numerical component). We first show that queries that compute quantiles or linear regression (as well as simpler queries that compute count and sum/average of positive values) can be compactly provisioned to provide (multiplicative) approximate answers to an arbitrary precision. In contrast, exact provisioning for each of these statistics requires the sketch size to be exponential in k. We then establish that for any complex query whose logical component is a positive relational algebra query, as long as the numerical component can be compactly provisioned, the complex query itself can be compactly provisioned. On the other hand, introducing negation or recursion in the logical component again requires the sketch size to be exponential in k. While our positive results use algorithms that do not access the original instance after a scenario is known, we prove our lower bounds even for the case when, knowing the scenario, limited access to the instance is allowed.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol048-icdt2016/LIPIcs.ICDT.2016.18/LIPIcs.ICDT.2016.18.pdf
What-if Analysis
Provisioning
Data Compression
Approximate Query Answering
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2016-03-14
48
19:1
19:17
10.4230/LIPIcs.ICDT.2016.19
article
Limits of Schema Mappings
Kolaitis, Phokion G.
Pichler, Reinhard
Sallinger, Emanuel
Savenkov, Vadim
Schema mappings have been extensively studied in the context of data exchange and data integration, where they have turned out to be the right level of abstraction for formalizing data inter-operability tasks. Up to now and for the most part, schema mappings have been studied as static objects, in the sense that each time the focus has been on a single schema mapping of interest or, in the case of composition, on a pair of schema mappings of interest.
In this paper, we adopt a dynamic viewpoint and embark on a study of sequences of schema mappings and of the limiting behavior of such sequences. To this effect, we first introduce a natural notion of distance on sets of finite target instances that expresses how "close" two sets of target instances are as regards the certain answers of conjunctive queries on these sets. Using this notion of distance, we investigate pointwise limits and uniform limits of sequences of schema mappings, as well as the companion notions of pointwise Cauchy and uniformly Cauchy sequences of schema mappings. We obtain a number of results about the limits of sequences of GAV schema mappings and the limits of sequences of LAV schema mappings that reveal striking differences between these two classes of schema mappings. We also consider the completion of the metric space of sets of target instances and obtain concrete representations of limits of sequences of schema mappings in terms of generalized schema mappings, i.e., schema mappings with infinite target instances as solutions to (finite) source instances.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol048-icdt2016/LIPIcs.ICDT.2016.19/LIPIcs.ICDT.2016.19.pdf
Limit
Pointwise convergence
Uniform convergence
Schema mapping
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2016-03-14
48
20:1
20:18
10.4230/LIPIcs.ICDT.2016.20
article
Reasoning About Integrity Constraints for Tree-Structured Data
Czerwinski, Wojciech
David, Claire
Murlak, Filip
Parys, Pawel
We study a class of integrity constraints for tree-structured data modelled as data trees, whose nodes have a label from a finite alphabet and store a data value from an infinite data domain. The constraints require each tuple of nodes selected by a conjunctive query (using navigational axes and labels) to satisfy a positive combination of equalities and a positive combination of inequalities over the stored data values. Such constraints are instances of the general framework of XML-to-relational constraints proposed recently by Niewerth and Schwentick. They cover some common classes of constraints, including W3C XML Schema key and unique constraints, as well as domain restrictions and denial constraints, but cannot express inclusion constraints, such as reference keys. Our main result is that consistency of such integrity constraints with respect to a given schema (modelled as a tree automaton) is decidable. An easy extension gives decidability for the entailment problem. Equivalently, we show that validity and containment of unions of conjunctive queries using navigational axes, labels, data equalities and inequalities is decidable, as long as none of the conjunctive queries uses both equalities and inequalities; without this restriction, both problems are known to be undecidable. In the context of XML data exchange, our result can be used to establish decidability for a consistency problem for XML schema mappings. All the decision procedures are doubly exponential, with matching lower bounds. The complexity may be lowered to singly exponential, when conjunctive queries are replaced by tree patterns, and the number of data comparisons is bounded.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol048-icdt2016/LIPIcs.ICDT.2016.20/LIPIcs.ICDT.2016.20.pdf
data trees
integrity constraints
unions of conjunctive queries
schema mappings
entailment
containment
consistency
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2016-03-14
48
21:1
21:18
10.4230/LIPIcs.ICDT.2016.21
article
Complexity of Repair Checking and Consistent Query Answering
Arming, Sebastian
Pichler, Reinhard
Sallinger, Emanuel
Inconsistent databases (i.e., databases violating some given set of integrity constraints) may arise in many applications such as, for instance, data integration. Hence, the handling of inconsistent data has evolved as an active field of research. In this paper, we consider two fundamental problems in this context: Repair Checking (RC) and Consistent Query Answering (CQA).
So far, these problems have been mainly studied from the point of view of data complexity (where all parts of the input except for the database are considered as fixed). While for some kinds of integrity constraints, also combined complexity (where all parts of the input are allowed to vary) has been considered, for several other kinds of integrity constraints, combined complexity has been left unexplored. Moreover, a more detailed analysis (keeping other parts of the input fixed - e.g., the constraints only) is completely missing.
The goal of our work is a thorough analysis of the complexity of the RC and CQA problems. Our contribution is a complete picture of the complexity of these problems for a wide range of integrity constraints. Our analysis thus allows us to get a better understanding of the true sources of complexity.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol048-icdt2016/LIPIcs.ICDT.2016.21/LIPIcs.ICDT.2016.21.pdf
inconsistency
consistent query answering
complexity
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2016-03-14
48
22:1
22:18
10.4230/LIPIcs.ICDT.2016.22
article
On the Complexity of Enumerating the Answers to Well-designed Pattern Trees
Kröll, Markus
Pichler, Reinhard
Skritek, Sebastian
Well-designed pattern trees (wdPTs) have been introduced as an extension of conjunctive queries to allow for partial matching - analogously to the OPTIONAL operator of the semantic web query language SPARQL. Several computational problems of wdPTs have been studied in recent years, such as the evaluation problem in various settings, the counting problem, as well as static analysis tasks including the containment and equivalence problems. Also restrictions needed to achieve tractability of these tasks have been proposed. In contrast, the problem of enumerating the answers to a wdPT has been largely ignored so far. In this work, we embark on a systematic study of the complexity of the enumeration problem of wdPTs. As our main result, we identify several tractable and intractable cases of this problem both from a classical complexity point of view and from a parameterized complexity point of view.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol048-icdt2016/LIPIcs.ICDT.2016.22/LIPIcs.ICDT.2016.22.pdf
SPARQL
Pattern Trees
CQs
Enumeration
Complexity
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2016-03-14
48
23:1
23:17
10.4230/LIPIcs.ICDT.2016.23
article
A Practically Efficient Algorithm for Generating Answers to Keyword Search Over Data Graphs
Golenberg, Konstantin
Sagiv, Yehoshua
In keyword search over a data graph, an answer is a non-redundant subtree that contains all the keywords of the query. A naive approach to producing all the answers by increasing height is to generalize Dijkstra's algorithm to enumerating all acyclic paths by increasing weight. The idea of freezing is introduced so that (most) non-shortest paths are generated only if they are actually needed for producing answers. The resulting algorithm for generating subtrees, called GTF, is subtle and its proof of correctness is intricate. Extensive experiments show that GTF outperforms existing systems, even ones that for efficiency's sake are incomplete (i.e., cannot produce all the answers). In particular, GTF is scalable and performs well even on large data graphs and when many answers are neede
https://drops.dagstuhl.de/storage/00lipics/lipics-vol048-icdt2016/LIPIcs.ICDT.2016.23/LIPIcs.ICDT.2016.23.pdf
Keyword search over data graphs
subtree enumeration by height
top-k answers
efficiency