LIPIcs, Volume 186, ICDT 2021, Complete Volume

eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2021-03-11 186 1 438 10.4230/LIPIcs.ICDT.2021 article LIPIcs, Volume 186, ICDT 2021, Complete Volume Yi, Ke 1 https://orcid.org/0000-0002-2178-3716 Wei, Zhewei 2 https://orcid.org/0000-0003-3620-5086 The Hong Kong University of Science and Technology, Hong Kong Renmin University of China, China LIPIcs, Volume 186, ICDT 2021, Complete Volume https://drops.dagstuhl.de/storage/00lipics/lipics-vol186-icdt2021/LIPIcs.ICDT.2021/LIPIcs.ICDT.2021.pdf LIPIcs, Volume 186, ICDT 2021, Complete Volume eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2021-03-11 186 0:i 0:xvi 10.4230/LIPIcs.ICDT.2021.0 article Front Matter, Table of Contents, Preface, Conference Organization Yi, Ke 1 https://orcid.org/0000-0002-2178-3716 Wei, Zhewei 2 https://orcid.org/0000-0003-3620-5086 The Hong Kong University of Science and Technology, Hong Kong Renmin University of China, China Front Matter, Table of Contents, Preface, Conference Organization https://drops.dagstuhl.de/storage/00lipics/lipics-vol186-icdt2021/LIPIcs.ICDT.2021.0/LIPIcs.ICDT.2021.0.pdf Front Matter Table of Contents Preface Conference Organization eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2021-03-11 186 1:1 1:1 10.4230/LIPIcs.ICDT.2021.1 article Explainability Queries for ML Models and its Connections with Data Management Problems (Invited Talk) Barceló, Pablo 1 Universidad Católica de Chile, Macul, Chile In this talk I will present two recent examples of my research on explainability problems over machine learning (ML) models. In rough terms, these explainability problems deal with specific queries one poses over a ML model in order to obtain meaningful justifications for their results. Both of the examples I will present deal with “local” and “post-hoc” explainability queries. Here “local” means that we intend to explain the output of the ML model for a particular input, while “post-hoc” refers to the fact that the explanation is obtained after the model is trained. In the process I will also establish connections with problems studied in data management. This with the intention of suggesting new possibilities for cross-fertilization between the area and ML. The first example I will present refers to computing explanations with scores based on Shapley values, in particular with the recently proposed, and already influential, SHAP-score. This score provides a measure of how different features in the input contribute to the output of the ML model. We provide a detailed analysis of the complexity of this problem for different classes of Boolean circuits. In particular, we show that the problem of computing SHAP-scores is tractable as long as the circuit is deterministic and decomposable, but becomes computationally hard if any of these restrictions is lifted. The tractability part of this result provides a generalization of a recent result stating that, for Boolean hierarchical conjunctive queries, the Shapley-value of the contribution of a tuple in the database to the final result can be computed in polynomial time. The second example I will present refers to the comparison of different ML models in terms of important families of (local and post-hoc) explainability queries. For the models, I will consider multi-layer perceptrons and binary decision diagrams. The main object of study will be the computational complexity of the aforementioned queries over such models. The obtained results will show an interesting theoretical counterpart to wisdom’s claims on interpretability. This work also suggests the need for developing query languages that support the process of retrieving explanations from ML models, and also for obtaining general tractability results for such languages over specific classes of models. https://drops.dagstuhl.de/storage/00lipics/lipics-vol186-icdt2021/LIPIcs.ICDT.2021.1/LIPIcs.ICDT.2021.1.pdf ML models Explainability Shapley values decision trees OBDDs deterministic and decomposable Boolean circuits eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2021-03-11 186 2:1 2:1 10.4230/LIPIcs.ICDT.2021.2 article Comparing Apples and Oranges: Fairness and Diversity in Ranking (Invited Talk) Stoyanovich, Julia 1 New York University, NY, USA Algorithmic rankers take a collection of candidates as input and produce a ranking (permutation) of the candidates as output. The simplest kind of ranker is score-based; it computes a score of each candidate independently and returns the candidates in score order. Another common kind of ranker is learning-to-rank, where supervised learning is used to predict the ranking of unseen candidates. For both kinds of rankers, we may output the entire permutation or only the highest scoring k candidates, the top-k. Set selection is a special case of ranking that ignores the relative order among the top-k. In the past few years, there has been much work on incorporating fairness and diversity requirements into algorithmic rankers, with contributions coming from the data management, algorithms, information retrieval, and recommender systems communities. In my talk I will offer a broad perspective that connects formalizations and algorithmic approaches across subfields, grounding them in a common narrative around the value frameworks that motivate specific fairness- and diversity-enhancing interventions. I will discuss some recent and ongoing work, and will outline future research directions where the data management community is well-positioned to make lasting impact, especially if we attack these problems with our rich theory-meets-systems toolkit. https://drops.dagstuhl.de/storage/00lipics/lipics-vol186-icdt2021/LIPIcs.ICDT.2021.2/LIPIcs.ICDT.2021.2.pdf fairness diversity ranking set selection responsible data management eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2021-03-11 186 3:1 3:23 10.4230/LIPIcs.ICDT.2021.3 article Box Covers and Domain Orderings for Beyond Worst-Case Join Processing Alway, Kaleb 1 Blais, Eric 1 Salihoglu, Semih 1 University of Waterloo, Canada Recent beyond worst-case optimal join algorithms Minesweeper and its generalization Tetris have brought the theory of indexing and join processing together by developing a geometric framework for joins. These algorithms take as input an index ℬ, referred to as a box cover, that stores output gaps that can be inferred from traditional indexes, such as B+ trees or tries, on the input relations. The performances of these algorithms highly depend on the certificate of ℬ, which is the smallest subset of gaps in ℬ whose union covers all of the gaps in the output space of a query Q. Different box covers can have different size certificates and the sizes of both the box covers and certificates highly depend on the ordering of the domain values of the attributes in Q. We study how to generate box covers that contain small size certificates to guarantee efficient runtimes for these algorithms. First, given a query Q over a set of relations of size N and a fixed set of domain orderings for the attributes, we give a Õ(N)-time algorithm called GAMB which generates a box cover for Q that is guaranteed to contain the smallest size certificate across any box cover for Q. Second, we show that finding a domain ordering to minimize the box cover size and certificate is NP-hard through a reduction from the 2 consecutive block minimization problem on boolean matrices. Our third contribution is a Õ(N)-time approximation algorithm called ADORA to compute domain orderings, under which one can compute a box cover of size Õ(K^r), where K is the minimum box cover for Q under any domain ordering and r is the maximum arity of any relation. This guarantees certificates of size Õ(K^r). We combine ADORA and GAMB with Tetris to form a new algorithm we call TetrisReordered, which provides several new beyond worst-case bounds. On infinite families of queries, TetrisReordered’s runtimes are unboundedly better than the bounds stated in prior work. https://drops.dagstuhl.de/storage/00lipics/lipics-vol186-icdt2021/LIPIcs.ICDT.2021.3/LIPIcs.ICDT.2021.3.pdf Beyond worst-case join algorithms Tetris Box covers Domain orderings eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2021-03-11 186 4:1 4:19 10.4230/LIPIcs.ICDT.2021.4 article A Purely Regular Approach to Non-Regular Core Spanners Schmid, Markus L. 1 https://orcid.org/0000-0001-5137-1504 Schweikardt, Nicole 1 https://orcid.org/0000-0001-5705-1675 Humboldt-Universität zu Berlin, Germany The regular spanners (characterised by vset-automata) are closed under the algebraic operations of union, join and projection, and have desirable algorithmic properties. The core spanners (introduced by Fagin, Kimelfeld, Reiss, and Vansummeren (PODS 2013, JACM 2015) as a formalisation of the core functionality of the query language AQL used in IBM’s SystemT) additionally need string equality selections and it has been shown by Freydenberger and Holldack (ICDT 2016, Theory of Computing Systems 2018) that this leads to high complexity and even undecidability of the typical problems in static analysis and query evaluation. We propose an alternative approach to core spanners: by incorporating the string-equality selections directly into the regular language that represents the underlying regular spanner (instead of treating it as an algebraic operation on the table extracted by the regular spanner), we obtain a fragment of core spanners that, while having slightly weaker expressive power than the full class of core spanners, arguably still covers the intuitive applications of string equality selections for information extraction and has much better upper complexity bounds of the typical problems in static analysis and query evaluation. https://drops.dagstuhl.de/storage/00lipics/lipics-vol186-icdt2021/LIPIcs.ICDT.2021.4/LIPIcs.ICDT.2021.4.pdf Document spanners regular expressions with backreferences eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2021-03-11 186 5:1 5:19 10.4230/LIPIcs.ICDT.2021.5 article Ranked Enumeration of Conjunctive Query Results Deep, Shaleen 1 Koutris, Paraschos 1 Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI, USA We study the problem of enumerating answers of Conjunctive Queries ranked according to a given ranking function. Our main contribution is a novel algorithm with small preprocessing time, logarithmic delay, and non-trivial space usage during execution. To allow for efficient enumeration, we exploit certain properties of ranking functions that frequently occur in practice. To this end, we introduce the notions of decomposable and compatible (w.r.t. a query decomposition) ranking functions, which allow for partial aggregation of tuple scores in order to efficiently enumerate the output. We complement the algorithmic results with lower bounds that justify why restrictions on the structure of ranking functions are necessary. Our results extend and improve upon a long line of work that has studied ranked enumeration from both a theoretical and practical perspective. https://drops.dagstuhl.de/storage/00lipics/lipics-vol186-icdt2021/LIPIcs.ICDT.2021.5/LIPIcs.ICDT.2021.5.pdf Query result enumeration joins ranking eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2021-03-11 186 6:1 6:23 10.4230/LIPIcs.ICDT.2021.6 article Towards Optimal Dynamic Indexes for Approximate (and Exact) Triangle Counting Lu, Shangqi 1 Tao, Yufei 1 The Chinese University of Hong Kong, China In ICDT'19, Kara, Ngo, Nikolic, Olteanu, and Zhang gave a structure which maintains the number T of triangles in an undirected graph G = (V, E) along with the edge insertions/deletions in G. Using O(m) space (m = |E|), their structure supports an update in O(√m log m) amortized time which is optimal (up to polylog factors) subject to the OMv-conjecture (Henzinger, Krinninger, Nanongkai, and Saranurak, STOC'15). Aiming to improve the update efficiency, we study: - the optimal tradeoff between update time and approximation quality. We require a structure to provide the (ε, Γ)-guarantee: when queried, it should return an estimate t of T that has relative error at most ε if T ≥ Γ, or an absolute error at most ε ⋅ Γ, otherwise. We prove that, under any ε ≤ 0.49 and subject to the OMv-conjecture, no structure can guarantee O(m^{0.5-δ}/Γ) expected amortized update time and O(m^{2/3-δ}) query time simultaneously for any constant δ > 0; this is true for Γ = m^c of any constant c in [0, 1/2). We match the lower bound with a structure that ensures Õ((1/ε)³ ⋅ √m/Γ) amortized update time with high probability, and O(1) query time. - (for exact counting) how to achieve arboricity-sensitive update time. For any 1 ≤ Γ ≤ √m, we describe a structure of O(min{α m + m log m, (m/Γ)²}) space that maintains T precisely, and supports an update in Õ(min{α + Γ, √m}) amortized time, where α is the largest arboricity of G in history (and does not need to be known). Our structure reconstructs the aforementioned ICDT'19 result up to polylog factors by setting Γ = √m, but achieves Õ(m^{0.5-δ}) update time as long as α = O(m^{0.5-δ}). https://drops.dagstuhl.de/storage/00lipics/lipics-vol186-icdt2021/LIPIcs.ICDT.2021.6/LIPIcs.ICDT.2021.6.pdf Triangle Counting Data Structures Lower Bounds Graph Algorithms eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2021-03-11 186 7:1 7:18 10.4230/LIPIcs.ICDT.2021.7 article Grammars for Document Spanners Peterfreund, Liat 1 2 DI ENS, ENS, CNRS, PSL University, Paris, France Inria, Paris, France We propose a new grammar-based language for defining information-extractors from documents (text) that is built upon the well-studied framework of document spanners for extracting structured data from text. While previously studied formalisms for document spanners are mainly based on regular expressions, we use an extension of context-free grammars, called {extraction grammars}, to define the new class of context-free spanners. Extraction grammars are simply context-free grammars extended with variables that capture interval positions of the document, namely spans. While regular expressions are efficient for tokenizing and tagging, context-free grammars are also efficient for capturing structural properties. Indeed, we show that context-free spanners are strictly more expressive than their regular counterparts. We reason about the expressive power of our new class and present a pushdown-automata model that captures it. We show that extraction grammars can be evaluated with polynomial data complexity. Nevertheless, as the degree of the polynomial depends on the query, we present an enumeration algorithm for unambiguous extraction grammars that, after quintic preprocessing, outputs the results sequentially, without repetitions, with a constant delay between every two consecutive ones. https://drops.dagstuhl.de/storage/00lipics/lipics-vol186-icdt2021/LIPIcs.ICDT.2021.7/LIPIcs.ICDT.2021.7.pdf Information Extraction Document Spanners Context-Free Grammars Constant-Delay Enumeration Regular Expressions Pushdown Automata eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2021-03-11 186 8:1 8:18 10.4230/LIPIcs.ICDT.2021.8 article Input-Output Disjointness for Forward Expressions in the Logic of Information Flows Aamer, Heba 1 https://orcid.org/0000-0003-0460-8534 Van den Bussche, Jan 1 https://orcid.org/0000-0003-0072-3252 Hasselt University, Belgium Last year we introduced the logic FLIF (forward logic of information flows) as a declarative language for specifying complex compositions of information sources with limited access patterns. The key insight of this approach is to view a system of information sources as a graph, where the nodes are valuations of variables, so that accesses to information sources can be modeled as edges in the graph. This allows the use of XPath-like navigational graph query languages. Indeed, a well-behaved fragment of FLIF, called io-disjoint FLIF, was shown to be equivalent to the executable fragment of first-order logic. It remained open, however, how io-disjoint FLIF compares to general FLIF . In this paper we close this gap by showing that general FLIF expressions can always be put into io-disjoint form. https://drops.dagstuhl.de/storage/00lipics/lipics-vol186-icdt2021/LIPIcs.ICDT.2021.8/LIPIcs.ICDT.2021.8.pdf Composition expressive power variable substitution eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2021-03-11 186 9:1 9:24 10.4230/LIPIcs.ICDT.2021.9 article Conjunctive Queries: Unique Characterizations and Exact Learnability ten Cate, Balder 1 https://orcid.org/0000-0002-2538-5846 Dalmau, Victor 2 https://orcid.org/0000-0002-9365-7372 Google, Mountain View, CA, USA Universitat Pompeu Fabra, Barcelona, Spain We answer the question of which conjunctive queries are uniquely characterized by polynomially many positive and negative examples, and how to construct such examples efficiently. As a consequence, we obtain a new efficient exact learning algorithm for a class of conjunctive queries. At the core of our contributions lie two new polynomial-time algorithms for constructing frontiers in the homomorphism lattice of finite structures. We also discuss implications for the unique characterizability and learnability of schema mappings and of description logic concepts. https://drops.dagstuhl.de/storage/00lipics/lipics-vol186-icdt2021/LIPIcs.ICDT.2021.9/LIPIcs.ICDT.2021.9.pdf Conjunctive Queries Homomorphisms Frontiers Unique Characterizations Exact Learnability Schema Mappings Description Logic eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2021-03-11 186 10:1 10:20 10.4230/LIPIcs.ICDT.2021.10 article The Complexity of Aggregates over Extractions by Regular Expressions Doleschal, Johannes 1 2 https://orcid.org/0000-0002-7045-7298 Bratman, Noa 3 Kimelfeld, Benny 3 Martens, Wim 1 Universität Bayreuth, Germany Hasselt University, Belgium Technion - Israel Institute of Technology, Haifa, Israel Regular expressions with capture variables, also known as "regex-formulas", extract relations of spans (intervals identified by their start and end indices) from text. In turn, the class of regular document spanners is the closure of the regex formulas under the Relational Algebra. We investigate the computational complexity of querying text by aggregate functions, such as sum, average, and quantile, on top of regular document spanners. To this end, we formally define aggregate functions over regular document spanners and analyze the computational complexity of exact and approximate computation. More precisely, we show that in a restricted case, all studied aggregate functions can be computed in polynomial time. In general, however, even though exact computation is intractable, some aggregates can still be approximated with fully polynomial-time randomized approximation schemes (FPRAS). https://drops.dagstuhl.de/storage/00lipics/lipics-vol186-icdt2021/LIPIcs.ICDT.2021.10/LIPIcs.ICDT.2021.10.pdf Information extraction document spanners regular expressions aggregation functions eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2021-03-11 186 11:1 11:22 10.4230/LIPIcs.ICDT.2021.11 article Answer Counting Under Guarded TGDs Feier, Cristina 1 Lutz, Carsten 1 Przybyłko, Marcin 1 Department of Computer Science, Universität Bremen, Germany We study the complexity of answer counting for ontology-mediated queries and for querying under constraints, considering conjunctive queries and unions thereof (UCQs) as the query language and guarded TGDs as the ontology and constraint language, respectively. Our main result is a classification according to whether answer counting is fixed-parameter tractable (FPT), W[1]-equivalent, #W[1]-equivalent, #W[2]-hard, or #A[2]-equivalent, lifting a recent classification for UCQs without ontologies and constraints due to Dell et al. [Holger Dell et al., 2019]. The classification pertains to various structural measures, namely treewidth, contract treewidth, starsize, and linked matching number. Our results rest on the assumption that the arity of relation symbols is bounded by a constant and, in the case of ontology-mediated querying, that all symbols from the ontology and query can occur in the data (so-called full data schema). We also study the meta-problems for the mentioned structural measures, that is, to decide whether a given ontology-mediated query or constraint-query specification is equivalent to one for which the structural measure is bounded. https://drops.dagstuhl.de/storage/00lipics/lipics-vol186-icdt2021/LIPIcs.ICDT.2021.11/LIPIcs.ICDT.2021.11.pdf Ontology-Mediated Querying Querying under Constraints Answer Counting Parameterized Complexity eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2021-03-11 186 12:1 12:20 10.4230/LIPIcs.ICDT.2021.12 article Maximum Coverage in the Data Stream Model: Parameterized and Generalized McGregor, Andrew 1 Tench, David 2 Vu, Hoa T. 3 University of Massachusetts Amherst, MA, USA Stony Brook University, NY, USA San Diego State University, CA, USA We present algorithms for the Max Coverage and Max Unique Coverage problems in the data stream model. The input to both problems are m subsets of a universe of size n and a value k ∈ [m]. In Max Coverage, the problem is to find a collection of at most k sets such that the number of elements covered by at least one set is maximized. In Max Unique Coverage, the problem is to find a collection of at most k sets such that the number of elements covered by exactly one set is maximized. These problems are closely related to a range of graph problems including matching, partial vertex cover, and capacitated maximum cut. In the data stream model, we assume k is given and the sets are revealed online. Our goal is to design single-pass algorithms that use space that is sublinear in the input size. Our main algorithmic results are: - If the sets have size at most d, there exist single-pass algorithms using O(d^{d+1} k^d) space that solve both problems exactly. This is optimal up to polylogarithmic factors for constant d. - If each element appears in at most r sets, we present single pass algorithms using Õ(k² r/ε³) space that return a 1+ε approximation in the case of Max Coverage. We also present a single-pass algorithm using slightly more memory, i.e., Õ(k³ r/ε⁴) space, that 1+ε approximates Max Unique Coverage. In contrast to the above results, when d and r are arbitrary, any constant pass 1+ε approximation algorithm for either problem requires Ω(ε^{-2}m) space but a single pass O(ε^{-2}mk) space algorithm exists. In fact any constant-pass algorithm with an approximation better than e/(e-1) and e^{1-1/k} for Max Coverage and Max Unique Coverage respectively requires Ω(m/k²) space when d and r are unrestricted. En route, we also obtain an algorithm for a parameterized version of the streaming Set Cover problem. https://drops.dagstuhl.de/storage/00lipics/lipics-vol186-icdt2021/LIPIcs.ICDT.2021.12/LIPIcs.ICDT.2021.12.pdf Data streams maximum coverage maximum unique coverage set cover eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2021-03-11 186 13:1 13:25 10.4230/LIPIcs.ICDT.2021.13 article Diverse Data Selection under Fairness Constraints Moumoulidou, Zafeiria 1 McGregor, Andrew 1 https://orcid.org/0000-0002-2124-160X Meliou, Alexandra 1 College of Information and Computer Sciences, University of Massachusetts Amherst, MA, USA Diversity is an important principle in data selection and summarization, facility location, and recommendation systems. Our work focuses on maximizing diversity in data selection, while offering fairness guarantees. In particular, we offer the first study that augments the Max-Min diversification objective with fairness constraints. More specifically, given a universe 𝒰 of n elements that can be partitioned into m disjoint groups, we aim to retrieve a k-sized subset that maximizes the pairwise minimum distance within the set (diversity) and contains a pre-specified k_i number of elements from each group i (fairness). We show that this problem is NP-complete even in metric spaces, and we propose three novel algorithms, linear in n, that provide strong theoretical approximation guarantees for different values of m and k. Finally, we extend our algorithms and analysis to the case where groups can be overlapping. https://drops.dagstuhl.de/storage/00lipics/lipics-vol186-icdt2021/LIPIcs.ICDT.2021.13/LIPIcs.ICDT.2021.13.pdf data selection diversity maximization fairness constraints approximation algorithms eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2021-03-11 186 14:1 14:17 10.4230/LIPIcs.ICDT.2021.14 article Enumeration Algorithms for Conjunctive Queries with Projection Deep, Shaleen 1 Hu, Xiao 2 Koutris, Paraschos 1 Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI, USA Department of Computer Sciences, Duke University, Durham, NC, USA We investigate the enumeration of query results for an important subset of CQs with projections, namely star and path queries. The task is to design data structures and algorithms that allow for efficient enumeration with delay guarantees after a preprocessing phase. Our main contribution is a series of results based on the idea of interleaving precomputed output with further join processing to maintain delay guarantees, which maybe of independent interest. In particular, we design combinatorial algorithms that provide instance-specific delay guarantees in linear preprocessing time. These algorithms improve upon the currently best known results. Further, we show how existing results can be improved upon by using fast matrix multiplication. We also present {new} results involving tradeoff between preprocessing time and delay guarantees for enumeration of path queries that contain projections. CQs with projection where the join attribute is projected away is equivalent to boolean matrix multiplication. Our results can therefore be also interpreted as sparse, output-sensitive matrix multiplication with delay guarantees. https://drops.dagstuhl.de/storage/00lipics/lipics-vol186-icdt2021/LIPIcs.ICDT.2021.14/LIPIcs.ICDT.2021.14.pdf Query result enumeration joins ranking eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2021-03-11 186 15:1 15:19 10.4230/LIPIcs.ICDT.2021.15 article The Shapley Value of Inconsistency Measures for Functional Dependencies Livshits, Ester 1 Kimelfeld, Benny 1 Technion - Israel Institute of Technology, Haifa, Israel Quantifying the inconsistency of a database is motivated by various goals including reliability estimation for new datasets and progress indication in data cleaning. Another goal is to attribute to individual tuples a level of responsibility to the overall inconsistency, and thereby prioritize tuples in the explanation or inspection of dirt. Therefore, inconsistency quantification and attribution have been a subject of much research in Knowledge Representation and, more recently, in Databases. As in many other fields, a conventional responsibility sharing mechanism is the Shapley value from cooperative game theory. In this paper, we carry out a systematic investigation of the complexity of the Shapley value in common inconsistency measures for functional-dependency (FD) violations. For several measures we establish a full classification of the FD sets into tractable and intractable classes with respect to Shapley-value computation. We also study the complexity of approximation in intractable cases. https://drops.dagstuhl.de/storage/00lipics/lipics-vol186-icdt2021/LIPIcs.ICDT.2021.15/LIPIcs.ICDT.2021.15.pdf Shapley value inconsistent databases functional dependencies database repairs eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2021-03-11 186 16:1 16:17 10.4230/LIPIcs.ICDT.2021.16 article Database Repairing with Soft Functional Dependencies Carmeli, Nofar 1 Grohe, Martin 2 Kimelfeld, Benny 1 Livshits, Ester 1 Tibi, Muhammad 1 Technion - Israel Institute of Technology, Haifa, Israel RWTH Aachen University, Germany A common interpretation of soft constraints penalizes the database for every violation of every constraint, where the penalty is the cost (weight) of the constraint. A computational challenge is that of finding an optimal subset: a collection of database tuples that minimizes the total penalty when each tuple has a cost of being excluded. When the constraints are strict (i.e., have an infinite cost), this subset is a "cardinality repair" of an inconsistent database; in soft interpretations, this subset corresponds to a "most probable world" of a probabilistic database, a "most likely intention" of a probabilistic unclean database, and so on. Within the class of functional dependencies, the complexity of finding a cardinality repair is thoroughly understood. Yet, very little is known about the complexity of finding an optimal subset for the more general soft semantics. This paper makes a significant progress in this direction. In addition to general insights about the hardness and approximability of the problem, we present algorithms for two special cases: a single functional dependency, and a bipartite matching. The latter is the problem of finding an optimal "almost matching" of a bipartite graph where a penalty is paid for every lost edge and every violation of monogamy. https://drops.dagstuhl.de/storage/00lipics/lipics-vol186-icdt2021/LIPIcs.ICDT.2021.16/LIPIcs.ICDT.2021.16.pdf Database inconsistency database repairs integrity constraints soft constraints functional dependencies eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2021-03-11 186 17:1 17:17 10.4230/LIPIcs.ICDT.2021.17 article Uniform Reliability of Self-Join-Free Conjunctive Queries Amarilli, Antoine 1 https://orcid.org/0000-0002-7977-4441 Kimelfeld, Benny 2 LTCI, Télécom Paris, Institut Polytechnique de Paris, France Technion - Israel Institute of Technology, Haifa, Israel The reliability of a Boolean Conjunctive Query (CQ) over a tuple-independent probabilistic database is the probability that the CQ is satisfied when the tuples of the database are sampled one by one, independently, with their associated probability. For queries without self-joins (repeated relation symbols), the data complexity of this problem is fully characterized in a known dichotomy: reliability can be computed in polynomial time for hierarchical queries, and is #P-hard for non-hierarchical queries. Hierarchical queries also characterize the tractability of queries for other tasks: having read-once lineage formulas, supporting insertion/deletion updates to the database in constant time, and having a tractable computation of tuples' Shapley and Banzhaf values. In this work, we investigate a fundamental counting problem for CQs without self-joins: how many sets of facts from the input database satisfy the query? This is equivalent to the uniform case of the query reliability problem, where the probability of every tuple is required to be 1/2. Of course, for hierarchical queries, uniform reliability is in polynomial time, like the reliability problem. However, it is an open question whether being hierarchical is necessary for the uniform reliability problem to be in polynomial time. In fact, the complexity of the problem has been unknown even for the simplest non-hierarchical CQs without self-joins. We solve this open question by showing that uniform reliability is #P-complete for every non-hierarchical CQ without self-joins. Hence, we establish that being hierarchical also characterizes the tractability of unweighted counting of the satisfying tuple subsets. We also consider the generalization to query reliability where all tuples of the same relation have the same probability, and give preliminary results on the complexity of this problem. https://drops.dagstuhl.de/storage/00lipics/lipics-vol186-icdt2021/LIPIcs.ICDT.2021.17/LIPIcs.ICDT.2021.17.pdf Hierarchical conjunctive queries query reliability tuple-independent database counting problems #P-hardness eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2021-03-11 186 18:1 18:19 10.4230/LIPIcs.ICDT.2021.18 article Efficient Differentially Private F₀ Linear Sketching Pagh, Rasmus 1 2 https://orcid.org/0000-0002-1516-9306 Stausholm, Nina Mesing 3 2 https://orcid.org/0000-0002-4322-7163 University of Copenhagen, Denmark BARC, Copenhagen, Denmark IT University of Copenhagen, Denmark A powerful feature of linear sketches is that from sketches of two data vectors, one can compute the sketch of the difference between the vectors. This allows us to answer fine-grained questions about the difference between two data sets. In this work we consider how to construct sketches for weighted F₀, i.e., the summed weights of the elements in the data set, that are small, differentially private, and computationally efficient. Let a weight vector w ∈ (0,1]^u be given. For x ∈ {0,1}^u we are interested in estimating ||x∘w||₁ where ∘ is the Hadamard product (entrywise product). Building on a technique of Kushilevitz et al. (STOC 1998), we introduce a sketch (depending on w) that is linear over GF(2), mapping a vector x ∈ {0,1}^u to Hx ∈ {0,1}^τ for a matrix H sampled from a suitable distribution ℋ. Differential privacy is achieved by using randomized response, flipping each bit of Hx with probability p < 1/2. That is, for a vector φ ∈ {0,1}^τ where Pr[(φ)_j = 1] = p independently for each entry j, we consider the noisy sketch Hx + φ, where the addition of noise happens over GF(2). We show that for every choice of 0 < β < 1 and ε = O(1) there exists p < 1/2 and a distribution ℋ of linear sketches of size τ = O(log²(u)ε^{-2}β^{-2}) such that: 1) For random H∼ℋ and noise vector φ, given Hx + φ we can compute an estimate of ||x∘w||₁ that is accurate within a factor 1±β, plus additive error O(log(u)ε^{-2}β^{-2}), w. p. 1-u^{-1}, and 2) For every H∼ℋ, Hx + φ is ε-differentially private over the randomness in φ. The special case w = (1,… ,1) is unweighted F₀. Previously, Mir et al. (PODS 2011) and Kenthapadi et al. (J. Priv. Confidentiality 2013) had described a differentially private way of sketching unweighted F₀, but the algorithms for calibrating noise to their sketches are not computationally efficient, either using quasipolynomial time in the sketch size or superlinear time in the universe size u. For fixed ε the size of our sketch is polynomially related to the lower bound of Ω(log(u)β^{-2}) bits by Jayram & Woodruff (Trans. Algorithms 2013). The additive error is comparable to the bound of Ω(1/ε) of Hardt & Talwar (STOC 2010). An application of our sketch is that two sketches can be added to form a noisy sketch of the form H(x₁+x₂) + (φ₁+φ₂), which allows us to estimate ||(x₁+x₂)∘w||₁. Since addition is over GF(2), this is the weight of the symmetric difference of the vectors x₁ and x₂. Recent work has shown how to privately and efficiently compute an estimate for the symmetric difference size of two sets using (non-linear) sketches such as FM-sketches and Bloom Filters, but these methods have an error bound no better than O(√{̄{m}}), where ̄{m} is an upper bound on ||x₁||₀ and ||x₂||₀. This improves previous work when β = o (1/√{̄{m}}) and log(u)/ε = ̄{m}^{o(1)}. In conclusion our results both improve the efficiency of existing methods for unweighted F₀ estimation and extend to a weighted generalization. We also give a distributed streaming implementation for estimating the size of the union between two input streams. https://drops.dagstuhl.de/storage/00lipics/lipics-vol186-icdt2021/LIPIcs.ICDT.2021.18/LIPIcs.ICDT.2021.18.pdf Differential Privacy Linear Sketches Weighted F0 Estimation eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2021-03-11 186 19:1 19:20 10.4230/LIPIcs.ICDT.2021.19 article Fine-Grained Complexity of Regular Path Queries Casel, Katrin 1 https://orcid.org/0000-0001-6146-8684 Schmid, Markus L. 2 https://orcid.org/0000-0001-5137-1504 Hasso Plattner Institute, Universität Potsdam, Germany Humboldt-Universität zu Berlin, Germany A regular path query (RPQ) is a regular expression q that returns all node pairs (u, v) from a graph database that are connected by an arbitrary path labelled with a word from L(q). The obvious algorithmic approach to RPQ evaluation (called PG-approach), i. e., constructing the product graph between an NFA for q and the graph database, is appealing due to its simplicity and also leads to efficient algorithms. However, it is unclear whether the PG-approach is optimal. We address this question by thoroughly investigating which upper complexity bounds can be achieved by the PG-approach, and we complement these with conditional lower bounds (in the sense of the fine-grained complexity framework). A special focus is put on enumeration and delay bounds, as well as the data complexity perspective. A main insight is that we can achieve optimal (or near optimal) algorithms with the PG-approach, but the delay for enumeration is rather high (linear in the database). We explore three successful approaches towards enumeration with sub-linear delay: super-linear preprocessing, approximations of the solution sets, and restricted classes of RPQs. https://drops.dagstuhl.de/storage/00lipics/lipics-vol186-icdt2021/LIPIcs.ICDT.2021.19/LIPIcs.ICDT.2021.19.pdf Graph Databases Regular Path Queries Enumeration Fine-Grained Complexity eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2021-03-11 186 20:1 20:19 10.4230/LIPIcs.ICDT.2021.20 article Ranked Enumeration of MSO Logic on Words Bourhis, Pierre 1 Grez, Alejandro 2 3 Jachiet, Louis 4 Riveros, Cristian 2 3 CNRS Lille, CRIStAL UMR 9189, University of Lille, INRIA Lille, France Pontificia Universidad Católica de Chile, Santiago, Chile Millennium Institute for Foundational Research on Data, Santiago, Chile LTCI, IP Paris, France In the last years, enumeration algorithms with bounded delay have attracted a lot of attention for several data management tasks. Given a query and the data, the task is to preprocess the data and then enumerate all the answers to the query one by one and without repetitions. This enumeration scheme is typically useful when the solutions are treated on the fly or when we want to stop the enumeration once the pertinent solutions have been found. However, with the current schemes, there is no restriction on the order how the solutions are given and this order usually depends on the techniques used and not on the relevance for the user. In this paper we study the enumeration of monadic second order logic (MSO) over words when the solutions are ranked. We present a framework based on MSO cost functions that allows to express MSO formulae on words with a cost associated with each solution. We then demonstrate the generality of our framework which subsumes, for instance, document spanners and adds ranking to them. The main technical result of the paper is an algorithm for enumerating all the solutions of formulae in increasing order of cost efficiently, namely, with a linear preprocessing phase and logarithmic delay between solutions. The novelty of this algorithm is based on using functional data structures, in particular, by extending functional Brodal queues to suit with the ranked enumeration of MSO on words. https://drops.dagstuhl.de/storage/00lipics/lipics-vol186-icdt2021/LIPIcs.ICDT.2021.20/LIPIcs.ICDT.2021.20.pdf Persistent data structures Query evaluation Enumeration algorithms eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2021-03-11 186 21:1 21:22 10.4230/LIPIcs.ICDT.2021.21 article Approximate Similarity Search Under Edit Distance Using Locality-Sensitive Hashing McCauley, Samuel 1 Williams College, Williamstown, MA, USA Edit distance similarity search, also called approximate pattern matching, is a fundamental problem with widespread database applications. The goal of the problem is to preprocess n strings of length d, to quickly answer queries q of the form: if there is a database string within edit distance r of q, return a database string within edit distance cr of q. Previous approaches to this problem either rely on very large (superconstant) approximation ratios c, or very small search radii r. Outside of a narrow parameter range, these solutions are not competitive with trivially searching through all n strings. In this work we give a simple and easy-to-implement hash function that can quickly answer queries for a wide range of parameters. Specifically, our strategy can answer queries in time Õ(d3^rn^{1/c}). The best known practical results require c ≫ r to achieve any correctness guarantee; meanwhile, the best known theoretical results are very involved and difficult to implement, and require query time that can be loosely bounded below by 24^r. Our results significantly broaden the range of parameters for which there exist nontrivial theoretical bounds, while retaining the practicality of a locality-sensitive hash function. https://drops.dagstuhl.de/storage/00lipics/lipics-vol186-icdt2021/LIPIcs.ICDT.2021.21/LIPIcs.ICDT.2021.21.pdf edit distance approximate pattern matching approximate nearest neighbor similarity search locality-sensitive hashing eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2021-03-11 186 22:1 22:25 10.4230/LIPIcs.ICDT.2021.22 article Locality-Aware Distribution Schemes Sundarmurthy, Bruhathi 1 Koutris, Paraschos 1 Naughton, Jeffrey 1 University of Wisconsin-Madison, Madison, WI, USA One of the bottlenecks in parallel query processing is the cost of shuffling data across nodes in a cluster. Ideally, given a distribution of the data across the nodes and a query, we want to execute the query by performing only local computation and no communication: in this case, the query is called parallel-correct with respect to the data distribution. Previous work studied this problem for Conjunctive Queries in the case where the distribution scheme is oblivious, i.e., the location of each tuple depends only on the tuple and is independent of the instance. In this work, we show that oblivious schemes have a fundamental theoretical limitation, and initiate the formal study of distribution schemes that are locality-aware. In particular, we focus on a class of distribution schemes called co-hash distribution schemes, which are widely used in parallel systems. In co-hash partitioning, some tables are initially hashed, and the remaining tables are co-located so that a join condition is always satisfied. Given a co-hash distribution scheme, we formally study the complexity of deciding various desirable properties, including obliviousness and redundancy. Then, for a given Conjunctive Query and co-hash scheme, we determine the computational complexity of deciding whether the query is parallel-correct. We also explore a stronger notion of correctness, called parallel disjoint correctness, which guarantees that the query result will be disjointly partitioned across nodes, i.e., there is no duplication of results. https://drops.dagstuhl.de/storage/00lipics/lipics-vol186-icdt2021/LIPIcs.ICDT.2021.22/LIPIcs.ICDT.2021.22.pdf partitioning parallel correctness join queries