Search Results

Documents authored by Ré, Christopher M.



Ré, Christopher M.

Document
GYM: A Multiround Distributed Join Algorithm

Authors: Foto N. Afrati, Manas R. Joglekar, Christopher M. Re, Semih Salihoglu, and Jeffrey D. Ullman

Published in: LIPIcs, Volume 68, 20th International Conference on Database Theory (ICDT 2017)


Abstract
Multiround algorithms are now commonly used in distributed data processing systems, yet the extent to which algorithms can benefit from running more rounds is not well understood. This paper answers this question for several rounds for the problem of computing the equijoin of n relations. Given any query Q with width w, intersection width iw, input size IN, output size OUT, and a cluster of machines with M=\Omega(IN \frac{1}{\epsilon}) memory available per machine, where \epsilon > 1 and w \ge 1 are constants, we show that: 1. Q can be computed in O(n) rounds with O(n(INw + OUT)2/M) communication cost with high probability. Q can be computed in O(log(n)) rounds with O(n(INmax(w, 3iw) + OUT)2/M) communication cost with high probability. Intersection width is a new notion we introduce for queries and generalized hypertree decompositions (GHDs) of queries that captures how connected the adjacent components of the GHDs are. We achieve our first result by introducing a distributed and generalized version of Yannakakis's algorithm, called GYM. GYM takes as input any GHD of Q with width w and depth d, and computes Q in O(d + log(n)) rounds and O(n (INw + OUT)2/M) communication cost. We achieve our second result by showing how to construct GHDs of Q with width max(w, 3iw) and depth O(log(n)). We describe another technique to construct GHDs with longer widths and lower depths, demonstrating other tradeoffs one can make between communication and the number of rounds.

Cite as

Foto N. Afrati, Manas R. Joglekar, Christopher M. Re, Semih Salihoglu, and Jeffrey D. Ullman. GYM: A Multiround Distributed Join Algorithm. In 20th International Conference on Database Theory (ICDT 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 68, pp. 4:1-4:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{afrati_et_al:LIPIcs.ICDT.2017.4,
  author =	{Afrati, Foto N. and Joglekar, Manas R. and Re, Christopher M. and Salihoglu, Semih and Ullman, Jeffrey D.},
  title =	{{GYM: A Multiround Distributed Join Algorithm}},
  booktitle =	{20th International Conference on Database Theory (ICDT 2017)},
  pages =	{4:1--4:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-024-8},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{68},
  editor =	{Benedikt, Michael and Orsi, Giorgio},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2017.4},
  URN =		{urn:nbn:de:0030-drops-70462},
  doi =		{10.4230/LIPIcs.ICDT.2017.4},
  annote =	{Keywords: Joins, Yannakakis, Bulk Synchronous Processing, GHDs}
}
Document
It's All a Matter of Degree: Using Degree Information to Optimize Multiway Joins

Authors: Manas R. Joglekar and Christopher M. Ré

Published in: LIPIcs, Volume 48, 19th International Conference on Database Theory (ICDT 2016)


Abstract
We optimize multiway equijoins on relational tables using degree information. We give a new bound that uses degree information to more tightly bound the maximum output size of a query. On real data, our bound on the number of triangles in a social network can be up to 95 times tighter than existing worst case bounds. We show that using only a constant amount of degree information, we are able to obtain join algorithms with a running time that has a smaller exponent than existing algorithms - for any database instance. We also show that this degree information can be obtained in nearly linear time, which yields asymptotically faster algorithms in the serial setting and lower communication algorithms in the MapReduce setting. In the serial setting, the data complexity of join processing can be expressed as a function O(IN^x + OUT) in terms of input size IN and output size OUT in which x depends on the query. An upper bound for x is given by fractional hypertreewidth. We are interested in situations in which we can get algorithms for which x is strictly smaller than the fractional hypertreewidth. We say that a join can be processed in subquadratic time if x < 2. Building on the AYZ algorithm for processing cycle joins in quadratic time, for a restricted class of joins which we call 1-series-parallel graphs, we obtain a complete decision procedure for identifying subquadratic solvability (subject to the 3-SUM problem requiring quadratic time). Our 3-SUM based quadratic lower bound is tight, making it the only known tight bound for joins that does not require any assumption about the matrix multiplication exponent omega. We also give a MapReduce algorithm that meets our improved communication bound and handles essentially optimal parallelism.

Cite as

Manas R. Joglekar and Christopher M. Ré. It's All a Matter of Degree: Using Degree Information to Optimize Multiway Joins. In 19th International Conference on Database Theory (ICDT 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 48, pp. 11:1-11:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016)


Copy BibTex To Clipboard

@InProceedings{joglekar_et_al:LIPIcs.ICDT.2016.11,
  author =	{Joglekar, Manas R. and R\'{e}, Christopher M.},
  title =	{{It's All a Matter of Degree: Using Degree Information to Optimize Multiway Joins}},
  booktitle =	{19th International Conference on Database Theory (ICDT 2016)},
  pages =	{11:1--11:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-002-6},
  ISSN =	{1868-8969},
  year =	{2016},
  volume =	{48},
  editor =	{Martens, Wim and Zeume, Thomas},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2016.11},
  URN =		{urn:nbn:de:0030-drops-57800},
  doi =		{10.4230/LIPIcs.ICDT.2016.11},
  annote =	{Keywords: Joins, Degree, MapReduce}
}

Re, Christopher M.

Document
GYM: A Multiround Distributed Join Algorithm

Authors: Foto N. Afrati, Manas R. Joglekar, Christopher M. Re, Semih Salihoglu, and Jeffrey D. Ullman

Published in: LIPIcs, Volume 68, 20th International Conference on Database Theory (ICDT 2017)


Abstract
Multiround algorithms are now commonly used in distributed data processing systems, yet the extent to which algorithms can benefit from running more rounds is not well understood. This paper answers this question for several rounds for the problem of computing the equijoin of n relations. Given any query Q with width w, intersection width iw, input size IN, output size OUT, and a cluster of machines with M=\Omega(IN \frac{1}{\epsilon}) memory available per machine, where \epsilon > 1 and w \ge 1 are constants, we show that: 1. Q can be computed in O(n) rounds with O(n(INw + OUT)2/M) communication cost with high probability. Q can be computed in O(log(n)) rounds with O(n(INmax(w, 3iw) + OUT)2/M) communication cost with high probability. Intersection width is a new notion we introduce for queries and generalized hypertree decompositions (GHDs) of queries that captures how connected the adjacent components of the GHDs are. We achieve our first result by introducing a distributed and generalized version of Yannakakis's algorithm, called GYM. GYM takes as input any GHD of Q with width w and depth d, and computes Q in O(d + log(n)) rounds and O(n (INw + OUT)2/M) communication cost. We achieve our second result by showing how to construct GHDs of Q with width max(w, 3iw) and depth O(log(n)). We describe another technique to construct GHDs with longer widths and lower depths, demonstrating other tradeoffs one can make between communication and the number of rounds.

Cite as

Foto N. Afrati, Manas R. Joglekar, Christopher M. Re, Semih Salihoglu, and Jeffrey D. Ullman. GYM: A Multiround Distributed Join Algorithm. In 20th International Conference on Database Theory (ICDT 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 68, pp. 4:1-4:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{afrati_et_al:LIPIcs.ICDT.2017.4,
  author =	{Afrati, Foto N. and Joglekar, Manas R. and Re, Christopher M. and Salihoglu, Semih and Ullman, Jeffrey D.},
  title =	{{GYM: A Multiround Distributed Join Algorithm}},
  booktitle =	{20th International Conference on Database Theory (ICDT 2017)},
  pages =	{4:1--4:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-024-8},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{68},
  editor =	{Benedikt, Michael and Orsi, Giorgio},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2017.4},
  URN =		{urn:nbn:de:0030-drops-70462},
  doi =		{10.4230/LIPIcs.ICDT.2017.4},
  annote =	{Keywords: Joins, Yannakakis, Bulk Synchronous Processing, GHDs}
}
Document
It's All a Matter of Degree: Using Degree Information to Optimize Multiway Joins

Authors: Manas R. Joglekar and Christopher M. Ré

Published in: LIPIcs, Volume 48, 19th International Conference on Database Theory (ICDT 2016)


Abstract
We optimize multiway equijoins on relational tables using degree information. We give a new bound that uses degree information to more tightly bound the maximum output size of a query. On real data, our bound on the number of triangles in a social network can be up to 95 times tighter than existing worst case bounds. We show that using only a constant amount of degree information, we are able to obtain join algorithms with a running time that has a smaller exponent than existing algorithms - for any database instance. We also show that this degree information can be obtained in nearly linear time, which yields asymptotically faster algorithms in the serial setting and lower communication algorithms in the MapReduce setting. In the serial setting, the data complexity of join processing can be expressed as a function O(IN^x + OUT) in terms of input size IN and output size OUT in which x depends on the query. An upper bound for x is given by fractional hypertreewidth. We are interested in situations in which we can get algorithms for which x is strictly smaller than the fractional hypertreewidth. We say that a join can be processed in subquadratic time if x < 2. Building on the AYZ algorithm for processing cycle joins in quadratic time, for a restricted class of joins which we call 1-series-parallel graphs, we obtain a complete decision procedure for identifying subquadratic solvability (subject to the 3-SUM problem requiring quadratic time). Our 3-SUM based quadratic lower bound is tight, making it the only known tight bound for joins that does not require any assumption about the matrix multiplication exponent omega. We also give a MapReduce algorithm that meets our improved communication bound and handles essentially optimal parallelism.

Cite as

Manas R. Joglekar and Christopher M. Ré. It's All a Matter of Degree: Using Degree Information to Optimize Multiway Joins. In 19th International Conference on Database Theory (ICDT 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 48, pp. 11:1-11:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016)


Copy BibTex To Clipboard

@InProceedings{joglekar_et_al:LIPIcs.ICDT.2016.11,
  author =	{Joglekar, Manas R. and R\'{e}, Christopher M.},
  title =	{{It's All a Matter of Degree: Using Degree Information to Optimize Multiway Joins}},
  booktitle =	{19th International Conference on Database Theory (ICDT 2016)},
  pages =	{11:1--11:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-002-6},
  ISSN =	{1868-8969},
  year =	{2016},
  volume =	{48},
  editor =	{Martens, Wim and Zeume, Thomas},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2016.11},
  URN =		{urn:nbn:de:0030-drops-57800},
  doi =		{10.4230/LIPIcs.ICDT.2016.11},
  annote =	{Keywords: Joins, Degree, MapReduce}
}

Re, Christopher

Document
Track A: Algorithms, Complexity and Games
Sparse Recovery for Orthogonal Polynomial Transforms

Authors: Anna Gilbert, Albert Gu, Christopher Ré, Atri Rudra, and Mary Wootters

Published in: LIPIcs, Volume 168, 47th International Colloquium on Automata, Languages, and Programming (ICALP 2020)


Abstract
In this paper we consider the following sparse recovery problem. We have query access to a vector 𝐱 ∈ ℝ^N such that x̂ = 𝐅 𝐱 is k-sparse (or nearly k-sparse) for some orthogonal transform 𝐅. The goal is to output an approximation (in an 𝓁₂ sense) to x̂ in sublinear time. This problem has been well-studied in the special case that 𝐅 is the Discrete Fourier Transform (DFT), and a long line of work has resulted in sparse Fast Fourier Transforms that run in time O(k ⋅ polylog N). However, for transforms 𝐅 other than the DFT (or closely related transforms like the Discrete Cosine Transform), the question is much less settled. In this paper we give sublinear-time algorithms - running in time poly(k log(N)) - for solving the sparse recovery problem for orthogonal transforms 𝐅 that arise from orthogonal polynomials. More precisely, our algorithm works for any 𝐅 that is an orthogonal polynomial transform derived from Jacobi polynomials. The Jacobi polynomials are a large class of classical orthogonal polynomials (and include Chebyshev and Legendre polynomials as special cases), and show up extensively in applications like numerical analysis and signal processing. One caveat of our work is that we require an assumption on the sparsity structure of the sparse vector, although we note that vectors with random support have this property with high probability. Our approach is to give a very general reduction from the k-sparse sparse recovery problem to the 1-sparse sparse recovery problem that holds for any flat orthogonal polynomial transform; then we solve this one-sparse recovery problem for transforms derived from Jacobi polynomials. Frequently, sparse FFT algorithms are described as implementing such a reduction; however, the technical details of such works are quite specific to the Fourier transform and moreover the actual implementations of these algorithms do not use the 1-sparse algorithm as a black box. In this work we give a reduction that works for a broad class of orthogonal polynomial families, and which uses any 1-sparse recovery algorithm as a black box.

Cite as

Anna Gilbert, Albert Gu, Christopher Ré, Atri Rudra, and Mary Wootters. Sparse Recovery for Orthogonal Polynomial Transforms. In 47th International Colloquium on Automata, Languages, and Programming (ICALP 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 168, pp. 58:1-58:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)


Copy BibTex To Clipboard

@InProceedings{gilbert_et_al:LIPIcs.ICALP.2020.58,
  author =	{Gilbert, Anna and Gu, Albert and R\'{e}, Christopher and Rudra, Atri and Wootters, Mary},
  title =	{{Sparse Recovery for Orthogonal Polynomial Transforms}},
  booktitle =	{47th International Colloquium on Automata, Languages, and Programming (ICALP 2020)},
  pages =	{58:1--58:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-138-2},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{168},
  editor =	{Czumaj, Artur and Dawar, Anuj and Merelli, Emanuela},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2020.58},
  URN =		{urn:nbn:de:0030-drops-124653},
  doi =		{10.4230/LIPIcs.ICALP.2020.58},
  annote =	{Keywords: Orthogonal polynomials, Jacobi polynomials, sublinear algorithms, sparse recovery}
}
Document
A Formal Framework for Probabilistic Unclean Databases

Authors: Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld, Christopher Ré, and Theodoros Rekatsinas

Published in: LIPIcs, Volume 127, 22nd International Conference on Database Theory (ICDT 2019)


Abstract
Most theoretical frameworks that focus on data errors and inconsistencies follow logic-based reasoning. Yet, practical data cleaning tools need to incorporate statistical reasoning to be effective in real-world data cleaning tasks. Motivated by empirical successes, we propose a formal framework for unclean databases, where two types of statistical knowledge are incorporated: The first represents a belief of how intended (clean) data is generated, and the second represents a belief of how noise is introduced in the actual observed database. To capture this noisy channel model, we introduce the concept of a Probabilistic Unclean Database (PUD), a triple that consists of a probabilistic database that we call the intention, a probabilistic data transformator that we call the realization and captures how noise is introduced, and an observed unclean database that we call the observation. We define three computational problems in the PUD framework: cleaning (infer the most probable intended database, given a PUD), probabilistic query answering (compute the probability of an answer tuple over the unclean observed database), and learning (estimate the most likely intention and realization models of a PUD, given examples as training data). We illustrate the PUD framework on concrete representations of the intention and realization, show that they generalize traditional concepts of repairs such as cardinality and value repairs, draw connections to consistent query answering, and prove tractability results. We further show that parameters can be learned in some practical instantiations, and in fact, prove that under certain conditions we can learn a PUD directly from a single dirty database without any need for clean examples.

Cite as

Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld, Christopher Ré, and Theodoros Rekatsinas. A Formal Framework for Probabilistic Unclean Databases. In 22nd International Conference on Database Theory (ICDT 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 127, pp. 6:1-6:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@InProceedings{desa_et_al:LIPIcs.ICDT.2019.6,
  author =	{De Sa, Christopher and Ilyas, Ihab F. and Kimelfeld, Benny and R\'{e}, Christopher and Rekatsinas, Theodoros},
  title =	{{A Formal Framework for Probabilistic Unclean Databases}},
  booktitle =	{22nd International Conference on Database Theory (ICDT 2019)},
  pages =	{6:1--6:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-101-6},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{127},
  editor =	{Barcelo, Pablo and Calautti, Marco},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2019.6},
  URN =		{urn:nbn:de:0030-drops-103083},
  doi =		{10.4230/LIPIcs.ICDT.2019.6},
  annote =	{Keywords: Unclean databases, data cleaning, probabilistic databases, noisy channel}
}
Document
08421 Working Group: Report of the Probabilistic Databases Benchmarking

Authors: Christoph Koch, Peter J. Haas, H.-J. Lenz, Dan Olteanu, Christopher Re, Maurice van Keulen, and Jeff Z. Pan

Published in: Dagstuhl Seminar Proceedings, Volume 8421, Uncertainty Management in Information Systems (2009)


Abstract
The results of the probabilistic database benchmark working group.

Cite as

Christoph Koch, Peter J. Haas, H.-J. Lenz, Dan Olteanu, Christopher Re, Maurice van Keulen, and Jeff Z. Pan. 08421 Working Group: Report of the Probabilistic Databases Benchmarking. In Uncertainty Management in Information Systems. Dagstuhl Seminar Proceedings, Volume 8421, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2009)


Copy BibTex To Clipboard

@InProceedings{koch_et_al:DagSemProc.08421.7,
  author =	{Koch, Christoph and Haas, Peter J. and Lenz, H.-J. and Olteanu, Dan and Re, Christopher and van Keulen, Maurice and Pan, Jeff Z.},
  title =	{{08421 Working Group: Report of the Probabilistic Databases Benchmarking}},
  booktitle =	{Uncertainty Management in Information Systems},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2009},
  volume =	{8421},
  editor =	{Christoph Koch and Birgitta K\"{o}nig-Ries and Volker Markl and Maurice van Keulen},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.08421.7},
  URN =		{urn:nbn:de:0030-drops-19367},
  doi =		{10.4230/DagSemProc.08421.7},
  annote =	{Keywords: Probabilistic databases, benchmark}
}

Ré, Christopher

Document
Track A: Algorithms, Complexity and Games
Sparse Recovery for Orthogonal Polynomial Transforms

Authors: Anna Gilbert, Albert Gu, Christopher Ré, Atri Rudra, and Mary Wootters

Published in: LIPIcs, Volume 168, 47th International Colloquium on Automata, Languages, and Programming (ICALP 2020)


Abstract
In this paper we consider the following sparse recovery problem. We have query access to a vector 𝐱 ∈ ℝ^N such that x̂ = 𝐅 𝐱 is k-sparse (or nearly k-sparse) for some orthogonal transform 𝐅. The goal is to output an approximation (in an 𝓁₂ sense) to x̂ in sublinear time. This problem has been well-studied in the special case that 𝐅 is the Discrete Fourier Transform (DFT), and a long line of work has resulted in sparse Fast Fourier Transforms that run in time O(k ⋅ polylog N). However, for transforms 𝐅 other than the DFT (or closely related transforms like the Discrete Cosine Transform), the question is much less settled. In this paper we give sublinear-time algorithms - running in time poly(k log(N)) - for solving the sparse recovery problem for orthogonal transforms 𝐅 that arise from orthogonal polynomials. More precisely, our algorithm works for any 𝐅 that is an orthogonal polynomial transform derived from Jacobi polynomials. The Jacobi polynomials are a large class of classical orthogonal polynomials (and include Chebyshev and Legendre polynomials as special cases), and show up extensively in applications like numerical analysis and signal processing. One caveat of our work is that we require an assumption on the sparsity structure of the sparse vector, although we note that vectors with random support have this property with high probability. Our approach is to give a very general reduction from the k-sparse sparse recovery problem to the 1-sparse sparse recovery problem that holds for any flat orthogonal polynomial transform; then we solve this one-sparse recovery problem for transforms derived from Jacobi polynomials. Frequently, sparse FFT algorithms are described as implementing such a reduction; however, the technical details of such works are quite specific to the Fourier transform and moreover the actual implementations of these algorithms do not use the 1-sparse algorithm as a black box. In this work we give a reduction that works for a broad class of orthogonal polynomial families, and which uses any 1-sparse recovery algorithm as a black box.

Cite as

Anna Gilbert, Albert Gu, Christopher Ré, Atri Rudra, and Mary Wootters. Sparse Recovery for Orthogonal Polynomial Transforms. In 47th International Colloquium on Automata, Languages, and Programming (ICALP 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 168, pp. 58:1-58:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)


Copy BibTex To Clipboard

@InProceedings{gilbert_et_al:LIPIcs.ICALP.2020.58,
  author =	{Gilbert, Anna and Gu, Albert and R\'{e}, Christopher and Rudra, Atri and Wootters, Mary},
  title =	{{Sparse Recovery for Orthogonal Polynomial Transforms}},
  booktitle =	{47th International Colloquium on Automata, Languages, and Programming (ICALP 2020)},
  pages =	{58:1--58:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-138-2},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{168},
  editor =	{Czumaj, Artur and Dawar, Anuj and Merelli, Emanuela},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2020.58},
  URN =		{urn:nbn:de:0030-drops-124653},
  doi =		{10.4230/LIPIcs.ICALP.2020.58},
  annote =	{Keywords: Orthogonal polynomials, Jacobi polynomials, sublinear algorithms, sparse recovery}
}
Document
A Formal Framework for Probabilistic Unclean Databases

Authors: Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld, Christopher Ré, and Theodoros Rekatsinas

Published in: LIPIcs, Volume 127, 22nd International Conference on Database Theory (ICDT 2019)


Abstract
Most theoretical frameworks that focus on data errors and inconsistencies follow logic-based reasoning. Yet, practical data cleaning tools need to incorporate statistical reasoning to be effective in real-world data cleaning tasks. Motivated by empirical successes, we propose a formal framework for unclean databases, where two types of statistical knowledge are incorporated: The first represents a belief of how intended (clean) data is generated, and the second represents a belief of how noise is introduced in the actual observed database. To capture this noisy channel model, we introduce the concept of a Probabilistic Unclean Database (PUD), a triple that consists of a probabilistic database that we call the intention, a probabilistic data transformator that we call the realization and captures how noise is introduced, and an observed unclean database that we call the observation. We define three computational problems in the PUD framework: cleaning (infer the most probable intended database, given a PUD), probabilistic query answering (compute the probability of an answer tuple over the unclean observed database), and learning (estimate the most likely intention and realization models of a PUD, given examples as training data). We illustrate the PUD framework on concrete representations of the intention and realization, show that they generalize traditional concepts of repairs such as cardinality and value repairs, draw connections to consistent query answering, and prove tractability results. We further show that parameters can be learned in some practical instantiations, and in fact, prove that under certain conditions we can learn a PUD directly from a single dirty database without any need for clean examples.

Cite as

Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld, Christopher Ré, and Theodoros Rekatsinas. A Formal Framework for Probabilistic Unclean Databases. In 22nd International Conference on Database Theory (ICDT 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 127, pp. 6:1-6:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@InProceedings{desa_et_al:LIPIcs.ICDT.2019.6,
  author =	{De Sa, Christopher and Ilyas, Ihab F. and Kimelfeld, Benny and R\'{e}, Christopher and Rekatsinas, Theodoros},
  title =	{{A Formal Framework for Probabilistic Unclean Databases}},
  booktitle =	{22nd International Conference on Database Theory (ICDT 2019)},
  pages =	{6:1--6:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-101-6},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{127},
  editor =	{Barcelo, Pablo and Calautti, Marco},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2019.6},
  URN =		{urn:nbn:de:0030-drops-103083},
  doi =		{10.4230/LIPIcs.ICDT.2019.6},
  annote =	{Keywords: Unclean databases, data cleaning, probabilistic databases, noisy channel}
}
Document
08421 Working Group: Report of the Probabilistic Databases Benchmarking

Authors: Christoph Koch, Peter J. Haas, H.-J. Lenz, Dan Olteanu, Christopher Re, Maurice van Keulen, and Jeff Z. Pan

Published in: Dagstuhl Seminar Proceedings, Volume 8421, Uncertainty Management in Information Systems (2009)


Abstract
The results of the probabilistic database benchmark working group.

Cite as

Christoph Koch, Peter J. Haas, H.-J. Lenz, Dan Olteanu, Christopher Re, Maurice van Keulen, and Jeff Z. Pan. 08421 Working Group: Report of the Probabilistic Databases Benchmarking. In Uncertainty Management in Information Systems. Dagstuhl Seminar Proceedings, Volume 8421, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2009)


Copy BibTex To Clipboard

@InProceedings{koch_et_al:DagSemProc.08421.7,
  author =	{Koch, Christoph and Haas, Peter J. and Lenz, H.-J. and Olteanu, Dan and Re, Christopher and van Keulen, Maurice and Pan, Jeff Z.},
  title =	{{08421 Working Group: Report of the Probabilistic Databases Benchmarking}},
  booktitle =	{Uncertainty Management in Information Systems},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2009},
  volume =	{8421},
  editor =	{Christoph Koch and Birgitta K\"{o}nig-Ries and Volker Markl and Maurice van Keulen},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.08421.7},
  URN =		{urn:nbn:de:0030-drops-19367},
  doi =		{10.4230/DagSemProc.08421.7},
  annote =	{Keywords: Probabilistic databases, benchmark}
}
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail