A Formal Framework for Probabilistic Unclean Databases

De Sa, Christopher; Ilyas, Ihab F.; Kimelfeld, Benny; Ré, Christopher; Rekatsinas, Theodoros

doi:10.4230/LIPIcs.ICDT.2019.6

Abstract

Most theoretical frameworks that focus on data errors and inconsistencies follow logic-based reasoning. Yet, practical data cleaning tools need to incorporate statistical reasoning to be effective in real-world data cleaning tasks. Motivated by empirical successes, we propose a formal framework for unclean databases, where two types of statistical knowledge are incorporated: The first represents a belief of how intended (clean) data is generated, and the second represents a belief of how noise is introduced in the actual observed database. To capture this noisy channel model, we introduce the concept of a Probabilistic Unclean Database (PUD), a triple that consists of a probabilistic database that we call the intention, a probabilistic data transformator that we call the realization and captures how noise is introduced, and an observed unclean database that we call the observation. We define three computational problems in the PUD framework: cleaning (infer the most probable intended database, given a PUD), probabilistic query answering (compute the probability of an answer tuple over the unclean observed database), and learning (estimate the most likely intention and realization models of a PUD, given examples as training data). We illustrate the PUD framework on concrete representations of the intention and realization, show that they generalize traditional concepts of repairs such as cardinality and value repairs, draw connections to consistent query answering, and prove tractability results. We further show that parameters can be learned in some practical instantiations, and in fact, prove that under certain conditions we can learn a PUD directly from a single dirty database without any need for clean examples.

Serge Abiteboul, Marcelo Arenas, Pablo Barceló, Meghyn Bienvenu, Diego Calvanese, Claire David, Richard Hull, Eyke Hüllermeier, Benny Kimelfeld, Leonid Libkin, Wim Martens, Tova Milo, Filip Murlak, Frank Neven, Magdalena Ortiz, Thomas Schwentick, Julia Stoyanovich, Jianwen Su, Dan Suciu, Victor Vianu, and Ke Yi. Research Directions for Principles of Data Management (Abridged). SIGMOD Record, 45(4):5-17, 2016. URL: http://dx.doi.org/10.1145/3092931.3092933.
Foto N. Afrati and Phokion G. Kolaitis. Repair checking in inconsistent databases: algorithms and complexity. In ICDT, pages 31-41. ACM, 2009.
Periklis Andritsos, Ariel Fuxman, and Renée J. Miller. Clean Answers over Dirty Databases: A Probabilistic Approach. In ICDE, page 30. IEEE Computer Society, 2006.
Marcelo Arenas, Leopoldo E. Bertossi, and Jan Chomicki. Consistent Query Answers in Inconsistent Databases. In PODS, pages 68-79. ACM, 1999. URL: http://dx.doi.org/10.1145/303976.303983.
Gükhan H. Bakir, Thomas Hofmann, Bernhard Schölkopf, Alexander J. Smola, Ben Taskar, and S. V. N. Vishwanathan. Predicting Structured Data (Neural Information Processing). The MIT Press, 2007.
Dimitri P Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999.
Philip Bohannon, Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. Conditional Functional Dependencies for Data Cleaning. In ICDE, pages 746-755. IEEE, 2007.
Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
N. E. Breslow and D. G. Clayton. Approximate Inference in Generalized Linear Mixed Models. Journal of the American Statistical Association, 88(421):9-25, 1993.
Marco Calautti, Leonid Libkin, and Andreas Pieris. An Operational Approach to Consistent Query Answering. In PODS, pages 239-251. ACM, 2018.
Jan Chomicki and Jerzy Marcinkowski. Minimal-change integrity maintenance using tuple deletions. Information and Computation, 197(1):90-121, 2005.
Xu Chu, Ihab F. Ilyas, and Paolo Papotti. Holistic Data Cleaning: Putting Violations into Context. In ICDE, pages 458-469, 2013.
Nilesh N. Dalvi and Dan Suciu. Efficient Query Evaluation on Probabilistic Databases. In VLDB, pages 864-875. Morgan Kaufmann, 2004.
C. J. Date. Referential Integrity. In VLDB, pages 2-12. VLDB Endowment, 1981.
Ronald Fagin, Benny Kimelfeld, and Phokion G. Kolaitis. Dichotomies in the Complexity of Preferred Repairs. In PODS, pages 3-15, New York, NY, USA, 2015. ACM.
Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. Conditional Functional Dependencies for Capturing Data Inconsistencies. ACM Trans. Database Syst., 33(2):6:1-6:48, June 2008.
Terry Gaasterland, Parke Godfrey, and Jack Minker. An Overview of Cooperative Answering. J. Intell. Inf. Syst., 1(2):123-157, 1992.
Amir Globerson, Tim Roughgarden, David Sontag, and Cafer Yildirim. How Hard is Inference for Structured Prediction? In ICML, pages 2181-2190. JMLR.org, 2015.
Eric Gribkoff, Guy Van den Broeck, and Dan Suciu. The Most Probable Database Problem. In BUDA, 2014.
Ihab F. Ilyas. Effective Data Cleaning with Continuous Evaluation. IEEE Data Eng. Bull., 39:38-46, 2016.
Abhay Kumar Jha, Vibhor Rastogi, and Dan Suciu. Query evaluation with soft-key constraints. In PODS, pages 119-128, 2008.
Daniel Jurafsky and James H. Martin. Speech and Language Processing (2Nd Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 2009.
Solmaz Kolahi and Laks V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, volume 361, pages 53-62. ACM, 2009.
Phokion G. Kolaitis and Enela Pema. A dichotomy in the complexity of consistent query answering for queries with two atoms. Inf. Process. Lett., 112(3):77-85, 2012.
Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. The MIT Press, 2009.
Paraschos Koutris and Jef Wijsen. Consistent Query Answering for Self-Join-Free Conjunctive Queries Under Primary Key Constraints. ACM Trans. Database Syst., 42(2):9:1-9:45, 2017.
Solomon Kullback. Information theory and statistics. Courier Corporation, 1997.
Maurizio Lenzerini. Data Integration: A Theoretical Perspective. In PODS, pages 233-246, New York, NY, USA, 2002. ACM.
Leonid Libkin. Incomplete Data: What Went Wrong, and How to Fix It. In PODS, pages 1-13, New York, NY, USA, 2014. ACM.
Ester Livshits, Benny Kimelfeld, and Sudeepa Roy. Computing Optimal Repairs for Functional Dependencies. In PODS, pages 225-237. ACM, 2018.
Ben London, Bert Huang, Ben Taskar, and Lise Getoor. Collective Stability in Structured Prediction: Generalization from One Example. In Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 828-836, 17-19 June 2013.
Andrei Lopatenko and Leopoldo E. Bertossi. Complexity of Consistent Query Answering in Databases Under Cardinality-Based and Incremental Repair Semantics. In ICDT, pages 179-193, 2007.
Cong Ma, Kaizheng Wang, Yuejie Chi, and Yuxin Chen. Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion and Blind Deconvolution. arXiv preprint, 2017. URL: http://arxiv.org/abs/1711.10467.
Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. HoloClean: Holistic Data Repairs with Probabilistic Inference. PVLDB, 10(11), 2017.
Matthew Richardson and Pedro Domingos. Markov Logic Networks. Mach. Learn., 62(1-2):107-136, February 2006.
Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld, Christopher Ré, and Theodoros Rekatsinas. A Formal Framework For Probabilistic Unclean Databases. CoRR, abs/1801.06750, 2018. URL: http://arxiv.org/abs/1801.06750.
Christopher De Sa, Christopher Ré, and Kunle Olukotun. Global Convergence of Stochastic Gradient Descent for Some Non-convex Matrix Problems. In ICML, volume 37 of JMLR Proceedings, pages 2332-2341. JMLR.org, 2015. URL: http://jmlr.org/proceedings/papers/v37/sa15.html.
Prithviraj Sen, Amol Deshpande, and Lise Getoor. PrDB: managing and exploiting rich correlations in probabilistic databases. VLDB J., 18(5):1065-1090, 2009.
Sameer Singh, Michael Wick, and Andrew McCallum. Monte Carlo MCMC: Efficient Inference by Approximate Sampling. In MNLP-CoNLL, pages 1104-1113. Association for Computational Linguistics, 2012.
Dan Suciu, Dan Olteanu, Christopher Ré, and Christoph Koch. Probabilistic Databases. Morgan &Claypool Publishers, 1st edition, 2011.
Martin J. Wainwright, Tommi S. Jaakkola, and Alan S. Willsky. Tree-reweighted belief propagation algorithms and approximate ML estimation via pseudo-moment matching. In AISTATS, January 2003.
Jiannan Wang and Nan Tang. Towards dependable data repairing with fixing rules. In SIGMOD, pages 457-468. ACM, 2014.
Mohamed Yakout, Ahmed K Elmagarmid, Jennifer Neville, Mourad Ouzzani, and Ihab F Ilyas. Guided data repair. PVLDB, 4(5):279-289, 2011.

A Formal Framework for Probabilistic Unclean Databases

Authors Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld, Christopher Ré, Theodoros Rekatsinas

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

A Formal Framework for Probabilistic Unclean Databases

Authors Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld, Christopher Ré, Theodoros Rekatsinas

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

References