A Formal Framework for Probabilistic Unclean Databases

Authors Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld, Christopher Ré, Theodoros Rekatsinas



PDF
Thumbnail PDF

File

LIPIcs.ICDT.2019.6.pdf
  • Filesize: 0.7 MB
  • 18 pages

Document Identifiers

Author Details

Christopher De Sa
  • Cornell University, Ithacan, NY, USA
Ihab F. Ilyas
  • University of Waterloo, Waterloo, ON, Canada
Benny Kimelfeld
  • Technion - Israel Institute of Technology, Haifa, Israel
Christopher Ré
  • Stanford University, Stanford, CA, USA
Theodoros Rekatsinas
  • University of Wisconsin - Madison, Madison, WI, USA

Cite AsGet BibTex

Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld, Christopher Ré, and Theodoros Rekatsinas. A Formal Framework for Probabilistic Unclean Databases. In 22nd International Conference on Database Theory (ICDT 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 127, pp. 6:1-6:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)
https://doi.org/10.4230/LIPIcs.ICDT.2019.6

Abstract

Most theoretical frameworks that focus on data errors and inconsistencies follow logic-based reasoning. Yet, practical data cleaning tools need to incorporate statistical reasoning to be effective in real-world data cleaning tasks. Motivated by empirical successes, we propose a formal framework for unclean databases, where two types of statistical knowledge are incorporated: The first represents a belief of how intended (clean) data is generated, and the second represents a belief of how noise is introduced in the actual observed database. To capture this noisy channel model, we introduce the concept of a Probabilistic Unclean Database (PUD), a triple that consists of a probabilistic database that we call the intention, a probabilistic data transformator that we call the realization and captures how noise is introduced, and an observed unclean database that we call the observation. We define three computational problems in the PUD framework: cleaning (infer the most probable intended database, given a PUD), probabilistic query answering (compute the probability of an answer tuple over the unclean observed database), and learning (estimate the most likely intention and realization models of a PUD, given examples as training data). We illustrate the PUD framework on concrete representations of the intention and realization, show that they generalize traditional concepts of repairs such as cardinality and value repairs, draw connections to consistent query answering, and prove tractability results. We further show that parameters can be learned in some practical instantiations, and in fact, prove that under certain conditions we can learn a PUD directly from a single dirty database without any need for clean examples.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data modeling
  • Theory of computation → Incomplete, inconsistent, and uncertain databases
Keywords
  • Unclean databases
  • data cleaning
  • probabilistic databases
  • noisy channel

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Serge Abiteboul, Marcelo Arenas, Pablo Barceló, Meghyn Bienvenu, Diego Calvanese, Claire David, Richard Hull, Eyke Hüllermeier, Benny Kimelfeld, Leonid Libkin, Wim Martens, Tova Milo, Filip Murlak, Frank Neven, Magdalena Ortiz, Thomas Schwentick, Julia Stoyanovich, Jianwen Su, Dan Suciu, Victor Vianu, and Ke Yi. Research Directions for Principles of Data Management (Abridged). SIGMOD Record, 45(4):5-17, 2016. URL: http://dx.doi.org/10.1145/3092931.3092933.
  2. Foto N. Afrati and Phokion G. Kolaitis. Repair checking in inconsistent databases: algorithms and complexity. In ICDT, pages 31-41. ACM, 2009. Google Scholar
  3. Periklis Andritsos, Ariel Fuxman, and Renée J. Miller. Clean Answers over Dirty Databases: A Probabilistic Approach. In ICDE, page 30. IEEE Computer Society, 2006. Google Scholar
  4. Marcelo Arenas, Leopoldo E. Bertossi, and Jan Chomicki. Consistent Query Answers in Inconsistent Databases. In PODS, pages 68-79. ACM, 1999. URL: http://dx.doi.org/10.1145/303976.303983.
  5. Gükhan H. Bakir, Thomas Hofmann, Bernhard Schölkopf, Alexander J. Smola, Ben Taskar, and S. V. N. Vishwanathan. Predicting Structured Data (Neural Information Processing). The MIT Press, 2007. Google Scholar
  6. Dimitri P Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999. Google Scholar
  7. Philip Bohannon, Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. Conditional Functional Dependencies for Data Cleaning. In ICDE, pages 746-755. IEEE, 2007. Google Scholar
  8. Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004. Google Scholar
  9. N. E. Breslow and D. G. Clayton. Approximate Inference in Generalized Linear Mixed Models. Journal of the American Statistical Association, 88(421):9-25, 1993. Google Scholar
  10. Marco Calautti, Leonid Libkin, and Andreas Pieris. An Operational Approach to Consistent Query Answering. In PODS, pages 239-251. ACM, 2018. Google Scholar
  11. Jan Chomicki and Jerzy Marcinkowski. Minimal-change integrity maintenance using tuple deletions. Information and Computation, 197(1):90-121, 2005. Google Scholar
  12. Xu Chu, Ihab F. Ilyas, and Paolo Papotti. Holistic Data Cleaning: Putting Violations into Context. In ICDE, pages 458-469, 2013. Google Scholar
  13. Nilesh N. Dalvi and Dan Suciu. Efficient Query Evaluation on Probabilistic Databases. In VLDB, pages 864-875. Morgan Kaufmann, 2004. Google Scholar
  14. C. J. Date. Referential Integrity. In VLDB, pages 2-12. VLDB Endowment, 1981. Google Scholar
  15. Ronald Fagin, Benny Kimelfeld, and Phokion G. Kolaitis. Dichotomies in the Complexity of Preferred Repairs. In PODS, pages 3-15, New York, NY, USA, 2015. ACM. Google Scholar
  16. Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. Conditional Functional Dependencies for Capturing Data Inconsistencies. ACM Trans. Database Syst., 33(2):6:1-6:48, June 2008. Google Scholar
  17. Terry Gaasterland, Parke Godfrey, and Jack Minker. An Overview of Cooperative Answering. J. Intell. Inf. Syst., 1(2):123-157, 1992. Google Scholar
  18. Amir Globerson, Tim Roughgarden, David Sontag, and Cafer Yildirim. How Hard is Inference for Structured Prediction? In ICML, pages 2181-2190. JMLR.org, 2015. Google Scholar
  19. Eric Gribkoff, Guy Van den Broeck, and Dan Suciu. The Most Probable Database Problem. In BUDA, 2014. Google Scholar
  20. Ihab F. Ilyas. Effective Data Cleaning with Continuous Evaluation. IEEE Data Eng. Bull., 39:38-46, 2016. Google Scholar
  21. Abhay Kumar Jha, Vibhor Rastogi, and Dan Suciu. Query evaluation with soft-key constraints. In PODS, pages 119-128, 2008. Google Scholar
  22. Daniel Jurafsky and James H. Martin. Speech and Language Processing (2Nd Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 2009. Google Scholar
  23. Solmaz Kolahi and Laks V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, volume 361, pages 53-62. ACM, 2009. Google Scholar
  24. Phokion G. Kolaitis and Enela Pema. A dichotomy in the complexity of consistent query answering for queries with two atoms. Inf. Process. Lett., 112(3):77-85, 2012. Google Scholar
  25. Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. The MIT Press, 2009. Google Scholar
  26. Paraschos Koutris and Jef Wijsen. Consistent Query Answering for Self-Join-Free Conjunctive Queries Under Primary Key Constraints. ACM Trans. Database Syst., 42(2):9:1-9:45, 2017. Google Scholar
  27. Solomon Kullback. Information theory and statistics. Courier Corporation, 1997. Google Scholar
  28. Maurizio Lenzerini. Data Integration: A Theoretical Perspective. In PODS, pages 233-246, New York, NY, USA, 2002. ACM. Google Scholar
  29. Leonid Libkin. Incomplete Data: What Went Wrong, and How to Fix It. In PODS, pages 1-13, New York, NY, USA, 2014. ACM. Google Scholar
  30. Ester Livshits, Benny Kimelfeld, and Sudeepa Roy. Computing Optimal Repairs for Functional Dependencies. In PODS, pages 225-237. ACM, 2018. Google Scholar
  31. Ben London, Bert Huang, Ben Taskar, and Lise Getoor. Collective Stability in Structured Prediction: Generalization from One Example. In Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 828-836, 17-19 June 2013. Google Scholar
  32. Andrei Lopatenko and Leopoldo E. Bertossi. Complexity of Consistent Query Answering in Databases Under Cardinality-Based and Incremental Repair Semantics. In ICDT, pages 179-193, 2007. Google Scholar
  33. Cong Ma, Kaizheng Wang, Yuejie Chi, and Yuxin Chen. Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion and Blind Deconvolution. arXiv preprint, 2017. URL: http://arxiv.org/abs/1711.10467.
  34. Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. HoloClean: Holistic Data Repairs with Probabilistic Inference. PVLDB, 10(11), 2017. Google Scholar
  35. Matthew Richardson and Pedro Domingos. Markov Logic Networks. Mach. Learn., 62(1-2):107-136, February 2006. Google Scholar
  36. Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld, Christopher Ré, and Theodoros Rekatsinas. A Formal Framework For Probabilistic Unclean Databases. CoRR, abs/1801.06750, 2018. URL: http://arxiv.org/abs/1801.06750.
  37. Christopher De Sa, Christopher Ré, and Kunle Olukotun. Global Convergence of Stochastic Gradient Descent for Some Non-convex Matrix Problems. In ICML, volume 37 of JMLR Proceedings, pages 2332-2341. JMLR.org, 2015. URL: http://jmlr.org/proceedings/papers/v37/sa15.html.
  38. Prithviraj Sen, Amol Deshpande, and Lise Getoor. PrDB: managing and exploiting rich correlations in probabilistic databases. VLDB J., 18(5):1065-1090, 2009. Google Scholar
  39. Sameer Singh, Michael Wick, and Andrew McCallum. Monte Carlo MCMC: Efficient Inference by Approximate Sampling. In MNLP-CoNLL, pages 1104-1113. Association for Computational Linguistics, 2012. Google Scholar
  40. Dan Suciu, Dan Olteanu, Christopher Ré, and Christoph Koch. Probabilistic Databases. Morgan &Claypool Publishers, 1st edition, 2011. Google Scholar
  41. Martin J. Wainwright, Tommi S. Jaakkola, and Alan S. Willsky. Tree-reweighted belief propagation algorithms and approximate ML estimation via pseudo-moment matching. In AISTATS, January 2003. Google Scholar
  42. Jiannan Wang and Nan Tang. Towards dependable data repairing with fixing rules. In SIGMOD, pages 457-468. ACM, 2014. Google Scholar
  43. Mohamed Yakout, Ahmed K Elmagarmid, Jennifer Neville, Mourad Ouzzani, and Ihab F Ilyas. Guided data repair. PVLDB, 4(5):279-289, 2011. Google Scholar