Integrity Constraints Revisited: From Exact to Approximate Implication

Authors Batya Kenig, Dan Suciu



PDF
Thumbnail PDF

File

LIPIcs.ICDT.2020.18.pdf
  • Filesize: 0.57 MB
  • 20 pages

Document Identifiers

Author Details

Batya Kenig
  • University of Washington, Seattle, WA, USA
Dan Suciu
  • University of Washington, Seattle, WA, USA

Cite As Get BibTex

Batya Kenig and Dan Suciu. Integrity Constraints Revisited: From Exact to Approximate Implication. In 23rd International Conference on Database Theory (ICDT 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 155, pp. 18:1-18:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020) https://doi.org/10.4230/LIPIcs.ICDT.2020.18

Abstract

Integrity constraints such as functional dependencies (FD), and multi-valued dependencies (MVD) are fundamental in database schema design. Likewise, probabilistic conditional independences (CI) are crucial for reasoning about multivariate probability distributions. The implication problem studies whether a set of constraints (antecedents) implies another constraint (consequent), and has been investigated in both the database and the AI literature, under the assumption that all constraints hold exactly. However, many applications today consider constraints that hold only approximately. In this paper we define an approximate implication as a linear inequality between the degree of satisfaction of the antecedents and consequent, and we study the relaxation problem: when does an exact implication relax to an approximate implication? We use information theory to define the degree of satisfaction, and prove several results. First, we show that any implication from a set of data dependencies (MVDs+FDs) can be relaxed to a simple linear inequality with a factor at most quadratic in the number of variables; when the consequent is an FD, the factor can be reduced to 1. Second, we prove that there exists an implication between CIs that does not admit any relaxation; however, we prove that every implication between CIs relaxes "in the limit". Finally, we show that the implication problem for differential constraints in market basket analysis also admits a relaxation with a factor equal to 1. Our results recover, and sometimes extend, several previously known results about the implication problem: implication of MVDs can be checked by considering only 2-tuple relations, and the implication of differential constraints for frequent item sets can be checked by considering only databases containing a single transaction.

Subject Classification

ACM Subject Classification
  • Theory of computation → Database theory
  • Theory of computation → Database constraints theory
Keywords
  • Integrity constraints
  • The implication problem

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. William Ward Armstrong and Claude Delobel. Decomposition and Functional Dependencies in Relations. ACM Trans. Database Syst., 5(4):404-430, 1980. URL: https://doi.org/10.1145/320610.320620.
  2. Catriel Beeri, Ronald Fagin, and John H. Howard. A Complete Axiomatization for Functional and Multivalued Dependencies in Database Relations. In Proceedings of the 1977 ACM SIGMOD International Conference on Management of Data, Toronto, Canada, August 3-5, 1977., pages 47-61, 1977. URL: https://doi.org/10.1145/509404.509414.
  3. Tobias Bleifuß, Susanne Bülow, Johannes Frohnhofen, Julian Risch, Georg Wiese, Sebastian Kruse, Thorsten Papenbrock, and Felix Naumann. Approximate Discovery of Functional Dependencies for Large Datasets. In Proceedings of the 25th ACM International Conference on Information and Knowledge Management, CIKM 2016, Indianapolis, IN, USA, October 24-28, 2016, pages 1803-1812, 2016. URL: https://doi.org/10.1145/2983323.2983781.
  4. Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004. URL: https://doi.org/10.1017/CBO9780511804441.
  5. Toon Calders and Jan Paredaens. Axiomatization of Frequent Sets. In Jan Van den Bussche and Victor Vianu, editors, Database Theory - ICDT 2001, pages 204-218, Berlin, Heidelberg, 2001. Springer Berlin Heidelberg. Google Scholar
  6. Terence Chan. Recent Progresses in Characterising Information Inequalities. Entropy, 13(2):379-401, 2011. URL: https://doi.org/10.3390/e13020379.
  7. Xu Chu, Ihab F. Ilyas, Paolo Papotti, and Yin Ye. RuleMiner: Data quality rules discovery. In IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014, pages 1222-1225, 2014. URL: https://doi.org/10.1109/ICDE.2014.6816746.
  8. Mehmet M. Dalkilic and Edward L. Roberston. Information Dependencies. In Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS '00, pages 245-253, New York, NY, USA, 2000. ACM. URL: https://doi.org/10.1145/335168.336059.
  9. Randall Dougherty, Christopher F. Freiling, and Kenneth Zeger. Linear rank inequalities on five or more variables. CoRR, abs/0910.0284, 2009. URL: http://arxiv.org/abs/0910.0284.
  10. Ronald Fagin. Horn clauses and database dependencies. J. ACM, 29(4):952-985, 1982. URL: https://doi.org/10.1145/322344.322347.
  11. Abram Friesen and Pedro Domingos. The Sum-Product Theorem: A Foundation for Learning Tractable Models. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1909-1918, New York, New York, USA, June 2016. PMLR. URL: http://proceedings.mlr.press/v48/friesen16.html.
  12. Dan Geiger and Judea Pearl. Logical and Algorithmic Properties of Conditional Independence and Graphical Models. The Annals of Statistics, 21(4):2001-2021, 1993. URL: http://www.jstor.org/stable/2242326.
  13. Chris Giannella and Edward L. Robertson. On approximation measures for functional dependencies. Inf. Syst., 29(6):483-507, 2004. URL: https://doi.org/10.1016/j.is.2003.10.006.
  14. Marc Gyssens, Mathias Niepert, and Dirk Van Gucht. On the completeness of the semigraphoid axioms for deriving arbitrary from saturated conditional independence statements. Inf. Process. Lett., 114(11):628-633, 2014. URL: https://doi.org/10.1016/j.ipl.2014.05.010.
  15. Christian Herrmann. Corrigendum to "On the undecidability of implications between embedded multivalued database dependencies" [Inform. and Comput. 122(1995) 221-235]. Inf. Comput., 204(12):1847-1851, 2006. URL: https://doi.org/10.1016/j.ic.2006.09.002.
  16. Ihab F. Ilyas and Xu Chu. Trends in Cleaning Relational Data: Consistency and Deduplication. Foundations and Trends in Databases, 5(4):281-393, 2015. URL: https://doi.org/10.1561/1900000045.
  17. Tarik Kaced and Andrei E. Romashchenko. Conditional Information Inequalities for Entropic and Almost Entropic Points. IEEE Trans. Information Theory, 59(11):7149-7167, 2013. URL: https://doi.org/10.1109/TIT.2013.2274614.
  18. Batya Kenig and Dan Suciu. Integrity Constraints Revisited: From Exact to Approximate Implication. CoRR, abs/1812.09987, 2018. URL: http://arxiv.org/abs/1812.09987.
  19. Juha Kontinen, Sebastian Link, and Jouko Väänänen. Independence in Database Relations. In Leonid Libkin, Ulrich Kohlenbach, and Ruy de Queiroz, editors, Logic, Language, Information, and Computation, pages 179-193, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg. Google Scholar
  20. Sebastian Kruse and Felix Naumann. Efficient Discovery of Approximate Dependencies. PVLDB, 11(7):759-772, 2018. URL: https://doi.org/10.14778/3192965.3192968.
  21. Tony T. Lee. An Information-Theoretic Analysis of Relational Databases - Part I: Data Dependencies and Information Metric. IEEE Trans. Software Eng., 13(10):1049-1061, 1987. URL: https://doi.org/10.1109/TSE.1987.232847.
  22. Chen Li, editor. Proceedings of the Twenty-fourth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 13-15, 2005, Baltimore, Maryland, USA. ACM, 2005. URL: http://dl.acm.org/citation.cfm?id=1065167.
  23. David Maier. Theory of Relational Databases. Computer Science Pr, 1983. Google Scholar
  24. Francesco M. Malvestuto. Statistical treatment of the information content of a database. Inf. Syst., 11(3):211-223, 1986. URL: https://doi.org/10.1016/0306-4379(86)90029-3.
  25. F. Matúš. Infinitely Many Information Inequalities. In 2007 IEEE International Symposium on Information Theory, pages 41-44, June 2007. URL: https://doi.org/10.1109/ISIT.2007.4557201.
  26. Alejandro Molina, Antonio Vergari, Nicola Di Mauro, Sriraam Natarajan, Floriana Esposito, and Kristian Kersting. Mixed Sum-Product Networks: A Deep Architecture for Hybrid Domains. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 3828-3835, 2018. URL: https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16865.
  27. Judea Pearl and Azaria Paz. Graphoids: Graph-Based Logic for Reasoning about Relevance Relations or When would x tell you more about y if you already know z? In ECAI, pages 357-363, 1986. Google Scholar
  28. Jean-Philippe Pellet and André Elisseeff. Using Markov Blankets for Causal Structure Learning. Journal of Machine Learning Research, 9:1295-1342, 2008. URL: https://doi.org/10.1145/1390681.1442776.
  29. Hoifung Poon and Pedro M. Domingos. Sum-Product Networks: A New Deep Architecture. In UAI 2011, Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, Barcelona, Spain, July 14-17, 2011, pages 337-346, 2011. URL: https://dslpitt.org/uai/displayArticleDetails.jsp?mmnu=1&smnu=2&article_id=2194&proceeding_id=27.
  30. Yehoshua Sagiv, Claude Delobel, D. Scott Parker, Jr., and Ronald Fagin. An Equivalence Between Relational Database Dependencies and a Fragment of Propositional Logic. J. ACM, 28(3):435-453, July 1981. URL: https://doi.org/10.1145/322261.322263.
  31. Babak Salimi, Johannes Gehrke, and Dan Suciu. Bias in OLAP Queries: Detection, Explanation, and Removal. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018, pages 1021-1035, 2018. URL: https://doi.org/10.1145/3183713.3196914.
  32. Bassem Sayrafi, Dirk Van Gucht, and Marc Gyssens. The implication problem for measure-based constraints. Inf. Syst., 33(2):221-239, 2008. URL: https://doi.org/10.1016/j.is.2007.07.005.
  33. Bassem Sayrafi and Dirk Van Gucht. Differential Constraints. In Proceedings of the Twenty-fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS '05, pages 348-357, New York, NY, USA, 2005. ACM. URL: https://doi.org/10.1145/1065167.1065213.
  34. Yannis Sismanis, Paul Brown, Peter J. Haas, and Berthold Reinwald. GORDIAN: efficient and scalable discovery of composite keys. In Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12-15, 2006, pages 691-702, 2006. URL: http://dl.acm.org/citation.cfm?id=1164187.
  35. Peter Spirtes, Clark N Glymour, and Richard Scheines. Causation, prediction, and search. MIT press, 2000. Google Scholar
  36. Milan Studený. Conditional Independence Relations have no finite complete characterization. In 11th Prague Conf. Information Theory, Statistical Decision Foundation and Random Processes, pages 377-396. Norwell, MA, 1990. Google Scholar
  37. Milan Studený. Convex cones in finite-dimensional real vector spaces. Kybernetika, 29(2):180-200, 1993. URL: http://www.kybernetika.cz/content/1993/2/180.
  38. S. K. Michael Wong, Cory J. Butz, and Dan Wu. On the implication problem for probabilistic conditional independency. IEEE Trans. Systems, Man, and Cybernetics, Part A, 30(6):785-805, 2000. URL: https://doi.org/10.1109/3468.895901.
  39. Raymond W. Yeung. A framework for linear information inequalities. IEEE Trans. Information Theory, 43(6):1924-1934, 1997. URL: https://doi.org/10.1109/18.641556.
  40. Raymond W. Yeung. Information Theory and Network Coding. Springer Publishing Company, Incorporated, 1 edition, 2008. Google Scholar
  41. Zhen Zhang and Raymond W. Yeung. A non-Shannon-type conditional inequality of information quantities. IEEE Trans. Information Theory, 43(6):1982-1986, 1997. URL: https://doi.org/10.1109/18.641561.
  42. Zhen Zhang and Raymond W. Yeung. On Characterization of Entropy Function via Information Inequalities. IEEE Trans. Information Theory, 44(4):1440-1452, 1998. URL: https://doi.org/10.1109/18.681320.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail