Infinite Probabilistic Databases

Authors Martin Grohe , Peter Lindner



PDF
Thumbnail PDF

File

LIPIcs.ICDT.2020.16.pdf
  • Filesize: 0.59 MB
  • 20 pages

Document Identifiers

Author Details

Martin Grohe
  • RWTH Aachen University, Germany
Peter Lindner
  • RWTH Aachen University, Germany

Acknowledgements

We are grateful to Sam Staton for insightful discussions related to this work, and for pointing us to point processes. We also thank Peter J. Haas for discussions on the open-world assumption and the math behind the MCDB system.

Cite AsGet BibTex

Martin Grohe and Peter Lindner. Infinite Probabilistic Databases. In 23rd International Conference on Database Theory (ICDT 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 155, pp. 16:1-16:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)
https://doi.org/10.4230/LIPIcs.ICDT.2020.16

Abstract

Probabilistic databases (PDBs) are used to model uncertainty in data in a quantitative way. In the standard formal framework, PDBs are finite probability spaces over relational database instances. It has been argued convincingly that this is not compatible with an open-world semantics (Ceylan et al., KR 2016) and with application scenarios that are modeled by continuous probability distributions (Dalvi et al., CACM 2009). We recently introduced a model of PDBs as infinite probability spaces that addresses these issues (Grohe and Lindner, PODS 2019). While that work was mainly concerned with countably infinite probability spaces, our focus here is on uncountable spaces. Such an extension is necessary to model typical continuous probability distributions that appear in many applications. However, an extension beyond countable probability spaces raises nontrivial foundational issues concerned with the measurability of events and queries and ultimately with the question whether queries have a well-defined semantics. It turns out that so-called finite point processes are the appropriate model from probability theory for dealing with probabilistic databases. This model allows us to construct suitable (uncountable) probability spaces of database instances in a systematic way. Our main technical results are measurability statements for relational algebra queries as well as aggregate queries and Datalog queries.

Subject Classification

ACM Subject Classification
  • Mathematics of computing → Probabilistic representations
  • Theory of computation → Incomplete, inconsistent, and uncertain databases
  • Theory of computation → Database query languages (principles)
Keywords
  • Probabilistic Databases
  • Possible Worlds Semantics
  • Query Measurability
  • Relational Algebra
  • Aggregate Queries

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Serge Abiteboul, T.-H. Hubert Chan, Evgeny Kharlamov, Werner Nutt, and Pierre Senellart. Capturing Continuous Data and Answering Aggregate Queries in Probabilistic XML. ACM Transactions on Database Systems (TODS), 36(4):25:1-25:45, 2011. URL: https://doi.org/10.1145/1804669.1804679.
  2. Serge Abiteboul, Richard Hull, and Victor Vianu. Foundations of Databases. Addison-Wesley, Boston, MA, USA, 1st edition, 1995. Google Scholar
  3. Charu C. Aggarwal and Philip S. Yu. A Survey of Uncertain Data Algorithms and Applications. IEEE Transactions on Knowledge and Data Engineering (TKDE), 21(5):609-623, 2009. URL: https://doi.org/10.1109/TKDE.2008.190.
  4. Parag Agrawal and Jennifer Widom. Continuous Uncertainty in Trio. In Proceedings of the 3rd VLDB Workshop on Management of Uncertain Data (MUD '09), pages 17-32, Enschede, The Netherlands, 2009. Centre for Telematics and Information Technology (CTIT). Google Scholar
  5. Joseph Albert. Algebraic Properties of Bag Data Types. In Proceedings of the 17th International Conference on Very Large Databases (VLDB 1991), pages 211-219, San Francisco, CA, USA, 1991. Morgan Kaufmann Publishers Inc. Google Scholar
  6. Adrian Baddeley. Spatial Point Processes and Their Applications. In Wolfgang Weil, editor, Stochastic Geometry, Lecture Notes in Mathematics, chapter 1, pages 1-75. Springer, Berlin, Heidelberg, Germany, 1st edition, 2007. Google Scholar
  7. Vince Bárány, Balder Ten Cate, Benny Kimelfeld, Dan Olteanu, and Zografoula Vagena. Declarative Probabilistic Programming with Datalog. ACM Transactions on Database Systems (TODS), 42(4), 2017. Google Scholar
  8. Daniel Barbará, Héctor García-Molina, and Daryl Porter. The Management of Probabilistic Data. IEEE Transactions on Knowledge and Data Engineering, 4(5):487-502, 1992. URL: https://doi.org/10.1109/69.166990.
  9. Stefan Borgwardt, İsmail İlkan Ceylan, and Thomas Lukasiewicz. Ontology-Mediated Queries for Probabilistic Databases. In Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI '17), pages 1063-1069, Palo Alto, CA, USA, 2017. AAAI Press. Google Scholar
  10. Stefan Borgwardt, İsmail İlkan Ceylan, and Thomas Lukasiewicz. Recent Advances in Querying Probabilistic Knowledge Bases. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI '18), pages 5420-5426. International Joint Conferences on Artificial Intelligence, 2018. URL: https://doi.org/10.24963/ijcai.2018/765.
  11. Stefan Borgwardt, İsmail İlkan Ceylan, and Thomas Lukasiewicz. Ontology-Mediated Query Answering over Log-Linear Probabilistic Data. In Proceedings fo the Thirty-Third AAAI Conference on Artificial Intelligence, volume 33, Palo Alto, CA, USA, 2019. AAAI Press. URL: https://doi.org/10.1609/aaai.v33i01.33012711.
  12. Jihad Boulos, Nilesh Dalvi, Bhushan Mandhani, Shobhit Mathur, Chris Ré, and Dan Suciu. MYSTIQ: A System for Finding more Answers by Using Probabilities. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data (SIGMOD '05), pages 891-893, New York, NY, USA, 2005. ACM. URL: https://doi.org/10.1145/1066157.1066277.
  13. Nicolas Bourbaki. General Topology. Chapters 5-10. Springer, Berlin and Heidelberg, Germany, 1st edition, 1989. Original French edition published by MASSON, Paris, 1974. Google Scholar
  14. Nicolas Bourbaki. General Topology. Chapters 1-4. Springer, Berlin and Heidelberg, Germany, 1st edition, 1995. Original French edition published by MASSON, Paris, 1971. URL: https://doi.org/10.1007/978-3-642-61701-0.
  15. Roger Cavallo and Michael Pittarelli. The Theory of Probabilistic Databases. In Proceedings of the 13th International Conference on Very Large Data Bases (VLDB '87), pages 71-81, San Francisco, CA, USA, 1987. Morgan Kaufmann. Google Scholar
  16. İsmail İlkan Ceylan, Adnan Darwiche, and Guy Van den Broeck. Open-World Probabilistic Databases. In Proceedings of the Fifteenth International Conference on Principles of Knowledge Representation and Reasoning (KR '16), pages 339-348, Palo Alto, CA, USA, 2016. AAAI Press. Google Scholar
  17. Reynold Cheng, Dmitri V. Kalashnikov, and Sunil Prabhakar. Evaluating Probabilistic Queries over Imprecise Data. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD '03), pages 551-562, New York, NY, USA, 2003. ACM. URL: https://doi.org/10.1145/872757.872823.
  18. Daryl John Daley and David Vere-Jones. An Introduction to the Theory of Point Processes, Volume I: Elementary Theory and Models. Probability and its Applications. Springer, New York, NY, USA, 2nd edition, 2003. URL: https://doi.org/10.1007/b97277.
  19. Daryl John Daley and David Vere-Jones. An Introduction to the Theory of Point Processes, Volume II: General Theory and Structure. Probability and its Applications. Springer, New York, NY, USA, 2nd edition, 2008. URL: https://doi.org/10.1007/978-0-387-49835-5.
  20. Nilesh Dalvi, Christopher Ré, and Dan Suciu. Probabilistic Databases: Diamonds in the Dirt. Communications of the ACM, 52(7):86-94, 2009. URL: https://doi.org/10.1145/1538788.1538810.
  21. Umeshwar Dayal, Nathan Goodman, and Randy Howard Katz. An Extended Relational Algebra with Control over Duplicate Elimination. In Proceedings of the 1st ACM SIGACT-SIGMOD Composium on Principles of Database Systems (PODS '82), pages 117-123, New York, NY, USA, 1982. ACM. Google Scholar
  22. Luc De Raedt, Kristian Kersting, Sriraam Natarajan, and David Poole. Statistical Relational Artificial Intelligence: Logic, Probability, and Computation. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool, San Rafael, CA, USA, 2016. Google Scholar
  23. Christoph Degen. Finite Point Processes and Their Application to Target Tracking. PhD thesis, Rheinische Friedrich-Wilhelms-Universität Bonn, 2015. Google Scholar
  24. Amol Deshpande, Carlos Guestrin, Samuel R. Madden, Joseph M. Hellerstein, and Wei Hong. Model-Driven Data Acquisition in Sensor Networks. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB '04), pages 588-599, St. Louis, 2004. Morgan Kaufmann. URL: https://doi.org/10.1016/B978-012088469-8.50053-X.
  25. Daniel Deutch, Christoph Koch, and Tova Milo. On Probabilistic Fixpoint and Markov Chain Query Languages. In Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS '10), pages 215-226, New York, NY, USA, 2010. ACM. Google Scholar
  26. Debabrata Dey and Sumit Sarkar. A Probabilisitic Relational Model and Algebra. ACM Transactions on Database Systems (TODS), 21(3), 1996. URL: https://doi.org/10.1145/232753.232796.
  27. Anton Faradjian, Johannes Gehrke, and Philippe Bonnett. GADT: A Probability Space ADT for Representing and Querying the Physical World. In Proceedings of the 18th International Conference on Data Engineering (ICDE '02), pages 201-–211. IEEE Computing Society, 2002. URL: https://doi.org/10.1109/ICDE.2002.994710.
  28. Robert Fink, Larisa Han, and Dan Olteanu. Aggregation in Probabilistic Databases via Knowledge Compilation. In Proceedings of the 38th International Conference on Very Large Data Bases (VLDB '12), volume 5, pages 490-501. VLDB Endowment, 2012. URL: https://doi.org/10.14778/2140436.2140445.
  29. David H. Fremlin. Measure Theory, Volume 4: Topological Measure Spaces. Torres Fremlin, Colchester, UK, 2nd edition, 2013. Google Scholar
  30. David H. Fremlin. Measure Theory, Volume 2: Broad Foundations. Torres Fremlin, Colchester, UK, 2nd printing edition, 2016. Google Scholar
  31. Tal Friedman and Guy Van den Broeck. On Constrained Open-World Probabilistic Databases. In The 1st Conference on Automated Knowledge Base Construction (AKBC), 2019. Google Scholar
  32. Bert E. Fristedt and Lawrence F. Gray. A Modern Approach to Probabilitiy Theory. Probability and its Applications. Birkhäuser, Cambridge, MA, USA, 1st edition, 1997. Google Scholar
  33. Norbert Fuhr. Probabilistic Datalog - A Logic for Powerful Retrieval Methods. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '95), pages 282-290, New York, NY, USA, 1995. ACM. Google Scholar
  34. Norbert Fuhr and Thomas Rölleke. A Probabilistic Relational Algebra for the Integration of Information Retrieval and Database Systems. ACM Transactions on Information Systems (TOIS), 15(1):32-66, 1997. URL: https://doi.org/10.1145/239041.239045.
  35. Erol Gelenbe and George Hebrail. A Probability Model of Uncertainty in Data Bases. In 1986 IEEE Second International Conference on Data Engineering, pages 328-333. IEEE, 1986 . URL: https://doi.org/10.1109/ICDE.1986.7266237.
  36. Todd J. Green. Models for Incomplete and Probabilistic Information. In Charu C. Aggarwal, editor, Managing and Mining Uncertain Data, volume 35 of Advances in Database Systems, chapter 2, pages 9-43. Springer, Boston, MA, USA, 2009. URL: https://doi.org/10.1007/978-0-387-09690-2_2.
  37. Martin Grohe and Peter Lindner. Infinite Probabilistic Databases, 2019. arXiv e-prints, URL: https://arxiv.org/abs/1904.06766.
  38. Martin Grohe and Peter Lindner. Probabilistic Databases with an Infinite Open-World Assumption. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS '19), pages 17-31, New York, NY, USA, 2019. ACM. Extended version available at arXiv: https://arxiv.org/abs/1807.00607. URL: https://doi.org/10.1145/3294052.3319681.
  39. Stéphane Grumbach, Leonid Libkin, Tova Milo, and Limsoon Wong. Query Languages for Bags: Expressive Power and Complexity. ACM SIGACT News, 1996(2):30-44, 1996. URL: https://doi.org/10.1145/235767.235770.
  40. Stéphane Grumbach and Tova Milo. Towards Tractable Algebras for Bags. Journal of Computer and System Sciences, 52(3):570-588, 1996. URL: https://doi.org/10.1006/jcss.1996.0042.
  41. Ravi Jampani, Fei Xu, Mingxi Wu, Luis Perez, Chris Jermaine, and Peter J. Haas. MCDB: A Monte Carlo Approach to Managing Uncertain Data. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD '08), pages 687-700, New York, NY, USA, 2008. ACM Press. URL: https://doi.org/10.1145/1376616.1376686.
  42. Ravi Jampani, Fei Xu, Mingxi Wu, Luis Perez, Chris Jermaine, and Peter J. Haas. The Monte Carlo Database System: Stochastic Analysis Close to the Data. ACM Transactions on Database Systems (TODS), 36(3):18:1-18:41, 2011. URL: https://doi.org/10.1145/2000824.2000828.
  43. Olav Kallenberg. Foundations of Modern Probability. Probability and its Applications. Springer, New York, NY, USA, 1st edition, 1997. Google Scholar
  44. Oliver Kennedy and Christoph Koch. PIP: A Database System for Great and Small Expectations. In Proceedings of the 26th International Conference on Data Engineering (ICDE '10), pages 157-168, Washington, DC, USA, 2010. IEEE. Google Scholar
  45. Christoph Koch. On Query Algebras for Probabilistic Databases. ACM SIGMOD Record, 37(4):78-85, 2008. Google Scholar
  46. Christoph Koch. MayBMS: A System for Managing Large Probabilistic Databases. In Charu C. Aggarwal, editor, Managing and Mining Uncertain Data, volume 35 of Advances in Database Systems, chapter 6, pages 149-184. Springer, Boston, MA, USA, 2009. URL: https://doi.org/10.1007/978-0-387-09690-2_6.
  47. Christoph Koch and Dan Olteanu. Conditioning Probabilistic Databases. In Proceedings of the 34th International Conference on Very Large Data Bases (VLDB '08), volume 1, pages 313-325. VLDB Endowment, 2008. URL: https://doi.org/10.14778/1453856.1453894.
  48. Günter Last and Matthew Penrose. Lectures on the Poisson Process. Institute of Mathematical Statistics Textbook. Cambridge University Press, Cambridge, UK, 2017. URL: https://doi.org/10.1017/9781316104477.
  49. Odile Macchi. The Coincidence Approach to Stochastic Point Processes. Advances in Applied Probability, 7(1):83-122, 1975. URL: https://doi.org/10.2307/1425855.
  50. Ronald P. S. Mahler. Statistical Multisource-Multitarget Information Fusion. Artech House, Inc., Norwood, MA, USA, 2007. Google Scholar
  51. Brian Milch, Bhaskara Marthi, Stuart Russell, David Sontag, David L. Ong, and Andrey Kolobov. BLOG: Probabilistic Models with Unknown Objects. In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI '05), St. Louis, MO, USA, 2005. Morgan Kaufmann. Google Scholar
  52. Brian Christopher Milch. Probabilistic Models with Unknown Objects. PhD thesis, University of California, Berkeley, 2006. Google Scholar
  53. José Enrique Moyal. The General Theory of Stochastic Population Processes. Acta Mathematica, 108:1-31, 1962. URL: https://doi.org/10.1007/BF02545761.
  54. Raghotham Murthy, Robert Ikeda, and Jennifer Widom. Making Aggregation Work in Uncertain and Probabilistic Databases. IEEE Transactions on Knowledge and Data Engineering (TKDE), 23(8):1261–1273, 2011. URL: https://doi.org/10.1109/TKDE.2010.166.
  55. Jian Pei, Bin Jiang, Xuemin Lin, and Yidong Yuan. Probabilistic Skylines on Uncertain Data. In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB '07), pages 15-26. VLDB Endowment, 2007. Google Scholar
  56. Michael Pittarelli. An Algebra for Probabilistic Databases. IEEE Transactions on Knowledge and Data Engineering (TKDE), 6(2):293-303, 1994. Google Scholar
  57. Raymond Reiter. On Closed World Data Bases. In Herve Gallaire and Jack Minker, editors, Logic and Data Bases, pages 55-76. Plenum Press, New York, NY, USA, 1st edition, 1978. Google Scholar
  58. Matthew Richardson and Pedro Domingos. Markov Logic Networks. Machine Learning, 62(1–2):107-–136, 2006. URL: https://doi.org/10.1007/s10994-006-5833-1.
  59. Robert Ross, V. S. Subrahmanian, and John Grant. Aggregate Operators in Probabilistic Databases. Journal of the ACM (JACM), 52(1):54-101, 2005. URL: https://doi.org/10.1145/1044731.1044734.
  60. Sarvjeet Singh, Chris Mayfield, Rahul Shah, Sunil Prabhakar, Susanne Hambrusch, Jennifer Neville, and Reynold Cheng. Database Support for Probabilistic Attributes and Tuples. In 2008 IEEE 24th International Conference on Data Engineering (ICDE '08), pages 1053-1061, Washington, DC, USA, 2008. IEEE Computer Society. URL: https://doi.org/10.1109/ICDE.2008.4497514.
  61. Parag Singla and Pedro Domingos. Markov Logic in Infinite Domains. In Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence (UAI '07), pages 368-375, Arlington, VA, USA, 2007. AUAI Press. Google Scholar
  62. Shashi Mohan Srivastava. A Course on Borel Sets. Graduate Texts in Mathematics. Springer, New York, NY, USA, 1st edition, 1998. URL: https://doi.org/10.1007/b98956.
  63. Dan Suciu, Dan Olteanu, Christopher Ré, and Christoph Koch. Probabilistic Databases. Synthesis Lectures on Data Management. Morgan & Claypool, San Rafael, CA, USA, 1st edition, 2011. URL: https://doi.org/10.2200/S00362ED1V01Y201105DTM016.
  64. Thanh T. L. Tran, Liping Peng, Yanlei Diao, Andrew McGregor, and Anna Liu. CLARO: Modeling and Processing Uncertain Data Streams. The VLDB Journal, 21(5):651–676, 2012. URL: https://doi.org/10.1007/s00778-011-0261-7.
  65. Guy Van den Broeck and Dan Suciu. Query Processing on Probabilistic Data: A Survey. Foundations and Trendssuperscriptregistered in Databases, 7(3-4):197-341, 2017. URL: https://doi.org/10.1561/1900000052.
  66. Brend Wanders and Maurice van Keulen. Revisiting the Formal Foundation of Probabilistic Databases. In Proceedings of the 2015 Conference of the International Fuzzy Systems Association and the European Society for Fuzzy Logic and Technology (IFSA-EUSFLAT '15), Advances in Intelligent Systems Research, pages 289-296, Paris, France, 2015. Atlantis Press. Google Scholar
  67. Yijie Wang, Xiaoyong Li, Xiaoling Li, and Yuan Wang. A Survey of Queries over Uncertain Data. Knowledge and Information Systems, 37(3):485-530, 2013. URL: https://doi.org/10.1007/s10115-013-0638-6.
  68. Jennifer Widom. Trio: A System for Data, Uncertainty, and Lineage. In Charu C. Aggarwal, editor, Managing and Mining Uncertain Data, volume 35 of Advances in Database Systems, chapter 5, pages 113-148. Springer, Boston, MA, USA, 2009. URL: https://doi.org/10.1007/978-0-387-09690-2_5.
  69. Eugene Wong. A Statistical Approach to Incomplete Information in Database Systems. ACM Transactions on Database Systems (TODS), 7(3):470-488, 1982. URL: https://doi.org/10.1145/319732.319747.
  70. Yi Wu, Siddharth Srivastava, Nicholas Hay, Simon Du, and Stuart Russell. Discrete-Continuous Mixtures in Probabilistic Programming: Generalized Semantics and Inference Algorithms. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), volume 80 of Proceedings of Machine Learning Research, pages 5343-5352. PMLR, 2018. Google Scholar
  71. Esteban Zimányi. Query Evaluation in Probabilistic Relational Databases. Theoretical Computer Science, 171(1):179-219, 1997. URL: https://doi.org/10.1016/S0304-3975(96)00129-6.