On an Information Theoretic Approach to Cardinality Estimation (Invited Talk)

Author Hung Q. Ngo



PDF
Thumbnail PDF

File

LIPIcs.ICDT.2022.1.pdf
  • Filesize: 0.78 MB
  • 21 pages

Document Identifiers

Author Details

Hung Q. Ngo
  • RelationalAI Inc., Berkeley, CA, USA

Acknowledgements

I would like to thank the technical program committee and the program chair Dan Olteanu of ICDT 2022 for inviting me to give a keynote talk at the conference

Cite AsGet BibTex

Hung Q. Ngo. On an Information Theoretic Approach to Cardinality Estimation (Invited Talk). In 25th International Conference on Database Theory (ICDT 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 220, pp. 1:1-1:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.ICDT.2022.1

Abstract

This article is a companion to an invited talk at ICDT'2022 with the same title. Cardinality estimation is among the most important problems in query optimization. It is well-documented that, when query plans go haywire, in most cases one can trace the root cause to the cardinality estimator being far off. In particular, traditional cardinality estimation based on selectivity estimation may sometimes under-estimate cardinalities by orders of magnitudes, because the independence or the uniformity assumptions do not typically hold. This talk outlines an approach to cardinality estimation that is "model-free" from a statistical stand-point. Being model-free means the approach tries to avoid making any distributional assumptions. Our approach is information-theoretic, and generalizes recent results on worst-case output size bounds of queries, allowing the estimator to take into account histogram information from the input relations. The estimator turns out to be the objective of a maximization problem subject to concave constraints, over an exponential number of variables. We then explain how the estimator can be computed in polynomial time for some fragment of these constraints. Overall, the talk introduces a new direction to address the classic problem of cardinality estimation that is designed to circumvent some of the pitfalls of selectivity-based estimation. We will also explain connections to other fundamental problems in information theory and database theory regarding information inequalities. The talk is based on (published and unpublished) joint works with Mahmoud Abo Khamis, Sungjin Im, Hossein Keshavarz, Phokion Kolaitis, Ben Moseley, XuanLong Nguyen, Kirk Pruhs, Dan Suciu, and Alireza Samadian Zakaria

Subject Classification

ACM Subject Classification
  • Information systems → Query optimization
  • Information systems → Query planning
  • Information systems → Join algorithms
Keywords
  • Cardinality Estimation
  • Information Theory
  • Polymatroid Bound
  • Worst-case Optimal Join

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. MS SQL server documentation, 2021. URL: https://docs.microsoft.com/en-us/sql/relational-databases/performance/cardinality-estimation-sql-server?view=sql-server-ver15.
  2. Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. Profiling relational data: a survey. VLDB J., 24(4):557-581, 2015. URL: https://doi.org/10.1007/s00778-015-0389-y.
  3. Serge Abiteboul, Richard Hull, and Victor Vianu. Foundations of Databases. Addison-Wesley, 1995. URL: http://webdam.inria.fr/Alice/.
  4. Mahmoud Abo Khamis, Phokion G. Kolaitis, Hung Q. Ngo, and Dan Suciu. Bag query containment and information theory. In Dan Suciu, Yufei Tao, and Zhewei Wei, editors, Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2020, Portland, OR, USA, June 14-19, 2020, pages 95-112. ACM, 2020. URL: https://doi.org/10.1145/3375395.3387645.
  5. Mahmoud Abo Khamis, Phokion G. Kolaitis, Hung Q. Ngo, and Dan Suciu. Decision problems in information theory. In Artur Czumaj, Anuj Dawar, and Emanuela Merelli, editors, 47th International Colloquium on Automata, Languages, and Programming, ICALP 2020, July 8-11, 2020, Saarbrücken, Germany (Virtual Conference), volume 168 of LIPIcs, pages 106:1-106:20. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020. URL: https://doi.org/10.4230/LIPIcs.ICALP.2020.106.
  6. Mahmoud Abo Khamis, Hung Q. Ngo, and Dan Suciu. Computing join queries with functional dependencies. In Tova Milo and Wang-Chiew Tan, editors, Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 327-342. ACM, 2016. URL: https://doi.org/10.1145/2902251.2902289.
  7. Mahmoud Abo Khamis, Hung Q. Ngo, and Dan Suciu. What do shannon-type inequalities, submodular width, and disjunctive datalog have to do with one another? In Emanuel Sallinger, Jan Van den Bussche, and Floris Geerts, editors, Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2017, Chicago, IL, USA, May 14-19, 2017, pages 429-444. ACM, 2017. URL: https://doi.org/10.1145/3034786.3056105.
  8. Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. In Gary L. Miller, editor, Proceedings of the Twenty-Eighth Annual ACM Symposium on the Theory of Computing, Philadelphia, Pennsylvania, USA, May 22-24, 1996, pages 20-29. ACM, 1996. URL: https://doi.org/10.1145/237814.237823.
  9. Albert Atserias, Martin Grohe, and Dániel Marx. Size bounds and query plans for relational joins. In FOCS, pages 739-748. IEEE Computer Society, 2008. URL: https://doi.org/10.1109/FOCS.2008.43.
  10. Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press, Cambridge, 2004. URL: https://doi.org/10.1017/CBO9780511804441.
  11. Walter Cai, Magdalena Balazinska, and Dan Suciu. Pessimistic cardinality estimation: Tighter upper bounds for intermediate join cardinalities. In Peter A. Boncz, Stefan Manegold, Anastasia Ailamaki, Amol Deshpande, and Tim Kraska, editors, Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, pages 18-35. ACM, 2019. URL: https://doi.org/10.1145/3299869.3319894.
  12. Yu Chen and Ke Yi. Random sampling and size estimation over cyclic joins. In Carsten Lutz and Jean Christoph Jung, editors, 23rd International Conference on Database Theory, ICDT 2020, March 30-April 2, 2020, Copenhagen, Denmark, volume 155 of LIPIcs, pages 7:1-7:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020. URL: https://doi.org/10.4230/LIPIcs.ICDT.2020.7.
  13. Zhiyuan Chen, Flip Korn, Nick Koudas, and S. Muthukrishnan. Selectivity estimation for boolean queries. In Victor Vianu and Georg Gottlob, editors, Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, May 15-17, 2000, Dallas, Texas, USA, pages 216-225. ACM, 2000. URL: https://doi.org/10.1145/335168.335225.
  14. F. R. K. Chung, R. L. Graham, P. Frankl, and J. B. Shearer. Some intersection theorems for ordered sets and graphs. J. Combin. Theory Ser. A, 43(1):23-37, 1986. URL: https://doi.org/10.1016/0097-3165(86)90019-1.
  15. Graham Cormode, Minos N. Garofalakis, Peter J. Haas, and Chris Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Found. Trends Databases, 4(1-3):1-294, 2012. URL: https://doi.org/10.1561/1900000004.
  16. Saumya K. Debray and Nai-Wei Lin. Static estimation of query sizes in horn programs. In Serge Abiteboul and Paris C. Kanellakis, editors, ICDT'90, Third International Conference on Database Theory, Paris, France, December 12-14, 1990, Proceedings, volume 470 of Lecture Notes in Computer Science, pages 514-528. Springer, 1990. URL: https://doi.org/10.1007/3-540-53507-1_99.
  17. Kyle Deeds, Dan Suciu, Magda Balazinska, and Walter Cai. Degree sequence bound for join cardinality estimation, 2022. URL: http://arxiv.org/abs/2201.04166.
  18. Alin Dobra. Histograms revisited: when are histograms the best approximation method for aggregates over joins? In Chen Li, editor, Proceedings of the Twenty-fourth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 13-15, 2005, Baltimore, Maryland, USA, pages 228-237. ACM, 2005. URL: https://doi.org/10.1145/1065167.1065196.
  19. Cristian Estan and Jeffrey F. Naughton. End-biased samples for join cardinality estimation. In Ling Liu, Andreas Reuter, Kyu-Young Whang, and Jianjun Zhang, editors, Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, 3-8 April 2006, Atlanta, GA, USA, page 20. IEEE Computer Society, 2006. URL: https://doi.org/10.1109/ICDE.2006.61.
  20. Ehud Friedgut and Jeff Kahn. On the number of copies of one hypergraph in another. Israel J. Math., 105:251-256, 1998. URL: https://doi.org/10.1007/BF02780332.
  21. Allen Van Gelder. Multiple join size estimation by virtual domains. In Catriel Beeri, editor, Proceedings of the Twelfth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 25-28, 1993, Washington, DC, USA, pages 180-189. ACM Press, 1993. URL: https://doi.org/10.1145/153850.153872.
  22. Georg Gottlob, Stephanie Tien Lee, Gregory Valiant, and Paul Valiant. Size and treewidth bounds for conjunctive queries. J. ACM, 59(3):16, 2012. URL: https://doi.org/10.1145/2220357.2220363.
  23. Peter J. Haas, Jeffrey F. Naughton, S. Seshadri, and Arun N. Swami. Fixed-precision estimation of join selectivity. In Catriel Beeri, editor, Proceedings of the Twelfth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 25-28, 1993, Washington, DC, USA, pages 190-201. ACM Press, 1993. URL: https://doi.org/10.1145/153850.153875.
  24. Peter J. Haas, Jeffrey F. Naughton, and Arun N. Swami. On the relative cost of sampling for join selectivity estimation. In Victor Vianu, editor, Proceedings of the Thirteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 24-26, 1994, Minneapolis, Minnesota, USA, pages 14-24. ACM Press, 1994. URL: https://doi.org/10.1145/182591.182594.
  25. Hazar Harmouch and Felix Naumann. Cardinality estimation: An experimental survey. Proc. VLDB Endow., 11(4):499-512, 2017. URL: https://doi.org/10.1145/3186728.3164145.
  26. Axel Hertzschuch, Claudio Hartmann, Dirk Habich, and Wolfgang Lehner. Simplicity done right for join ordering. In 11th Conference on Innovative Data Systems Research, CIDR 2021, Virtual Event, January 11-15, 2021, Online Proceedings. www.cidrdb.org, 2021. URL: http://cidrdb.org/cidr2021/papers/cidr2021_paper01.pdf.
  27. Dawei Huang, Dong Young Yoon, Seth Pettie, and Barzan Mozafari. Join on samples: A theoretical guide for practitioners. Proc. VLDB Endow., 13(4):547-560, 2019. URL: https://doi.org/10.14778/3372716.3372726.
  28. Sungjin Im, Ben Moseley, Hung Q. Ngo, Kirk Pruhs, and Alireza Samadian. On the complexity of computing the polymatroid bound, 2022. Manuscript. Google Scholar
  29. Yannis E. Ioannidis. The history of histograms (abridged). In Johann Christoph Freytag, Peter C. Lockemann, Serge Abiteboul, Michael J. Carey, Patricia G. Selinger, and Andreas Heuer, editors, Proceedings of 29th International Conference on Very Large Data Bases, VLDB 2003, Berlin, Germany, September 9-12, 2003, pages 19-30. Morgan Kaufmann, 2003. URL: https://doi.org/10.1016/B978-012722442-8/50011-2.
  30. Yannis E. Ioannidis and Stavros Christodoulakis. On the propagation of errors in the size of join results. In James Clifford and Roger King, editors, Proceedings of the 1991 ACM SIGMOD International Conference on Management of Data, Denver, Colorado, May 29-31, 1991., pages 268-277. ACM Press, 1991. URL: https://doi.org/10.1145/115790.115835.
  31. H. V. Jagadish, Raymond T. Ng, and Divesh Srivastava. Substring selectivity estimation. In Victor Vianu and Christos H. Papadimitriou, editors, Proceedings of the Eighteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 31 - June 2, 1999, Philadelphia, Pennsylvania, USA, pages 249-260. ACM Press, 1999. URL: https://doi.org/10.1145/303976.304001.
  32. Sai Vikneshwar Mani Jayaraman, Corey Ropell, and Atri Rudra. Worst-case optimal binary join algorithms under general 𝓁_p constraints. CoRR, abs/2112.01003, 2021. URL: http://arxiv.org/abs/2112.01003.
  33. E. T. Jaynes. Information theory and statistical mechanics. Phys. Rev. (2), 106:620-630, 1957. Google Scholar
  34. Stasys Jukna. Extremal combinatorics. Texts in Theoretical Computer Science. An EATCS Series. Springer, Heidelberg, second edition, 2011. With applications in computer science. URL: https://doi.org/10.1007/978-3-642-17364-6.
  35. Hossein Keshavarz, Hung Q. Ngo, and XuanLong Nguyen. Output size bounds under histogrammed frequency moment constraints, 2022. Manuscript. Google Scholar
  36. Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter A. Boncz, and Alfons Kemper. Learned cardinalities: Estimating correlated joins with deep learning. In 9th Biennial Conference on Innovative Data Systems Research, CIDR 2019, Asilomar, CA, USA, January 13-16, 2019, Online Proceedings. www.cidrdb.org, 2019. URL: http://cidrdb.org/cidr2019/papers/p101-kipf-cidr19.pdf.
  37. Nick Koudas, S. Muthukrishnan, and Divesh Srivastava. Optimal histograms for hierarchical range queries. In Victor Vianu and Georg Gottlob, editors, Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, May 15-17, 2000, Dallas, Texas, USA, pages 196-204. ACM, 2000. URL: https://doi.org/10.1145/335168.335223.
  38. Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter A. Boncz, Alfons Kemper, and Thomas Neumann. How good are query optimizers, really? Proc. VLDB Endow., 9(3):204-215, 2015. URL: https://doi.org/10.14778/2850583.2850594.
  39. Viktor Leis, Bernhard Radke, Andrey Gubichev, Alfons Kemper, and Thomas Neumann. Cardinality estimation done right: Index-based join sampling. In 8th Biennial Conference on Innovative Data Systems Research, CIDR 2017, Chaminade, CA, USA, January 8-11, 2017, Online Proceedings. www.cidrdb.org, 2017. URL: http://cidrdb.org/cidr2017/papers/p9-leis-cidr17.pdf.
  40. Viktor Leis, Bernhard Radke, Andrey Gubichev, Atanas Mirchev, Peter A. Boncz, Alfons Kemper, and Thomas Neumann. Query optimization through the looking glass, and what we found running the join order benchmark. VLDB J., 27(5):643-668, 2018. URL: https://doi.org/10.1007/s00778-017-0480-7.
  41. Richard J. Lipton and Jeffrey F. Naughton. Query size estimation by adaptive sampling. In Daniel J. Rosenkrantz and Yehoshua Sagiv, editors, Proceedings of the Ninth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, April 2-4, 1990, Nashville, Tennessee, USA, pages 40-46. ACM Press, 1990. URL: https://doi.org/10.1145/298514.298540.
  42. G. Lohman. Is query optimization a solved problem?, 2014. URL: http://wp.sigmod.org/?p=1075.
  43. Volker Markl, Peter J. Haas, Marcel Kutsch, Nimrod Megiddo, Utkarsh Srivastava, and Tam Minh Tran. Consistent selectivity estimation via maximum entropy. VLDB J., 16(1):55-76, 2007. URL: https://doi.org/10.1007/s00778-006-0030-1.
  44. Guido Moerkotte, Thomas Neumann, and Gabriele Steidl. Preventing bad plans by bounding the impact of cardinality estimation errors. Proc. VLDB Endow., 2(1):982-993, 2009. URL: https://doi.org/10.14778/1687627.1687738.
  45. M. Muralikrishna and David J. DeWitt. Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In Haran Boral and Per-Åke Larson, editors, Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, Chicago, Illinois, USA, June 1-3, 1988, pages 28-36. ACM Press, 1988. URL: https://doi.org/10.1145/50202.50205.
  46. S. Muthukrishnan, Viswanath Poosala, and Torsten Suel. On rectangular partitionings in two dimensions: Algorithms, complexity, and applications. In Catriel Beeri and Peter Buneman, editors, Database Theory - ICDT '99, 7th International Conference, Jerusalem, Israel, January 10-12, 1999, Proceedings, volume 1540 of Lecture Notes in Computer Science, pages 236-256. Springer, 1999. URL: https://doi.org/10.1007/3-540-49257-7_16.
  47. Hung Q. Ngo. Worst-case optimal join algorithms: Techniques, results, and open problems. In Jan Van den Bussche and Marcelo Arenas, editors, Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Houston, TX, USA, June 10-15, 2018, pages 111-124. ACM, 2018. URL: https://doi.org/10.1145/3196959.3196990.
  48. Hung Q. Ngo, Ely Porat, Christopher Ré, and Atri Rudra. Worst-case optimal join algorithms: [extended abstract]. In Michael Benedikt, Markus Krötzsch, and Maurizio Lenzerini, editors, Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2012, Scottsdale, AZ, USA, May 20-24, 2012, pages 37-48. ACM, 2012. Google Scholar
  49. Hung Q. Ngo, Ely Porat, Christopher Ré, and Atri Rudra. Worst-case optimal join algorithms: [extended abstract]. In PODS, pages 37-48, 2012. Google Scholar
  50. J. Radhakrishnan. Entropy and counting. In J. C. Misra, editor, Computational Mathematics, Modelling and Algorithms, pages 146-168. Narosa Pub House, 2003. Google Scholar
  51. Raghu Ramakrishnan and Johannes Gehrke. Database management systems (3. ed.). McGraw-Hill, 2003. Google Scholar
  52. Christopher Ré and Dan Suciu. Understanding cardinality estimation using entropy maximization. In Jan Paredaens and Dirk Van Gucht, editors, Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2010, June 6-11, 2010, Indianapolis, Indiana, USA, pages 53-64. ACM, 2010. URL: https://doi.org/10.1145/1807085.1807095.
  53. P. Griffiths Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In Proceedings of the 1979 ACM SIGMOD international conference on Management of data, SIGMOD '79, pages 23-34, New York, NY, USA, 1979. ACM. Google Scholar
  54. Patricia G. Selinger, Morton M. Astrahan, Donald D. Chamberlin, Raymond A. Lorie, and Thomas G. Price. Access path selection in a relational database management system. In Philip A. Bernstein, editor, Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data, Boston, Massachusetts, USA, May 30 - June 1, pages 23-34. ACM, 1979. URL: https://doi.org/10.1145/582095.582099.
  55. Richard P. Stanley. Enumerative combinatorics. Volume 1, volume 49 of Cambridge Studies in Advanced Mathematics. Cambridge University Press, Cambridge, second edition, 2012. Google Scholar
  56. Todd L. Veldhuizen. Triejoin: A simple, worst-case optimal join algorithm. In Nicole Schweikardt, Vassilis Christophides, and Vincent Leroy, editors, Proc. 17th International Conference on Database Theory (ICDT), Athens, Greece, March 24-28, 2014, pages 96-106. OpenProceedings.org, 2014. URL: https://doi.org/10.5441/002/icdt.2014.13.
  57. Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, Xi Chen, Pieter Abbeel, Joseph M. Hellerstein, Sanjay Krishnan, and Ion Stoica. Deep unsupervised cardinality estimation. Proc. VLDB Endow., 13(3):279-292, 2019. URL: https://doi.org/10.14778/3368289.3368294.
  58. Raymond W. Yeung. Information Theory and Network Coding. Springer Publishing Company, Incorporated, 1 edition, 2008. Google Scholar
  59. Zhen Zhang and Raymond W. Yeung. On characterization of entropy function via information inequalities. IEEE Transactions on Information Theory, 44(4):1440-1452, 1998. URL: https://doi.org/10.1109/18.681320.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail