Document

# Computing Data Distribution from Query Selectivities

## File

LIPIcs.ICDT.2024.18.pdf
• Filesize: 0.88 MB
• 20 pages

## Cite As

Pankaj K. Agarwal, Rahul Raychaudhury, Stavros Sintos, and Jun Yang. Computing Data Distribution from Query Selectivities. In 27th International Conference on Database Theory (ICDT 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 290, pp. 18:1-18:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.ICDT.2024.18

## Abstract

We are given a set 𝒵 = {(R_1,s_1), …, (R_n,s_n)}, where each R_i is a range in ℝ^d, such as rectangle or ball, and s_i ∈ [0,1] denotes its selectivity. The goal is to compute a small-size discrete data distribution 𝒟 = {(q₁,w₁),…, (q_m,w_m)}, where q_j ∈ ℝ^d and w_j ∈ [0,1] for each 1 ≤ j ≤ m, and ∑_{1≤j≤m} w_j = 1, such that 𝒟 is the most consistent with 𝒵, i.e., err_p(𝒟,𝒵) = 1/n ∑_{i = 1}ⁿ |s_i - ∑_{j=1}^m w_j⋅1(q_j ∈ R_i)|^p is minimized. In a database setting, 𝒵 corresponds to a workload of range queries over some table, together with their observed selectivities (i.e., fraction of tuples returned), and 𝒟 can be used as compact model for approximating the data distribution within the table without accessing the underlying contents. In this paper, we obtain both upper and lower bounds for this problem. In particular, we show that the problem of finding the best data distribution from selectivity queries is NP-complete. On the positive side, we describe a Monte Carlo algorithm that constructs, in time O((n+δ^{-d}) δ^{-2} polylog n), a discrete distribution 𝒟̃ of size O(δ^{-2}), such that err_p(𝒟̃,𝒵) ≤ min_𝒟 err_p(𝒟,𝒵)+δ (for p = 1,2,∞) where the minimum is taken over all discrete distributions. We also establish conditional lower bounds, which strongly indicate the infeasibility of relative approximations as well as removal of the exponential dependency on the dimension for additive approximations. This suggests that significant improvements to our algorithm are unlikely.

## Subject Classification

##### ACM Subject Classification
• Theory of computation → Computational geometry
##### Keywords
• selectivity queries
• discrete distributions
• Multiplicative Weights Update
• eps-approximation
• learnable functions
• depth problem
• arrangement

## Metrics

• Access Statistics
• Total Accesses (updated on a weekly basis)
0

## References

1. Pankaj Agarwal, Rahul Raychaudhury, Stavros Sintos, and Jun Yang. Computing data distribution from query selectivities. CoRR, abs/2401.06047, 2024. URL: https://doi.org/10.48550/arXiv.2401.06047.
2. Pankaj K. Agarwal, Boris Aronov, Esther Ezra, and Joshua Zahl. Efficient algorithm for generalized polynomial partitioning and its applications. SIAM Journal on Computing, 50(2):760-787, 2021. URL: https://doi.org/10.1137/19M1268550.
3. Pankaj K. Agarwal and Micha Sharir. Arrangements and their applications. In Handbook of computational geometry, pages 49-119. Elsevier, 2000. URL: https://doi.org/10.1016/b978-044482537-7/50003-6.
4. Sanjeev Arora, Elad Hazan, and Satyen Kale. Fast algorithms for approximate semidefinite programming using the multiplicative weights update method. In 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05), pages 339-348. IEEE, 2005. URL: https://doi.org/10.1109/SFCS.2005.35.
5. Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory of computing, 8(1):121-164, 2012. URL: https://doi.org/10.4086/toc.2012.v008a006.
6. S. Basu, R. Pollack, and M. F. Roy. Algorithms in real algebraic geometry. In Algorithms and Computation in Mathematics. 2nd ed., Springer-Verlag, 2000.
7. Nicolas Bruno, Surajit Chaudhuri, and Luis Gravano. Stholes: A multidimensional workload-aware histogram. In Sharad Mehrotra and Timos K. Sellis, editors, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, Santa Barbara, CA, USA, May 21-24, 2001, pages 211-222. ACM, 2001. URL: https://doi.org/10.1145/375663.375686.
8. Timothy M. Chan. A (slightly) faster algorithm for klee’s measure problem. In Proceedings of the twenty-fourth annual symposium on Computational geometry, pages 94-100, 2008. URL: https://doi.org/10.1145/1377676.1377693.
9. Timothy M. Chan. Klee’s measure problem made easy. In 2013 IEEE 54th annual symposium on foundations of computer science, pages 410-419. IEEE, 2013. URL: https://doi.org/10.1109/FOCS.2013.51.
10. Timothy M. Chan and Qizheng He. Faster approximation algorithms for geometric set cover. In 36th International Symposium on Computational Geometry (SoCG 2020), 2020. URL: https://doi.org/10.4230/LIPIcs.SoCG.2020.27.
11. Bernard Chazelle. The discrepancy method: randomness and complexity. Cambridge University Press, 2000.
12. Chandra Chekuri, Sariel Har-Peled, and Kent Quanrud. Fast lp-based approximations for geometric packing and covering problems. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1019-1038. SIAM, 2020. URL: https://doi.org/10.1137/1.9781611975994.62.
13. Kenneth L. Clarkson and Kasturi Varadarajan. Improved approximation algorithms for geometric set cover. In Proceedings of the twenty-first annual symposium on Computational geometry, pages 135-141, 2005. URL: https://doi.org/10.1145/1064092.1064115.
14. Marek Cygan, Fedor V. Fomin, Lukasz Kowalik, Daniel Lokshtanov, Dániel Marx, Marcin Pilipczuk, Michal Pilipczuk, and Saket Saurabh. Parameterized algorithms, volume 5(4). Springer, 2015. URL: https://doi.org/10.1007/978-3-319-21275-3.
15. Anshuman Dutt, Chi Wang, Azade Nazi, Srikanth Kandula, Vivek R. Narasayya, and Surajit Chaudhuri. Selectivity estimation for range predicates using lightweight models. Proc. VLDB Endow., 12(9):1044-1057, 2019. URL: https://doi.org/10.14778/3329772.3329780.
16. Dean P. Foster and Rakesh Vohra. Regret in the on-line decision problem. Games and Economic Behavior, 29(1-2):7-35, 1999.
17. Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119-139, 1997. URL: https://doi.org/10.1006/jcss.1997.1504.
18. Yoav Freund and Robert E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1-2):79-103, 1999.
19. Naveen Garg and Jochen Könemann. Faster and simpler algorithms for multicommodity flow and other fractional packing problems. SIAM Journal on Computing, 37(2):630-652, 2007. URL: https://doi.org/10.1137/S0097539704446232.
20. Michael D. Grigoriadis and Leonid G. Khachiyan. A sublinear-time randomized approximation algorithm for matrix games. Operations Research Letters, 18(2):53-58, 1995. URL: https://doi.org/10.1016/0167-6377(95)00032-0.
21. Martin Grötschel, László Lovász, and Alexander Schrijver. The ellipsoid method and its consequences in combinatorial optimization. Combinatorica, 1:169-197, 1981. URL: https://doi.org/10.1007/BF02579273.
22. Martin Grötschel, László Lovász, and Alexander Schrijver. Geometric algorithms and combinatorial optimization, volume 2. Springer Science & Business Media, 2012.
23. Sariel Har-Peled. Geometric approximation algorithms. In Mathematical Surveys and Monographs, volume 173. American Mathematical Soc., 2011.
24. Shohedul Hasan, Saravanan Thirumuruganathan, Jees Augustine, Nick Koudas, and Gautam Das. Deep learning models for selectivity estimation of multi-attribute queries. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 1035-1050, 2020. URL: https://doi.org/10.1145/3318464.3389741.
25. David Haussler. Decision theoretic generalizations of the pac model for neural net and other learning applications. Information and computation, 100(1):78-150, 1992. URL: https://doi.org/10.1016/0890-5401(92)90010-D.
26. Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. Deepdb: learn from data, not from queries! Proceedings of the VLDB Endowment, 13(7):992-1005, 2020. URL: https://doi.org/10.14778/3384345.3384349.
27. Xiao Hu, Yuxi Liu, Haibo Xiu, Pankaj K. Agarwal, Debmalya Panigrahi, Sudeepa Roy, and Jun Yang. Selectivity functions of range queries are learnable. In SIGMOD '22: International Conference on Management of Data, Philadelphia, PA, USA, June 12-17, 2022, pages 959-972. ACM, 2022. URL: https://doi.org/10.1145/3514221.3517896.
28. Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter A. Boncz, and Alfons Kemper. Learned cardinalities: Estimating correlated joins with deep learning. In 9th Biennial Conference on Innovative Data Systems Research, CIDR 2019, Asilomar, CA, USA, January 13-16, 2019, Online Proceedings, 2019. URL: http://cidrdb.org/cidr2019/papers/p101-kipf-cidr19.pdf.
29. Alexander Kogler and Patrick Traxler. Parallel and robust empirical risk minimization via the median trick. In Mathematical Aspects of Computer and Information Sciences: 7th International Conference, MACIS 2017, Vienna, Austria, November 15-17, 2017, Proceedings, pages 378-391. Springer, 2017. URL: https://doi.org/10.1007/978-3-319-72453-9_31.
30. Christos Koufogiannakis and Neal E. Young. Beating simplex for fractional packing and covering linear programs. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2007), October 20-23, 2007, Providence, RI, USA, Proceedings, pages 494-504. IEEE Computer Society, 2007. URL: https://doi.org/10.1109/FOCS.2007.16.
31. Richard J. Lipton, Jeffrey F. Naughton, and Donovan A. Schneider. Practical selectivity estimation through adaptive sampling. In Proceedings of the 1990 ACM SIGMOD international conference on Management of data, pages 1-11, 1990. URL: https://doi.org/10.1145/93597.93611.
32. Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh, and Tim Kraska. Bao: Making learned query optimization practical. SIGMOD Rec., 51(1):6-13, 2022. URL: https://doi.org/10.1145/3542700.3542703.
33. Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. Neo: a learned query optimizer. Proceedings of the VLDB Endowment, 12(11), 2019. URL: https://doi.org/10.14778/3342263.3342644.
34. Ryan Marcus and Olga Papaemmanouil. Deep reinforcement learning for join order enumeration. In Proceedings of the First International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, pages 1-4, 2018. URL: https://doi.org/10.1145/3211954.3211957.
35. Volker Markl, Peter J. Haas, Marcel Kutsch, Nimrod Megiddo, Utkarsh Srivastava, and Tam Minh Tran. Consistent selectivity estimation via maximum entropy. The VLDB journal, 16:55-76, 2007. URL: https://doi.org/10.1007/s00778-006-0030-1.
36. Yossi Matias, Jeffrey Scott Vitter, and Min Wang. Wavelet-based histograms for selectivity estimation. In Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pages 448-459, 1998. URL: https://doi.org/10.1145/276304.276344.
37. Nimrod Megiddo and Kenneth J. Supowit. On the complexity of some common geometric location problems. SIAM J. Comput., 13(1):182-196, 1984. URL: https://doi.org/10.1137/0213014.
38. Parimarjan Negi, Ryan Marcus, Hongzi Mao, Nesime Tatbul, Tim Kraska, and Mohammad Alizadeh. Cost-guided cardinality estimation: Focus where it matters. In 2020 IEEE 36th International Conference on Data Engineering Workshops (ICDEW), pages 154-157. IEEE, 2020. URL: https://doi.org/10.1109/ICDEW49219.2020.00034.
39. Jaroslav Nešetřil and Svatopluk Poljak. On the complexity of the subgraph problem. Commentationes Mathematicae Universitatis Carolinae, 26(2):415-419, 1985.
40. Yongjoo Park, Shucheng Zhong, and Barzan Mozafari. Quicksel: Quick selectivity learning with mixture models. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 1017-1033, 2020. URL: https://doi.org/10.1145/3318464.3389727.
41. Jeff M. Phillips. Algorithms for epsilon-approximations of terrains. In Automata, Languages and Programming, 35th International Colloquium, ICALP 2008, Reykjavik, Iceland, July 7-11, 2008, Proceedings, Part I: Tack A: Algorithms, Automata, Complexity, and Games, volume 5125 of Lecture Notes in Computer Science, pages 447-458. Springer, 2008. URL: https://doi.org/10.1007/978-3-540-70575-8_37.
42. Serge A. Plotkin, David B. Shmoys, and Éva Tardos. Fast approximation algorithms for fractional packing and covering problems. Mathematics of Operations Research, 20(2):257-301, 1995. URL: https://doi.org/10.1287/moor.20.2.257.
43. Viswanath Poosala, Peter J. Haas, Yannis E. Ioannidis, and Eugene J. Shekita. Improved histograms for selectivity estimation of range predicates. ACM Sigmod Record, 25(2):294-305, 1996. URL: https://doi.org/10.1145/233269.233342.
44. Viswanath Poosala and Yannis E. Ioannidis. Selectivity estimation without the attribute value independence assumption. In VLDB, volume 97, pages 486-495, 1997. URL: http://www.vldb.org/conf/1997/P486.PDF.
45. Utkarsh Srivastava, Peter J. Haas, Volker Markl, Marcel Kutsch, and Tam Minh Tran. ISOMER: consistent histogram construction using query feedback. In Ling Liu, Andreas Reuter, Kyu-Young Whang, and Jianjun Zhang, editors, Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, 3-8 April 2006, Atlanta, GA, USA, page 39. IEEE Computer Society, 2006. URL: https://doi.org/10.1109/ICDE.2006.84.
46. Markus Stocker, Andy Seaborne, Abraham Bernstein, Christoph Kiefer, and Dave Reynolds. Sparql basic graph pattern optimization using selectivity estimation. In Proceedings of the 17th international conference on World Wide Web, pages 595-604, 2008. URL: https://doi.org/10.1145/1367497.1367578.
47. Xiaoying Wang, Changbo Qu, Weiyuan Wu, Jiannan Wang, and Qingqing Zhou. Are we ready for learned cardinality estimation? Proceedings of the VLDB Endowment, 14(9):1640-1654, 2021. URL: https://doi.org/10.14778/3461535.3461552.
48. David P. Williamson. The primal-dual method for approximation algorithms. Mathematical Programming, 91:447-478, 2002. URL: https://doi.org/10.1007/s101070100262.
49. Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, Xi Chen, Pieter Abbeel, Joseph M. Hellerstein, Sanjay Krishnan, and Ion Stoica. Deep unsupervised cardinality estimation. Proceedings of the VLDB Endowment, 13(3):279-292, 2019. URL: https://doi.org/10.14778/3368289.3368294.
50. Neal E. Young. Sequential and parallel algorithms for mixed packing and covering. In Proceedings 42nd IEEE symposium on foundations of computer science, pages 538-546. IEEE, 2001. URL: https://doi.org/10.1109/SFCS.2001.959930.