The Importance of Parameters in Database Queries

Grohe, Martin; Kimelfeld, Benny; Lindner, Peter; Standke, Christoph

doi:10.4230/LIPIcs.ICDT.2024.14

Abstract

We propose and study a framework for quantifying the importance of the choices of parameter values to the result of a query over a database. These parameters occur as constants in logical queries, such as conjunctive queries. In our framework, the importance of a parameter is its SHAP score. This score is a popular instantiation of the game-theoretic Shapley value to measuring the importance of feature values in machine learning models. We make the case for the rationale of using this score by explaining the intuition behind SHAP, and by showing that we arrive at this score in two different, apparently opposing, approaches to quantifying the contribution of a parameter.
The application of the SHAP score requires two components in addition to the query and the database: (a) a probability distribution over the combinations of parameter values, and (b) a utility function that measures the similarity between the result for the original parameters and the result for hypothetical parameters. The main question addressed in the paper is the complexity of calculating the SHAP score for different distributions and similarity measures. We first address the case of probabilistically independent parameters. The problem is hard if we consider a fragment of queries that is hard to evaluate (as one would expect), and even for the fragment of acyclic conjunctive queries. In some cases, though, one can efficiently list all relevant parameter combinations, and then the SHAP score can be computed in polynomial time under reasonable general conditions. Also tractable is the case of full acyclic conjunctive queries for certain (natural) similarity functions. We extend our results to conjunctive queries with inequalities between variables and parameters. Finally, we discuss a simple approximation technique for the case of correlated parameters.

Encarnación Algaba, Vito Fragnelli, and Joaquín Sánchez-Soriano, editors. Handbook of the Shapley Value. CRC Press, 2019. URL: https://doi.org/10.1201/9781351241410.
Dana Arad, Daniel Deutch, and Nave Frost. LearnShapley: Learning to predict rankings of facts contribution based on query logs. In CIKM, pages 4788-4792. ACM, 2022. URL: https://doi.org/10.1145/3511808.3557204.
Marcelo Arenas, Pablo Barceló, Leopoldo E. Bertossi, and Mikaël Monet. The tractability of shap-score-based explanations for classification over deterministic and decomposable boolean circuits. In AAAI, pages 6670-6678. AAAI Press, 2021. URL: https://doi.org/10.1609/aaai.v35i8.16825.
Leopoldo E. Bertossi, Loreto Bravo, Enrico Franconi, and Andrei Lopatenko. The complexity and approximation of fixing numerical attributes in databases under integrity constraints. Inf. Syst., 33(4-5):407-434, 2008. URL: https://doi.org/10.1016/j.is.2008.01.005.
Adriane Chapman and H. V. Jagadish. Why not? In SIGMOD Conference, pages 523-534. ACM, 2009. URL: https://doi.org/10.1145/1559845.1559901.
Xiaotie Deng and Christos H. Papadimitriou. On the complexity of cooperative solution concepts. Math. Oper. Res., 19(2):257-266, 1994. URL: https://doi.org/10.1287/moor.19.2.257.
Daniel Deutch, Nave Frost, Amir Gilad, and Oren Sheffer. Explanations for data repair through shapley values. In CIKM, pages 362-371. ACM, 2021. URL: https://doi.org/10.1145/3459637.3482341.
Daniel Deutch, Nave Frost, Benny Kimelfeld, and Mikaël Monet. Computing the shapley value of facts in query answering. In SIGMOD Conference, pages 1570-1583. ACM, 2022. URL: https://doi.org/10.1145/3514221.3517912.
Arnaud Durand and Stefan Mengel. The complexity of weighted counting for acyclic conjunctive queries. J. Comput. Syst. Sci., 80(1):277-296, 2014. URL: https://doi.org/10.1016/j.jcss.2013.08.001.
U. Faigle and W. Kern. The shapley value for cooperative games under precedence constraints. Int. J. Game Theory, 21(3):249-266, sep 1992. URL: https://doi.org/10.1007/BF01258278.
Martin Grohe, Benny Kimelfeld, Peter Lindner, and Christoph Standke. The importance of parameters in database queries, 2024. http://arxiv.org/abs/2401.04606. URL: https://doi.org/10.48550/arXiv.2401.04606.
Yuri Gurevich and Saharon Shelah. Time polynomial in input or output. J. Symb. Log., 54(3):1083-1088, 1989. URL: https://doi.org/10.2307/2274767.
Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, 2009.
Nick Koudas, Chen Li, Anthony K. H. Tung, and Rares Vernica. Relaxing join and selection queries. In VLDB, pages 199-210. ACM, 2006. URL: http://dl.acm.org/citation.cfm?id=1164146.
Marie-Jeanne Lesot, Maria Rifqi, and Hamid Benhadda. Similarity measures for binary and numerical data: a survey. International Journal of Knowledge Engineering and Soft Data Paradigms, 1(1):63-84, dec 2008. URL: https://doi.org/10.1504/ijkesdp.2009.021985.
Yin Lin, Brit Youngmann, Yuval Moskovitch, H. V. Jagadish, and Tova Milo. On detecting cherry-picked generalizations. Proc. VLDB Endow., 15(1):59-71, 2021. URL: https://doi.org/10.14778/3485450.3485457.
Ester Livshits, Leopoldo E. Bertossi, Benny Kimelfeld, and Moshe Sebag. The Shapley value of tuples in query answering. In ICDT, volume 155 of LIPIcs, pages 20: 1-20: 19. Schloss Dagstuhl, 2020. URL: https://doi.org/10.4230/LIPIcs.ICDT.2020.20.
Ester Livshits and Benny Kimelfeld. The shapley value of inconsistency measures for functional dependencies. Log. Methods Comput. Sci., 18(2), 2022. URL: https://doi.org/10.46298/lmcs-18(2:20)2022.
Scott M. Lundberg, Gabriel G. Erion, Hugh Chen, Alex J. DeGrave, Jordan M. Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell., 2(1):56-67, 2020. URL: https://doi.org/10.1038/s42256-019-0138-9.
Scott M. Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In NIPS, pages 4765-4774, 2017. URL: https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html.
Christoph Molnar. Interpretable machine learning: A guide for making black box models explainable, 2023. Version 2023-08-21. URL: https://christophm.github.io/interpretable-ml-book.
Davide Mottin, Alice Marascu, Senjuti Basu Roy, Gautam Das, Themis Palpanas, and Yannis Velegrakis. A probabilistic optimization framework for the empty-answer problem. Proc. VLDB Endow., 6(14):1762-1773, 2013. URL: https://doi.org/10.14778/2556549.2556560.
Santiago Ontañón. An overview of distance and similarity functions for structured data. Artif. Intell. Rev., 53(7):5309-5351, 2020. URL: https://doi.org/10.1007/s10462-020-09821-w.
Reinhard Pichler and Sebastian Skritek. Tractable counting of the answers to conjunctive queries. J. Comput. Syst. Sci., 79(6):984-1001, 2013. URL: https://doi.org/10.1016/j.jcss.2013.01.012.
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "Why should I trust you?": Explaining the predictions of any classifier. In KDD, pages 1135-1144. ACM, 2016. URL: https://doi.org/10.1145/2939672.2939778.
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Anchors: High-precision model-agnostic explanations. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018. URL: https://doi.org/10.1609/aaai.v32i1.11491.
Alvin E. Roth. The Shapley value: essays in honor of Lloyd S. Shapley. Cambridge University Press, 1988.
B. Sathiya and T. V. Geetha. A review on semantic similarity measures for ontology. J. Intell. Fuzzy Syst., 36(4):3045-3059, 2019. URL: https://doi.org/10.3233/JIFS-18120.
Lloyd S. Shapley. A value for n-person games. In Harold W. Kuhn and Albert W. Tucker, editors, Contributions to the Theory of Games II, pages 307-317. Princeton University Press, Princeton, 1953.
Dan Suciu, Dan Olteanu, Christopher Ré, and Christoph Koch. Probabilistic Databases. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2011. URL: https://doi.org/10.2200/S00362ED1V01Y201105DTM016.
Quoc Trung Tran and Chee-Yong Chan. How to conquer why-not questions. In SIGMOD Conference, pages 15-26. ACM, 2010. URL: https://doi.org/10.1145/1807167.1807172.
Guy Van den Broeck, Anton Lykov, Maximilian Schleich, and Dan Suciu. On the tractability of SHAP explanations. Journal of Artificial Intelligence Research, 74:851-886, jun 2022. URL: https://doi.org/10.1613/jair.1.13283.
Guy Van den Broeck and Dan Suciu. Query processing on probabilistic data: A survey. Found. Trends Databases, 7(3-4):197-341, 2017. URL: https://doi.org/10.1561/1900000052.
Moshe Y. Vardi. The complexity of relational query languages (extended abstract). In Harry R. Lewis, Barbara B. Simons, Walter A. Burkhard, and Lawrence H. Landweber, editors, Proceedings of the 14th Annual ACM Symposium on Theory of Computing, May 5-7, 1982, San Francisco, California, USA, pages 137-146. ACM, 1982. URL: https://doi.org/10.1145/800070.802186.
You Wu, Pankaj K. Agarwal, Chengkai Li, Jun Yang, and Cong Yu. Computational fact checking through query perturbations. ACM Trans. Database Syst., 42(1):4:1-4:41, 2017. URL: https://doi.org/10.1145/2996453.

The Importance of Parameters in Database Queries

Authors Martin Grohe , Benny Kimelfeld , Peter Lindner , Christoph Standke

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

The Importance of Parameters in Database Queries

Authors Martin Grohe , Benny Kimelfeld , Peter Lindner , Christoph Standke

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message