Algorithms for Provisioning Queries and Analytics
Provisioning is a technique for avoiding repeated expensive computations in what-if analysis. Given a query, an analyst formulates k hypotheticals, each retaining some of the tuples of a database instance, possibly overlapping, and she wishes to answer the query under scenarios, where a scenario is defined by a subset of the hypotheticals that are "turned on". We say that a query admits compact provisioning if given any database instance and any k hypotheticals, one can create a poly-size (in k) sketch that can then be used to answer the query under any of the 2^k possible scenarios without accessing the original instance.
In this paper, we focus on provisioning complex queries that combine relational algebra (the logical component), grouping, and statistics/analytics (the numerical component). We first show that queries that compute quantiles or linear regression (as well as simpler queries that compute count and sum/average of positive values) can be compactly provisioned to provide (multiplicative) approximate answers to an arbitrary precision. In contrast, exact provisioning for each of these statistics requires the sketch size to be exponential in k. We then establish that for any complex query whose logical component is a positive relational algebra query, as long as the numerical component can be compactly provisioned, the complex query itself can be compactly provisioned. On the other hand, introducing negation or recursion in the logical component again requires the sketch size to be exponential in k. While our positive results use algorithms that do not access the original instance after a scenario is known, we prove our lower bounds even for the case when, knowing the scenario, limited access to the instance is allowed.
What-if Analysis
Provisioning
Data Compression
Approximate Query Answering
18:1-18:18
Regular Paper
Sepehr
Assadi
Sepehr Assadi
Sanjeev
Khanna
Sanjeev Khanna
Yang
Li
Yang Li
Val
Tannen
Val Tannen
10.4230/LIPIcs.ICDT.2016.18
S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995.
Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. In STOC, pages 20-29. ACM, 1996.
Sepehr Assadi, Sanjeev Khanna, Yang Li, and Val Tannen. Algorithms for provisioning queries and analytics. CoRR, abs/1512.06143, 2015. URL: http://arxiv.org/abs/1512.06143.
http://arxiv.org/abs/1512.06143
Andrey Balmin, Thanos Papadimitriou, and Yannis Papakonstantinou. Hypothetical queries in an OLAP environment. In VLDB, pages 220-231, 2000. URL: http://www.vldb.org/conf/2000/P220.pdf.
http://www.vldb.org/conf/2000/P220.pdf
Ziv Bar-Yossef, TS Jayram, Ravi Kumar, D Sivakumar, and Luca Trevisan. Counting distinct elements in a data stream. In RANDOM. Springer, 2002.
Kenneth L Clarkson and David P Woodruff. Numerical linear algebra in the streaming model. In STOC, pages 205-214. ACM, 2009.
G. Cormode, S. Muthukrishnan, and W. Zhuang. What’s different: Distributed, continuous monitoring of duplicate-resilient aggregates on data streams. In ICDE, pages 20-31, 2006. URL: http://dx.doi.org/10.1109/ICDE.2006.173.
http://dx.doi.org/10.1109/ICDE.2006.173
Graham Cormode, Flip Korn, S Muthukrishnan, and Divesh Srivastava. Space-and time-efficient deterministic algorithms for biased quantiles over data streams. In PODS, pages 263-272. ACM, 2006.
Graham Cormode and S Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1), 2005.
Graham Cormode and S. Muthukrishnan. Space efficient mining of multigraph streams. In PODS, pages 271-282, 2005.
Graham Cormode, S Muthukrishnan, and Ke Yi. Algorithms for distributed functional monitoring. ACM Transactions on Algorithms (TALG), 7(2):21, 2011.
Graham Cormode, S Muthukrishnan, Ke Yi, and Qin Zhang. Optimal sampling from distributed streams. In PODS, pages 77-86. ACM, 2010.
Daniel Deutch, Zachary G Ives, Tova Milo, and Val Tannen. Caravan: Provisioning for what-if analysis. In CIDR, 2013.
Petros Drineas, Michael W Mahoney, and S Muthukrishnan. Sampling algorithms for 𝓁₂ regression and applications. In SODA, pages 1127-1136. ACM, 2006.
Petros Drineas, Michael W Mahoney, S Muthukrishnan, and Tamás Sarlós. Faster least squares approximation. Numerische Mathematik, 117(2):219-249, 2011.
Philippe Flajolet and G Nigel Martin. Probabilistic counting algorithms for data base applications. Journal of computer and system sciences, 31(2):182-209, 1985.
Shahram Ghandeharizadeh, Richard Hull, and Dean Jacobs. Heraclitus: Elevating deltas to be first-class citizens in a database programming language. ACM Trans. Database Syst., 21(3):370-426, 1996. URL: http://dx.doi.org/10.1145/232753.232801.
http://dx.doi.org/10.1145/232753.232801
Anna C Gilbert, Yannis Kotidis, S Muthukrishnan, and Martin J Strauss. How to summarize the universe: Dynamic maintenance of quantiles. In PVLDB, pages 454-465. VLDB Endowment, 2002.
T.J. Green. Containment of conjunctive queries on annotated relations. Theory Comput. Syst., 49(2), 2011.
T.J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, pages 31-40, 2007.
Greenplum DB (Pivotal). URL: http://pivotal.io/big-data/pivotal-greenplum-database.
http://pivotal.io/big-data/pivotal-greenplum-database
Michael Greenwald and Sanjeev Khanna. Space-efficient online computation of quantile summaries. ACM SIGMOD Record, 30(2):58-66, 2001.
Michael B Greenwald and Sanjeev Khanna. Power-conserving computation of order-statistics over sensor networks. In PODS, pages 275-285. ACM, 2004.
Anupam Gupta and Francis X Zane. Counting inversions in lists. In SODA, pages 253-254. Society for Industrial and Applied Mathematics, 2003.
Joseph M. Hellerstein, Christopher Ré, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar. The madlib analytics library or MAD skills, the SQL. PVLDB, 5(12):1700-1711, 2012. URL: http://vldb.org/pvldb/vol5/p1700_joehellerstein_vldb2012.pdf.
http://vldb.org/pvldb/vol5/p1700_joehellerstein_vldb2012.pdf
Roger A Horn and Charles R Johnson. Matrix analysis. Cambridge university press, 2012.
T. Imielinski and W. Lipski. Incomplete information in relational databases. J. ACM, 31(4), 1984.
Daniel M Kane, Jelani Nelson, and David P Woodruff. An optimal algorithm for the distinct elements problem. In PODS, pages 41-52. ACM, 2010.
The MADlib Project. URL: http://madlib.net.
http://madlib.net
Michael W Mahoney. Randomized algorithms for matrices and data. Foundations and Trendsregistered in Machine Learning, 3(2):123-224, 2011.
Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G Lindsay. Approximate medians and other quantiles in one pass and with limited memory. ACM SIGMOD Record, 27(2):426-435, 1998.
Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In FOCS, pages 143-152. IEEE, 2006.
D. Suciu, D. Olteanu, C. Ré, and C. Koch. Probabilistic Databases. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2011.
David P Woodruff. Sketching as a tool for numerical linear algebra. arXiv:1411.4357, 2014.
David P Woodruff and Qin Zhang. Tight bounds for distributed functional monitoring. In STOC, pages 941-960. ACM, 2012.
Ke Yi and Qin Zhang. Optimal tracking of distributed heavy hitters and quantiles. Algorithmica, 65(1):206-223, 2013.
Creative Commons Attribution 3.0 Unported license
https://creativecommons.org/licenses/by/3.0/legalcode