Algorithms for Provisioning Queries and Analytics

Authors Sepehr Assadi, Sanjeev Khanna, Yang Li, Val Tannen



PDF
Thumbnail PDF

File

LIPIcs.ICDT.2016.18.pdf
  • Filesize: 0.6 MB
  • 18 pages

Document Identifiers

Author Details

Sepehr Assadi
Sanjeev Khanna
Yang Li
Val Tannen

Cite As Get BibTex

Sepehr Assadi, Sanjeev Khanna, Yang Li, and Val Tannen. Algorithms for Provisioning Queries and Analytics. In 19th International Conference on Database Theory (ICDT 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 48, pp. 18:1-18:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016) https://doi.org/10.4230/LIPIcs.ICDT.2016.18

Abstract

Provisioning is a technique for avoiding repeated expensive computations in what-if analysis. Given a query, an analyst formulates k hypotheticals, each retaining some of the tuples of a database instance, possibly overlapping, and she wishes to answer the query under scenarios, where a scenario is defined by a subset of the hypotheticals that are "turned on". We say that a query admits compact provisioning if given any database instance and any k hypotheticals, one can create a poly-size (in k) sketch that can then be used to answer the query under any of the 2^k possible scenarios without accessing the original instance.

In this paper, we focus on provisioning complex queries that combine relational algebra (the logical component), grouping, and statistics/analytics (the numerical component). We first show that queries that compute quantiles or linear regression (as well as simpler queries that compute count and sum/average of positive values) can be compactly provisioned to provide (multiplicative) approximate answers to an arbitrary precision. In contrast, exact provisioning for each of these statistics requires the sketch size to be exponential in k. We then establish that for any complex query whose logical component is a positive relational algebra query, as long as the numerical component can be compactly provisioned, the complex query itself can be compactly provisioned. On the other hand, introducing negation or recursion in the logical component again requires the sketch size to be exponential in k. While our positive results use algorithms that do not access the original instance after a scenario is known, we prove our lower bounds even for the case when, knowing the scenario, limited access to the instance is allowed.

Subject Classification

Keywords
  • What-if Analysis
  • Provisioning
  • Data Compression
  • Approximate Query Answering

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. Google Scholar
  2. Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. In STOC, pages 20-29. ACM, 1996. Google Scholar
  3. Sepehr Assadi, Sanjeev Khanna, Yang Li, and Val Tannen. Algorithms for provisioning queries and analytics. CoRR, abs/1512.06143, 2015. URL: http://arxiv.org/abs/1512.06143.
  4. Andrey Balmin, Thanos Papadimitriou, and Yannis Papakonstantinou. Hypothetical queries in an OLAP environment. In VLDB, pages 220-231, 2000. URL: http://www.vldb.org/conf/2000/P220.pdf.
  5. Ziv Bar-Yossef, TS Jayram, Ravi Kumar, D Sivakumar, and Luca Trevisan. Counting distinct elements in a data stream. In RANDOM. Springer, 2002. Google Scholar
  6. Kenneth L Clarkson and David P Woodruff. Numerical linear algebra in the streaming model. In STOC, pages 205-214. ACM, 2009. Google Scholar
  7. G. Cormode, S. Muthukrishnan, and W. Zhuang. What’s different: Distributed, continuous monitoring of duplicate-resilient aggregates on data streams. In ICDE, pages 20-31, 2006. URL: http://dx.doi.org/10.1109/ICDE.2006.173.
  8. Graham Cormode, Flip Korn, S Muthukrishnan, and Divesh Srivastava. Space-and time-efficient deterministic algorithms for biased quantiles over data streams. In PODS, pages 263-272. ACM, 2006. Google Scholar
  9. Graham Cormode and S Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1), 2005. Google Scholar
  10. Graham Cormode and S. Muthukrishnan. Space efficient mining of multigraph streams. In PODS, pages 271-282, 2005. Google Scholar
  11. Graham Cormode, S Muthukrishnan, and Ke Yi. Algorithms for distributed functional monitoring. ACM Transactions on Algorithms (TALG), 7(2):21, 2011. Google Scholar
  12. Graham Cormode, S Muthukrishnan, Ke Yi, and Qin Zhang. Optimal sampling from distributed streams. In PODS, pages 77-86. ACM, 2010. Google Scholar
  13. Daniel Deutch, Zachary G Ives, Tova Milo, and Val Tannen. Caravan: Provisioning for what-if analysis. In CIDR, 2013. Google Scholar
  14. Petros Drineas, Michael W Mahoney, and S Muthukrishnan. Sampling algorithms for 𝓁₂ regression and applications. In SODA, pages 1127-1136. ACM, 2006. Google Scholar
  15. Petros Drineas, Michael W Mahoney, S Muthukrishnan, and Tamás Sarlós. Faster least squares approximation. Numerische Mathematik, 117(2):219-249, 2011. Google Scholar
  16. Philippe Flajolet and G Nigel Martin. Probabilistic counting algorithms for data base applications. Journal of computer and system sciences, 31(2):182-209, 1985. Google Scholar
  17. Shahram Ghandeharizadeh, Richard Hull, and Dean Jacobs. Heraclitus: Elevating deltas to be first-class citizens in a database programming language. ACM Trans. Database Syst., 21(3):370-426, 1996. URL: http://dx.doi.org/10.1145/232753.232801.
  18. Anna C Gilbert, Yannis Kotidis, S Muthukrishnan, and Martin J Strauss. How to summarize the universe: Dynamic maintenance of quantiles. In PVLDB, pages 454-465. VLDB Endowment, 2002. Google Scholar
  19. T.J. Green. Containment of conjunctive queries on annotated relations. Theory Comput. Syst., 49(2), 2011. Google Scholar
  20. T.J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, pages 31-40, 2007. Google Scholar
  21. Greenplum DB (Pivotal). URL: http://pivotal.io/big-data/pivotal-greenplum-database.
  22. Michael Greenwald and Sanjeev Khanna. Space-efficient online computation of quantile summaries. ACM SIGMOD Record, 30(2):58-66, 2001. Google Scholar
  23. Michael B Greenwald and Sanjeev Khanna. Power-conserving computation of order-statistics over sensor networks. In PODS, pages 275-285. ACM, 2004. Google Scholar
  24. Anupam Gupta and Francis X Zane. Counting inversions in lists. In SODA, pages 253-254. Society for Industrial and Applied Mathematics, 2003. Google Scholar
  25. Joseph M. Hellerstein, Christopher Ré, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar. The madlib analytics library or MAD skills, the SQL. PVLDB, 5(12):1700-1711, 2012. URL: http://vldb.org/pvldb/vol5/p1700_joehellerstein_vldb2012.pdf.
  26. Roger A Horn and Charles R Johnson. Matrix analysis. Cambridge university press, 2012. Google Scholar
  27. T. Imielinski and W. Lipski. Incomplete information in relational databases. J. ACM, 31(4), 1984. Google Scholar
  28. Daniel M Kane, Jelani Nelson, and David P Woodruff. An optimal algorithm for the distinct elements problem. In PODS, pages 41-52. ACM, 2010. Google Scholar
  29. The MADlib Project. URL: http://madlib.net.
  30. Michael W Mahoney. Randomized algorithms for matrices and data. Foundations and Trendsregistered in Machine Learning, 3(2):123-224, 2011. Google Scholar
  31. Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G Lindsay. Approximate medians and other quantiles in one pass and with limited memory. ACM SIGMOD Record, 27(2):426-435, 1998. Google Scholar
  32. Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In FOCS, pages 143-152. IEEE, 2006. Google Scholar
  33. D. Suciu, D. Olteanu, C. Ré, and C. Koch. Probabilistic Databases. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2011. Google Scholar
  34. David P Woodruff. Sketching as a tool for numerical linear algebra. arXiv:1411.4357, 2014. Google Scholar
  35. David P Woodruff and Qin Zhang. Tight bounds for distributed functional monitoring. In STOC, pages 941-960. ACM, 2012. Google Scholar
  36. Ke Yi and Qin Zhang. Optimal tracking of distributed heavy hitters and quantiles. Algorithmica, 65(1):206-223, 2013. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail