Random Sampling and Size Estimation Over Cyclic Joins

Authors Yu Chen, Ke Yi



PDF
Thumbnail PDF

File

LIPIcs.ICDT.2020.7.pdf
  • Filesize: 0.73 MB
  • 18 pages

Document Identifiers

Author Details

Yu Chen
  • Hong Kong University of Science and Technology, Hong Kong
Ke Yi
  • Hong Kong University of Science and Technology, Hong Kong

Cite AsGet BibTex

Yu Chen and Ke Yi. Random Sampling and Size Estimation Over Cyclic Joins. In 23rd International Conference on Database Theory (ICDT 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 155, pp. 7:1-7:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)
https://doi.org/10.4230/LIPIcs.ICDT.2020.7

Abstract

Computing joins is expensive, and often unnecessary when the output size is large. In 1999, Chaudhuri et al. [Surajit Chaudhuri et al., 1999] posed the problem of random sampling over joins as a potentially effective approach to avoiding computing the join in full, while obtaining important statistical information about the join results. Unfortunately, no significant progress has been made in the last 20 years, except for the case of acyclic joins. In this paper, we present the first non-trivial result on sampling over cyclic joins. We show that after a linear-time preprocessing step, a join result can be drawn uniformly at random in expected time O(IN^ρ/OUT), where IN^ρ is known as the AGM bound of the join and OUT is its output size. This result holds for all joins on binary relations, as well as certain joins on relations of higher arity. We further show how this algorithm immediately leads to a join size estimation algorithm with the same running time.

Subject Classification

ACM Subject Classification
  • Theory of computation → Database theory
Keywords
  • Random sampling
  • joins
  • join size estimation

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Swarup Acharya, Phillip B Gibbons, Viswanath Poosala, and Sridhar Ramaswamy. Join synopses for approximate query answering. In Proc. ACM SIGMOD International Conference on Management of Data, 1999. Google Scholar
  2. Sepehr Assadi, Mikhail Kapralov, and Sanjeev Khanna. A Simple Sublinear-Time Algorithm for Counting Arbitrary Subgraphs via Edge Sampling. In Proc. Innovations in Theoretical Computer Science, 2019. Google Scholar
  3. Albert Atserias, Martin Grohe, and Dániel Marx. Size bounds and query plans for relational joins. SIAM Journal on Computing, 42(4):1737-1767, 2013. Google Scholar
  4. S. K. Bera and A. Chakrabarti. Towards tighter space bounds for counting triangles and other substructures in graph streams. In Symposium on Theoretical Aspects of Computer Science, 2017. Google Scholar
  5. Andreas Björklund, Rasmus Pagh, Virginia V. Williams, and Uri Zwick. Listing triangles. In Proc. International Colloquium on Automata, Languages, and Programming, 2014. Google Scholar
  6. P. Bratley, B. L. Fox, and L. E. Schrage. A Guide to Simulation. Springer Verlag, 1983. Google Scholar
  7. Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya. On Random Sampling over Joins. In Proc. ACM SIGMOD International Conference on Management of Data, 1999. Google Scholar
  8. T. Eden, A. Levi, D. Ron, and C. Seshadhri. Approximately counting triangles in sublinear time. In Proc. IEEE Symposium on Foundations of Computer Science, 2015. Google Scholar
  9. Talya Eden, Dana Ron, and C. Seshadhri. On Approximating the Number of k-cliques in Sublinear Time. In Proc. ACM Symposium on Theory of Computing, 2018. Google Scholar
  10. G. Gottlob, M. Grohe, N. Musliu, M. Samer, and F. Scarcello. Hypertree decompositions: structure, algorithms, and applications. In Lecture Notes in Computer Science, volume 3787, pages 1-15. Springer, 2005. Google Scholar
  11. Mahmoud Abo Khamis, Hung Q. Ngo, and Dan Suciu. What do shannon-type inequalities, submodular width, and disjunctive datalog have to do with one another? In Proc. ACM Symposium on Principles of Database Systems, 2017. Google Scholar
  12. Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. Wander Join: Online Aggregation via Random Walks. In Proc. ACM SIGMOD International Conference on Management of Data, 2016. Google Scholar
  13. Hung Q Ngo, Ely Porat, Christopher Ré, and Atri Rudra. Worst-case optimal join algorithms. In Proc. ACM Symposium on Principles of Database Systems, pages 37-48, 2012. Google Scholar
  14. Hung Q Ngo, Christopher Ré, and Atri Rudra. Skew strikes back: New developments in the theory of join algorithms. ACM SIGMOD Record, 42(4):5-16, 2014. Google Scholar
  15. C Seshadhri, Ali Pinar, and Tamara G Kolda. Triadic measures on graphs: the power of wedge sampling. In Proc. SIAM International Conference on Data Mining, 2013. Google Scholar
  16. Todd Veldhuizen. Leapfrog Triejoin: A Simple, Worst-Case Optimal Join Algorithm. In Proc. International Conference on Database Theory, 2014. Google Scholar
  17. Pinghui Wang, Junzhou Zhao, Xiangliang Zhang, Zhenguo Li, Jiefeng Cheng, John CS Lui, Don Towsley, Jing Tao, and Xiaohong Guan. MOSS-5: A Fast Method of Approximating Counts of 5-Node Graphlets in Large Graphs. IEEE Transactions on Knowledge and Data Engineering, 2017. Google Scholar
  18. Mihalis Yannakakis. Algorithms for acyclic database schemes. In Proc. International Conference on Very Large Data Bases, pages 82-94, 1981. Google Scholar
  19. Zhuoyue Zhao, Robert Christensen, Feifei Li, Xiao Hu, and Ke Yi. Random Sampling over Joins Revisited. In Proc. ACM SIGMOD International Conference on Management of Data, 2018. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail