Random Sampling and Size Estimation Over Cyclic Joins

Chen, Yu; Yi, Ke

doi:10.4230/LIPIcs.ICDT.2020.7

Abstract

Computing joins is expensive, and often unnecessary when the output size is large. In 1999, Chaudhuri et al. [Surajit Chaudhuri et al., 1999] posed the problem of random sampling over joins as a potentially effective approach to avoiding computing the join in full, while obtaining important statistical information about the join results. Unfortunately, no significant progress has been made in the last 20 years, except for the case of acyclic joins. In this paper, we present the first non-trivial result on sampling over cyclic joins. We show that after a linear-time preprocessing step, a join result can be drawn uniformly at random in expected time O(IN^ρ/OUT), where IN^ρ is known as the AGM bound of the join and OUT is its output size. This result holds for all joins on binary relations, as well as certain joins on relations of higher arity. We further show how this algorithm immediately leads to a join size estimation algorithm with the same running time.

Cite As Get BibTex

Yu Chen and Ke Yi. Random Sampling and Size Estimation Over Cyclic Joins. In 23rd International Conference on Database Theory (ICDT 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 155, pp. 7:1-7:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020) https://doi.org/10.4230/LIPIcs.ICDT.2020.7

Author Details

Yu Chen

Hong Kong University of Science and Technology, Hong Kong

Ke Yi

Hong Kong University of Science and Technology, Hong Kong

Funding

This work has been supported by HKRGC under grants 16202317, 16201318, and 16201819.

Supplementary Materials

Video of the Presentation: https://doi.org/10.5446/46833

References

Swarup Acharya, Phillip B Gibbons, Viswanath Poosala, and Sridhar Ramaswamy. Join synopses for approximate query answering. In Proc. ACM SIGMOD International Conference on Management of Data, 1999.
Sepehr Assadi, Mikhail Kapralov, and Sanjeev Khanna. A Simple Sublinear-Time Algorithm for Counting Arbitrary Subgraphs via Edge Sampling. In Proc. Innovations in Theoretical Computer Science, 2019.
Albert Atserias, Martin Grohe, and Dániel Marx. Size bounds and query plans for relational joins. SIAM Journal on Computing, 42(4):1737-1767, 2013.
S. K. Bera and A. Chakrabarti. Towards tighter space bounds for counting triangles and other substructures in graph streams. In Symposium on Theoretical Aspects of Computer Science, 2017.
Andreas Björklund, Rasmus Pagh, Virginia V. Williams, and Uri Zwick. Listing triangles. In Proc. International Colloquium on Automata, Languages, and Programming, 2014.
P. Bratley, B. L. Fox, and L. E. Schrage. A Guide to Simulation. Springer Verlag, 1983.
Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya. On Random Sampling over Joins. In Proc. ACM SIGMOD International Conference on Management of Data, 1999.
T. Eden, A. Levi, D. Ron, and C. Seshadhri. Approximately counting triangles in sublinear time. In Proc. IEEE Symposium on Foundations of Computer Science, 2015.
Talya Eden, Dana Ron, and C. Seshadhri. On Approximating the Number of k-cliques in Sublinear Time. In Proc. ACM Symposium on Theory of Computing, 2018.
G. Gottlob, M. Grohe, N. Musliu, M. Samer, and F. Scarcello. Hypertree decompositions: structure, algorithms, and applications. In Lecture Notes in Computer Science, volume 3787, pages 1-15. Springer, 2005.
Mahmoud Abo Khamis, Hung Q. Ngo, and Dan Suciu. What do shannon-type inequalities, submodular width, and disjunctive datalog have to do with one another? In Proc. ACM Symposium on Principles of Database Systems, 2017.
Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. Wander Join: Online Aggregation via Random Walks. In Proc. ACM SIGMOD International Conference on Management of Data, 2016.
Hung Q Ngo, Ely Porat, Christopher Ré, and Atri Rudra. Worst-case optimal join algorithms. In Proc. ACM Symposium on Principles of Database Systems, pages 37-48, 2012.
Hung Q Ngo, Christopher Ré, and Atri Rudra. Skew strikes back: New developments in the theory of join algorithms. ACM SIGMOD Record, 42(4):5-16, 2014.
C Seshadhri, Ali Pinar, and Tamara G Kolda. Triadic measures on graphs: the power of wedge sampling. In Proc. SIAM International Conference on Data Mining, 2013.
Todd Veldhuizen. Leapfrog Triejoin: A Simple, Worst-Case Optimal Join Algorithm. In Proc. International Conference on Database Theory, 2014.
Pinghui Wang, Junzhou Zhao, Xiangliang Zhang, Zhenguo Li, Jiefeng Cheng, John CS Lui, Don Towsley, Jing Tao, and Xiaohong Guan. MOSS-5: A Fast Method of Approximating Counts of 5-Node Graphlets in Large Graphs. IEEE Transactions on Knowledge and Data Engineering, 2017.
Mihalis Yannakakis. Algorithms for acyclic database schemes. In Proc. International Conference on Very Large Data Bases, pages 82-94, 1981.
Zhuoyue Zhao, Robert Christensen, Feifei Li, Xiao Hu, and Ke Yi. Random Sampling over Joins Revisited. In Proc. ACM SIGMOD International Conference on Management of Data, 2018.

Random Sampling and Size Estimation Over Cyclic Joins

Authors Yu Chen, Ke Yi

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message