Computing joins is expensive, and often unnecessary when the output size is large. In 1999, Chaudhuri et al. [Surajit Chaudhuri et al., 1999] posed the problem of random sampling over joins as a potentially effective approach to avoiding computing the join in full, while obtaining important statistical information about the join results. Unfortunately, no significant progress has been made in the last 20 years, except for the case of acyclic joins. In this paper, we present the first non-trivial result on sampling over cyclic joins. We show that after a linear-time preprocessing step, a join result can be drawn uniformly at random in expected time O(IN^ρ/OUT), where IN^ρ is known as the AGM bound of the join and OUT is its output size. This result holds for all joins on binary relations, as well as certain joins on relations of higher arity. We further show how this algorithm immediately leads to a join size estimation algorithm with the same running time.
@InProceedings{chen_et_al:LIPIcs.ICDT.2020.7, author = {Chen, Yu and Yi, Ke}, title = {{Random Sampling and Size Estimation Over Cyclic Joins}}, booktitle = {23rd International Conference on Database Theory (ICDT 2020)}, pages = {7:1--7:18}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-139-9}, ISSN = {1868-8969}, year = {2020}, volume = {155}, editor = {Lutz, Carsten and Jung, Jean Christoph}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2020.7}, URN = {urn:nbn:de:0030-drops-119315}, doi = {10.4230/LIPIcs.ICDT.2020.7}, annote = {Keywords: Random sampling, joins, join size estimation} }
Feedback for Dagstuhl Publishing