A Framework for Estimating Stream Expression Cardinalities

Dasgupta, Anirban; Lang, Kevin J.; Rhodes, Lee; Thaler, Justin

doi:10.4230/LIPIcs.ICDT.2016.6

File

Subject Classification

Keywords

sketching
data stream algorithms
mergeability
distinct elements
set operations

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

Abstract

Given m distributed data streams A_1,..., A_m, we consider the problem of estimating the number of unique identifiers in streams defined by set expressions over A_1,..., A_m. We identify a broad class of algorithms for solving this problem, and show that the estimators output by any algorithm in this class are perfectly unbiased and satisfy strong variance bounds. Our analysis unifies and generalizes a variety of earlier results in the literature. To demonstrate its generality, we describe several novel sampling algorithms in our class, and show that they achieve a novel tradeoff between accuracy, space usage, update speed, and applicability.

Cite As Get BibTex

Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin Thaler. A Framework for Estimating Stream Expression Cardinalities. In 19th International Conference on Database Theory (ICDT 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 48, pp. 6:1-6:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016) https://doi.org/10.4230/LIPIcs.ICDT.2016.6

Author Details

Anirban Dasgupta

Kevin J. Lang

Lee Rhodes

Justin Thaler

References

Yuriy Arbitman, Moni Naor, and Gil Segev. De-amortized cuckoo hashing: Provable worst-case performance and experimental results. In Automata, Languages and Programming, 36th International Colloquium, ICALP 2009, Rhodes, Greece, July 5-12, 2009, Proceedings, Part I, pages 107-118, 2009. URL: http://dx.doi.org/10.1007/978-3-642-02927-1_11.
Yuriy Arbitman, Moni Naor, and Gil Segev. Backyard cuckoo hashing: Constant worst-case operations with a succinct representation. In 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, October 23-26, 2010, Las Vegas, Nevada, USA, pages 787-796, 2010. URL: http://dx.doi.org/10.1109/FOCS.2010.80.
Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. Counting distinct elements in a data stream. In Randomization and Approximation Techniques, 6th International Workshop, RANDOM 2002, Cambridge, MA, USA, September 13-15, 2002, Proceedings, pages 1-10, 2002. URL: http://dx.doi.org/10.1007/3-540-45726-7_1.
Kevin S. Beyer, Rainer Gemulla, Peter J. Haas, Berthold Reinwald, and Yannis Sismanis. Distinct-value synopses for multiset operations. Commun. ACM, 52(10):87-95, 2009. URL: http://dx.doi.org/10.1145/1562764.1562787.
Edith Cohen and Haim Kaplan. Leveraging discarded samples for tighter estimation of multiple-set aggregates. In Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems, SIGMETRICS/Performance 2009, Seattle, WA, USA, June 15-19, 2009, pages 251-262, 2009. URL: http://dx.doi.org/10.1145/1555349.1555379.
Anirban Dasgupta, Kevin Lang, Lee Rhodes, and Justin Thaler. A framework for estimating stream expression cardinalities. CoRR, abs/1510.01455, 2015. URL: http://arxiv.org/abs/1510.01455.
Nick G. Duffield, Carsten Lund, and Mikkel Thorup. Priority sampling for estimation of arbitrary subset sums. J. ACM, 54(6), 2007. URL: http://dx.doi.org/10.1145/1314690.1314696.
Philippe Flajolet. On adaptive sampling. Computing, 43(4):391-400, 1990. URL: http://dx.doi.org/10.1007/BF02241657.
Phillip B. Gibbons and Srikanta Tirthapura. Estimating simple functions on the union of data streams. In SPAA, pages 281-291, 2001. URL: http://dx.doi.org/10.1145/378580.378687.
Frédéric Giroire. Order statistics and estimating cardinalities of massive data sets. Discrete Applied Mathematics, 157(2):406-427, 2009. URL: http://dx.doi.org/10.1016/j.dam.2008.06.020.
Stefan Heule, Marc Nunkesser, and Alexander Hall. Hyperloglog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In Joint 2013 EDBT/ICDT Conferences, EDBT'13 Proceedings, Genoa, Italy, March 18-22, 2013, pages 683-692, 2013. URL: http://dx.doi.org/10.1145/2452376.2452456.
C. A. R. Hoare. Algorithm 65: Find. Commun. ACM, 4(7):321-322, July 1961. URL: http://dx.doi.org/10.1145/366622.366647.
Daniel M. Kane, Jelani Nelson, and David P. Woodruff. An optimal algorithm for the distinct elements problem. In Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2010, June 6-11, 2010, Indianapolis, Indiana, USA, pages 41-52, 2010. URL: http://dx.doi.org/10.1145/1807085.1807094.
Robert Morris. Counting large numbers of events in small registers. Commun. ACM, 21(10):840-842, October 1978. URL: http://dx.doi.org/10.1145/359619.359627.

A Framework for Estimating Stream Expression Cardinalities

Authors Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, Justin Thaler

File

Document Identifiers

Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message