Cardinality Estimation Using Gumbel Distribution

Łukasiewicz, Aleksander; Uznański, Przemysław

doi:10.4230/LIPIcs.ESA.2022.76

File

Subject Classification

ACM Subject Classification

Theory of computation → Sketching and sampling

Keywords

Streaming algorithms
Cardinality estimation
Sketching
Gumbel distribution

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

Abstract

Cardinality estimation is the task of approximating the number of distinct elements in a large dataset with possibly repeating elements. LogLog and HyperLogLog (c.f. Durand and Flajolet [ESA 2003], Flajolet et al. [Discrete Math Theor. 2007]) are small space sketching schemes for cardinality estimation, which have both strong theoretical guarantees of performance and are highly effective in practice. This makes them a highly popular solution with many implementations in big-data systems (e.g. Algebird, Apache DataSketches, BigQuery, Presto and Redis). However, despite having simple and elegant formulation, both the analysis of LogLog and HyperLogLog are extremely involved - spanning over tens of pages of analytic combinatorics and complex function analysis. We propose a modification to both LogLog and HyperLogLog that replaces discrete geometric distribution with the continuous Gumbel distribution. This leads to a very short, simple and elementary analysis of estimation guarantees, and smoother behavior of the estimator.

Cite As Get BibTex

Aleksander Łukasiewicz and Przemysław Uznański. Cardinality Estimation Using Gumbel Distribution. In 30th Annual European Symposium on Algorithms (ESA 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 244, pp. 76:1-76:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022) https://doi.org/10.4230/LIPIcs.ESA.2022.76

Author Details

Aleksander Łukasiewicz

Faculty of Mathematics and Computer Science, University of Wrocław, Poland

Przemysław Uznański

Faculty of Mathematics and Computer Science, University of Wrocław, Poland

References

Algebird HyperLogLog implementation. Accessed: 2022-04-21. URL: https://twitter.github.io/algebird/datatypes/approx/hyperloglog.html.
Counting uniques faster in BigQuery with HyperLogLog++. Accessed: 2022-04-21, URL: https://cloud.google.com/blog/products/gcp/counting-uniques-faster-in-bigquery-with-hyperloglog.
HyperLogLog Sketch. Accessed: 2022-04-21. URL: https://datasketches.apache.org/docs/HLL/HLL.html.
Presto HyperLogLog function. Accessed: 2022-04-21. URL: https://prestodb.github.io/docs/current/functions/hyperloglog.html.
Redis PFCOUNT command. Accessed: 2022-04-21. URL: https://redis.io/commands/pfcount.
Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. In STOC, pages 20-29, 1996. URL: https://doi.org/10.1145/237814.237823.
Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. Counting distinct elements in a data stream. In RANDOM 2002, pages 1-10, 2002. URL: https://doi.org/10.1007/3-540-45726-7_1.
Ziv Bar-Yossef, Ravi Kumar, and D. Sivakumar. Reductions in streaming algorithms, with an application to counting triangles in graphs. In SODA 2002, pages 623-632. ACM/SIAM, 2002. URL: http://dl.acm.org/citation.cfm?id=545381.545464.
Kevin Beyer, Rainer Gemulla, Peter J Haas, Berthold Reinwald, and Yannis Sismanis. Distinct-value synopses for multiset operations. Communications of the ACM, 52(10):87-95, 2009.
Joshua Brody and Amit Chakrabarti. A multi-round communication lower bound for gap hamming and some consequences. In CCC 2009, pages 358-368, 2009. URL: https://doi.org/10.1109/CCC.2009.31.
Jarosław Błasiok. Optimal streaming and tracking distinct elements with high probability. In SODA 2018, pages 2432-2448, 2018. URL: https://doi.org/10.1137/1.9781611975031.156.
Aiyou Chen, Jin Cao, Larry Shepp, and Tuan Nguyen. Distinct counting with a self-learning bitmap. Journal of the American Statistical Association, 106(495):879-890, 2011.
Peter Clifford and Ioana A Cosma. A statistical analysis of probabilistic counting algorithms. Scandinavian Journal of Statistics, 39(1):1-14, 2012.
Edith Cohen. All-distances sketches, revisited: Hip estimators for massive graphs analysis. IEEE Transactions on Knowledge and Data Engineering, 27(9):2320-2334, 2015.
Laurens De Haan and Ana Ferreira. Extreme value theory: an introduction. Springer Science & Business Media, 2007.
Marianne Durand and Philippe Flajolet. Loglog counting of large cardinalities (extended abstract). In ESA 2003, pages 605-617, 2003. URL: https://doi.org/10.1007/978-3-540-39658-1_55.
Otmar Ertl. New cardinality estimation algorithms for hyperloglog sketches. CoRR, abs/1702.01284, 2017. URL: http://arxiv.org/abs/1702.01284.
Cristian Estan, George Varghese, and Michael E. Fisk. Bitmap algorithms for counting active flows on high-speed links. IEEE/ACM Trans. Netw., 14(5):925-937, 2006. URL: https://doi.org/10.1145/1217709.
Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In Discrete Mathematics and Theoretical Computer Science, pages 137-156. Discrete Mathematics and Theoretical Computer Science, 2007.
Philippe Flajolet and G. Nigel Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 31(2):182-209, 1985. URL: https://doi.org/10.1016/0022-0000(85)90041-8.
Lucas Gerin and Philippe Chassaing. Efficient estimation of the cardinality of large data sets. Discrete Mathematics & Theoretical Computer Science, 2006.
Phillip B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In VLDB 2001, pages 541-550, 2001. URL: http://www.vldb.org/conf/2001/P541.pdf.
Phillip B. Gibbons and Srikanta Tirthapura. Estimating simple functions on the union of data streams. In SPAA 2001, pages 281-291, 2001. URL: https://doi.org/10.1145/378580.378687.
Frédéric Giroire. Order statistics and estimating cardinalities of massive data sets. Discret. Appl. Math., 157(2):406-427, 2009. URL: https://doi.org/10.1016/j.dam.2008.06.020.
Emil Julius Gumbel. Les valeurs extrêmes des distributions statistiques. In Annales de l'Institut Henri Poincaré, volume 5(2), pages 115-158, 1935.
Stefan Heule, Marc Nunkesser, and Alexander Hall. Hyperloglog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In EDBT 2013, pages 683-692, 2013. URL: https://doi.org/10.1145/2452376.2452456.
Piotr Indyk and David P. Woodruff. Tight lower bounds for the distinct elements problem. In FOCS 2003, pages 283-288, 2003. URL: https://doi.org/10.1109/SFCS.2003.1238202.
T. S. Jayram and David P. Woodruff. Optimal bounds for johnson-lindenstrauss transforms and streaming problems with sub-constant error. In SODA 2011, pages 1-10, 2011. URL: https://doi.org/10.1137/1.9781611973082.1.
Daniel M. Kane, Jelani Nelson, and David P. Woodruff. An optimal algorithm for the distinct elements problem. In PODS 2010, pages 41-52, 2010. URL: https://doi.org/10.1145/1807085.1807094.
Jérémie Lumbroso. An optimal cardinality estimation algorithm based on order statistics and its full analysis. Discrete Mathematics & Theoretical Computer Science, 2010.
Seth Pettie and Dingyu Wang. Information theoretic limits of cardinality estimation: Fisher meets shannon. In STOC 2021, pages 556-569. ACM, 2021.
Seth Pettie, Dingyu Wang, and Longhui Yin. Non-mergeable sketching for cardinality estimation. In ICALP 2021, volume 198 of LIPIcs, pages 104:1-104:20, 2021.
Wojciech Szpankowski. Average case analysis of algorithms on sequences, volume 50. John Wiley & Sons, 2011.
Daniel Ting. Streamed approximate counting of distinct elements: beating optimal batch methods. In KDD 2014, pages 442-451. ACM, 2014. URL: https://doi.org/10.1145/2623330.2623669.
Alfredo Viola, Conrado Martínez, Jérémie Lumbroso, and Ahmed Helmi. Data streams as random permutations: the distinct element problem. Discrete Mathematics & Theoretical Computer Science, 2012.
David P. Woodruff. Optimal space lower bounds for all frequency moments. In SODA 2004, pages 167-175, 2004. URL: http://dl.acm.org/citation.cfm?id=982792.982817.
Qingjun Xiao, You Zhou, and Shigang Chen. Better with fewer bits: Improving the performance of cardinality estimation of large data streams. In INFOCOM 2017, pages 1-9, 2017.

Cardinality Estimation Using Gumbel Distribution

Authors Aleksander Łukasiewicz , Przemysław Uznański

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message

Cardinality Estimation Using Gumbel Distribution

Authors Aleksander Łukasiewicz , Przemysław Uznański

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Acknowledgements

References

Thanks for your feedback!

Could not send message