,
Przemysław Uznański
Creative Commons Attribution 4.0 International license
Cardinality estimation is the task of approximating the number of distinct elements in a large dataset with possibly repeating elements. LogLog and HyperLogLog (c.f. Durand and Flajolet [ESA 2003], Flajolet et al. [Discrete Math Theor. 2007]) are small space sketching schemes for cardinality estimation, which have both strong theoretical guarantees of performance and are highly effective in practice. This makes them a highly popular solution with many implementations in big-data systems (e.g. Algebird, Apache DataSketches, BigQuery, Presto and Redis). However, despite having simple and elegant formulation, both the analysis of LogLog and HyperLogLog are extremely involved - spanning over tens of pages of analytic combinatorics and complex function analysis. We propose a modification to both LogLog and HyperLogLog that replaces discrete geometric distribution with the continuous Gumbel distribution. This leads to a very short, simple and elementary analysis of estimation guarantees, and smoother behavior of the estimator.
@InProceedings{lukasiewicz_et_al:LIPIcs.ESA.2022.76,
author = {{\L}ukasiewicz, Aleksander and Uzna\'{n}ski, Przemys{\l}aw},
title = {{Cardinality Estimation Using Gumbel Distribution}},
booktitle = {30th Annual European Symposium on Algorithms (ESA 2022)},
pages = {76:1--76:13},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {978-3-95977-247-1},
ISSN = {1868-8969},
year = {2022},
volume = {244},
editor = {Chechik, Shiri and Navarro, Gonzalo and Rotenberg, Eva and Herman, Grzegorz},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2022.76},
URN = {urn:nbn:de:0030-drops-170140},
doi = {10.4230/LIPIcs.ESA.2022.76},
annote = {Keywords: Streaming algorithms, Cardinality estimation, Sketching, Gumbel distribution}
}