Cardinality Estimation Using Gumbel Distribution

Authors Aleksander Łukasiewicz , Przemysław Uznański



PDF
Thumbnail PDF

File

LIPIcs.ESA.2022.76.pdf
  • Filesize: 0.92 MB
  • 13 pages

Document Identifiers

Author Details

Aleksander Łukasiewicz
  • Faculty of Mathematics and Computer Science, University of Wrocław, Poland
Przemysław Uznański
  • Faculty of Mathematics and Computer Science, University of Wrocław, Poland

Acknowledgements

We would like to thank Seth Pettie for useful remarks that helped us improve the paper.

Cite AsGet BibTex

Aleksander Łukasiewicz and Przemysław Uznański. Cardinality Estimation Using Gumbel Distribution. In 30th Annual European Symposium on Algorithms (ESA 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 244, pp. 76:1-76:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.ESA.2022.76

Abstract

Cardinality estimation is the task of approximating the number of distinct elements in a large dataset with possibly repeating elements. LogLog and HyperLogLog (c.f. Durand and Flajolet [ESA 2003], Flajolet et al. [Discrete Math Theor. 2007]) are small space sketching schemes for cardinality estimation, which have both strong theoretical guarantees of performance and are highly effective in practice. This makes them a highly popular solution with many implementations in big-data systems (e.g. Algebird, Apache DataSketches, BigQuery, Presto and Redis). However, despite having simple and elegant formulation, both the analysis of LogLog and HyperLogLog are extremely involved - spanning over tens of pages of analytic combinatorics and complex function analysis. We propose a modification to both LogLog and HyperLogLog that replaces discrete geometric distribution with the continuous Gumbel distribution. This leads to a very short, simple and elementary analysis of estimation guarantees, and smoother behavior of the estimator.

Subject Classification

ACM Subject Classification
  • Theory of computation → Sketching and sampling
Keywords
  • Streaming algorithms
  • Cardinality estimation
  • Sketching
  • Gumbel distribution

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Algebird HyperLogLog implementation. Accessed: 2022-04-21. URL: https://twitter.github.io/algebird/datatypes/approx/hyperloglog.html.
  2. Counting uniques faster in BigQuery with HyperLogLog++. Accessed: 2022-04-21, URL: https://cloud.google.com/blog/products/gcp/counting-uniques-faster-in-bigquery-with-hyperloglog.
  3. HyperLogLog Sketch. Accessed: 2022-04-21. URL: https://datasketches.apache.org/docs/HLL/HLL.html.
  4. Presto HyperLogLog function. Accessed: 2022-04-21. URL: https://prestodb.github.io/docs/current/functions/hyperloglog.html.
  5. Redis PFCOUNT command. Accessed: 2022-04-21. URL: https://redis.io/commands/pfcount.
  6. Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. In STOC, pages 20-29, 1996. URL: https://doi.org/10.1145/237814.237823.
  7. Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. Counting distinct elements in a data stream. In RANDOM 2002, pages 1-10, 2002. URL: https://doi.org/10.1007/3-540-45726-7_1.
  8. Ziv Bar-Yossef, Ravi Kumar, and D. Sivakumar. Reductions in streaming algorithms, with an application to counting triangles in graphs. In SODA 2002, pages 623-632. ACM/SIAM, 2002. URL: http://dl.acm.org/citation.cfm?id=545381.545464.
  9. Kevin Beyer, Rainer Gemulla, Peter J Haas, Berthold Reinwald, and Yannis Sismanis. Distinct-value synopses for multiset operations. Communications of the ACM, 52(10):87-95, 2009. Google Scholar
  10. Joshua Brody and Amit Chakrabarti. A multi-round communication lower bound for gap hamming and some consequences. In CCC 2009, pages 358-368, 2009. URL: https://doi.org/10.1109/CCC.2009.31.
  11. Jarosław Błasiok. Optimal streaming and tracking distinct elements with high probability. In SODA 2018, pages 2432-2448, 2018. URL: https://doi.org/10.1137/1.9781611975031.156.
  12. Aiyou Chen, Jin Cao, Larry Shepp, and Tuan Nguyen. Distinct counting with a self-learning bitmap. Journal of the American Statistical Association, 106(495):879-890, 2011. Google Scholar
  13. Peter Clifford and Ioana A Cosma. A statistical analysis of probabilistic counting algorithms. Scandinavian Journal of Statistics, 39(1):1-14, 2012. Google Scholar
  14. Edith Cohen. All-distances sketches, revisited: Hip estimators for massive graphs analysis. IEEE Transactions on Knowledge and Data Engineering, 27(9):2320-2334, 2015. Google Scholar
  15. Laurens De Haan and Ana Ferreira. Extreme value theory: an introduction. Springer Science & Business Media, 2007. Google Scholar
  16. Marianne Durand and Philippe Flajolet. Loglog counting of large cardinalities (extended abstract). In ESA 2003, pages 605-617, 2003. URL: https://doi.org/10.1007/978-3-540-39658-1_55.
  17. Otmar Ertl. New cardinality estimation algorithms for hyperloglog sketches. CoRR, abs/1702.01284, 2017. URL: http://arxiv.org/abs/1702.01284.
  18. Cristian Estan, George Varghese, and Michael E. Fisk. Bitmap algorithms for counting active flows on high-speed links. IEEE/ACM Trans. Netw., 14(5):925-937, 2006. URL: https://doi.org/10.1145/1217709.
  19. Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In Discrete Mathematics and Theoretical Computer Science, pages 137-156. Discrete Mathematics and Theoretical Computer Science, 2007. Google Scholar
  20. Philippe Flajolet and G. Nigel Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 31(2):182-209, 1985. URL: https://doi.org/10.1016/0022-0000(85)90041-8.
  21. Lucas Gerin and Philippe Chassaing. Efficient estimation of the cardinality of large data sets. Discrete Mathematics & Theoretical Computer Science, 2006. Google Scholar
  22. Phillip B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In VLDB 2001, pages 541-550, 2001. URL: http://www.vldb.org/conf/2001/P541.pdf.
  23. Phillip B. Gibbons and Srikanta Tirthapura. Estimating simple functions on the union of data streams. In SPAA 2001, pages 281-291, 2001. URL: https://doi.org/10.1145/378580.378687.
  24. Frédéric Giroire. Order statistics and estimating cardinalities of massive data sets. Discret. Appl. Math., 157(2):406-427, 2009. URL: https://doi.org/10.1016/j.dam.2008.06.020.
  25. Emil Julius Gumbel. Les valeurs extrêmes des distributions statistiques. In Annales de l'Institut Henri Poincaré, volume 5(2), pages 115-158, 1935. Google Scholar
  26. Stefan Heule, Marc Nunkesser, and Alexander Hall. Hyperloglog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In EDBT 2013, pages 683-692, 2013. URL: https://doi.org/10.1145/2452376.2452456.
  27. Piotr Indyk and David P. Woodruff. Tight lower bounds for the distinct elements problem. In FOCS 2003, pages 283-288, 2003. URL: https://doi.org/10.1109/SFCS.2003.1238202.
  28. T. S. Jayram and David P. Woodruff. Optimal bounds for johnson-lindenstrauss transforms and streaming problems with sub-constant error. In SODA 2011, pages 1-10, 2011. URL: https://doi.org/10.1137/1.9781611973082.1.
  29. Daniel M. Kane, Jelani Nelson, and David P. Woodruff. An optimal algorithm for the distinct elements problem. In PODS 2010, pages 41-52, 2010. URL: https://doi.org/10.1145/1807085.1807094.
  30. Jérémie Lumbroso. An optimal cardinality estimation algorithm based on order statistics and its full analysis. Discrete Mathematics & Theoretical Computer Science, 2010. Google Scholar
  31. Seth Pettie and Dingyu Wang. Information theoretic limits of cardinality estimation: Fisher meets shannon. In STOC 2021, pages 556-569. ACM, 2021. Google Scholar
  32. Seth Pettie, Dingyu Wang, and Longhui Yin. Non-mergeable sketching for cardinality estimation. In ICALP 2021, volume 198 of LIPIcs, pages 104:1-104:20, 2021. Google Scholar
  33. Wojciech Szpankowski. Average case analysis of algorithms on sequences, volume 50. John Wiley & Sons, 2011. Google Scholar
  34. Daniel Ting. Streamed approximate counting of distinct elements: beating optimal batch methods. In KDD 2014, pages 442-451. ACM, 2014. URL: https://doi.org/10.1145/2623330.2623669.
  35. Alfredo Viola, Conrado Martínez, Jérémie Lumbroso, and Ahmed Helmi. Data streams as random permutations: the distinct element problem. Discrete Mathematics & Theoretical Computer Science, 2012. Google Scholar
  36. David P. Woodruff. Optimal space lower bounds for all frequency moments. In SODA 2004, pages 167-175, 2004. URL: http://dl.acm.org/citation.cfm?id=982792.982817.
  37. Qingjun Xiao, You Zhou, and Shigang Chen. Better with fewer bits: Improving the performance of cardinality estimation of large data streams. In INFOCOM 2017, pages 1-9, 2017. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail