Bit-Array-Based Alternatives to HyperLogLog

Authors Svante Janson , Jérémie Lumbroso , Robert Sedgewick



PDF
Thumbnail PDF

File

LIPIcs.AofA.2024.5.pdf
  • Filesize: 1.15 MB
  • 19 pages

Document Identifiers

Author Details

Svante Janson
  • Department of Mathematics, Uppsala University, Sweden
Jérémie Lumbroso
  • Department of Computer Science, University of Pennsylvania, Philadelphia, PA, USA
Robert Sedgewick
  • Department of Computer Science, Princeton University, NJ, USA

Acknowledgements

This work is dedicated to the memory of Philippe Flajolet. We would like to thank Martin Pépin and two anonymous reviewers for their helpful comments on our initial submission; and Seth Pettie and Jelani Nelson for feedback on this paper. We would also like to thank our colleagues, Conrado Martínez, Sampath Kannan, Val Tannen, and Pedro Paredes for their interest and feedback; and our students, Alex Iriza and Alex Baroody for their discussions and implementation work on earlier versions of these algorithms. Finally, we would like to thank the editors, Cécile Mailler and Sebastian Wild, for their service to the community.

Cite AsGet BibTex

Svante Janson, Jérémie Lumbroso, and Robert Sedgewick. Bit-Array-Based Alternatives to HyperLogLog. In 35th International Conference on Probabilistic, Combinatorial and Asymptotic Methods for the Analysis of Algorithms (AofA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 302, pp. 5:1-5:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.AofA.2024.5

Abstract

We present a family of algorithms for the problem of estimating the number of distinct items in an input stream that are simple to implement and are appropriate for practical applications. Our algorithms are a logical extension of the series of algorithms developed by Flajolet and his coauthors starting in 1983 that culminated in the widely used HyperLogLog algorithm. These algorithms divide the input stream into M substreams and lead to a time-accuracy tradeoff where a constant number of bits per substream are saved to achieve a relative accuracy proportional to 1/√M. Our algorithms use just one or two bits per substream. Their effectiveness is demonstrated by a proof of approximate normality, with explicit expressions for standard errors that inform parameter settings and allow proper quantitative comparisons with other methods. Hypotheses about performance are validated through experiments using a realistic input stream, with the conclusion that our algorithms are more accurate than HyperLogLog when using the same amount of memory, and they use two-thirds as much memory as HyperLogLog to achieve a given accuracy.

Subject Classification

ACM Subject Classification
  • Theory of computation → Sketching and sampling
Keywords
  • Cardinality estimation
  • sketching
  • Hyperloglog

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. In Gary L. Miller, editor, Proceedings of the Twenty-Eighth Annual ACM Symposium on the Theory of Computing, Philadelphia, Pennsylvania, USA, May 22-24, 1996, pages 20-29. ACM, 1996. URL: https://doi.org/10.1145/237814.237823.
  2. A. D. Barbour, Lars Holst, and Svante Janson. Poisson Approximation. Oxford University Press, 1992. Google Scholar
  3. Kai-Min Chung, Michael Mitzenmacher, and Salil P. Vadhan. Why simple hash functions work: Exploiting the entropy in a data stream. Theory Comput., 9:897-945, 2013. URL: https://doi.org/10.4086/TOC.2013.V009A030.
  4. Marianne Durand and Philippe Flajolet. Loglog counting of large cardinalities (extended abstract). In Giuseppe Di Battista and Uri Zwick, editors, Algorithms - ESA 2003, 11th Annual European Symposium, Budapest, Hungary, September 16-19, 2003, Proceedings, volume 2832 of Lecture Notes in Computer Science, pages 605-617. Springer, 2003. URL: https://doi.org/10.1007/978-3-540-39658-1_55.
  5. Philippe Flajolet and G. Nigel Martin. Probabilistic counting. In 24th Annual Symposium on Foundations of Computer Science, Tucson, Arizona, USA, 7-9 November 1983, pages 76-82. IEEE Computer Society, 1983. URL: https://doi.org/10.1109/SFCS.1983.46.
  6. Philippe Flajolet and G. Nigel Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 31(2):182-209, 1985. URL: https://doi.org/10.1016/0022-0000(85)90041-8.
  7. Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In Philippe Jacquet, editor, AofA 07— Proceedings of the 2007 Conference on Analysis of Algorithms, Juan-les-pins, France, June 18-22, 2007, DMTCS Proceedings volume AH, pages 127-146. DMTCS, 2007. URL: https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf.
  8. Allan Gut. Probability: A Graduate Course (2nd edition). Springer Texts in Statistics, 75, 2013. Google Scholar
  9. Stefan Heule, Marc Nunkesser, and Alexander Hall. Hyperloglog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In Giovanna Guerrini and Norman W. Paton, editors, Joint 2013 EDBT/ICDT Conferences, EDBT '13 Proceedings, Genoa, Italy, March 18-22, 2013, pages 683-692. ACM, 2013. URL: https://doi.org/10.1145/2452376.2452456.
  10. Piotr Indyk and David P. Woodruff. Tight lower bounds for the distinct elements problem. In 44th Symposium on Foundations of Computer Science (FOCS 2003), 11-14 October 2003, Cambridge, MA, USA, Proceedings, pages 283-288. IEEE Computer Society, 2003. URL: https://doi.org/10.1109/SFCS.2003.1238202.
  11. Daniel M. Kane, Jelani Nelson, and David P. Woodruff. An optimal algorithm for the distinct elements problem. In Jan Paredaens and Dirk Van Gucht, editors, Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2010, June 6-11, 2010, Indianapolis, Indiana, USA, pages 41-52. ACM, 2010. URL: https://doi.org/10.1145/1807085.1807094.
  12. Matti Karppa and Rasmus Pagh. Hyperlogloglog: cardinality estimation with one log more. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 753-761, 2022. Google Scholar
  13. Jérémie Lumbroso. An optimal cardinality estimation algorithm based on order statistics and its full analysis. Discrete Mathematics & Theoretical Computer Science, AM, 2010. URL: https://doi.org/10.46298/dmtcs.2780.
  14. Jérémie Lumbroso. How Flajolet processed streams with coin flips. CoRR, abs/1805.00612, 2018. URL: https://arxiv.org/abs/1805.00612.
  15. Jérémie Lumbroso and Conrado Martínez. Affirmative Sampling: Theory and Applications. In Mark Daniel Ward, editor, 33rd International Conference on Probabilistic, Combinatorial and Asymptotic Methods for the Analysis of Algorithms (AofA 2022), volume 225 of Leibniz International Proceedings in Informatics (LIPIcs), pages 12:1-12:17, Dagstuhl, Germany, 2022. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. URL: https://doi.org/10.4230/LIPIcs.AofA.2022.12.
  16. Tal Ohayon. Extendedhyperloglog: Analysis of a new cardinality estimator. CoRR, abs/2106.06525, 2021. URL: https://arxiv.org/abs/2106.06525.
  17. Seth Pettie and Dingyu Wang. Information theoretic limits of cardinality estimation: Fisher meets shannon. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 556-569, 2021. Google Scholar
  18. Robert Sedgewick and Philippe Flajolet. An Introduction to the Analysis of Algorithms, Second Edition. Addison-Wesley-Longman, 2013. Google Scholar
  19. Dingyu Wang and Seth Pettie. Better cardinality estimators for hyperloglog, pcsa, and beyond. In Floris Geerts, Hung Q. Ngo, and Stavros Sintos, editors, Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2023, Seattle, WA, USA, June 18-23, 2023, pages 317-327. ACM, 2023. URL: https://doi.org/10.1145/3584372.3588680.