Document Open Access Logo

An Embarrassingly Parallel Optimal-Space Cardinality Estimation Algorithm

Author Emin Karayel



PDF
Thumbnail PDF

File

LIPIcs.APPROX-RANDOM.2023.35.pdf
  • Filesize: 0.96 MB
  • 22 pages

Document Identifiers

Author Details

Emin Karayel
  • Department of Computer Science, Technische Universität München, Germany

Acknowledgements

Thanks to the anonymous reviewers for their careful review and helpful comments and to Tobias Nipkow for many helpful discussions.

Cite AsGet BibTex

Emin Karayel. An Embarrassingly Parallel Optimal-Space Cardinality Estimation Algorithm. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 275, pp. 35:1-35:22, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.APPROX/RANDOM.2023.35

Abstract

In 2020 Błasiok (ACM Trans. Algorithms 16(2) 3:1-3:28) constructed an optimal space streaming algorithm for the cardinality estimation problem with the space complexity of O(ε^{-2} ln(δ^{-1}) + ln n) where ε, δ and n denote the relative accuracy, failure probability and universe size, respectively. However, his solution requires the stream to be processed sequentially. On the other hand, there are algorithms that admit a merge operation; they can be used in a distributed setting, allowing parallel processing of sections of the stream, and are highly relevant for large-scale distributed applications. The best-known such algorithm, unfortunately, has a space complexity exceeding Ω(ln(δ^{-1}) (ε^{-2} ln ln n + ln n)). This work presents a new algorithm that improves on the solution by Błasiok, preserving its space complexity, but with the benefit that it admits such a merge operation, thus providing an optimal solution for the problem for both sequential and parallel applications. Orthogonally, the new algorithm also improves algorithmically on Błasiok’s solution (even in the sequential setting) by reducing its implementation complexity and requiring fewer distinct pseudo-random objects.

Subject Classification

ACM Subject Classification
  • Mathematics of computing → Probabilistic algorithms
  • Theory of computation → Pseudorandomness and derandomization
  • Theory of computation → Distributed algorithms
  • Theory of computation → Sketching and sampling
Keywords
  • Distinct Elements
  • Distributed Algorithms
  • Randomized Algorithms
  • Expander Graphs
  • Derandomization
  • Sketching

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Archive of Formal Proofs. https://isa-afp.org. Accessed: 2023-07-03.
  2. Rohit Agrawal. Samplers and Extractors for Unbounded Functions. In Dimitris Achlioptas and László A. Végh, editors, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019), volume 145 of Leibniz International Proceedings in Informatics (LIPIcs), pages 59:1-59:21, Dagstuhl, Germany, 2019. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. URL: https://doi.org/10.4230/LIPIcs.APPROX-RANDOM.2019.59.
  3. Noga Alon, Uriel Feige, Avi Wigderson, and David Zuckerman. Derandomized graph products. computational complexity, 5:60-75, 1995. URL: https://doi.org/10.1007/BF01277956.
  4. Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137-147, 1999. URL: https://doi.org/10.1006/jcss.1997.1545.
  5. Frank J. Balbach. The cook-levin theorem. Archive of Formal Proofs, January 2023. , Formal proof development. URL: https://isa-afp.org/entries/Cook_Levin.html.
  6. Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. Counting distinct elements in a data stream. In Randomization and Approximation Techniques in Computer Science, pages 1-10. Springer Berlin Heidelberg, 2002. URL: https://doi.org/10.1007/3-540-45726-7_1.
  7. Daniel K. Blandford and Guy E. Blelloch. Compact dictionaries for variable-length keys and data with applications. ACM Trans. Algorithms, 4(2), May 2008. URL: https://doi.org/10.1145/1361192.1361194.
  8. Jarosław Błasiok. Optimal streaming and tracking distinct elements with high probability. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, pages 2432-2448, 2018. URL: https://doi.org/10.1137/1.9781611975031.156.
  9. Jarosław Błasiok. Optimal streaming and tracking distinct elements with high probability. ACM Trans. Algorithms, 16(1):3:1-3:28, 2020. URL: https://doi.org/10.1145/3309193.
  10. Gil Cohen, Noam Peri, and Amnon Ta-Shma. Expander random walks: A fourier-analytic approach. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2021, pages 1643-1655, New York, NY, USA, 2021. URL: https://doi.org/10.1145/3406325.3451049.
  11. Jeffrey Dean and Sanjay Ghemawat. Mapreduce: A flexible data processing tool. Commun. ACM, 53(1):72-77, January 2010. URL: https://doi.org/10.1145/1629175.1629198.
  12. Manuel Eberl and Lawrence C. Paulson. The prime number theorem. Archive of Formal Proofs, September 2018. , Formal proof development. URL: https://isa-afp.org/entries/Prime_Number_Theorem.html.
  13. P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2):194-203, 1975. Google Scholar
  14. Philippe Flajolet and G. Nigel Martin. Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 31(2):182-209, 1985. URL: https://doi.org/10.1016/0022-0000(85)90041-8.
  15. Ian Foster. Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Addison-Wesley Longman Publishing Co., Inc., USA, 1995. Google Scholar
  16. Ankit Garg, Yin Tat Lee, Zhao Song, and Nikhil Srivastava. A matrix expander chernoff bound. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, pages 1102-1114, New York, NY, USA, 2018. URL: https://doi.org/10.1145/3188745.3188890.
  17. Phillip B. Gibbons and Srikanta Tirthapura. Estimating simple functions on the union of data streams. In Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '01, pages 281-291, 2001. URL: https://doi.org/10.1145/378580.378687.
  18. David Gillman. A chernoff bound for random walks on expander graphs. SIAM Journal on Computing, 27(4):1203-1220, 1998. URL: https://doi.org/10.1137/S0097539794268765.
  19. Louis Golowich. A new berry-esseen theorem for expander walks. Electron. Colloquium Comput. Complex., TR22, 2022. Google Scholar
  20. Louis Golowich and Salil Vadhan. Pseudorandomness of expander random walks for symmetric functions and permutation branching programs. In Proceedings of the 37th Computational Complexity Conference, CCC '22, Dagstuhl, Germany, 2022. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. URL: https://doi.org/10.4230/LIPIcs.CCC.2022.27.
  21. Venkatesan Guruswami, Christopher Umans, and Salil Vadhan. Unbalanced expanders and randomness extractors from parvaresh-vardy codes. J. ACM, 56(4), July 2009. URL: https://doi.org/10.1145/1538902.1538904.
  22. Russell Impagliazzo and Valentine Kabanets. Constructive proofs of concentration bounds. In Maria Serna, Ronen Shaltiel, Klaus Jansen, and José Rolim, editors, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 617-631, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg. URL: https://doi.org/10.1007/978-3-642-15369-3_46.
  23. T. S. Jayram and David P. Woodruff. Optimal bounds for johnson-lindenstrauss transforms and streaming problems with subconstant error. ACM Trans. Algorithms, 9(3), June 2013. URL: https://doi.org/10.1145/2483699.2483706.
  24. Daniel M. Kane, Jelani Nelson, and David P. Woodruff. An optimal algorithm for the distinct elements problem. In Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS '10, pages 41-52, New York, 2010. URL: https://doi.org/10.1145/1807085.1807094.
  25. Emin Karayel. Distributed distinct elements. Archive of Formal Proofs, April 2023. , Formal proof development. URL: https://isa-afp.org/entries/Distributed_Distinct_Elements.html.
  26. Emin Karayel. An embarrassingly parallel optimal-space cardinality estimation algorithm, 2023. URL: https://arxiv.org/abs/2307.00985.
  27. Emin Karayel. Expander graphs. Archive of Formal Proofs, March 2023. , Formal proof development. URL: https://isa-afp.org/entries/Expander_Graphs.html.
  28. Pascal Lezaud. Chernoff-type bound for finite Markov chains. The Annals of Applied Probability, 8(3):849-867, 1998. URL: https://doi.org/10.1214/aoap/1028903453.
  29. Jack Murtagh, Omer Reingold, Aaron Sidford, and Salil Vadhan. Deterministic Approximation of Random Walks in Small Space. In Dimitris Achlioptas and László A. Végh, editors, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019), volume 145 of Leibniz International Proceedings in Informatics (LIPIcs), pages 42:1-42:22, Dagstuhl, Germany, 2019. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. URL: https://doi.org/10.4230/LIPIcs.APPROX-RANDOM.2019.42.
  30. Assaf Naor, Shravas Rao, and Oded Regev. Concentration of markov chains with bounded moments. Annales de l'Institut Henri Poincaré, Probabilités et Statistiques, 56(3):2270-2280, 2020. URL: https://doi.org/10.1214/19-AIHP1039.
  31. Tobias Nipkow, Lawrence C Paulson, and Markus Wenzel. Isabelle/HOL: A Proof Assistant for Higher-Order Logic, volume 2283 of Lecture Notes in Computer Science. Springer-Verlag, Berlin, Heidelberg, first edition, 2002. Google Scholar
  32. Daniel Paulin. Concentration inequalities for Markov chains by Marton couplings and spectral methods. Electronic Journal of Probability, 20:1-32, 2015. URL: https://doi.org/10.1214/EJP.v20-4039.
  33. Seth Pettie and Dingyu Wang. Information theoretic limits of cardinality estimation: Fisher meets shannon. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2021, pages 556-569, New York, NY, USA, 2021. Association for Computing Machinery. URL: https://doi.org/10.1145/3406325.3451032.
  34. Shravas Rao. A hoeffding inequality for markov chains. Electronic Communications in Probability, 24:1-11, 2019. URL: https://doi.org/10.1214/19-ECP219.
  35. Shravas Rao and Oded Regev. A sharp tail bound for the expander random sampler, 2017. URL: https://arxiv.org/abs/1703.10205.
  36. Omer Reingold, Thomas Steinke, and Salil Vadhan. Pseudorandomness for regular branching programs via fourier analysis. In Prasad Raghavendra, Sofya Raskhodnikova, Klaus Jansen, and José D. P. Rolim, editors, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 655-670, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg. URL: https://doi.org/10.1007/978-3-642-40328-6_45.
  37. Salil P. Vadhan. Pseudorandomness. Foundations and Trends® in Theoretical Computer Science, 7(1-3):1-336, 2012. URL: https://doi.org/10.1561/0400000010.
  38. Roy Wagner. Tail estimates for sums of variables sampled by a random walk. Comb. Probab. Comput., 17(2):307-316, March 2008. URL: https://doi.org/10.1017/S0963548307008772.
  39. Mark N. Wegman and J. Lawrence Carter. New hash functions and their use in authentication and set equality. Journal of Computer and System Sciences, 22(3):265-279, 1981. URL: https://doi.org/10.1016/0022-0000(81)90033-7.
  40. David Woodruff. Optimal space lower bounds for all frequency moments. In Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '04, pages 167-175, USA, 2004. Society for Industrial and Applied Mathematics. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail