Distinct Elements in Streams: An Algorithm for the (Text) Book

Authors Sourav Chakraborty , N. V. Vinodchandran¹ , Kuldeep S. Meel



PDF
Thumbnail PDF

File

LIPIcs.ESA.2022.34.pdf
  • Filesize: 0.69 MB
  • 6 pages

Document Identifiers

Author Details

Sourav Chakraborty
  • Indian Statistical Institute Kolkata, India
N. V. Vinodchandran¹
  • University of Nebraska-Lincoln, NE, USA
Kuldeep S. Meel
  • National University of Singapore, Singapore

Cite AsGet BibTex

Sourav Chakraborty, N. V. Vinodchandran¹, and Kuldeep S. Meel. Distinct Elements in Streams: An Algorithm for the (Text) Book. In 30th Annual European Symposium on Algorithms (ESA 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 244, pp. 34:1-34:6, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.ESA.2022.34

Abstract

Given a data stream 𝒟 = ⟨ a₁, a₂, …, a_m ⟩ of m elements where each a_i ∈ [n], the Distinct Elements problem is to estimate the number of distinct elements in 𝒟. Distinct Elements has been a subject of theoretical and empirical investigations over the past four decades resulting in space optimal algorithms for it. All the current state-of-the-art algorithms are, however, beyond the reach of an undergraduate textbook owing to their reliance on the usage of notions such as pairwise independence and universal hash functions. We present a simple, intuitive, sampling-based space-efficient algorithm whose description and the proof are accessible to undergraduates with the knowledge of basic probability theory.

Subject Classification

ACM Subject Classification
  • Theory of computation → Sketching and sampling
Keywords
  • F₀ Estimation
  • Streaming
  • Sampling

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci., 58(1):137-147, 1999. URL: https://doi.org/10.1006/jcss.1997.1545.
  2. Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. Counting distinct elements in a data stream. In Proc. of RANDOM, pages 1-10, 2002. URL: https://doi.org/10.1007/3-540-45726-7_1.
  3. Ziv Bar-Yossef, Ravi Kumar, and D. Sivakumar. Reductions in streaming algorithms, with an application to counting triangles in graphs. In Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms,, pages 623-632, 2002. URL: http://dl.acm.org/citation.cfm?id=545381.545464.
  4. Jaroslaw Blasiok. Optimal streaming and tracking distinct elements with high probability. In Proc. of SODA, 2018. URL: https://doi.org/10.1137/1.9781611975031.156.
  5. Joshua Brody and Amit Chakrabarti. A multi-round communication lower bound for gap hamming and some consequences. In Proceedings of the 24th Annual IEEE Conference on Computational Complexity, CCC 2009, pages 358-368. IEEE Computer Society, 2009. URL: https://doi.org/10.1109/CCC.2009.31.
  6. Marianne Durand and Philippe Flajolet. Loglog counting of large cardinalities (extended abstract). In Algorithms - ESA 2003, 11th Annual European Symposium, volume 2832, pages 605-617, 2003. URL: https://doi.org/10.1007/978-3-540-39658-1_55.
  7. Cristian Estan, George Varghese, and Michael E. Fisk. Bitmap algorithms for counting active flows on high-speed links. IEEE/ACM Trans. Netw., 14(5):925-937, 2006. URL: https://doi.org/10.1145/1217709.
  8. Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In Discrete Mathematics and Theoretical Computer Science, pages 137-156, 2007. URL: https://doi.org/10.46298/dmtcs.3545.
  9. Philippe Flajolet and G. Nigel Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 31(2):182-209, 1985. URL: https://doi.org/10.1016/0022-0000(85)90041-8.
  10. Phillip B. Gibbons and Srikanta Tirthapura. Estimating simple functions on the union of data streams. In Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA, pages 281-291, 2001. URL: https://doi.org/10.1145/378580.378687.
  11. Piotr Indyk and David P. Woodruff. Tight lower bounds for the distinct elements problem. In Proc. of FOCS, pages 283-288. IEEE Computer Society, 2003. URL: https://doi.org/10.1109/SFCS.2003.1238202.
  12. Daniel M. Kane, Jelani Nelson, and David P. Woodruff. An optimal algorithm for the distinct elements problem. In Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS, pages 41-52, 2010. URL: https://doi.org/10.1145/1807085.1807094.
  13. Kuldeep S. Meel, N. V. Vinodchandran, and Sourav Chakraborty. Estimating the size of union of sets in streaming models. In PODS'21: Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 126-137, 2021. URL: https://doi.org/10.1145/3452021.3458333.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail