Space-Optimal Profile Estimation in Data Streams with Applications to Symmetric Functions

Authors Justin Y. Chen, Piotr Indyk, David P. Woodruff



PDF
Thumbnail PDF

File

LIPIcs.ITCS.2024.32.pdf
  • Filesize: 0.78 MB
  • 22 pages

Document Identifiers

Author Details

Justin Y. Chen
  • Massachusetts Institute of Technology, Cambridge, MA, USA
Piotr Indyk
  • Massachusetts Institute of Technology, Cambridge, MA, USA
David P. Woodruff
  • Carnegie Mellon University, Pittsburgh, PA, USA

Cite AsGet BibTex

Justin Y. Chen, Piotr Indyk, and David P. Woodruff. Space-Optimal Profile Estimation in Data Streams with Applications to Symmetric Functions. In 15th Innovations in Theoretical Computer Science Conference (ITCS 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 287, pp. 32:1-32:22, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.ITCS.2024.32

Abstract

We revisit the problem of estimating the profile (also known as the rarity) in the data stream model. Given a sequence of m elements from a universe of size n, its profile is a vector ϕ whose i-th entry ϕ_i represents the number of distinct elements that appear in the stream exactly i times. A classic paper by Datar and Muthukrishan from 2002 gave an algorithm which estimates any entry ϕ_i up to an additive error of ± ε D using O(1/ε² (log n + log m)) bits of space, where D is the number of distinct elements in the stream. In this paper, we considerably improve on this result by designing an algorithm which simultaneously estimates many coordinates of the profile vector ϕ up to small overall error. We give an algorithm which, with constant probability, produces an estimated profile ϕˆ with the following guarantees in terms of space and estimation error: b) For any constant τ, with O(1 / ε² + log n) bits of space, ∑_{i = 1}^τ |ϕ_i - ϕˆ_i| ≤ ε D. c) With O(1/ ε²log (1/ε) + log n + log log m) bits of space, ∑_{i = 1}^m |ϕ_i - ϕˆ_i| ≤ ε m. In addition to bounding the error across multiple coordinates, our space bounds separate the terms that depend on 1/ε and those that depend on n and m. We prove matching lower bounds on space in both regimes. Application of our profile estimation algorithm gives estimates within error ± ε D of several symmetric functions of frequencies in O(1/ε² + log n) bits. This generalizes space-optimal algorithms for the distinct elements problems to other problems including estimating the Huber and Tukey losses as well as frequency cap statistics.

Subject Classification

ACM Subject Classification
  • Theory of computation → Streaming, sublinear and near linear time algorithms
  • Theory of computation → Sketching and sampling
Keywords
  • Streaming and Sketching Algorithms
  • Sublinear Algorithms

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. A unified maximum likelihood approach for optimal distribution property estimation. In International Conference on Machine Learning. PMLR, 2017. Google Scholar
  2. Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. In Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing, STOC '96, pages 20-29, New York, NY, USA, 1996. Association for Computing Machinery. URL: https://doi.org/10.1145/237814.237823.
  3. Nima Anari, Moses Charikar, Kirankumar Shiragur, and Aaron Sidford. The bethe and sinkhorn permanents of low rank matrices and implications for profile maximum likelihood. In Conference on Learning Theory, pages 93-158. PMLR, 2021. Google Scholar
  4. Daniel K. Blandford and Guy E. Blelloch. Compact dictionaries for variable-length keys and data with applications. ACM Trans. Algorithms, 4:17:1-17:25, 2008. Google Scholar
  5. Jaroslaw Blasiok. Optimal streaming and tracking distinct elements with high probability. ACM Trans. Algorithms, 16(1):3:1-3:28, 2020. URL: https://doi.org/10.1145/3309193.
  6. Vladimir Braverman, Ran Gelles, and Rafail Ostrovsky. How to catch l2-heavy-hitters on sliding windows. Theoretical Computer Science, 554:82-94, 2014. Google Scholar
  7. Luciana S. Buriol, Debora Donato, Stefano Leonardi, and Tobias Matzner. Using data stream algorithms for computing properties of large graphs. In Workshop on Massive Geometric Data Sets (MASSIVE’05), pages 9-14, 2005. Google Scholar
  8. J. Lawrence Carter and Mark N. Wegman. Universal classes of hash functions. Journal of Computer and System Sciences, 18(2):143-154, 1979. URL: https://doi.org/10.1016/0022-0000(79)90044-8.
  9. Moses Charikar, Kirankumar Shiragur, and Aaron Sidford. Efficient profile maximum likelihood for universal symmetric property estimation. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, pages 780-791, New York, NY, USA, 2019. Association for Computing Machinery. URL: https://doi.org/10.1145/3313276.3316398.
  10. Rayan Chikhi and Paul Medvedev. Informed and automated k-mer size selection for genome assembly. Bioinformatics, 30(1):31-37, 2014. Google Scholar
  11. Edith Cohen. Stream sampling for frequency cap statistics. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 159-168, 2015. Google Scholar
  12. Edith Cohen. Hyperloglog hyperextended: Sketches for concave sublinear frequency statistics. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '17, pages 105-114, New York, NY, USA, 2017. Association for Computing Machinery. URL: https://doi.org/10.1145/3097983.3098020.
  13. Edith Cohen and Ofir Geri. Sampling sketches for concave sublinear functions of frequencies. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 1361-1371, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/61b4a64be663682e8cb037d9719ad8cd-Abstract.html.
  14. Graham Cormode, Senthilmurugan Muthukrishnan, and Irina Rozenbaum. Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In VLDB, volume 5, pages 25-36, 2005. Google Scholar
  15. Mayur Datar and S Muthukrishnan. Estimating rarity and similarity over data stream windows. In European Symposium on Algorithms, pages 323-335. Springer, 2002. Google Scholar
  16. Charlie Dickens. Personal communication, 2023. Google Scholar
  17. Cynthia Dwork, Moni Naor, Toniann Pitassi, Guy N Rothblum, and Sergey Yekhanin. Pan-private streaming algorithms. In ics, pages 66-80, 2010. Google Scholar
  18. Guy Feigenblat, Ely Porat, and Ariel Shiftan. Exponential time improvement for min-wise based algorithms. In Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms, pages 57-66. SIAM, 2011. Google Scholar
  19. Philippe Flajolet and G. Nigel Martin. Probabilistic counting algorithms for database applications. Journal of Computer and System Sciences, 31(2):182-209, 1985. Google Scholar
  20. Badih Ghazi, Ben Kreuter, Ravi Kumar, Pasin Manurangsi, Jiayu Peng, Evgeny Skvortsov, Yao Wang, and Craig Wright. Multiparty reach and frequency histogram: Private, secure, and practical. Proceedings on Privacy Enhancing Technologies, 2022:373-395, January 2022. URL: https://doi.org/10.2478/popets-2022-0019.
  21. Peter J. Huber. Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics, 35(1):73-101, 1964. URL: https://doi.org/10.1214/aoms/1177703732.
  22. Piotr Indyk. Stable distributions, pseudorandom generators, embeddings, and data stream computation. Journal of the ACM (JACM), 53(3):307-323, 2006. Google Scholar
  23. Piotr Indyk and David Woodruff. Optimal approximations of the frequency moments of data streams. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 202-208, 2005. Google Scholar
  24. Rajesh Jayaram, David P. Woodruff, and Samson Zhou. Truly perfect samplers for data streams and sliding windows. In Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS '22, pages 29-40, New York, NY, USA, 2022. Association for Computing Machinery. URL: https://doi.org/10.1145/3517804.3524139.
  25. Praneeth Kacham, Rasmus Pagh, Mikkel Thorup, and David P. Woodruff. Pseudorandom hashing for space-bounded computation with applications to streaming. In Proceedings of the 64th Annual Symposium on Foundations of Computer Science (FOCS), 2023. Google Scholar
  26. Daniel M Kane, Jelani Nelson, and David P Woodruff. An optimal algorithm for the distinct elements problem. In Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 41-52, 2010. Google Scholar
  27. Vijay Karamcheti, Davi Geiger, Zvi Kedem, and S Muthukrishnan. Detecting malicious network traffic using inverse distributions of packet contents. In Proceedings of the 2005 ACM SIGCOMM workshop on Mining network data, pages 165-170, 2005. Google Scholar
  28. Robert H. Morris. Counting large numbers of events in small registers. Communications of the ACM, 21(10):840-842, 1978. Google Scholar
  29. Jelani Nelson and Huacheng Yu. Optimal bounds for approximate counting. In Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS '22, pages 119-127, New York, NY, USA, 2022. Association for Computing Machinery. Google Scholar
  30. William J. J. Rey. Introduction to Robust and Quasi-Robust Statistical Methods. Universitext. Springer, Berlin, Heidelberg, 1983. Google Scholar
  31. Gregory Valiant and Paul Valiant. Estimating the unseen: improved estimators for entropy and other properties. Journal of the ACM (JACM), 64(6):1-41, 2017. Google Scholar