Document

# Space-Optimal Profile Estimation in Data Streams with Applications to Symmetric Functions

## File

LIPIcs.ITCS.2024.32.pdf
• Filesize: 0.78 MB
• 22 pages

## Cite As

Justin Y. Chen, Piotr Indyk, and David P. Woodruff. Space-Optimal Profile Estimation in Data Streams with Applications to Symmetric Functions. In 15th Innovations in Theoretical Computer Science Conference (ITCS 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 287, pp. 32:1-32:22, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.ITCS.2024.32

## Abstract

We revisit the problem of estimating the profile (also known as the rarity) in the data stream model. Given a sequence of m elements from a universe of size n, its profile is a vector ϕ whose i-th entry ϕ_i represents the number of distinct elements that appear in the stream exactly i times. A classic paper by Datar and Muthukrishan from 2002 gave an algorithm which estimates any entry ϕ_i up to an additive error of ± ε D using O(1/ε² (log n + log m)) bits of space, where D is the number of distinct elements in the stream. In this paper, we considerably improve on this result by designing an algorithm which simultaneously estimates many coordinates of the profile vector ϕ up to small overall error. We give an algorithm which, with constant probability, produces an estimated profile ϕˆ with the following guarantees in terms of space and estimation error: b) For any constant τ, with O(1 / ε² + log n) bits of space, ∑_{i = 1}^τ |ϕ_i - ϕˆ_i| ≤ ε D. c) With O(1/ ε²log (1/ε) + log n + log log m) bits of space, ∑_{i = 1}^m |ϕ_i - ϕˆ_i| ≤ ε m. In addition to bounding the error across multiple coordinates, our space bounds separate the terms that depend on 1/ε and those that depend on n and m. We prove matching lower bounds on space in both regimes. Application of our profile estimation algorithm gives estimates within error ± ε D of several symmetric functions of frequencies in O(1/ε² + log n) bits. This generalizes space-optimal algorithms for the distinct elements problems to other problems including estimating the Huber and Tukey losses as well as frequency cap statistics.

## Subject Classification

##### ACM Subject Classification
• Theory of computation → Streaming, sublinear and near linear time algorithms
• Theory of computation → Sketching and sampling
##### Keywords
• Streaming and Sketching Algorithms
• Sublinear Algorithms

## Metrics

• Access Statistics
• Total Accesses (updated on a weekly basis)
0

## References

1. Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. A unified maximum likelihood approach for optimal distribution property estimation. In International Conference on Machine Learning. PMLR, 2017.
2. Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. In Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing, STOC '96, pages 20-29, New York, NY, USA, 1996. Association for Computing Machinery. URL: https://doi.org/10.1145/237814.237823.
3. Nima Anari, Moses Charikar, Kirankumar Shiragur, and Aaron Sidford. The bethe and sinkhorn permanents of low rank matrices and implications for profile maximum likelihood. In Conference on Learning Theory, pages 93-158. PMLR, 2021.
4. Daniel K. Blandford and Guy E. Blelloch. Compact dictionaries for variable-length keys and data with applications. ACM Trans. Algorithms, 4:17:1-17:25, 2008.
5. Jaroslaw Blasiok. Optimal streaming and tracking distinct elements with high probability. ACM Trans. Algorithms, 16(1):3:1-3:28, 2020. URL: https://doi.org/10.1145/3309193.
6. Vladimir Braverman, Ran Gelles, and Rafail Ostrovsky. How to catch l2-heavy-hitters on sliding windows. Theoretical Computer Science, 554:82-94, 2014.
7. Luciana S. Buriol, Debora Donato, Stefano Leonardi, and Tobias Matzner. Using data stream algorithms for computing properties of large graphs. In Workshop on Massive Geometric Data Sets (MASSIVE’05), pages 9-14, 2005.
8. J. Lawrence Carter and Mark N. Wegman. Universal classes of hash functions. Journal of Computer and System Sciences, 18(2):143-154, 1979. URL: https://doi.org/10.1016/0022-0000(79)90044-8.
9. Moses Charikar, Kirankumar Shiragur, and Aaron Sidford. Efficient profile maximum likelihood for universal symmetric property estimation. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, pages 780-791, New York, NY, USA, 2019. Association for Computing Machinery. URL: https://doi.org/10.1145/3313276.3316398.
10. Rayan Chikhi and Paul Medvedev. Informed and automated k-mer size selection for genome assembly. Bioinformatics, 30(1):31-37, 2014.
11. Edith Cohen. Stream sampling for frequency cap statistics. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 159-168, 2015.
12. Edith Cohen. Hyperloglog hyperextended: Sketches for concave sublinear frequency statistics. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '17, pages 105-114, New York, NY, USA, 2017. Association for Computing Machinery. URL: https://doi.org/10.1145/3097983.3098020.
13. Edith Cohen and Ofir Geri. Sampling sketches for concave sublinear functions of frequencies. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 1361-1371, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/61b4a64be663682e8cb037d9719ad8cd-Abstract.html.
14. Graham Cormode, Senthilmurugan Muthukrishnan, and Irina Rozenbaum. Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In VLDB, volume 5, pages 25-36, 2005.
15. Mayur Datar and S Muthukrishnan. Estimating rarity and similarity over data stream windows. In European Symposium on Algorithms, pages 323-335. Springer, 2002.
16. Charlie Dickens. Personal communication, 2023.
17. Cynthia Dwork, Moni Naor, Toniann Pitassi, Guy N Rothblum, and Sergey Yekhanin. Pan-private streaming algorithms. In ics, pages 66-80, 2010.
18. Guy Feigenblat, Ely Porat, and Ariel Shiftan. Exponential time improvement for min-wise based algorithms. In Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms, pages 57-66. SIAM, 2011.
19. Philippe Flajolet and G. Nigel Martin. Probabilistic counting algorithms for database applications. Journal of Computer and System Sciences, 31(2):182-209, 1985.
20. Badih Ghazi, Ben Kreuter, Ravi Kumar, Pasin Manurangsi, Jiayu Peng, Evgeny Skvortsov, Yao Wang, and Craig Wright. Multiparty reach and frequency histogram: Private, secure, and practical. Proceedings on Privacy Enhancing Technologies, 2022:373-395, January 2022. URL: https://doi.org/10.2478/popets-2022-0019.
21. Peter J. Huber. Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics, 35(1):73-101, 1964. URL: https://doi.org/10.1214/aoms/1177703732.
22. Piotr Indyk. Stable distributions, pseudorandom generators, embeddings, and data stream computation. Journal of the ACM (JACM), 53(3):307-323, 2006.
23. Piotr Indyk and David Woodruff. Optimal approximations of the frequency moments of data streams. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 202-208, 2005.
24. Rajesh Jayaram, David P. Woodruff, and Samson Zhou. Truly perfect samplers for data streams and sliding windows. In Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS '22, pages 29-40, New York, NY, USA, 2022. Association for Computing Machinery. URL: https://doi.org/10.1145/3517804.3524139.
25. Praneeth Kacham, Rasmus Pagh, Mikkel Thorup, and David P. Woodruff. Pseudorandom hashing for space-bounded computation with applications to streaming. In Proceedings of the 64th Annual Symposium on Foundations of Computer Science (FOCS), 2023.
26. Daniel M Kane, Jelani Nelson, and David P Woodruff. An optimal algorithm for the distinct elements problem. In Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 41-52, 2010.
27. Vijay Karamcheti, Davi Geiger, Zvi Kedem, and S Muthukrishnan. Detecting malicious network traffic using inverse distributions of packet contents. In Proceedings of the 2005 ACM SIGCOMM workshop on Mining network data, pages 165-170, 2005.
28. Robert H. Morris. Counting large numbers of events in small registers. Communications of the ACM, 21(10):840-842, 1978.
29. Jelani Nelson and Huacheng Yu. Optimal bounds for approximate counting. In Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS '22, pages 119-127, New York, NY, USA, 2022. Association for Computing Machinery.
30. William J. J. Rey. Introduction to Robust and Quasi-Robust Statistical Methods. Universitext. Springer, Berlin, Heidelberg, 1983.
31. Gregory Valiant and Paul Valiant. Estimating the unseen: improved estimators for entropy and other properties. Journal of the ACM (JACM), 64(6):1-41, 2017.
X

Feedback for Dagstuhl Publishing