We revisit the problem of estimating the profile (also known as the rarity) in the data stream model. Given a sequence of m elements from a universe of size n, its profile is a vector ϕ whose i-th entry ϕ_i represents the number of distinct elements that appear in the stream exactly i times. A classic paper by Datar and Muthukrishan from 2002 gave an algorithm which estimates any entry ϕ_i up to an additive error of ± ε D using O(1/ε² (log n + log m)) bits of space, where D is the number of distinct elements in the stream. In this paper, we considerably improve on this result by designing an algorithm which simultaneously estimates many coordinates of the profile vector ϕ up to small overall error. We give an algorithm which, with constant probability, produces an estimated profile ϕˆ with the following guarantees in terms of space and estimation error: b) For any constant τ, with O(1 / ε² + log n) bits of space, ∑_{i = 1}^τ |ϕ_i - ϕˆ_i| ≤ ε D. c) With O(1/ ε²log (1/ε) + log n + log log m) bits of space, ∑_{i = 1}^m |ϕ_i - ϕˆ_i| ≤ ε m. In addition to bounding the error across multiple coordinates, our space bounds separate the terms that depend on 1/ε and those that depend on n and m. We prove matching lower bounds on space in both regimes. Application of our profile estimation algorithm gives estimates within error ± ε D of several symmetric functions of frequencies in O(1/ε² + log n) bits. This generalizes space-optimal algorithms for the distinct elements problems to other problems including estimating the Huber and Tukey losses as well as frequency cap statistics.
@InProceedings{chen_et_al:LIPIcs.ITCS.2024.32, author = {Chen, Justin Y. and Indyk, Piotr and Woodruff, David P.}, title = {{Space-Optimal Profile Estimation in Data Streams with Applications to Symmetric Functions}}, booktitle = {15th Innovations in Theoretical Computer Science Conference (ITCS 2024)}, pages = {32:1--32:22}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-309-6}, ISSN = {1868-8969}, year = {2024}, volume = {287}, editor = {Guruswami, Venkatesan}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ITCS.2024.32}, URN = {urn:nbn:de:0030-drops-195605}, doi = {10.4230/LIPIcs.ITCS.2024.32}, annote = {Keywords: Streaming and Sketching Algorithms, Sublinear Algorithms} }
Feedback for Dagstuhl Publishing