Document Open Access Logo

Nearly Optimal Distinct Elements and Heavy Hitters on Sliding Windows

Authors Vladimir Braverman, Elena Grigorescu, Harry Lang, David P. Woodruff, Samson Zhou



PDF
Thumbnail PDF

File

LIPIcs.APPROX-RANDOM.2018.7.pdf
  • Filesize: 0.62 MB
  • 22 pages

Document Identifiers

Author Details

Vladimir Braverman
  • Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
Elena Grigorescu
  • Department of Computer Science, Purdue University, West Lafayette, IN, USA
Harry Lang
  • Department of Mathematics, Johns Hopkins University, Baltimore, MD, USA
David P. Woodruff
  • School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
Samson Zhou
  • Department of Computer Science, Purdue University, West Lafayette, IN, USA

Cite AsGet BibTex

Vladimir Braverman, Elena Grigorescu, Harry Lang, David P. Woodruff, and Samson Zhou. Nearly Optimal Distinct Elements and Heavy Hitters on Sliding Windows. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 116, pp. 7:1-7:22, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2018)
https://doi.org/10.4230/LIPIcs.APPROX-RANDOM.2018.7

Abstract

We study the distinct elements and l_p-heavy hitters problems in the sliding window model, where only the most recent n elements in the data stream form the underlying set. We first introduce the composable histogram, a simple twist on the exponential (Datar et al., SODA 2002) and smooth histograms (Braverman and Ostrovsky, FOCS 2007) that may be of independent interest. We then show that the composable histogram{} along with a careful combination of existing techniques to track either the identity or frequency of a few specific items suffices to obtain algorithms for both distinct elements and l_p-heavy hitters that are nearly optimal in both n and epsilon. Applying our new composable histogram framework, we provide an algorithm that outputs a (1+epsilon)-approximation to the number of distinct elements in the sliding window model and uses O{1/(epsilon^2) log n log (1/epsilon)log log n+ (1/epsilon) log^2 n} bits of space. For l_p-heavy hitters, we provide an algorithm using space O{(1/epsilon^p) log^2 n (log^2 log n+log 1/epsilon)} for 0<p <=2, improving upon the best-known algorithm for l_2-heavy hitters (Braverman et al., COCOON 2014), which has space complexity O{1/epsilon^4 log^3 n}. We also show complementing nearly optimal lower bounds of Omega ((1/epsilon) log^2 n+(1/epsilon^2) log n) for distinct elements and Omega ((1/epsilon^p) log^2 n) for l_p-heavy hitters, both tight up to O{log log n} and O{log 1/epsilon} factors.

Subject Classification

ACM Subject Classification
  • Theory of computation → Streaming, sublinear and near linear time algorithms
Keywords
  • Streaming algorithms
  • sliding windows
  • heavy hitters
  • distinct elements

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci., 58(1):137-147, 1999. A preliminary version appeared in the Proceedings of the Twenty-Eighth Annual ACM Symposium on the Theory of Computing (STOC), 1996. Google Scholar
  2. Arvind Arasu and Gurmeet Singh Manku. Approximate counts and quantiles over sliding windows. In Proceedings of the Twenty-third ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 286-296, 2004. Google Scholar
  3. Brian Babcock, Mayur Datar, Rajeev Motwani, and Liadan O'Callaghan. Maintaining variance and k-medians over data stream windows. In Proceedings of the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), pages 234-243, 2003. Google Scholar
  4. Nagender Bandi, Divyakant Agrawal, and Amr El Abbadi. Fast algorithms for heavy distinct hitters using associative memories. In 27th IEEE International Conference on Distributed Computing Systems (ICDCS), page 6, 2007. Google Scholar
  5. Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, and D. Sivakumar. An information statistics approach to data stream and communication complexity. J. Comput. Syst. Sci., 68(4):702-732, 2004. A preliminary version appeared in the Proceedings of the 43rd Symposium on Foundations of Computer Science (FOCS), 2002. Google Scholar
  6. Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. Counting distinct elements in a data stream. In Randomization and Approximation Techniques, 6th International Workshop, RANDOM, Proceedings, pages 1-10, 2002. Google Scholar
  7. Ran Ben-Basat, Gil Einziger, Roy Friedman, and Yaron Kassner. Heavy hitters in streams and sliding windows. In 35th Annual IEEE International Conference on Computer Communications, INFOCOM, pages 1-9, 2016. Google Scholar
  8. Radu Berinde, Piotr Indyk, Graham Cormode, and Martin J. Strauss. Space-optimal heavy hitters with strong error bounds. ACM Trans. Database Syst., 35(4):26:1-26:28, 2010. A preliminary version appeared in the Proceedings of the Twenty-Eigth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2009. Google Scholar
  9. Jaroslaw Blasiok. Optimal streaming and tracking distinct elements with high probability. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 2432-2448, 2018. Google Scholar
  10. Jaroslaw Blasiok, Jian Ding, and Jelani Nelson. Continuous monitoring of 𝓁_p norms in data streams. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM, pages 32:1-32:13, 2017. Google Scholar
  11. Vladimir Braverman. Sliding window algorithms, 2016. Google Scholar
  12. Vladimir Braverman, Stephen R. Chestnut, Nikita Ivkin, Jelani Nelson, Zhengyu Wang, and David P. Woodruff. Bptree: An 𝓁_2 heavy hitters algorithm using constant memory. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS, pages 361-376, 2017. Google Scholar
  13. Vladimir Braverman, Stephen R. Chestnut, Nikita Ivkin, and David P. Woodruff. Beating countsketch for heavy hitters in insertion streams. In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC, pages 740-753, 2016. Google Scholar
  14. Vladimir Braverman, Petros Drineas, Jalaj Upadhyay, and Samson Zhou. Numerical linear algebra in the sliding window model. CoRR, abs/1805.03765, 2018. URL: http://arxiv.org/abs/1805.03765.
  15. Vladimir Braverman, Ran Gelles, and Rafail Ostrovsky. How to catch 𝓁_2-heavy-hitters on sliding windows. Theor. Comput. Sci., 554:82-94, 2014. A preliminary version appeared in the Proceedings of Computing and Combinatorics, 19th International Conference (COCOON), 2013. Google Scholar
  16. Vladimir Braverman, Elena Grigorescu, Harry Lang, David P. Woodruff, and Samson Zhou. Nearly optimal distinct elements and heavy hitters on sliding windows. CoRR, abs/1805.00212, 2018. URL: http://arxiv.org/abs/1805.00212.
  17. Vladimir Braverman, Harry Lang, Keith Levin, and Morteza Monemizadeh. Clustering on sliding windows in polylogarithmic space. In 35th IARCS Annual Conference on Foundation of Software Technology and Theoretical Computer Science, FSTTCS, pages 350-364, 2015. Google Scholar
  18. Vladimir Braverman, Harry Lang, Keith Levin, and Morteza Monemizadeh. Clustering problems on sliding windows. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 1374-1390, 2016. Google Scholar
  19. Vladimir Braverman and Rafail Ostrovsky. Smooth histograms for sliding windows. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS) Proceedings, pages 283-293, 2007. Google Scholar
  20. Vladimir Braverman, Rafail Ostrovsky, and Alan Roytman. Zero-one laws for sliding windows and universal sketches. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM, pages 573-590, 2015. Google Scholar
  21. Yousra Chabchoub, Christine Fricker, and Hanene Mohamed. Analysis of a bloom filter algorithm via the supermarket model. In 21st International Teletraffic Congress, ITC, pages 1-8, 2009. Google Scholar
  22. Amit Chakrabarti, Graham Cormode, and Andrew McGregor. A near-optimal algorithm for estimating the entropy of a stream. ACM Trans. Algorithms, 6(3):51:1-51:21, 2010. Google Scholar
  23. Amit Chakrabarti, Subhash Khot, and Xiaodong Sun. Near-optimal lower bounds on the multi-party communication complexity of set disjointness. In 18th Annual IEEE Conference on Computational Complexity, pages 107-117, 2003. Google Scholar
  24. Amit Chakrabarti and Oded Regev. An optimal lower bound on the communication complexity of gap-hamming-distance. SIAM J. Comput., 41(5):1299-1317, 2012. A preliminary version appeared in the Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011. Google Scholar
  25. Timothy M. Chan and Bashir S. Sadjad. Geometric optimization problems over sliding windows. Int. J. Comput. Geometry Appl., 16(2-3):145-158, 2006. A preliminary version appeared in the Proceedings of Algorithms and Computation, 15th International Symposium (ISAAC), 2004. Google Scholar
  26. Moses Charikar, Kevin C. Chen, and Martin Farach-Colton. Finding frequent items in data streams. Theor. Comput. Sci., 312(1):3-15, 2004. A preliminary version appeared in the Proceedings of the Automata, Languages and Programming, 29th International Colloquium (ICALP), 2002. Google Scholar
  27. Jiecao Chen, Huy L. Nguyen, and Qin Zhang. Submodular maximization over sliding windows. CoRR, abs/1611.00129, 2016. Google Scholar
  28. Yun Chi, Haixun Wang, Philip S. Yu, and Richard R. Muntz. Catch the moment: maintaining closed frequent itemsets over a data stream sliding window. Knowl. Inf. Syst., 10(3):265-294, 2006. A preliminary version appeared in the Proceedings of the 4th IEEE International Conference on Data Mining (ICDM), 2004. Google Scholar
  29. Graham Cormode. The continuous distributed monitoring model. SIGMOD Record, 42(1):5-14, 2013. Google Scholar
  30. Graham Cormode and Minos N. Garofalakis. Streaming in a connected world: querying and tracking distributed data streams. In EDBT, page 745, 2008. Google Scholar
  31. Graham Cormode, Flip Korn, S. Muthukrishnan, and Divesh Srivastava. Finding hierarchical heavy hitters in streaming data. TKDD, 1(4):2:1-2:48, 2008. Google Scholar
  32. Graham Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 55(1):58-75, 2005. A preliminary version appeared in the Proceedings of the 6th Latin American Symposium (LATIN), 2004. Google Scholar
  33. Graham Cormode and S. Muthukrishnan. What’s new: finding significant differences in network data streams. IEEE/ACM Transactions on Networking, 13(6):1219-1232, 2005. Google Scholar
  34. Michael S. Crouch, Andrew McGregor, and Daniel Stubbs. Dynamic graphs in the sliding-window model. In Algorithms - ESA 2013 - 21st Annual European Symposium, Proceedings, pages 337-348, 2013. Google Scholar
  35. Mayur Datar, Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Maintaining stream statistics over sliding windows. SIAM J. Comput., 31(6):1794-1813, 2002. A preliminary version appeared in the Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2002. Google Scholar
  36. Mayur Datar and S. Muthukrishnan. Estimating rarity and similarity over data stream windows. In Algorithms - ESA 2002, 10th Annual European Symposium, Proceedings, pages 323-334, 2002. Google Scholar
  37. Erik D. Demaine, Alejandro López-Ortiz, and J. Ian Munro. Frequency estimation of internet packet streams with limited space. In Algorithms - ESA, 10th Annual European Symposium, Proceedings, pages 348-360, 2002. Google Scholar
  38. Marianne Durand and Philippe Flajolet. Loglog counting of large cardinalities (extended abstract). In Algorithms - ESA, 11th Annual European Symposium, Proceedings, pages 605-617, 2003. Google Scholar
  39. Alessandro Epasto, Silvio Lattanzi, Sergei Vassilvitskii, and Morteza Zadimoghaddam. Submodular optimization over sliding windows. In Proceedings of the 26th International Conference on World Wide Web, WWW, pages 421-430, 2017. Google Scholar
  40. Cristian Estan and George Varghese. New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. ACM Trans. Comput. Syst., 21(3):270-313, 2003. Google Scholar
  41. Min Fang, Narayanan Shivakumar, Hector Garcia-Molina, Rajeev Motwani, and Jeffrey D. Ullman. Computing iceberg queries efficiently. In VLDB'98, Proceedings of 24rd International Conference on Very Large Data Bases, pages 299-310, 1998. Google Scholar
  42. Joan Feigenbaum, Sampath Kannan, and Jian Zhang. Computing diameter in the streaming and sliding-window models. Algorithmica, 41(1):25-41, 2005. Google Scholar
  43. Philippe Flajolet, Eric Fusy, Olivier Gandouet, and Frederic Meunier. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In AofA: Analysis of Algorithms, page 137–156, 2007. Google Scholar
  44. Philippe Flajolet and G. Nigel Martin. Probabilistic counting. In 24th Annual Symposium on Foundations of Computer Science, pages 76-82, 1983. Google Scholar
  45. Phillip B. Gibbons and Srikanta Tirthapura. Estimating simple functions on the union of data streams. In SPAA, pages 281-291, 2001. Google Scholar
  46. Phillip B. Gibbons and Srikanta Tirthapura. Distributed streams algorithms for sliding windows. In SPAA, pages 63-72, 2002. Google Scholar
  47. Parikshit Gopalan and Jaikumar Radhakrishnan. Finding duplicates in a data stream. In Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 402-411, 2009. Google Scholar
  48. Nicholas J. A. Harvey, Jelani Nelson, and Krzysztof Onak. Sketching and streaming entropy via approximation theory. In 49th Annual IEEE Symposium on Foundations of Computer Science, FOCS, pages 489-498, 2008. Google Scholar
  49. Regant Y. S. Hung and Hing-Fung Ting. Finding heavy hitters over the sliding window of a weighted data stream. In LATIN: Theoretical Informatics, 8th Latin American Symposium, Proceedings, pages 699-710, 2008. Google Scholar
  50. Piotr Indyk and David P. Woodruff. Tight lower bounds for the distinct elements problem. In 44th Symposium on Foundations of Computer Science (FOCS), pages 283-288, 2003. Google Scholar
  51. Piotr Indyk and David P. Woodruff. Optimal approximations of the frequency moments of data streams. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing (STOC), pages 202-208, 2005. Google Scholar
  52. Hossein Jowhari, Mert Saglam, and Gábor Tardos. Tight bounds for lp samplers, finding duplicates in streams, and related problems. In Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pages 49-58, 2011. Google Scholar
  53. Daniel M. Kane, Jelani Nelson, and David P. Woodruff. An optimal algorithm for the distinct elements problem. In Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS, pages 41-52, 2010. Google Scholar
  54. Abhishek Kumar and Jun (Jim) Xu. Sketch guided sampling - using on-line estimates of flow size for adaptive data collection. In INFOCOM 2006. 25th IEEE International Conference on Computer Communications, Joint Conference of the IEEE Computer and Communications Societies, 2006. Google Scholar
  55. Kasper Green Larsen, Jelani Nelson, Huy L. Nguyen, and Mikkel Thorup. Heavy hitters via cluster-preserving clustering. In IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS, pages 61-70, 2016. Google Scholar
  56. Lap-Kei Lee and H. F. Ting. A simpler and more efficient deterministic scheme for finding frequent items over sliding windows. In Proceedings of the Twenty-Fifth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 290-297, 2006. Google Scholar
  57. Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts over data streams. PVLDB, 5(12):1699, 2012. A preliminary version appeared in the Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), 2002. Google Scholar
  58. Peter Bro Miltersen, Noam Nisan, Shmuel Safra, and Avi Wigderson. On data structures and asymmetric communication complexity. In Proceedings of the Twenty-Seventh Annual ACM Symposium on Theory of Computing, pages 103-111, 1995. Google Scholar
  59. Jayadev Misra and David Gries. Finding repeated elements. Sci. Comput. Program., 2(2):143-152, 1982. Google Scholar
  60. Morteza Monemizadeh and David P. Woodruff. 1-pass relative-error 𝓁_p-sampling with applications. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 1143-1160, 2010. Google Scholar
  61. Miles Osborne, Sean Moran, Richard McCreadie, Alexander Von Lunen, Martin Sykora, Elizabeth Cano, Neil Ireson, Craig MacDonald, Iadh Ounis, Yulan He, Tom Jackson, Fabio Ciravegna, and Ann O'Brien. Real-time detection, tracking and monitoring of automatically discovered events in social media. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 2014. Google Scholar
  62. Subhabrata Sen and Jia Wang. Analyzing peer-to-peer traffic across large networks. IEEE/ACM Trans. Netw., 12(2):219-232, 2004. Google Scholar
  63. Mikkel Thorup and Yin Zhang. Tabulation-based 5-independent hashing with applications to linear probing and second moment estimation. SIAM J. Comput., 41(2):293-331, 2012. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail