Document Open Access Logo

Towards Optimal Moment Estimation in Streaming and Distributed Models

Authors Rajesh Jayaram, David P. Woodruff



PDF
Thumbnail PDF

File

LIPIcs.APPROX-RANDOM.2019.29.pdf
  • Filesize: 0.62 MB
  • 21 pages

Document Identifiers

Author Details

Rajesh Jayaram
  • Carnegie Mellon University, Pittsburgh, PA, USA
David P. Woodruff
  • Carnegie Mellon University, Pittsburgh, PA, USA

Cite AsGet BibTex

Rajesh Jayaram and David P. Woodruff. Towards Optimal Moment Estimation in Streaming and Distributed Models. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 145, pp. 29:1-29:21, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2019)
https://doi.org/10.4230/LIPIcs.APPROX-RANDOM.2019.29

Abstract

One of the oldest problems in the data stream model is to approximate the p-th moment ||X||_p^p = sum_{i=1}^n X_i^p of an underlying non-negative vector X in R^n, which is presented as a sequence of poly(n) updates to its coordinates. Of particular interest is when p in (0,2]. Although a tight space bound of Theta(epsilon^-2 log n) bits is known for this problem when both positive and negative updates are allowed, surprisingly there is still a gap in the space complexity of this problem when all updates are positive. Specifically, the upper bound is O(epsilon^-2 log n) bits, while the lower bound is only Omega(epsilon^-2 + log n) bits. Recently, an upper bound of O~(epsilon^-2 + log n) bits was obtained under the assumption that the updates arrive in a random order. We show that for p in (0, 1], the random order assumption is not needed. Namely, we give an upper bound for worst-case streams of O~(epsilon^-2 + log n) bits for estimating |X |_p^p. Our techniques also give new upper bounds for estimating the empirical entropy in a stream. On the other hand, we show that for p in (1,2], in the natural coordinator and blackboard distributed communication topologies, there is an O~(epsilon^-2) bit max-communication upper bound based on a randomized rounding scheme. Our protocols also give rise to protocols for heavy hitters and approximate matrix product. We generalize our results to arbitrary communication topologies G, obtaining an O~(epsilon^2 log d) max-communication upper bound, where d is the diameter of G. Interestingly, our upper bound rules out natural communication complexity-based approaches for proving an Omega(epsilon^-2 log n) bit lower bound for p in (1,2] for streaming algorithms. In particular, any such lower bound must come from a topology with large diameter.

Subject Classification

ACM Subject Classification
  • Theory of computation → Streaming, sublinear and near linear time algorithms
Keywords
  • Streaming
  • Sketching
  • Message Passing
  • Moment Estimation

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, pages 20-29. ACM, 1996. Google Scholar
  2. Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Streaming algorithms from precision sampling. arXiv preprint, 2010. URL: http://arxiv.org/abs/1011.1263.
  3. Chrisil Arackaparambil, Joshua Brody, and Amit Chakrabarti. Functional monitoring without monotonicity. In International Colloquium on Automata, Languages, and Programming, pages 95-106. Springer, 2009. Google Scholar
  4. Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. Models and issues in data stream systems. In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 1-16. ACM, 2002. Google Scholar
  5. Maria Florina Balcan, Yingyu Liang, Le Song, David Woodruff, and Bo Xie. Communication efficient distributed kernel principal component analysis. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 725-734. ACM, 2016. Google Scholar
  6. Ziv Bar-Yossef, Thathachar S Jayram, Ravi Kumar, and D Sivakumar. An information statistics approach to data stream and communication complexity. Journal of Computer and System Sciences, 68(4):702-732, 2004. Google Scholar
  7. Jarosław Błasiok, Jian Ding, and Jelani Nelson. Continuous monitoring of lp norms in data streams. arXiv preprint, 2017. URL: http://arxiv.org/abs/1704.06710.
  8. Mark Braverman, Faith Ellen, Rotem Oshman, Toniann Pitassi, and Vinod Vaikuntanathan. A tight bound for set disjointness in the message-passing model. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 668-677. IEEE, 2013. Google Scholar
  9. Vladimir Braverman, Stephen R Chestnut, Nikita Ivkin, Jelani Nelson, Zhengyu Wang, and David P Woodruff. BPTree: an L2 heavy hitters algorithm using constant memory. arXiv preprint, 2016. URL: http://arxiv.org/abs/1603.00759.
  10. Vladimir Braverman, Stephen R Chestnut, David P Woodruff, and Lin F Yang. Streaming space complexity of nearly all functions of one variable on frequency vectors. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 261-276. ACM, 2016. Google Scholar
  11. Vladimir Braverman, Jonathan Katzman, Charles Seidell, and Gregory Vorsanger. An Optimal Algorithm for Large Frequency Moments Using O n^(1-2/k) Bits. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2014). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2014. Google Scholar
  12. Vladimir Braverman and Rafail Ostrovsky. Recursive sketching for frequency moments. arXiv preprint, 2010. URL: http://arxiv.org/abs/1011.2571.
  13. Vladimir Braverman, Emanuele Viola, David Woodruff, and Lin F Yang. Revisiting frequency moment estimation in random order streams. arXiv preprint, 2018. URL: http://arxiv.org/abs/1803.02270.
  14. Amit Chakrabarti, Graham Cormode, and Andrew McGregor. A near-optimal algorithm for estimating the entropy of a stream. ACM Transactions on Algorithms (TALG), 6(3):51, 2010. Google Scholar
  15. Amit Chakrabarti, Graham Cormode, and Andrew McGregor. Robust Lower Bounds for Communication and Stream Computation. Theory of Computing, 12(1):1-35, 2016. Google Scholar
  16. Amit Chakrabarti, Subhash Khot, and Xiaodong Sun. Near-optimal lower bounds on the multi-party communication complexity of set disjointness. In 18th IEEE Annual Conference on Computational Complexity, 2003. Proceedings., pages 107-117. IEEE, 2003. Google Scholar
  17. Amit Chakrabarti and Oded Regev. An optimal lower bound on the communication complexity of gap-hamming-distance. SIAM Journal on Computing, 41(5):1299-1317, 2012. Google Scholar
  18. Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. Automata, languages and programming, pages 784-784, 2002. Google Scholar
  19. Arkadev Chattopadhyay, Jaikumar Radhakrishnan, and Atri Rudra. Topology matters in communication. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 631-640. IEEE, 2014. Google Scholar
  20. Jiecao Chen, He Sun, David Woodruff, and Qin Zhang. Communication-optimal distributed clustering. In Advances in Neural Information Processing Systems, pages 3727-3735, 2016. Google Scholar
  21. Peter Clifford and Ioana Cosma. A simple sketching algorithm for entropy estimation over streaming data. In Artificial Intelligence and Statistics, pages 196-206, 2013. Google Scholar
  22. Graham Cormode, Piotr Indyk, Nick Koudas, and S Muthukrishnan. Fast mining of massive tabular data via approximate distance computations. In Proceedings 18th International Conference on Data Engineering, pages 605-614. IEEE, 2002. Google Scholar
  23. Graham Cormode and Hossein Jowhari. L p Samplers and Their Applications: A Survey. ACM Computing Surveys (CSUR), 52(1):16, 2019. Google Scholar
  24. Graham Cormode, S Muthukrishnan, and Irina Rozenbaum. Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In Proceedings of the 31st international conference on Very large data bases, pages 25-36. VLDB Endowment, 2005. Google Scholar
  25. Graham Cormode, S Muthukrishnan, and Ke Yi. Algorithms for distributed functional monitoring. ACM Transactions on Algorithms (TALG), 7(2):21, 2011. Google Scholar
  26. Philippe Flajolet. Approximate counting: a detailed analysis. BIT Numerical Mathematics, 25(1):113-134, 1985. Google Scholar
  27. Phillip B Gibbons and Yossi Matias. New sampling-based summary statistics for improving approximate query answers. In ACM SIGMOD Record, volume 27, pages 331-342. ACM, 1998. Google Scholar
  28. Phillip B Gibbons, Yossi Matias, and Viswanath Poosala. Fast Incremental Maintenance of Approximate Histograms. In Proceedings of the 23rd International Conference on Very Large Data Bases, pages 466-475. Morgan Kaufmann Publishers Inc., 1997. Google Scholar
  29. Anna C Gilbert, Yannis Kotidis, S Muthukrishnan, and Martin J Strauss. How to summarize the universe: Dynamic maintenance of quantiles. In VLDB'02: Proceedings of the 28th International Conference on Very Large Databases, pages 454-465. Elsevier, 2002. Google Scholar
  30. Andre Gronemeier. Asymptotically optimal lower bounds on the nih-multi-party information. arXiv preprint, 2009. URL: http://arxiv.org/abs/0902.1609.
  31. Nicholas JA Harvey, Jelani Nelson, and Krzysztof Onak. Sketching and streaming entropy via approximation theory. In 2008 49th Annual IEEE Symposium on Foundations of Computer Science, pages 489-498. IEEE, 2008. Google Scholar
  32. Nicholas JA Harvey, Jelani Nelson, and Krzysztof Onak. Streaming algorithms for estimating entropy. In 2008 IEEE Information Theory Workshop, pages 227-231. IEEE, 2008. Google Scholar
  33. Ling Huang, XuanLong Nguyen, Minos Garofalakis, Joseph M Hellerstein, Michael I Jordan, Anthony D Joseph, and Nina Taft. Communication-efficient online detection of network-wide anomalies. In INFOCOM 2007. 26th IEEE International Conference on Computer Communications. IEEE, pages 134-142. IEEE, 2007. Google Scholar
  34. Zengfeng Huang, Ke Yi, and Qin Zhang. Randomized algorithms for tracking distributed count, frequencies, and ranks. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems, pages 295-306. ACM, 2012. Google Scholar
  35. Piotr Indyk. Stable distributions, pseudorandom generators, embeddings, and data stream computation. Journal of the ACM (JACM), 53(3):307-323, 2006. Google Scholar
  36. Piotr Indyk and David Woodruff. Optimal approximations of the frequency moments of data streams. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 202-208. ACM, 2005. Google Scholar
  37. Rajesh Jayaram, Gokarna Sharma, Srikanta Tirthapura, and David P. Woodruff. Weighted Reservoir Sampling from Distributed Streams. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, SIGMOD/PODS '19, 2019. Google Scholar
  38. Rajesh Jayaram and David P Woodruff. Perfect lp sampling in a data stream. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 544-555. IEEE, 2018. Google Scholar
  39. Thathachar S Jayram, Ravi Kumar, and D Sivakumar. The One-Way Communication Complexity of Hamming Distance. Theory of Computing, 4(1):129-135, 2008. Google Scholar
  40. Thathachar S Jayram and David P Woodruff. The data stream space complexity of cascaded norms. In 2009 50th Annual IEEE Symposium on Foundations of Computer Science, pages 765-774. IEEE, 2009. Google Scholar
  41. TS Jayram. Hellinger strikes back: A note on the multi-party information complexity of AND. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 562-573. Springer, 2009. Google Scholar
  42. Hossein Jowhari, Mert Sağlam, and Gábor Tardos. Tight Bounds for Lp Samplers, Finding Duplicates in Streams, and Related Problems. In Proceedings of the Thirtieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS '11, pages 49-58, New York, NY, USA, 2011. ACM. URL: https://doi.org/10.1145/1989284.1989289.
  43. Daniel M Kane and Jelani Nelson. Sparser johnson-lindenstrauss transforms. Journal of the ACM (JACM), 61(1):4, 2014. Google Scholar
  44. Daniel M Kane, Jelani Nelson, Ely Porat, and David P Woodruff. Fast moment estimation in data streams in optimal space. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 745-754. ACM, 2011. Google Scholar
  45. Daniel M Kane, Jelani Nelson, and David P Woodruff. On the exact space complexity of sketching and streaming small norms. In Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms, pages 1161-1178. SIAM, 2010. Google Scholar
  46. Michael Kapralov, Jelani Nelson, Jakub Pachocki, Zhengyu Wang, David P Woodruff, and Mobin Yahyazadeh. Optimal lower bounds for universal relation, and for samplers and finding duplicates in streams. arXiv preprint, 2017. URL: http://arxiv.org/abs/1704.00633.
  47. Ping Li and Cun-Hui Zhang. A new algorithm for compressed counting with applications in shannon entropy estimation in dynamic data. In Proceedings of the 24th Annual Conference on Learning Theory, pages 477-496, 2011. Google Scholar
  48. Yi Li and David P Woodruff. A tight lower bound for high frequency moment estimation with small error. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 623-638. Springer, 2013. Google Scholar
  49. Andrew McGregor, A Pavan, Srikanta Tirthapura, and David P Woodruff. Space-Efficient Estimation of Statistics Over Sub-Sampled Streams. Algorithmica, 74(2):787-811, 2016. Google Scholar
  50. Morteza Monemizadeh and David P Woodruff. 1-pass relative-error lp-sampling with applications. In Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms, pages 1143-1160. SIAM, 2010. Google Scholar
  51. Robert Morris. Counting large numbers of events in small registers. Communications of the ACM, 21(10):840-842, 1978. Google Scholar
  52. Shanmugavelayutham Muthukrishnan et al. Data streams: Algorithms and applications. Foundations and Trendsregistered in Theoretical Computer Science, 1(2):117-236, 2005. Google Scholar
  53. Noam Nisan. Pseudorandom generators for space-bounded computation. Combinatorica, 12(4):449-461, 1992. Google Scholar
  54. J. P. Nolan. Stable Distributions - Models for Heavy Tailed Data. Birkhauser, Boston, 2018. In progress, Chapter 1 online at http://fs2.american.edu/jpnolan/www/stable/stable.html. Google Scholar
  55. Frank Olken. Random sampling from databases. PhD thesis, University of California, Berkeley, 1993. Google Scholar
  56. Srikanta Tirthapura and David P Woodruff. Optimal random sampling from distributed streams revisited. In International Symposium on Distributed Computing, pages 283-297. Springer, 2011. Google Scholar
  57. Omri Weinstein and David P Woodruff. The simultaneous communication of disjointness with applications to data streams. In International Colloquium on Automata, Languages, and Programming, pages 1082-1093. Springer, 2015. Google Scholar
  58. David Woodruff. Optimal space lower bounds for all frequency moments. In Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms, pages 167-175. Society for Industrial and Applied Mathematics, 2004. Google Scholar
  59. David P Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trendsregistered in Theoretical Computer Science, 10(1-2):1-157, 2014. Google Scholar
  60. David P Woodruff and Qin Zhang. Tight bounds for distributed functional monitoring. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pages 941-960. ACM, 2012. Google Scholar
  61. David P Woodruff and Qin Zhang. Distributed Statistical Estimation of Matrix Products with Applications. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 383-394. ACM, 2018. Google Scholar
  62. David P Woodruff and Peilin Zhong. Distributed low rank approximation of implicit functions of a matrix. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pages 847-858. IEEE, 2016. Google Scholar
  63. Ke Yi and Qin Zhang. Optimal tracking of distributed heavy hitters and quantiles. Algorithmica, 65(1):206-223, 2013. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail