Stochastic Streams: Sample Complexity vs. Space Complexity

Authors Michael Crouch, Andrew McGregor, Gregory Valiant, David P. Woodruff



PDF
Thumbnail PDF

File

LIPIcs.ESA.2016.32.pdf
  • Filesize: 489 kB
  • 15 pages

Document Identifiers

Author Details

Michael Crouch
Andrew McGregor
Gregory Valiant
David P. Woodruff

Cite As Get BibTex

Michael Crouch, Andrew McGregor, Gregory Valiant, and David P. Woodruff. Stochastic Streams: Sample Complexity vs. Space Complexity. In 24th Annual European Symposium on Algorithms (ESA 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 57, pp. 32:1-32:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016) https://doi.org/10.4230/LIPIcs.ESA.2016.32

Abstract

We address the trade-off between the computational resources needed to process a large data set and the number of samples available from the data set. Specifically, we consider the following abstraction: we receive a potentially infinite stream of IID samples from some unknown distribution D, and  are tasked with computing some function f(D). If the stream is observed for time t, how much memory, s, is required to estimate f(D)? We refer to t as the sample complexity and s as the space complexity. The main focus of this paper is investigating the trade-offs between the space and  sample complexity. We study these trade-offs for several canonical problems studied in the data stream model: estimating the collision probability, i.e., the second moment of a distribution, deciding if a graph is connected, and approximating the dimension of an unknown subspace. Our results are based on techniques for simulating different classical sampling procedures in this model, emulating random walks given a sequence of IID samples, as well as leveraging a characterization between communication bounded protocols and statistical query algorithms.

Subject Classification

Keywords
  • data streams
  • sample complexity
  • frequency moments
  • graph connectivity
  • subspace approximation

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Jayadev Acharya, Alon Orlitsky, Ananda Theertha Suresh, and Himanshu Tyagi. The complexity of estimating rényi entropy. CoRR, abs/1408.1000, 2014. Google Scholar
  2. Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137-147, 1999. Google Scholar
  3. Alexandr Andoni, Andrew McGregor, Krzysztof Onak, and Rina Panigrahy. Better bounds for frequency moments in random-order streams. CoRR, abs/0808.2222, 2008. Google Scholar
  4. Ziv Bar-Yossef, Ravi Kumar, and D. Sivakumar. Sampling algorithms: lower bounds and applications. In STOC, pages 266-275, 2001. URL: http://dx.doi.org/10.1145/380752.380810.
  5. Greg Barnes and Uriel Feige. Short Random Walks on Graphs. SIAM Journal on Discrete Mathematics, 9(1):19, 1996. URL: http://dx.doi.org/10.1137/S0895480194264988.
  6. Marc Bury and Chris Schwiegelshohn. Sublinear estimation of weighted matchings in dynamic data streams. CoRR, abs/1505.02019, 2015. URL: http://arxiv.org/abs/1505.02019.
  7. Amit Chakrabarti, Graham Cormode, and Andrew McGregor. Robust lower bounds for communication and stream computation. In STOC, pages 641-650, 2008. URL: http://dx.doi.org/10.1145/1374376.1374470.
  8. Amit Chakrabarti, T. S. Jayram, and Mihai Patrascu. Tight lower bounds for selection in randomly ordered streams. In SODA, pages 720-729, 2008. URL: http://dx.doi.org/10.1145/1347082.1347161.
  9. Steve Chien, Katrina Ligett, and Andrew McGregor. Space-efficient estimation of robust statistics and distribution testing. In ICS, pages 251-265, 2010. Google Scholar
  10. Kenneth L. Clarkson and David P. Woodruff. Numerical linear algebra in the streaming model. In Proceedings of the 41st Annual ACM Symposium on Theory of Computing, STOC 2009, Bethesda, MD, USA, May 31 - June 2, 2009, pages 205-214, 2009. Google Scholar
  11. Thomas M. Cover, Michael A. Freedman, and Martin E. Hellman. Optimal finite memory learning algorithms for the finite sample problem. Information and Control, 30(1):49-85, 1976. Google Scholar
  12. Klim Efremenko and Omer Reingold. How Well Do Random Walks Parallelize? In APPROX-RANDOM, volume 5687, pages 476-489, Berlin, Heidelberg, 2009. URL: http://dx.doi.org/10.1007/978-3-642-03685-9.
  13. Uriel Feige. A fast randomized logspace algorithm for graph connectivity. Theor. Comput. Sci., 169(2):147-160, 1996. URL: http://dx.doi.org/10.1016/S0304-3975(96)00118-1.
  14. Uriel Feige. A Spectrum of Time-Space Trade-offs for Undirected s-t Connectivity. Journal of Computer and System Sciences, 54(2):305-316, April 1997. URL: http://dx.doi.org/10.1006/jcss.1997.1471.
  15. R. A. Fisher. On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 222:309-368, 1922. Google Scholar
  16. I.J. Good. Surprise indexes and p-values. J. Statistical Computation and Simulation, 32:90-92, 1989. Google Scholar
  17. Michael Greenwald and Sanjeev Khanna. Efficient online computation of quantile summaries. In ACM International Conference on Management of Data, pages 58-66, 2001. URL: http://dx.doi.org/10.1145/375663.375670.
  18. Sudipto Guha and Zhiyi Huang. Revisiting the direct sum theorem and space lower bounds in random order streams. In ICALP, pages 513-524, 2009. URL: http://dx.doi.org/10.1007/978-3-642-02927-1_43.
  19. Sudipto Guha and Andrew McGregor. Space-efficient sampling. In AISTATS, pages 169-176, 2007. Google Scholar
  20. Sudipto Guha and Andrew McGregor. Stream order and order statistics: Quantile estimation in random-order streams. SIAM Journal on Computing, 38(5):2044-2059, 2009. URL: http://dx.doi.org/10.1137/07069328X.
  21. Martin E Hellman and Thomas M Cover. Learning with finite memory. The Annals of Mathematical Statistics, pages 765-782, 1970. Google Scholar
  22. Daniel M. Kane, Jelani Nelson, and David P. Woodruff. An optimal algorithm for the distinct elements problem. In PODS, pages 41-52, 2010. URL: http://dx.doi.org/10.1145/1807085.1807094.
  23. Jack Koplowitz. Necessary and sufficient memory size for m-hypothesis testing. Information Theory, IEEE Transactions on, 21(1):44-46, 1975. Google Scholar
  24. Frank Thomson Leighton and Ronald L. Rivest. Estimating a probability using finite memory. IEEE Trans. Information Theory, 32(6):733-742, 1986. Google Scholar
  25. Yi Li, Huy L. Nguyen, and David P. Woodruff. On sketching matrix norms and the top singular vector. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2014, Portland, Oregon, USA, January 5-7, 2014, pages 1562-1581, 2014. Google Scholar
  26. Andrew McGregor, A. Pavan, Srikanta Tirthapura, and David P. Woodruff. Space-efficient estimation of statistics over sub-sampled streams. In PODS, pages 273-282, 2012. URL: http://dx.doi.org/10.1145/2213556.2213594.
  27. Andrew McGregor and Paul Valiant. The shifting sands algorithm. In ACM-SIAM Symposium on Discrete Algorithms, pages 453-458, 2012. Google Scholar
  28. Michael Mitzenmacher and Eli Upfal. Probability and computing - randomized algorithms and probabilistic analysis. Cambridge University Press, 2005. Google Scholar
  29. J. Ian Munro and Mike Paterson. Selection and sorting with limited storage. Theor. Comput. Sci., 12:315-323, 1980. Google Scholar
  30. S. Muthukrishnan. Stochastic data streams. In MFCS, page 55, 2009. URL: http://dx.doi.org/10.1007/978-3-642-03816-7_5.
  31. AS Nemirovsky and DB Yudin. Problem Complexity and Method Efficiency in Optimization. J. Wiley @ Sons, New York, 1983. Google Scholar
  32. Yurii Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103(1):127-152, 2005. Google Scholar
  33. Piyush Rai, Hal Daumé III, and Suresh Venkatasubramanian. Streamed learning: One-pass SVMs. In IJCAI, pages 1211-1216, 2009. Google Scholar
  34. Ran Raz. Fast learning requires good memory: A time-space lower bound for parity learning. CoRR, abs/1602.05161, 2016. Google Scholar
  35. Omer Reingold. Undirected connectivity in log-space. Journal of the ACM, 55(4):1-24, September 2008. URL: http://dx.doi.org/10.1145/1391289.1391291.
  36. Atri Rudra and Steve Uurtamo. Data stream algorithms for codeword testing. In Automata, Languages and Programming, pages 629-640. Springer, 2010. Google Scholar
  37. Florin Rusu and Alin Dobra. Sketching sampled data streams. In ICDE, pages 381-392, 2009. URL: http://dx.doi.org/10.1109/ICDE.2009.31.
  38. Ohad Shamir. Fundamental limits of online and distributed algorithms for statistical learning and estimation. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 163-171, 2014. Google Scholar
  39. J. Steinhardt, G. Valiant, and S. Wager. Memory, communication, and statistical queries. Manuscript, 2015. Google Scholar
  40. TheJournal.ie. 50 billion instant messages expected to be sent each day in 2014. URL: http://businessetc.thejournal.ie/sms-messaging-falls-1262017-Jan2014/.
  41. David P. Woodruff. The average-case complexity of counting distinct elements. In ICDT, pages 284-295, 2009. URL: http://dx.doi.org/10.1145/1514894.1514928.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail