Stochastic Streams: Sample Complexity vs. Space Complexity

Crouch, Michael; McGregor, Andrew; Valiant, Gregory; Woodruff, David P.

doi:10.4230/LIPIcs.ESA.2016.32

File

Subject Classification

Keywords

data streams
sample complexity
frequency moments
graph connectivity
subspace approximation

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

Document

0

Metadata

Abstract

We address the trade-off between the computational resources needed to process a large data set and the number of samples available from the data set. Specifically, we consider the following abstraction: we receive a potentially infinite stream of IID samples from some unknown distribution D, and are tasked with computing some function f(D). If the stream is observed for time t, how much memory, s, is required to estimate f(D)? We refer to t as the sample complexity and s as the space complexity. The main focus of this paper is investigating the trade-offs between the space and sample complexity. We study these trade-offs for several canonical problems studied in the data stream model: estimating the collision probability, i.e., the second moment of a distribution, deciding if a graph is connected, and approximating the dimension of an unknown subspace. Our results are based on techniques for simulating different classical sampling procedures in this model, emulating random walks given a sequence of IID samples, as well as leveraging a characterization between communication bounded protocols and statistical query algorithms.

Cite As Get BibTex

Michael Crouch, Andrew McGregor, Gregory Valiant, and David P. Woodruff. Stochastic Streams: Sample Complexity vs. Space Complexity. In 24th Annual European Symposium on Algorithms (ESA 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 57, pp. 32:1-32:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016) https://doi.org/10.4230/LIPIcs.ESA.2016.32

Author Details

Michael Crouch

Andrew McGregor

Gregory Valiant

David P. Woodruff

References

Jayadev Acharya, Alon Orlitsky, Ananda Theertha Suresh, and Himanshu Tyagi. The complexity of estimating rényi entropy. CoRR, abs/1408.1000, 2014.
Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137-147, 1999.
Alexandr Andoni, Andrew McGregor, Krzysztof Onak, and Rina Panigrahy. Better bounds for frequency moments in random-order streams. CoRR, abs/0808.2222, 2008.
Ziv Bar-Yossef, Ravi Kumar, and D. Sivakumar. Sampling algorithms: lower bounds and applications. In STOC, pages 266-275, 2001. URL: http://dx.doi.org/10.1145/380752.380810.
Greg Barnes and Uriel Feige. Short Random Walks on Graphs. SIAM Journal on Discrete Mathematics, 9(1):19, 1996. URL: http://dx.doi.org/10.1137/S0895480194264988.
Marc Bury and Chris Schwiegelshohn. Sublinear estimation of weighted matchings in dynamic data streams. CoRR, abs/1505.02019, 2015. URL: http://arxiv.org/abs/1505.02019.
Amit Chakrabarti, Graham Cormode, and Andrew McGregor. Robust lower bounds for communication and stream computation. In STOC, pages 641-650, 2008. URL: http://dx.doi.org/10.1145/1374376.1374470.
Amit Chakrabarti, T. S. Jayram, and Mihai Patrascu. Tight lower bounds for selection in randomly ordered streams. In SODA, pages 720-729, 2008. URL: http://dx.doi.org/10.1145/1347082.1347161.
Steve Chien, Katrina Ligett, and Andrew McGregor. Space-efficient estimation of robust statistics and distribution testing. In ICS, pages 251-265, 2010.
Kenneth L. Clarkson and David P. Woodruff. Numerical linear algebra in the streaming model. In Proceedings of the 41st Annual ACM Symposium on Theory of Computing, STOC 2009, Bethesda, MD, USA, May 31 - June 2, 2009, pages 205-214, 2009.
Thomas M. Cover, Michael A. Freedman, and Martin E. Hellman. Optimal finite memory learning algorithms for the finite sample problem. Information and Control, 30(1):49-85, 1976.
Klim Efremenko and Omer Reingold. How Well Do Random Walks Parallelize? In APPROX-RANDOM, volume 5687, pages 476-489, Berlin, Heidelberg, 2009. URL: http://dx.doi.org/10.1007/978-3-642-03685-9.
Uriel Feige. A fast randomized logspace algorithm for graph connectivity. Theor. Comput. Sci., 169(2):147-160, 1996. URL: http://dx.doi.org/10.1016/S0304-3975(96)00118-1.
Uriel Feige. A Spectrum of Time-Space Trade-offs for Undirected s-t Connectivity. Journal of Computer and System Sciences, 54(2):305-316, April 1997. URL: http://dx.doi.org/10.1006/jcss.1997.1471.
R. A. Fisher. On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 222:309-368, 1922.
I.J. Good. Surprise indexes and p-values. J. Statistical Computation and Simulation, 32:90-92, 1989.
Michael Greenwald and Sanjeev Khanna. Efficient online computation of quantile summaries. In ACM International Conference on Management of Data, pages 58-66, 2001. URL: http://dx.doi.org/10.1145/375663.375670.
Sudipto Guha and Zhiyi Huang. Revisiting the direct sum theorem and space lower bounds in random order streams. In ICALP, pages 513-524, 2009. URL: http://dx.doi.org/10.1007/978-3-642-02927-1_43.
Sudipto Guha and Andrew McGregor. Space-efficient sampling. In AISTATS, pages 169-176, 2007.
Sudipto Guha and Andrew McGregor. Stream order and order statistics: Quantile estimation in random-order streams. SIAM Journal on Computing, 38(5):2044-2059, 2009. URL: http://dx.doi.org/10.1137/07069328X.
Martin E Hellman and Thomas M Cover. Learning with finite memory. The Annals of Mathematical Statistics, pages 765-782, 1970.
Daniel M. Kane, Jelani Nelson, and David P. Woodruff. An optimal algorithm for the distinct elements problem. In PODS, pages 41-52, 2010. URL: http://dx.doi.org/10.1145/1807085.1807094.
Jack Koplowitz. Necessary and sufficient memory size for m-hypothesis testing. Information Theory, IEEE Transactions on, 21(1):44-46, 1975.
Frank Thomson Leighton and Ronald L. Rivest. Estimating a probability using finite memory. IEEE Trans. Information Theory, 32(6):733-742, 1986.
Yi Li, Huy L. Nguyen, and David P. Woodruff. On sketching matrix norms and the top singular vector. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2014, Portland, Oregon, USA, January 5-7, 2014, pages 1562-1581, 2014.
Andrew McGregor, A. Pavan, Srikanta Tirthapura, and David P. Woodruff. Space-efficient estimation of statistics over sub-sampled streams. In PODS, pages 273-282, 2012. URL: http://dx.doi.org/10.1145/2213556.2213594.
Andrew McGregor and Paul Valiant. The shifting sands algorithm. In ACM-SIAM Symposium on Discrete Algorithms, pages 453-458, 2012.
Michael Mitzenmacher and Eli Upfal. Probability and computing - randomized algorithms and probabilistic analysis. Cambridge University Press, 2005.
J. Ian Munro and Mike Paterson. Selection and sorting with limited storage. Theor. Comput. Sci., 12:315-323, 1980.
S. Muthukrishnan. Stochastic data streams. In MFCS, page 55, 2009. URL: http://dx.doi.org/10.1007/978-3-642-03816-7_5.
AS Nemirovsky and DB Yudin. Problem Complexity and Method Efficiency in Optimization. J. Wiley @ Sons, New York, 1983.
Yurii Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103(1):127-152, 2005.
Piyush Rai, Hal Daumé III, and Suresh Venkatasubramanian. Streamed learning: One-pass SVMs. In IJCAI, pages 1211-1216, 2009.
Ran Raz. Fast learning requires good memory: A time-space lower bound for parity learning. CoRR, abs/1602.05161, 2016.
Omer Reingold. Undirected connectivity in log-space. Journal of the ACM, 55(4):1-24, September 2008. URL: http://dx.doi.org/10.1145/1391289.1391291.
Atri Rudra and Steve Uurtamo. Data stream algorithms for codeword testing. In Automata, Languages and Programming, pages 629-640. Springer, 2010.
Florin Rusu and Alin Dobra. Sketching sampled data streams. In ICDE, pages 381-392, 2009. URL: http://dx.doi.org/10.1109/ICDE.2009.31.
Ohad Shamir. Fundamental limits of online and distributed algorithms for statistical learning and estimation. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 163-171, 2014.
J. Steinhardt, G. Valiant, and S. Wager. Memory, communication, and statistical queries. Manuscript, 2015.
TheJournal.ie. 50 billion instant messages expected to be sent each day in 2014. URL: http://businessetc.thejournal.ie/sms-messaging-falls-1262017-Jan2014/.
David P. Woodruff. The average-case complexity of counting distinct elements. In ICDT, pages 284-295, 2009. URL: http://dx.doi.org/10.1145/1514894.1514928.

Stochastic Streams: Sample Complexity vs. Space Complexity

Authors Michael Crouch, Andrew McGregor, Gregory Valiant, David P. Woodruff

File

Document Identifiers

Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message