Streaming Algorithms with Large Approximation Factors

Authors Yi Li , Honghao Lin, David P. Woodruff, Yuheng Zhang



PDF
Thumbnail PDF

File

LIPIcs.APPROX-RANDOM.2022.13.pdf
  • Filesize: 0.91 MB
  • 23 pages

Document Identifiers

Author Details

Yi Li
  • Division of Mathematical Sciences, Nanyang Technological University, Singapore
Honghao Lin
  • Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, USA
David P. Woodruff
  • Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, USA
Yuheng Zhang
  • Zhiyuan College, Shanghai Jiao Tong University, China

Cite AsGet BibTex

Yi Li, Honghao Lin, David P. Woodruff, and Yuheng Zhang. Streaming Algorithms with Large Approximation Factors. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 245, pp. 13:1-13:23, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.APPROX/RANDOM.2022.13

Abstract

We initiate a broad study of classical problems in the streaming model with insertions and deletions in the setting where we allow the approximation factor α to be much larger than 1. Such algorithms can use significantly less memory than the usual setting for which α = 1+ε for an ε ∈ (0,1). We study large approximations for a number of problems in sketching and streaming, assuming that the underlying n-dimensional vector has all coordinates bounded by M throughout the data stream: 1) For the 𝓁_p norm/quasi-norm, 0 < p ≤ 2, we show that obtaining a poly(n)-approximation requires the same amount of memory as obtaining an O(1)-approximation for any M = n^Θ(1), which holds even for randomly ordered streams or for streams in the bounded deletion model. 2) For estimating the 𝓁_p norm, p > 2, we show an upper bound of O(n^{1-2/p} (log n log M)/α²) bits for an α-approximation, and give a matching lower bound for linear sketches. 3) For the 𝓁₂-heavy hitters problem, we show that the known lower bound of Ω(k log nlog M) bits for identifying (1/k)-heavy hitters holds even if we are allowed to output items that are 1/(α k)-heavy, provided the algorithm succeeds with probability 1-O(1/n). We also obtain a lower bound for linear sketches that is tight even for constant failure probability algorithms. 4) For estimating the number 𝓁₀ of distinct elements, we give an n^{1/t}-approximation algorithm using O(tlog log M) bits of space, as well as a lower bound of Ω(t) bits, both excluding the storage of random bits, where n is the dimension of the underlying frequency vector and M is an upper bound on the magnitude of its coordinates. 5) For α-approximation to the Schatten-p norm, we give near-optimal Õ(n^{2-4/p}/α⁴) sketching dimension for every even integer p and every α ≥ 1, while for p not an even integer we obtain near-optimal sketching dimension once α = Ω(n^{1/q-1/p}), where q is the largest even integer less than p. The latter is surprising as it is unknown what the complexity of Schatten-p norm estimation is for constant approximation; we show once the approximation factor is at least n^{1/q-1/p}, we can obtain near-optimal sketching bounds.

Subject Classification

ACM Subject Classification
  • Theory of computation → Streaming, sublinear and near linear time algorithms
Keywords
  • streaming algorithms
  • 𝓁_p norm
  • heavy hitters
  • distinct elements

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Swarup Acharya, Phillip B. Gibbons, Viswanath Poosala, and Sridhar Ramaswamy. The aqua approximate query answering system. In Alex Delis, Christos Faloutsos, and Shahram Ghandeharizadeh, editors, SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, pages 574-576, June 1-3, 1999, Philadelphia, Pennsylvania, USA, 1999. ACM Press. Google Scholar
  2. Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. In Proc. 20th International Conference on Very Large Data Bases, pages 487-499, 1994. Google Scholar
  3. Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci., 58(1):137-147, 1999. URL: https://doi.org/10.1006/jcss.1997.1545.
  4. Alexandr Andoni. High frequency moments via max-stability. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017, pages 6364-6368. IEEE, 2017. URL: https://doi.org/10.1109/ICASSP.2017.7953381.
  5. Alexandr Andoni, Khanh Do Ba, Piotr Indyk, and David P. Woodruff. Efficient sketches for earth-mover distance, with applications. In 50th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2009, October 25-27, 2009, Atlanta, Georgia, USA, pages 324-330. IEEE Computer Society, 2009. URL: https://doi.org/10.1109/FOCS.2009.25.
  6. Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Streaming algorithms via precision sampling. In 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science, pages 363-372. IEEE, 2011. Google Scholar
  7. Sepehr Assadi, Sanjeev Khanna, and Yang Li. On estimating maximum matching size in graph streams. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1723-1742. SIAM, 2017. Google Scholar
  8. Khanh Do Ba, Piotr Indyk, Eric Price, and David P. Woodruff. Lower bounds for sparse recovery. In Moses Charikar, editor, Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17-19, 2010, pages 1190-1197. SIAM, 2010. URL: https://doi.org/10.1137/1.9781611973075.95.
  9. Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, and D. Sivakumar. An information statistics approach to data stream and communication complexity. J. Comput. Syst. Sci., 68(4):702-732, 2004. Google Scholar
  10. Ziv Bar-Yossef, Thathachar S Jayram, Robert Krauthgamer, and Ravi Kumar. The sketching complexity of pattern matching. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 261-272. Springer, 2004. Google Scholar
  11. Kevin S. Beyer and Raghu Ramakrishnan. Bottom-up computation of sparse and iceberg CUBEs. In Proc. ACM SIGMOD International Conference on Management of Data, pages 359-370, 1999. Google Scholar
  12. Mark Braverman, Sumegha Garg, and David P Woodruff. The coin problem with applications to data streams. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 318-329. IEEE, 2020. Google Scholar
  13. Mark Braverman, Sumegha Garg, and Or Zamir. Tight space complexity of the coin problem. Electron. Colloquium Comput. Complex., 2021. URL: https://eccc.weizmann.ac.il/report/2021/083.
  14. Vladimir Braverman, Moses Charikar, William Kuszmaul, David P. Woodruff, and Lin F. Yang. The one-way communication complexity of dynamic time warping distance. In Gill Barequet and Yusu Wang, editors, 35th International Symposium on Computational Geometry, SoCG 2019, June 18-21, 2019, Portland, Oregon, USA, volume 129 of LIPIcs, pages 16:1-16:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019. Google Scholar
  15. Vladimir Braverman, Stephen R. Chestnut, Robert Krauthgamer, Yi Li, David P. Woodruff, and Lin F. Yang. Matrix norms in data streams: Faster, multi-pass and row-order. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, volume 80 of Proceedings of Machine Learning Research, pages 648-657, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, 2018. PMLR. Google Scholar
  16. Vladimir Braverman, Robert Krauthgamer, Aditya Krishnan, and Roi Sinoff. Schatten norms in matrix streams: Hello sparsity, goodbye dimension. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, volume 119 of Proceedings of Machine Learning Research, pages 1100-1110, Virtual Event, 2020. PMLR. Google Scholar
  17. Paul Brown, Peter Haas, Jussi Myllymaki, Hamid Pirahesh, Berthold Reinwald, and Yannis Sismanis. Toward automated large-scale information integration and discovery. In Data Management in a Connected World, pages 161-180. Springer, 2005. Google Scholar
  18. Amit Chakrabarti and Sagar Kale. Strong fooling sets for multi-player communication with applications to deterministic estimation of stream statistics. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 41-50. IEEE, 2016. Google Scholar
  19. Graham Cormode and S Muthukrishnan. Space efficient mining of multigraph streams. In Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 271-282, 2005. Google Scholar
  20. Tamraparni Dasu, Theodore Johnson, Shanmugauelayut Muthukrishnan, and Vladislav Shkapenyuk. Mining database structure; or, how to build a data quality browser. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pages 240-251, 2002. Google Scholar
  21. David J DeWitt, Jeffrey F Naughton, Donovan A Schneider, and Srinivasan Seshadri. Practical skew handling in parallel joins. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 1992. Google Scholar
  22. Elbert Du, Michael Mitzenmacher, David P. Woodruff, and Guang Yang. Separating k-Player from t-Player One-Way Communication, with Applications to Data Streams. arXiv e-prints, page arXiv:1905.07135, May 2019. URL: http://arxiv.org/abs/1905.07135.
  23. Min Fang, Narayanan Shivakumar, Hector Garcia-Molina, Rajeev Motwani, and Jeffrey D. Ullman. Computing iceberg queries efficiently. In Proc. 24rd International Conference on Very Large Data Bases, pages 299-310, 1998. Google Scholar
  24. Schkolnick Finkelstein, Mario Schkolnick, and Paolo Tiberio. Physical database design for relational databases. ACM Transactions on Database Systems (TODS), 13(1):91-128, 1988. Google Scholar
  25. Sumit Ganguly and David P. Woodruff. High probability frequency moment sketches. In Ioannis Chatzigiannakis, Christos Kaklamanis, Dániel Marx, and Donald Sannella, editors, 45th International Colloquium on Automata, Languages, and Programming, ICALP 2018, July 9-13, 2018, Prague, Czech Republic, volume 107 of LIPIcs, pages 58:1-58:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2018. Google Scholar
  26. Jiawei Han, Jian Pei, Guozhu Dong, and Ke Wang. Efficient computation of iceberg cubes with complex measures. In Proc. 2001 ACM SIGMOD International Conference on Management of Data,, pages 1-12, 2001. Google Scholar
  27. Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without candidate generation. In Proc. 2000 ACM SIGMOD International Conference on Management of Data, pages 1-12, 2000. Google Scholar
  28. Christian Hidber. Online association rule mining. In SIGMOD 1999, Proc. ACM SIGMOD International Conference on Management of Data, pages 145-156, 1999. Google Scholar
  29. Piotr Indyk and Ali Vakilian. Tight trade-offs for the maximum k-coverage problem in the general streaming model. In Dan Suciu, Sebastian Skritek, and Christoph Koch, editors, Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, pages 200-217. ACM, 2019. Google Scholar
  30. Rajesh Jayaram and David P. Woodruff. Data streams with bounded deletions. CoRR, abs/1803.08777, 2018. URL: http://arxiv.org/abs/1803.08777.
  31. T. S. Jayram and David P. Woodruff. The data stream space complexity of cascaded norms. In 50th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2009, October 25-27, 2009, Atlanta, Georgia, USA, pages 765-774. IEEE Computer Society, 2009. URL: https://doi.org/10.1109/FOCS.2009.82.
  32. Tanqiu Jiang, Yi Li, Honghao Lin, Yisong Ruan, and David P Woodruff. Learning-augmented data stream algorithms. In International Conference on Learning Representations, 2019. Google Scholar
  33. Hossein Jowhari, Mert Saglam, and Gábor Tardos. Tight bounds for L_p samplers, finding duplicates in streams, and related problems. In Maurizio Lenzerini and Thomas Schwentick, editors, Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2011, June 12-16, 2011, Athens, Greece, pages 49-58. ACM, 2011. URL: https://doi.org/10.1145/1989284.1989289.
  34. John Kallaugher and Eric Price. Separations and equivalences between turnstile streaming and linear sketching. In Konstantin Makarychev, Yury Makarychev, Madhur Tulsiani, Gautam Kamath, and Julia Chuzhoy, editors, Proccedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, Chicago, IL, USA, June 22-26, 2020, pages 1223-1236. ACM, 2020. Google Scholar
  35. Daniel M. Kane, Jelani Nelson, and David P. Woodruff. On the exact space complexity of sketching and streaming small norms. In Moses Charikar, editor, Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17-19, 2010, pages 1161-1178. SIAM, 2010. Google Scholar
  36. Daniel M. Kane, Jelani Nelson, and David P. Woodruff. An optimal algorithm for the distinct elements problem. In Jan Paredaens and Dirk Van Gucht, editors, Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2010, June 6-11, 2010, Indianapolis, Indiana, USA, pages 41-52. ACM, 2010. URL: https://doi.org/10.1145/1807085.1807094.
  37. Aditya Krishnan, Sidhanth Mohanty, and David P. Woodruff. On sketching the q to p norms. CoRR, abs/1806.06429, 2018. URL: http://arxiv.org/abs/1806.06429.
  38. Rafał Latała. Estimation of moments of sums of independent real random variables. The Annals of Probability, 25(3):1502-1513, 1997. URL: https://doi.org/10.1214/aop/1024404522.
  39. Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Springer, 1991. Google Scholar
  40. Yi Li, Huy L. Nguyen, and David P. Woodruff. Turnstile streaming algorithms might as well be linear sketches. In David B. Shmoys, editor, Symposium on Theory of Computing, STOC 2014, New York, NY, USA, May 31 - June 03, 2014, pages 174-183. ACM, 2014. Google Scholar
  41. Yi Li, Huy L. Nguyen, and David P. Woodruff. On approximating matrix norms in data streams. SIAM J. Comput., 48(6):1643-1697, 2019. URL: https://doi.org/10.1137/17M1152255.
  42. Yi Li and David P. Woodruff. Tight bounds for sketching the operator norm, Schatten norms, and subspace embeddings. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2016). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2016. Google Scholar
  43. Yi Li and David P Woodruff. Embeddings of Schatten norms with applications to data streams. In Ioannis Chatzigiannakis, Piotr Indyk, Fabian Kuhn, and Anca Muscholl, editors, Proceedings of ICALP, volume 80 of LIPIcs, pages 60:1-60:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017. URL: https://doi.org/10.4230/LIPIcs.ICALP.2017.60.
  44. Sriram Padmanabhan, Bishwaranjan Bhattacharjee, Tim Malkemus, Leslie Cranston, and Matthew Huras. Multi-dimensional clustering: A new data layout scheme in db2. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 637-641, 2003. Google Scholar
  45. Eric Price and David P. Woodruff. (1 + ε)-approximate sparse recovery. In Rafail Ostrovsky, editor, IEEE 52nd Annual Symposium on Foundations of Computer Science, FOCS 2011, Palm Springs, CA, USA, October 22-25, 2011, pages 295-304. IEEE Computer Society, 2011. URL: https://doi.org/10.1109/FOCS.2011.92.
  46. Ashok Savasere, Edward Omiecinski, and Shamkant B. Navathe. An efficient algorithm for mining association rules in large databases. In Proc. 21th International Conference on Very Large Data Bases, pages 432-444, 1995. Google Scholar
  47. P Griffiths Selinger, Morton M Astrahan, Donald D Chamberlin, Raymond A Lorie, and Thomas G Price. Access path selection in a relational database management system. In Readings in Artificial Intelligence and Databases, pages 511-522. Elsevier, 1989. Google Scholar
  48. Amit Shukla, Prasad Deshpande, Jeffrey F Naughton, and Karthikeyan Ramasamy. Storage estimation for multidimensional aggregates in the presence of hierarchies. In VLDB, volume 96, pages 522-531. Citeseer, 1996. Google Scholar
  49. Srikanta Tirthapura and David Woodruff. Rectangle-efficient aggregation in spatial data streams. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems, pages 283-294. ACM, 2012. Google Scholar
  50. Hannu Toivonen. Sampling large databases for association rules. In Proc. 22th International Conference on Very Large Data Bases, pages 134-145, 1996. Google Scholar
  51. Omri Weinstein and David P. Woodruff. The simultaneous communication of disjointness with applications to data streams. In Magnús M. Halldórsson, Kazuo Iwama, Naoki Kobayashi, and Bettina Speckmann, editors, Automata, Languages, and Programming - 42nd International Colloquium, ICALP 2015, Kyoto, Japan, July 6-10, 2015, Proceedings, Part I, volume 9134 of Lecture Notes in Computer Science, pages 1082-1093. Springer, 2015. Google Scholar
  52. David P. Woodruff and Guang Yang. Separating k-player from t-player one-way communication, with applications to data streams. In Christel Baier, Ioannis Chatzigiannakis, Paola Flocchini, and Stefano Leonardi, editors, 46th International Colloquium on Automata, Languages, and Programming, ICALP 2019, July 9-12, 2019, Patras, Greece, volume 132 of LIPIcs, pages 97:1-97:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail