Document Open Access Logo

A Simple Proof of a New Set Disjointness with Applications to Data Streams

Authors Akshay Kamath, Eric Price, David P. Woodruff



PDF
Thumbnail PDF

File

LIPIcs.CCC.2021.37.pdf
  • Filesize: 1.27 MB
  • 24 pages

Document Identifiers

Author Details

Akshay Kamath
  • University of Texas at Austin, TX, USA
Eric Price
  • University of Texas at Austin, TX, USA
David P. Woodruff
  • Carnegie Mellon University, Pittsburgh, PA, USA

Acknowledgements

The authors would like to thank the anonymous reviewers of a previous version of this paper for helpful suggestions that significantly improved the presentation.

Cite AsGet BibTex

Akshay Kamath, Eric Price, and David P. Woodruff. A Simple Proof of a New Set Disjointness with Applications to Data Streams. In 36th Computational Complexity Conference (CCC 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 200, pp. 37:1-37:24, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.CCC.2021.37

Abstract

The multiplayer promise set disjointness is one of the most widely used problems from communication complexity in applications. In this problem there are k players with subsets S¹, …, S^k, each drawn from {1, 2, …, n}, and we are promised that either the sets are (1) pairwise disjoint, or (2) there is a unique element j occurring in all the sets, which are otherwise pairwise disjoint. The total communication of solving this problem with constant probability in the blackboard model is Ω(n/k). We observe for most applications, it instead suffices to look at what we call the "mostly" set disjointness problem, which changes case (2) to say there is a unique element j occurring in at least half of the sets, and the sets are otherwise disjoint. This change gives us a much simpler proof of an Ω(n/k) randomized total communication lower bound, avoiding Hellinger distance and Poincare inequalities. Our proof also gives strong lower bounds for high probability protocols, which are much larger than what is possible for the set disjointness problem. Using this we show several new results for data streams: 1) for 𝓁₂-Heavy Hitters, any O(1)-pass streaming algorithm in the insertion-only model for detecting if an ε-𝓁₂-heavy hitter exists requires min(1/(ε²)log((ε²n)/δ), 1/(ε)n^{1/2}) bits of memory, which is optimal up to a log n factor. For deterministic algorithms and constant ε, this gives an Ω(n^{1/2}) lower bound, improving the prior Ω(log n) lower bound. We also obtain lower bounds for Zipfian distributions. 2) for 𝓁_p-Estimation, p > 2, we show an O(1)-pass Ω(n^{1-2/p} log(1/δ)) bit lower bound for outputting an O(1)- approximation with probability 1-δ, in the insertion-only model. This is optimal, and the best previous lower bound was Ω(n^{1-2/p} + log(1/δ)). 3) for low rank approximation of a sparse matrix in ℝ^{d× n}, if we see the rows of a matrix one at a time in the row-order model, each row having O(1) non-zero entries, any deterministic algorithm requires Ω(√d) memory to output an O(1)-approximate rank-1 approximation. Finally, we consider strict and general turnstile streaming models, and show separations between sketching lower bounds and non-sketching upper bounds for the heavy hitters problem.

Subject Classification

ACM Subject Classification
  • Theory of computation → Lower bounds and information complexity
Keywords
  • Streaming algorithms
  • heavy hitters
  • communication complexity
  • information complexity

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. In VLDB'94, Proceedings of 20th International Conference on Very Large Data Bases, September 12-15, 1994, Santiago de Chile, Chile, pages 487-499, 1994. Google Scholar
  2. Yuqing Ai, Wei Hu, Yi Li, and David P Woodruff. New characterizations in turnstile streams with applications. In LIPIcs-Leibniz International Proceedings in Informatics, volume 50. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2016. Google Scholar
  3. Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci., 58(1):137-147, 1999. Google Scholar
  4. Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. Models and issues in data stream systems. In Proceedings of the Twenty-first ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 3-5, Madison, Wisconsin, USA, pages 1-16, 2002. Google Scholar
  5. Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, and D. Sivakumar. An information statistics approach to data stream and communication complexity. J. Comput. Syst. Sci., 68(4):702-732, 2004. URL: https://doi.org/10.1016/j.jcss.2003.11.006.
  6. Paul Beame and Trinh Huynh. Multiparty communication complexity and threshold circuit size of $$1sfac^0. SIAM Journal on Computing, 41(3):484-518, 2012. Google Scholar
  7. Paul Beame, Toniann Pitassi, Nathan Segerlind, and Avi Wigderson. A strong direct product theorem for corruption and the multiparty communication complexity of disjointness. Computational Complexity, 15(4):391-432, 2006. Google Scholar
  8. Radu Berinde, Piotr Indyk, Graham Cormode, and Martin J. Strauss. Space-optimal heavy hitters with strong error bounds. ACM Trans. Database Syst., 35(4):26, 2010. URL: https://doi.org/10.1145/1862919.1862923.
  9. Kevin S. Beyer and Raghu Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. In SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1-3, 1999, Philadelphia, Pennsylvania, USA., pages 359-370, 1999. Google Scholar
  10. Christos Boutsidis, David P Woodruff, and Peilin Zhong. Optimal principal component analysis in distributed and streaming models. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 236-249, 2016. Google Scholar
  11. Mark Braverman, Ankit Garg, Tengyu Ma, Huy L Nguyen, and David P Woodruff. Communication lower bounds for statistical estimation problems via a distributed data processing inequality. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 1011-1020, 2016. Google Scholar
  12. Vladimir Braverman, Stephen R. Chestnut, Nikita Ivkin, Jelani Nelson, Zhengyu Wang, and David P. Woodruff. Bptree: an 𝓁₂ heavy hitters algorithm using constant memory. CoRR, abs/1603.00759, 2016. Google Scholar
  13. Vladimir Braverman, Stephen R. Chestnut, Nikita Ivkin, and David P. Woodruff. Beating countsketch for heavy hitters in insertion streams. STOC, 2016. Google Scholar
  14. Vladimir Braverman, Jonathan Katzman, Charles Seidell, and Gregory Vorsanger. An optimal algorithm for large frequency moments using o(n^(1-2/k)) bits. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2014, September 4-6, 2014, Barcelona, Spain, pages 531-544, 2014. Google Scholar
  15. Yousra Chabchoub, Christine Fricker, and Hanene Mohamed. Analysis of a bloom filter algorithm via the supermarket model. In 21st International Teletraffic Congress, ITC 2009, Paris, France, September 15-17, 2009, pages 1-8, 2009. Google Scholar
  16. Amit Chakrabarti, Graham Cormode, and Andrew McGregor. A near-optimal algorithm for estimating the entropy of a stream. ACM Transactions on Algorithms, 6(3), 2010. Google Scholar
  17. Amit Chakrabarti and Sagar Kale. Strong fooling sets for multi-player communication with applications to deterministic estimation of stream statistics. In IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016, 9-11 October 2016, Hyatt Regency, New Brunswick, New Jersey, USA, pages 41-50, 2016. Google Scholar
  18. Amit Chakrabarti, Subhash Khot, and Xiaodong Sun. Near-optimal lower bounds on the multi-party communication complexity of set disjointness. In 18th IEEE Annual Conference on Computational Complexity, 2003. Proceedings., pages 107-117. IEEE, 2003. Google Scholar
  19. Ho-Leung Chan, Tak-Wah Lam, Lap-Kei Lee, Jiangwei Pan, Hing-Fung Ting, and Qin Zhang. Edit distance to monotonicity in sliding windows. In International Symposium on Algorithms and Computation, pages 564-573. Springer, 2011. Google Scholar
  20. Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. Theor. Comput. Sci., 312(1):3-15, 2004. Google Scholar
  21. Arkadev Chattopadhyay and Anil Ada. Multiparty communication complexity of disjointness. arXiv preprint, 2008. URL: http://arxiv.org/abs/0801.3624.
  22. Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ Newman. Power-law distributions in empirical data. SIAM review, 51(4):661-703, 2009. Google Scholar
  23. A. Cohen, W. Dahmen, and R. DeVore. Compressed sensing and best k-term approximation. J. Amer. Math. Soc, 22(1):211-231, 2009. Google Scholar
  24. Graham Cormode. Open problem in data streams and related topics. IITK Workshop on Algorithms for Data Streams, 2006. Google Scholar
  25. Graham Cormode and Marios Hadjieleftheriou. Finding frequent items in data streams. PVLDB, 1(2):1530-1541, 2008. Google Scholar
  26. Graham Cormode and S Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1):58-75, 2005. Google Scholar
  27. Erik D Demaine, Alejandro López-Ortiz, and J Ian Munro. Frequency estimation of internet packet streams with limited space. In Algorithms—ESA 2002, pages 348-360. Springer, 2002. Google Scholar
  28. Khanh Do Ba, Piotr Indyk, Eric Price, and David P. Woodruff. Lower bounds for sparse recovery. CoRR, abs/1106.0365, 2011. Google Scholar
  29. Cristian Estan and George Varghese. New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. ACM Trans. Comput. Syst., 21(3):270-313, 2003. Google Scholar
  30. Min Fang, Narayanan Shivakumar, Hector Garcia-Molina, Rajeev Motwani, and Jeffrey D. Ullman. Computing iceberg queries efficiently. In VLDB'98, Proceedings of 24rd International Conference on Very Large Data Bases, August 24-27, 1998, New York City, New York, USA, pages 299-310, 1998. Google Scholar
  31. Sumit Ganguly. Deterministically estimating data stream frequencies. In Ding-Zhu Du, Xiaodong Hu, and Panos M. Pardalos, editors, Combinatorial Optimization and Applications, Third International Conference, COCOA 2009, Huangshan, China, June 10-12, 2009. Proceedings, volume 5573 of Lecture Notes in Computer Science, pages 301-312. Springer, 2009. Google Scholar
  32. Ankit Garg, Tengyu Ma, and Huy Nguyen. On communication cost of distributed statistical estimation and dimensionality. In Advances in Neural Information Processing Systems, pages 2726-2734, 2014. Google Scholar
  33. Mina Ghashami, Edo Liberty, and Jeff M Phillips. Efficient frequent directions algorithm for sparse matrices. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 845-854, 2016. Google Scholar
  34. Mina Ghashami, Edo Liberty, Jeff M Phillips, and David P Woodruff. Frequent directions: Simple and deterministic matrix sketching. SIAM Journal on Computing, 45(5):1762-1792, 2016. Google Scholar
  35. Mina Ghashami and Jeff M Phillips. Relative errors for deterministic low-rank matrix approximations. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 707-717. SIAM, 2014. Google Scholar
  36. Anna C Gilbert, Hung Q Ngo, Ely Porat, Atri Rudra, and Martin J Strauss. L2/l2-foreach sparse recovery with low risk. arXiv preprint, 2013. URL: http://arxiv.org/abs/1304.6232.
  37. Parikshit Gopalan and Jaikumar Radhakrishnan. Finding duplicates in a data stream. In Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 402-411, 2009. Google Scholar
  38. Vince Grolmusz. The bns lower-bound for multiparty protocols is nearly optimal. Information and computation, 112(1):51-54, 1994. Google Scholar
  39. André Gronemeier. Asymptotically optimal lower bounds on the nih-multi-party information complexity of the and-function and disjointness. In Susanne Albers and Jean-Yves Marion, editors, 26th International Symposium on Theoretical Aspects of Computer Science, STACS 2009, February 26-28, 2009, Freiburg, Germany, Proceedings, volume 3 of LIPIcs, pages 505-516. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, Germany, 2009. Google Scholar
  40. Venkatesan Guruswami and Ali Kemal Sinop. Optimal column-based low-rank matrix reconstruction. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms, pages 1207-1214. SIAM, 2012. Google Scholar
  41. Jiawei Han, Jian Pei, Guozhu Dong, and Ke Wang. Efficient computation of iceberg cubes with complex measures. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data, Santa Barbara, CA, USA, May 21-24, 2001, pages 1-12, 2001. Google Scholar
  42. Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without candidate generation. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA., pages 1-12, 2000. Google Scholar
  43. Nicholas J. A. Harvey, Jelani Nelson, and Krzysztof Onak. Sketching and streaming entropy via approximation theory. In 49th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 489-498, 2008. Google Scholar
  44. Christian Hidber. Online association rule mining. In SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1-3, 1999, Philadelphia, Pennsylvania, USA., pages 145-156, 1999. Google Scholar
  45. Zengfeng Huang. Near optimal frequent directions for sketching dense and sparse matrices. In International Conference on Machine Learning, pages 2048-2057, 2018. Google Scholar
  46. Piotr Indyk and David P. Woodruff. Optimal approximations of the frequency moments of data streams. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing (STOC), pages 202-208, 2005. Google Scholar
  47. Rajesh Jayaram and David P Woodruff. Data streams with bounded deletions. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Prin ciples of Database Systems, pages 341-354. ACM, 2018. Google Scholar
  48. Thathachar S Jayram and David P Woodruff. The data stream space complexity of cascaded norms. In 2009 50th Annual IEEE Symposium on Foundations of Computer Science, pages 765-774. IEEE, 2009. Google Scholar
  49. TS Jayram. Hellinger strikes back: A note on the multi-party information complexity of and. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 562-573. Springer, 2009. Google Scholar
  50. Hossein Jowhari, Mert Saglam, and Gábor Tardos. Tight bounds for lp samplers, finding duplicates in streams, and related problems. In Maurizio Lenzerini and Thomas Schwentick, editors, Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2011, June 12-16, 2011, Athens, Greece, pages 49-58. ACM, 2011. Google Scholar
  51. John Kallaugher and Eric Price. Separations and equivalences between turnstile streaming and linear sketching. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 1223-1236, 2020. Google Scholar
  52. Ravi Kannan, Santosh Vempala, and David Woodruff. Principal component analysis and higher correlations for distributed data. In Conference on Learning Theory, pages 1040-1057, 2014. Google Scholar
  53. Richard M Karp, Scott Shenker, and Christos H Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems (TODS), 28(1):51-55, 2003. Google Scholar
  54. Abhishek Kumar and Jun (Jim) Xu. Sketch guided sampling - using on-line estimates of flow size for adaptive data collection. In INFOCOM 2006. 25th IEEE International Conference on Computer Communications, Joint Conference of the IEEE Computer and Communications Societies, 23-29 April 2006, Barcelona, Catalunya, Spain, 2006. Google Scholar
  55. Kasper Green Larsen, Jelani Nelson, Huy L. Nguyen, and Mikkel Thorup. Heavy hitters via cluster-preserving clustering. Commun. ACM, 62(8):95-100, 2019. Google Scholar
  56. Troy Lee and Adi Shraibman. Disjointness is hard in the multiparty number-on-the-forehead model. Computational Complexity, 18(2):309-336, 2009. Google Scholar
  57. Yi Li, Huy L Nguyen, and David P Woodruff. Turnstile streaming algorithms might as well be linear sketches. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 174-183, 2014. Google Scholar
  58. Yi Li, Huy L. Nguyen, and David P. Woodruff. Turnstile streaming algorithms might as well be linear sketches. In Symposium on Theory of Computing, STOC 2014, New York, NY, USA, May 31 - June 03, 2014, pages 174-183, 2014. URL: https://doi.org/10.1145/2591796.2591812.
  59. Edo Liberty. Simple and deterministic matrix sketching. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 581-588, 2013. Google Scholar
  60. Debmalya Mandal, Ariel D. Procaccia, Nisarg Shah, and David P. Woodruff. Efficient and thrifty voting by any means necessary. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 7178-7189, 2019. Google Scholar
  61. Debmalya Mandal, Nisarg Shah, and David P. Woodruff. Optimal communication-distortion tradeoff in voting. In EC '20: The 21st ACM Conference on Economics and Computation, Virtual Event, Hungary, July 13-17, 2020, pages 795-813, 2020. Google Scholar
  62. Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts over data streams. In Proceedings of the 28th international conference on Very Large Data Bases, pages 346-357. VLDB Endowment, 2002. Google Scholar
  63. Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Efficient computation of frequent and top-k elements in data streams. In Proceedings of the 10th International Conference on Database Theory, ICDT'05, pages 398-412, Berlin, Heidelberg, 2005. Springer-Verlag. URL: https://doi.org/10.1007/978-3-540-30570-5_27.
  64. Jayadev Misra and David Gries. Finding repeated elements. Sci. Comput. Program., 2(2):143-152, 1982. Google Scholar
  65. Morteza Monemizadeh and David P. Woodruff. 1-pass relative-error L_p-sampling with applications. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1143-1160, 2010. Google Scholar
  66. Shanmugavelayutham Muthukrishnan. Data streams: Algorithms and applications. Now Publishers Inc, 2005. Google Scholar
  67. Eric Price and David P Woodruff. Lower bounds for adaptive sparse recovery. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 652-663. SIAM, 2013. Google Scholar
  68. Ashok Savasere, Edward Omiecinski, and Shamkant B. Navathe. An efficient algorithm for mining association rules in large databases. In VLDB'95, Proceedings of 21th International Conference on Very Large Data Bases, September 11-15, 1995, Zurich, Switzerland., pages 432-444, 1995. Google Scholar
  69. Alexander A Sherstov. The multiparty communication complexity of set disjointness. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pages 525-548, 2012. Google Scholar
  70. Alexander A Sherstov. Communication lower bounds using directional derivatives. Journal of the ACM (JACM), 61(6):1-71, 2014. Google Scholar
  71. Xiaoming Sun, David P. Woodruff, Guang Yang, and Jialin Zhang. Querying a matrix through matrix-vector products. In 46th International Colloquium on Automata, Languages, and Programming, ICALP 2019, July 9-12, 2019, Patras, Greece, pages 94:1-94:16, 2019. Google Scholar
  72. Pascal Tesson. Computational complexity questions related to finite monoids and semigroups, 2003. Google Scholar
  73. Hannu Toivonen. Sampling large databases for association rules. In VLDB'96, Proceedings of 22th International Conference on Very Large Data Bases, September 3-6, 1996, Mumbai (Bombay), India, pages 134-145, 1996. Google Scholar
  74. Wikipedia contributors. Jensen–Shannon divergence - Wikipedia, the free encyclopedia, 2020. [Online; accessed 06-November-2020]. URL: https://en.wikipedia.org/w/index.php?title=Jensen%E2%80%93Shannon_divergence&oldid=980081721.
  75. David Woodruff. Low rank approximation lower bounds in row-update streams. In Advances in Neural Information Processing Systems, pages 1781-1789, 2014. Google Scholar
  76. David P Woodruff. New algorithms for heavy hitters in data streams. arXiv preprint, 2016. URL: http://arxiv.org/abs/1603.01733.
  77. David P. Woodruff and Qin Zhang. Tight bounds for distributed functional monitoring. In Proceedings of the 44th Symposium on Theory of Computing Conference, STOC 2012, New York, NY, USA, May 19 - 22, 2012, pages 941-960, 2012. Google Scholar
  78. David P Woodruff and Qin Zhang. An optimal lower bound for distinct elements in the message passing model. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 718-733. SIAM, 2014. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail