A Simple Proof of a New Set Disjointness with Applications to Data Streams

Kamath, Akshay; Price, Eric; Woodruff, David P.

doi:10.4230/LIPIcs.CCC.2021.37

Abstract

The multiplayer promise set disjointness is one of the most widely used problems from communication complexity in applications. In this problem there are k players with subsets S¹, …, S^k, each drawn from {1, 2, …, n}, and we are promised that either the sets are (1) pairwise disjoint, or (2) there is a unique element j occurring in all the sets, which are otherwise pairwise disjoint. The total communication of solving this problem with constant probability in the blackboard model is Ω(n/k). 
We observe for most applications, it instead suffices to look at what we call the "mostly" set disjointness problem, which changes case (2) to say there is a unique element j occurring in at least half of the sets, and the sets are otherwise disjoint. This change gives us a much simpler proof of an Ω(n/k) randomized total communication lower bound, avoiding Hellinger distance and Poincare inequalities. Our proof also gives strong lower bounds for high probability protocols, which are much larger than what is possible for the set disjointness problem. Using this we show several new results for data streams:  
1) for 𝓁₂-Heavy Hitters, any O(1)-pass streaming algorithm in the insertion-only model for detecting if an ε-𝓁₂-heavy hitter exists requires min(1/(ε²)log((ε²n)/δ), 1/(ε)n^{1/2}) bits of memory, which is optimal up to a log n factor. For deterministic algorithms and constant ε, this gives an Ω(n^{1/2}) lower bound, improving the prior Ω(log n) lower bound. We also obtain lower bounds for Zipfian distributions. 
2) for 𝓁_p-Estimation, p > 2, we show an O(1)-pass Ω(n^{1-2/p} log(1/δ)) bit lower bound for outputting an O(1)- approximation with probability 1-δ, in the insertion-only model. This is optimal, and the best previous lower bound was Ω(n^{1-2/p} + log(1/δ)). 
3) for low rank approximation of a sparse matrix in ℝ^{d× n}, if we see the rows of a matrix one at a time in the row-order model, each row having O(1) non-zero entries, any deterministic algorithm requires Ω(√d) memory to output an O(1)-approximate rank-1 approximation.  Finally, we consider strict and general turnstile streaming models, and show separations between sketching lower bounds and non-sketching upper bounds for the heavy hitters problem.

Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. In VLDB'94, Proceedings of 20th International Conference on Very Large Data Bases, September 12-15, 1994, Santiago de Chile, Chile, pages 487-499, 1994.
Yuqing Ai, Wei Hu, Yi Li, and David P Woodruff. New characterizations in turnstile streams with applications. In LIPIcs-Leibniz International Proceedings in Informatics, volume 50. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2016.
Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci., 58(1):137-147, 1999.
Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. Models and issues in data stream systems. In Proceedings of the Twenty-first ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 3-5, Madison, Wisconsin, USA, pages 1-16, 2002.
Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, and D. Sivakumar. An information statistics approach to data stream and communication complexity. J. Comput. Syst. Sci., 68(4):702-732, 2004. URL: https://doi.org/10.1016/j.jcss.2003.11.006.
Paul Beame and Trinh Huynh. Multiparty communication complexity and threshold circuit size of $$1sfac^0. SIAM Journal on Computing, 41(3):484-518, 2012.
Paul Beame, Toniann Pitassi, Nathan Segerlind, and Avi Wigderson. A strong direct product theorem for corruption and the multiparty communication complexity of disjointness. Computational Complexity, 15(4):391-432, 2006.
Radu Berinde, Piotr Indyk, Graham Cormode, and Martin J. Strauss. Space-optimal heavy hitters with strong error bounds. ACM Trans. Database Syst., 35(4):26, 2010. URL: https://doi.org/10.1145/1862919.1862923.
Kevin S. Beyer and Raghu Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. In SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1-3, 1999, Philadelphia, Pennsylvania, USA., pages 359-370, 1999.
Christos Boutsidis, David P Woodruff, and Peilin Zhong. Optimal principal component analysis in distributed and streaming models. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 236-249, 2016.
Mark Braverman, Ankit Garg, Tengyu Ma, Huy L Nguyen, and David P Woodruff. Communication lower bounds for statistical estimation problems via a distributed data processing inequality. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 1011-1020, 2016.
Vladimir Braverman, Stephen R. Chestnut, Nikita Ivkin, Jelani Nelson, Zhengyu Wang, and David P. Woodruff. Bptree: an 𝓁₂ heavy hitters algorithm using constant memory. CoRR, abs/1603.00759, 2016.
Vladimir Braverman, Stephen R. Chestnut, Nikita Ivkin, and David P. Woodruff. Beating countsketch for heavy hitters in insertion streams. STOC, 2016.
Vladimir Braverman, Jonathan Katzman, Charles Seidell, and Gregory Vorsanger. An optimal algorithm for large frequency moments using o(n^(1-2/k)) bits. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2014, September 4-6, 2014, Barcelona, Spain, pages 531-544, 2014.
Yousra Chabchoub, Christine Fricker, and Hanene Mohamed. Analysis of a bloom filter algorithm via the supermarket model. In 21st International Teletraffic Congress, ITC 2009, Paris, France, September 15-17, 2009, pages 1-8, 2009.
Amit Chakrabarti, Graham Cormode, and Andrew McGregor. A near-optimal algorithm for estimating the entropy of a stream. ACM Transactions on Algorithms, 6(3), 2010.
Amit Chakrabarti and Sagar Kale. Strong fooling sets for multi-player communication with applications to deterministic estimation of stream statistics. In IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016, 9-11 October 2016, Hyatt Regency, New Brunswick, New Jersey, USA, pages 41-50, 2016.
Amit Chakrabarti, Subhash Khot, and Xiaodong Sun. Near-optimal lower bounds on the multi-party communication complexity of set disjointness. In 18th IEEE Annual Conference on Computational Complexity, 2003. Proceedings., pages 107-117. IEEE, 2003.
Ho-Leung Chan, Tak-Wah Lam, Lap-Kei Lee, Jiangwei Pan, Hing-Fung Ting, and Qin Zhang. Edit distance to monotonicity in sliding windows. In International Symposium on Algorithms and Computation, pages 564-573. Springer, 2011.
Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. Theor. Comput. Sci., 312(1):3-15, 2004.
Arkadev Chattopadhyay and Anil Ada. Multiparty communication complexity of disjointness. arXiv preprint, 2008. URL: http://arxiv.org/abs/0801.3624.
Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ Newman. Power-law distributions in empirical data. SIAM review, 51(4):661-703, 2009.
A. Cohen, W. Dahmen, and R. DeVore. Compressed sensing and best k-term approximation. J. Amer. Math. Soc, 22(1):211-231, 2009.
Graham Cormode. Open problem in data streams and related topics. IITK Workshop on Algorithms for Data Streams, 2006.
Graham Cormode and Marios Hadjieleftheriou. Finding frequent items in data streams. PVLDB, 1(2):1530-1541, 2008.
Graham Cormode and S Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1):58-75, 2005.
Erik D Demaine, Alejandro López-Ortiz, and J Ian Munro. Frequency estimation of internet packet streams with limited space. In Algorithms—ESA 2002, pages 348-360. Springer, 2002.
Khanh Do Ba, Piotr Indyk, Eric Price, and David P. Woodruff. Lower bounds for sparse recovery. CoRR, abs/1106.0365, 2011.
Cristian Estan and George Varghese. New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. ACM Trans. Comput. Syst., 21(3):270-313, 2003.
Min Fang, Narayanan Shivakumar, Hector Garcia-Molina, Rajeev Motwani, and Jeffrey D. Ullman. Computing iceberg queries efficiently. In VLDB'98, Proceedings of 24rd International Conference on Very Large Data Bases, August 24-27, 1998, New York City, New York, USA, pages 299-310, 1998.
Sumit Ganguly. Deterministically estimating data stream frequencies. In Ding-Zhu Du, Xiaodong Hu, and Panos M. Pardalos, editors, Combinatorial Optimization and Applications, Third International Conference, COCOA 2009, Huangshan, China, June 10-12, 2009. Proceedings, volume 5573 of Lecture Notes in Computer Science, pages 301-312. Springer, 2009.
Ankit Garg, Tengyu Ma, and Huy Nguyen. On communication cost of distributed statistical estimation and dimensionality. In Advances in Neural Information Processing Systems, pages 2726-2734, 2014.
Mina Ghashami, Edo Liberty, and Jeff M Phillips. Efficient frequent directions algorithm for sparse matrices. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 845-854, 2016.
Mina Ghashami, Edo Liberty, Jeff M Phillips, and David P Woodruff. Frequent directions: Simple and deterministic matrix sketching. SIAM Journal on Computing, 45(5):1762-1792, 2016.
Mina Ghashami and Jeff M Phillips. Relative errors for deterministic low-rank matrix approximations. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 707-717. SIAM, 2014.
Anna C Gilbert, Hung Q Ngo, Ely Porat, Atri Rudra, and Martin J Strauss. L2/l2-foreach sparse recovery with low risk. arXiv preprint, 2013. URL: http://arxiv.org/abs/1304.6232.
Parikshit Gopalan and Jaikumar Radhakrishnan. Finding duplicates in a data stream. In Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 402-411, 2009.
Vince Grolmusz. The bns lower-bound for multiparty protocols is nearly optimal. Information and computation, 112(1):51-54, 1994.
André Gronemeier. Asymptotically optimal lower bounds on the nih-multi-party information complexity of the and-function and disjointness. In Susanne Albers and Jean-Yves Marion, editors, 26th International Symposium on Theoretical Aspects of Computer Science, STACS 2009, February 26-28, 2009, Freiburg, Germany, Proceedings, volume 3 of LIPIcs, pages 505-516. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, Germany, 2009.
Venkatesan Guruswami and Ali Kemal Sinop. Optimal column-based low-rank matrix reconstruction. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms, pages 1207-1214. SIAM, 2012.
Jiawei Han, Jian Pei, Guozhu Dong, and Ke Wang. Efficient computation of iceberg cubes with complex measures. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data, Santa Barbara, CA, USA, May 21-24, 2001, pages 1-12, 2001.
Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without candidate generation. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA., pages 1-12, 2000.
Nicholas J. A. Harvey, Jelani Nelson, and Krzysztof Onak. Sketching and streaming entropy via approximation theory. In 49th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 489-498, 2008.
Christian Hidber. Online association rule mining. In SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1-3, 1999, Philadelphia, Pennsylvania, USA., pages 145-156, 1999.
Zengfeng Huang. Near optimal frequent directions for sketching dense and sparse matrices. In International Conference on Machine Learning, pages 2048-2057, 2018.
Piotr Indyk and David P. Woodruff. Optimal approximations of the frequency moments of data streams. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing (STOC), pages 202-208, 2005.
Rajesh Jayaram and David P Woodruff. Data streams with bounded deletions. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Prin ciples of Database Systems, pages 341-354. ACM, 2018.
Thathachar S Jayram and David P Woodruff. The data stream space complexity of cascaded norms. In 2009 50th Annual IEEE Symposium on Foundations of Computer Science, pages 765-774. IEEE, 2009.
TS Jayram. Hellinger strikes back: A note on the multi-party information complexity of and. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 562-573. Springer, 2009.
Hossein Jowhari, Mert Saglam, and Gábor Tardos. Tight bounds for lp samplers, finding duplicates in streams, and related problems. In Maurizio Lenzerini and Thomas Schwentick, editors, Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2011, June 12-16, 2011, Athens, Greece, pages 49-58. ACM, 2011.
John Kallaugher and Eric Price. Separations and equivalences between turnstile streaming and linear sketching. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 1223-1236, 2020.
Ravi Kannan, Santosh Vempala, and David Woodruff. Principal component analysis and higher correlations for distributed data. In Conference on Learning Theory, pages 1040-1057, 2014.
Richard M Karp, Scott Shenker, and Christos H Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems (TODS), 28(1):51-55, 2003.
Abhishek Kumar and Jun (Jim) Xu. Sketch guided sampling - using on-line estimates of flow size for adaptive data collection. In INFOCOM 2006. 25th IEEE International Conference on Computer Communications, Joint Conference of the IEEE Computer and Communications Societies, 23-29 April 2006, Barcelona, Catalunya, Spain, 2006.
Kasper Green Larsen, Jelani Nelson, Huy L. Nguyen, and Mikkel Thorup. Heavy hitters via cluster-preserving clustering. Commun. ACM, 62(8):95-100, 2019.
Troy Lee and Adi Shraibman. Disjointness is hard in the multiparty number-on-the-forehead model. Computational Complexity, 18(2):309-336, 2009.
Yi Li, Huy L Nguyen, and David P Woodruff. Turnstile streaming algorithms might as well be linear sketches. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 174-183, 2014.
Yi Li, Huy L. Nguyen, and David P. Woodruff. Turnstile streaming algorithms might as well be linear sketches. In Symposium on Theory of Computing, STOC 2014, New York, NY, USA, May 31 - June 03, 2014, pages 174-183, 2014. URL: https://doi.org/10.1145/2591796.2591812.
Edo Liberty. Simple and deterministic matrix sketching. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 581-588, 2013.
Debmalya Mandal, Ariel D. Procaccia, Nisarg Shah, and David P. Woodruff. Efficient and thrifty voting by any means necessary. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 7178-7189, 2019.
Debmalya Mandal, Nisarg Shah, and David P. Woodruff. Optimal communication-distortion tradeoff in voting. In EC '20: The 21st ACM Conference on Economics and Computation, Virtual Event, Hungary, July 13-17, 2020, pages 795-813, 2020.
Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts over data streams. In Proceedings of the 28th international conference on Very Large Data Bases, pages 346-357. VLDB Endowment, 2002.
Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Efficient computation of frequent and top-k elements in data streams. In Proceedings of the 10th International Conference on Database Theory, ICDT'05, pages 398-412, Berlin, Heidelberg, 2005. Springer-Verlag. URL: https://doi.org/10.1007/978-3-540-30570-5_27.
Jayadev Misra and David Gries. Finding repeated elements. Sci. Comput. Program., 2(2):143-152, 1982.
Morteza Monemizadeh and David P. Woodruff. 1-pass relative-error L_p-sampling with applications. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1143-1160, 2010.
Shanmugavelayutham Muthukrishnan. Data streams: Algorithms and applications. Now Publishers Inc, 2005.
Eric Price and David P Woodruff. Lower bounds for adaptive sparse recovery. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 652-663. SIAM, 2013.
Ashok Savasere, Edward Omiecinski, and Shamkant B. Navathe. An efficient algorithm for mining association rules in large databases. In VLDB'95, Proceedings of 21th International Conference on Very Large Data Bases, September 11-15, 1995, Zurich, Switzerland., pages 432-444, 1995.
Alexander A Sherstov. The multiparty communication complexity of set disjointness. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pages 525-548, 2012.
Alexander A Sherstov. Communication lower bounds using directional derivatives. Journal of the ACM (JACM), 61(6):1-71, 2014.
Xiaoming Sun, David P. Woodruff, Guang Yang, and Jialin Zhang. Querying a matrix through matrix-vector products. In 46th International Colloquium on Automata, Languages, and Programming, ICALP 2019, July 9-12, 2019, Patras, Greece, pages 94:1-94:16, 2019.
Pascal Tesson. Computational complexity questions related to finite monoids and semigroups, 2003.
Hannu Toivonen. Sampling large databases for association rules. In VLDB'96, Proceedings of 22th International Conference on Very Large Data Bases, September 3-6, 1996, Mumbai (Bombay), India, pages 134-145, 1996.
Wikipedia contributors. Jensen–Shannon divergence - Wikipedia, the free encyclopedia, 2020. [Online; accessed 06-November-2020]. URL: https://en.wikipedia.org/w/index.php?title=Jensen%E2%80%93Shannon_divergence&oldid=980081721.
David Woodruff. Low rank approximation lower bounds in row-update streams. In Advances in Neural Information Processing Systems, pages 1781-1789, 2014.
David P Woodruff. New algorithms for heavy hitters in data streams. arXiv preprint, 2016. URL: http://arxiv.org/abs/1603.01733.
David P. Woodruff and Qin Zhang. Tight bounds for distributed functional monitoring. In Proceedings of the 44th Symposium on Theory of Computing Conference, STOC 2012, New York, NY, USA, May 19 - 22, 2012, pages 941-960, 2012.
David P Woodruff and Qin Zhang. An optimal lower bound for distinct elements in the message passing model. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 718-733. SIAM, 2014.

A Simple Proof of a New Set Disjointness with Applications to Data Streams

Authors Akshay Kamath, Eric Price, David P. Woodruff

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message

A Simple Proof of a New Set Disjointness with Applications to Data Streams

Authors Akshay Kamath, Eric Price, David P. Woodruff

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Acknowledgements

References

Thanks for your feedback!

Could not send message