Distributed Data Summarization in Well-Connected Networks

Authors Hsin-Hao Su, Hoa T. Vu

Thumbnail PDF


  • Filesize: 0.56 MB
  • 16 pages

Document Identifiers

Author Details

Hsin-Hao Su
  • Boston College, MA, USA
Hoa T. Vu
  • Boston College, MA, USA

Cite AsGet BibTex

Hsin-Hao Su and Hoa T. Vu. Distributed Data Summarization in Well-Connected Networks. In 33rd International Symposium on Distributed Computing (DISC 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 146, pp. 33:1-33:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


We study distributed algorithms for some fundamental problems in data summarization. Given a communication graph G of n nodes each of which may hold a value initially, we focus on computing sum_{i=1}^N g(f_i), where f_i is the number of occurrences of value i and g is some fixed function. This includes important statistics such as the number of distinct elements, frequency moments, and the empirical entropy of the data. In the CONGEST~ model, a simple adaptation from streaming lower bounds shows that it requires Omega~(D+ n) rounds, where D is the diameter of the graph, to compute some of these statistics exactly. However, these lower bounds do not hold for graphs that are well-connected. We give an algorithm that computes sum_{i=1}^{N} g(f_i) exactly in {tau_{G}} * 2^{O(sqrt{log n})} rounds where {tau_{G}} is the mixing time of G. This also has applications in computing the top k most frequent elements. We demonstrate that there is a high similarity between the GOSSIP~ model and the CONGEST~ model in well-connected graphs. In particular, we show that each round of the GOSSIP~ model can be simulated almost perfectly in O~({tau_{G}}) rounds of the CONGEST~ model. To this end, we develop a new algorithm for the GOSSIP~ model that 1 +/- epsilon approximates the p-th frequency moment F_p = sum_{i=1}^N f_i^p in O~(epsilon^{-2} n^{1-k/p}) rounds , for p >= 2, when the number of distinct elements F_0 is at most O(n^{1/(k-1)}). This result can be translated back to the CONGEST~ model with a factor O~({tau_{G}}) blow-up in the number of rounds.

Subject Classification

ACM Subject Classification
  • Theory of computation → Distributed algorithms
  • Networks → Network algorithms
  • Mathematics of computing → Graph algorithms
  • Distributed Algorithms
  • Network Algorithms
  • Data Summarization


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Miklós Ajtai, János Komlós, and Endre Szemerédi. An O(n log n) Sorting Network. In STOC, pages 1-9. ACM, 1983. Google Scholar
  2. Noga Alon, Yossi Matias, and Mario Szegedy. The Space Complexity of Approximating the Frequency Moments. J. Comput. Syst. Sci., 58(1):137-147, 1999. Google Scholar
  3. Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Streaming Algorithms via Precision Sampling. In FOCS, pages 363-372. IEEE Computer Society, 2011. Google Scholar
  4. Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. Counting Distinct Elements in a Data Stream. In RANDOM, volume 2483 of Lecture Notes in Computer Science, pages 1-10. Springer, 2002. Google Scholar
  5. Vladimir Braverman and Stephen R. Chestnut. Universal Sketches for the Frequency Negative Moments and Other Decreasing Streaming Sums. In APPROX-RANDOM, volume 40 of LIPIcs, pages 591-605. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2015. Google Scholar
  6. Vladimir Braverman, Stephen R. Chestnut, David P. Woodruff, and Lin F. Yang. Streaming Space Complexity of Nearly All Functions of One Variable on Frequency Vectors. In PODS, pages 261-276. ACM, 2016. Google Scholar
  7. Vladimir Braverman and Rafail Ostrovsky. Zero-one frequency laws. In STOC, pages 281-290. ACM, 2010. Google Scholar
  8. Joshua Brody and Amit Chakrabarti. A Multi-Round Communication Lower Bound for Gap Hamming and Some Consequences. In IEEE Conference on Computational Complexity, pages 358-368. IEEE Computer Society, 2009. Google Scholar
  9. Amit Chakrabarti, Khanh Do Ba, and S. Muthukrishnan. Estimating Entropy and Entropy Norm on Data Streams. In STACS, volume 3884 of Lecture Notes in Computer Science, pages 196-205. Springer, 2006. Google Scholar
  10. Amit Chakrabarti, Graham Cormode, and Andrew McGregor. A near-optimal algorithm for estimating the entropy of a stream. ACM Trans. Algorithms, 6(3):51:1-51:21, 2010. Google Scholar
  11. Jen-Yeu Chen and Gopal Pandurangan. Almost-Optimal Gossip-Based Aggregate Computation. SIAM Journal on Computing, 41(3):455-483, 2012. Google Scholar
  12. Alan Demers, Dan Greene, Carl Hauser, Wes Irish, John Larson, Scott Shenker, Howard Sturgis, Dan Swinehart, and Doug Terry. Epidemic Algorithms for Replicated Database Maintenance. In Proc. 6th ACM Symposium on Principles of Distributed Computing (PODC), pages 1-12, 1987. Google Scholar
  13. A.M. Frieze and G.R. Grimmett. The shortest-path problem for graphs with random arc-lengths. Discrete Applied Mathematics, 10(1):57-77, 1985. Google Scholar
  14. Sumit Ganguly. Taylor Polynomial Estimator for Estimating Frequency Moments. In ICALP (1), volume 9134 of Lecture Notes in Computer Science, pages 542-553. Springer, 2015. Google Scholar
  15. Mohsen Ghaffari, Fabian Kuhn, and Hsin-Hao Su. Distributed MST and Routing in Almost Mixing Time. In PODC, pages 131-140. ACM, 2017. Google Scholar
  16. Mohsen Ghaffari and Jason Li. New Distributed Algorithms in Almost Mixing Time via Transformations from Parallel Algorithms. In DISC, volume 121 of LIPIcs, pages 31:1-31:16. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2018. Google Scholar
  17. George Giakkoupis, Anne-Marie Kermarrec, and Philipp Woelfel. Gossip Protocols for Renaming and Sorting. In Yehuda Afek, editor, DISC, 2013. Google Scholar
  18. Phillip B. Gibbons and Srikanta Tirthapura. Estimating simple functions on the union of data streams. In SPAA, pages 281-291. ACM, 2001. Google Scholar
  19. Oded Goldreich. Basic Facts about Expander Graphs. In Studies in Complexity and Cryptography, volume 6650 of Lecture Notes in Computer Science, pages 451-464. Springer, 2011. Google Scholar
  20. Yu Gu, Andrew McCallum, and Donald F. Towsley. Detecting Anomalies in Network Traffic Using Maximum Entropy Estimation. In Internet Measurment Conference, pages 345-350. USENIX Association, 2005. Google Scholar
  21. Bernhard Haeupler, Jeet Mohapatra, and Hsin-Hao Su. Optimal Gossip Algorithms for Exact and Approximate Quantile Computations. In PODC, pages 179-188. ACM, 2018. Google Scholar
  22. Nicholas J. A. Harvey, Jelani Nelson, and Krzysztof Onak. Sketching and Streaming Entropy via Approximation Theory. In FOCS, pages 489-498. IEEE Computer Society, 2008. Google Scholar
  23. Shlomo Hoory, Nathan Linial, and Avi Wigderson. Expander graphs and their applications. Bulletin of the American Mathematical Society, 43(4):439-561, 2006. Google Scholar
  24. Piotr Indyk. Stable distributions, pseudorandom generators, embeddings, and data stream computation. J. ACM, 53(3):307-323, 2006. Google Scholar
  25. Piotr Indyk and David P. Woodruff. Optimal approximations of the frequency moments of data streams. In STOC, pages 202-208. ACM, 2005. Google Scholar
  26. Rajesh Jayaram and David P. Woodruff. Perfect Lp Sampling in a Data Stream. In FOCS, pages 544-555. IEEE Computer Society, 2018. Google Scholar
  27. Márk Jelasity, Spyros Voulgaris, Rachid Guerraoui, Anne-Marie Kermarrec, and Maarten van Steen. Gossip-based Peer Sampling. ACM Trans. Comput. Syst., 25(3), August 2007. URL: https://doi.org/10.1145/1275517.1275520.
  28. Hossein Jowhari, Mert Saglam, and Gábor Tardos. Tight bounds for Lp samplers, finding duplicates in streams, and related problems. In PODS, pages 49-58. ACM, 2011. Google Scholar
  29. Daniel M. Kane, Jelani Nelson, and David P. Woodruff. An optimal algorithm for the distinct elements problem. In PODS, pages 41-52. ACM, 2010. Google Scholar
  30. R. Karp, C. Schindelhauer, S. Shenker, and B. Vocking. Randomized rumor spreading. In Proc. 41st IEEE Symposium on Foundations of Computer Science (FOCS), pages 565-574, 2000. Google Scholar
  31. Srinivas Kashyap, Supratim Deb, K. V. M. Naidu, Rajeev Rastogi, and Anand Srinivasan. Efficient Gossip-based Aggregate Computation. In Proc. of the 25th ACM Symposium on Principles of Database Systems (PODS), pages 308-317, 2006. Google Scholar
  32. David Kempe, Alin Dobra, and Johannes Gehrke. Gossip-based computation of aggregate information. In Proc. of the Symp. on Found. of Comp. Sci. (FOCS), pages 482-491, 2003. Google Scholar
  33. Fabian Kuhn, Thomas Locher, and Stefan Schmid. Distributed computation of the mode. In PODC, pages 15-24. ACM, 2008. Google Scholar
  34. Fabian Kuhn, Thomas Locher, and Rogert Wattenhofer. Tight Bounds for Distributed Selection. In Proc. of 19th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 145-153, 2007. Google Scholar
  35. Fabian Kuhn and Rotem Oshman. The Complexity of Data Aggregation in Directed Networks. In DISC, volume 6950 of Lecture Notes in Computer Science, pages 416-431. Springer, 2011. Google Scholar
  36. Morteza Monemizadeh and David P. Woodruff. 1-Pass Relative-Error L_p-Sampling with Applications. In SODA, pages 1143-1160. SIAM, 2010. Google Scholar
  37. David Peleg. Distributed Computing: A Locality-sensitive Approach. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2000. Google Scholar
  38. Boris Pittel. On Spreading a Rumor. SIAM Journal on Applied Mathematics, 47(1):213-223, 1987. Google Scholar
  39. Hsin-Hao Su and Hoa T. Vu. Distributed Data Summarization in Well-Connected Networks. CoRR, abs/1908.00236, 2019. URL: http://arxiv.org/abs/1908.00236.
  40. Arno Wagner and Bernhard Plattner. Entropy Based Worm and Anomaly Detection in Fast IP Networks. In WETICE, pages 172-177. IEEE Computer Society, 2005. Google Scholar
  41. David P. Woodruff. Optimal space lower bounds for all frequency moments. In SODA, pages 167-175. SIAM, 2004. Google Scholar
  42. Kuai Xu, Zhi-Li Zhang, and Supratik Bhattacharyya. Profiling internet backbone traffic: behavior models and applications. In SIGCOMM, pages 169-180. ACM, 2005. Google Scholar