Worst-Case Optimal Algorithms for Parallel Query Processing

Authors Paraschos Koutris, Paul Beame, Dan Suciu



PDF
Thumbnail PDF

File

LIPIcs.ICDT.2016.8.pdf
  • Filesize: 0.59 MB
  • 18 pages

Document Identifiers

Author Details

Paraschos Koutris
Paul Beame
Dan Suciu

Cite As Get BibTex

Paraschos Koutris, Paul Beame, and Dan Suciu. Worst-Case Optimal Algorithms for Parallel Query Processing. In 19th International Conference on Database Theory (ICDT 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 48, pp. 8:1-8:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016) https://doi.org/10.4230/LIPIcs.ICDT.2016.8

Abstract

In this paper, we study the communication complexity for the problem of computing a conjunctive query on a large database in a parallel setting with p servers. In contrast to previous work, where upper and lower bounds on the communication were specified for particular structures of data (either data without skew, or data with specific types of skew), in this work we focus on worst-case analysis of the communication cost. The goal is to find worst-case optimal parallel algorithms, similar to the work of (Ngo et al. 2012) for sequential algorithms. 

We first show that for a single round we can obtain an optimal worst-case algorithm. The optimal load for a conjunctive query q when all relations have size equal to M is O(M/p^{1/psi^*}), where psi^* is a new query-related quantity called the edge quasi-packing number, which is different from both the edge packing number and edge cover number of the query hypergraph. For multiple rounds, we present algorithms that are optimal for several classes of queries. Finally, we show a surprising connection to the external memory model, which allows us to translate parallel algorithms to external memory algorithms. This technique allows us to recover (within a polylogarithmic factor) several recent results on the I/O complexity for computing join queries, and also obtain optimal algorithms for other classes of queries.

Subject Classification

Keywords
  • conjunctive query
  • parallel computation
  • worst-case bounds

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Foto N. Afrati, Anish Das Sarma, Semih Salihoglu, and Jeffrey D. Ullman. Upper and lower bounds on the cost of a map-reduce computation. CoRR, abs/1206.4377, 2012. Google Scholar
  2. Foto N. Afrati and Jeffrey D. Ullman. Optimizing joins in a map-reduce environment. In EDBT, pages 99-110, 2010. URL: http://dx.doi.org/10.1145/1739041.1739056.
  3. Albert Atserias, Martin Grohe, and Dániel Marx. Size bounds and query plans for relational joins. In FOCS, pages 739-748, 2008. URL: http://dx.doi.org/10.1109/FOCS.2008.43.
  4. Paul Beame, Paraschos Koutris, and Dan Suciu. Communication steps for parallel query processing. In PODS, pages 273-284, 2013. URL: http://dx.doi.org/10.1145/2463664.2465224.
  5. Paul Beame, Paraschos Koutris, and Dan Suciu. Skew in parallel query processing. In PODS, pages 212-223, 2014. URL: http://dx.doi.org/10.1145/2594538.2594558.
  6. Shumo Chu and James Cheng. Triangle listing in massive networks. TKDD, 6(4):17, 2012. URL: http://dx.doi.org/10.1145/2382577.2382581.
  7. Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137-150, 2004. Google Scholar
  8. Jon Feldman, S. Muthukrishnan, Anastasios Sidiropoulos, Clifford Stein, and Zoya Svitkina. On distributing symmetric streaming computations. ACM Transactions on Algorithms, 6(4), 2010. Google Scholar
  9. Gero Greiner and Riko Jacob. The efficiency of mapreduce in parallel external memory. In Proceedings of the 10th Latin American International Conference on Theoretical Informatics, LATIN'12, pages 433-445, Berlin, Heidelberg, 2012. Springer-Verlag. URL: http://dx.doi.org/10.1007/978-3-642-29344-3_37.
  10. Daniel Halperin, Victor Teixeira de Almeida, Lee Lee Choo, Shumo Chu, Paraschos Koutris, Dominik Moritz, Jennifer Ortiz, Vaspol Ruamviboonsuk, Jingjing Wang, Andrew Whitaker, Shengliang Xu, Magdalena Balazinska, Bill Howe, and Dan Suciu. Demonstration of the Myria big data management service. In Curtis E. Dyreson, Feifei Li, and M. Tamer Özsu, editors, International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014, pages 881-884. ACM, 2014. URL: http://dx.doi.org/10.1145/2588555.2594530.
  11. Xiaocheng Hu, Miao Qiao, and Yufei Tao. Join dependency testing, Loomis-Whitney join, and triangle enumeration. In Proceedings of the 34th ACM Symposium on Principles of Database Systems, PODS 2015, Melbourne, Victoria, Australia, May 31 - June 4, 2015, pages 291-301, 2015. URL: http://dx.doi.org/10.1145/2745754.2745768.
  12. Xiaocheng Hu, Yufei Tao, and Chin-Wan Chung. Massive graph triangulation. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22-27, 2013, pages 325-336, 2013. URL: http://dx.doi.org/10.1145/2463676.2463704.
  13. Howard J. Karloff, Siddharth Suri, and Sergei Vassilvitskii. A model of computation for mapreduce. In SODA, pages 938-948, 2010. Google Scholar
  14. Hartmut Klauck, Danupon Nanongkai, Gopal Pandurangan, and Peter Robinson. Distributed computation of large-scale graph problems. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA'15, pages 391-410. SIAM, 2015. Google Scholar
  15. Paraschos Koutris and Dan Suciu. Parallel evaluation of conjunctive queries. In PODS, pages 223-234, 2011. URL: http://dx.doi.org/10.1145/1989284.1989310.
  16. Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. Dremel: Interactive analysis of web-scale datasets. PVLDB, 3(1):330-339, 2010. Google Scholar
  17. Hung Q. Ngo, Ely Porat, Christopher Ré, and Atri Rudra. Worst-case optimal join algorithms: [extended abstract]. In PODS, pages 37-48, 2012. URL: http://dx.doi.org/10.1145/2213556.2213565.
  18. Rasmus Pagh and Francesco Silvestri. The input/output complexity of triangle enumeration. In PODS, pages 224-233, 2014. URL: http://dx.doi.org/10.1145/2594538.2594552.
  19. Francesco Silvestri. Subgraph enumeration in massive graphs. CoRR, abs/1402.3444, 2014. URL: http://arxiv.org/abs/1402.3444.
  20. DavidP. Woodruff and Qin Zhang. When distributed computation is communication expensive. In Yehuda Afek, editor, Distributed Computing, volume 8205 of Lecture Notes in Computer Science, pages 16-30. Springer Berlin Heidelberg, 2013. URL: http://dx.doi.org/10.1007/978-3-642-41527-2_2.
  21. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail