Worst-Case Optimal Algorithms for Parallel Query Processing

Koutris, Paraschos; Beame, Paul; Suciu, Dan

doi:10.4230/LIPIcs.ICDT.2016.8

File

Subject Classification

Keywords

conjunctive query
parallel computation
worst-case bounds

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

Document

0

Metadata

Abstract

In this paper, we study the communication complexity for the problem of computing a conjunctive query on a large database in a parallel setting with p servers. In contrast to previous work, where upper and lower bounds on the communication were specified for particular structures of data (either data without skew, or data with specific types of skew), in this work we focus on worst-case analysis of the communication cost. The goal is to find worst-case optimal parallel algorithms, similar to the work of (Ngo et al. 2012) for sequential algorithms. We first show that for a single round we can obtain an optimal worst-case algorithm. The optimal load for a conjunctive query q when all relations have size equal to M is O(M/p^{1/psi^*}), where psi^* is a new query-related quantity called the edge quasi-packing number, which is different from both the edge packing number and edge cover number of the query hypergraph. For multiple rounds, we present algorithms that are optimal for several classes of queries. Finally, we show a surprising connection to the external memory model, which allows us to translate parallel algorithms to external memory algorithms. This technique allows us to recover (within a polylogarithmic factor) several recent results on the I/O complexity for computing join queries, and also obtain optimal algorithms for other classes of queries.

Cite As Get BibTex

Paraschos Koutris, Paul Beame, and Dan Suciu. Worst-Case Optimal Algorithms for Parallel Query Processing. In 19th International Conference on Database Theory (ICDT 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 48, pp. 8:1-8:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016) https://doi.org/10.4230/LIPIcs.ICDT.2016.8

Author Details

Paraschos Koutris

Paul Beame

Dan Suciu

References

Foto N. Afrati, Anish Das Sarma, Semih Salihoglu, and Jeffrey D. Ullman. Upper and lower bounds on the cost of a map-reduce computation. CoRR, abs/1206.4377, 2012.
Foto N. Afrati and Jeffrey D. Ullman. Optimizing joins in a map-reduce environment. In EDBT, pages 99-110, 2010. URL: http://dx.doi.org/10.1145/1739041.1739056.
Albert Atserias, Martin Grohe, and Dániel Marx. Size bounds and query plans for relational joins. In FOCS, pages 739-748, 2008. URL: http://dx.doi.org/10.1109/FOCS.2008.43.
Paul Beame, Paraschos Koutris, and Dan Suciu. Communication steps for parallel query processing. In PODS, pages 273-284, 2013. URL: http://dx.doi.org/10.1145/2463664.2465224.
Paul Beame, Paraschos Koutris, and Dan Suciu. Skew in parallel query processing. In PODS, pages 212-223, 2014. URL: http://dx.doi.org/10.1145/2594538.2594558.
Shumo Chu and James Cheng. Triangle listing in massive networks. TKDD, 6(4):17, 2012. URL: http://dx.doi.org/10.1145/2382577.2382581.
Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137-150, 2004.
Jon Feldman, S. Muthukrishnan, Anastasios Sidiropoulos, Clifford Stein, and Zoya Svitkina. On distributing symmetric streaming computations. ACM Transactions on Algorithms, 6(4), 2010.
Gero Greiner and Riko Jacob. The efficiency of mapreduce in parallel external memory. In Proceedings of the 10th Latin American International Conference on Theoretical Informatics, LATIN'12, pages 433-445, Berlin, Heidelberg, 2012. Springer-Verlag. URL: http://dx.doi.org/10.1007/978-3-642-29344-3_37.
Daniel Halperin, Victor Teixeira de Almeida, Lee Lee Choo, Shumo Chu, Paraschos Koutris, Dominik Moritz, Jennifer Ortiz, Vaspol Ruamviboonsuk, Jingjing Wang, Andrew Whitaker, Shengliang Xu, Magdalena Balazinska, Bill Howe, and Dan Suciu. Demonstration of the Myria big data management service. In Curtis E. Dyreson, Feifei Li, and M. Tamer Özsu, editors, International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014, pages 881-884. ACM, 2014. URL: http://dx.doi.org/10.1145/2588555.2594530.
Xiaocheng Hu, Miao Qiao, and Yufei Tao. Join dependency testing, Loomis-Whitney join, and triangle enumeration. In Proceedings of the 34th ACM Symposium on Principles of Database Systems, PODS 2015, Melbourne, Victoria, Australia, May 31 - June 4, 2015, pages 291-301, 2015. URL: http://dx.doi.org/10.1145/2745754.2745768.
Xiaocheng Hu, Yufei Tao, and Chin-Wan Chung. Massive graph triangulation. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22-27, 2013, pages 325-336, 2013. URL: http://dx.doi.org/10.1145/2463676.2463704.
Howard J. Karloff, Siddharth Suri, and Sergei Vassilvitskii. A model of computation for mapreduce. In SODA, pages 938-948, 2010.
Hartmut Klauck, Danupon Nanongkai, Gopal Pandurangan, and Peter Robinson. Distributed computation of large-scale graph problems. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA'15, pages 391-410. SIAM, 2015.
Paraschos Koutris and Dan Suciu. Parallel evaluation of conjunctive queries. In PODS, pages 223-234, 2011. URL: http://dx.doi.org/10.1145/1989284.1989310.
Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. Dremel: Interactive analysis of web-scale datasets. PVLDB, 3(1):330-339, 2010.
Hung Q. Ngo, Ely Porat, Christopher Ré, and Atri Rudra. Worst-case optimal join algorithms: [extended abstract]. In PODS, pages 37-48, 2012. URL: http://dx.doi.org/10.1145/2213556.2213565.
Rasmus Pagh and Francesco Silvestri. The input/output complexity of triangle enumeration. In PODS, pages 224-233, 2014. URL: http://dx.doi.org/10.1145/2594538.2594552.
Francesco Silvestri. Subgraph enumeration in massive graphs. CoRR, abs/1402.3444, 2014. URL: http://arxiv.org/abs/1402.3444.
DavidP. Woodruff and Qin Zhang. When distributed computation is communication expensive. In Yehuda Afek, editor, Distributed Computing, volume 8205 of Lecture Notes in Computer Science, pages 16-30. Springer Berlin Heidelberg, 2013. URL: http://dx.doi.org/10.1007/978-3-642-41527-2_2.
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012.

Worst-Case Optimal Algorithms for Parallel Query Processing

Authors Paraschos Koutris, Paul Beame, Dan Suciu

File

Document Identifiers

Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message