A Simple Parallel Algorithm for Natural Joins on Binary Relations
In PODS'17, Ketsman and Suciu gave an algorithm in the MPC model for computing the result of any natural join where every input relation has two attributes. Achieving an optimal load O(m/p^{1/ρ}) - where m is the total size of the input relations, p the number of machines, and ρ the fractional edge covering number of the join - their algorithm requires 7 rounds to finish. This paper presents a simpler algorithm that ensures the same load with 3 rounds (in fact, the second round incurs only a load of O(p²) to transmit certain statistics to assist machine allocation in the last round). Our algorithm is made possible by a new theorem that provides fresh insight on the structure of the problem, and brings us closer to understanding the intrinsic reason why joins on binary relations can be settled with load O(m/p^{1/ρ}).
Natural Joins
Conjunctive Queries
MPC Algorithms
Parallel Computing
Theory of computation~Database query processing and optimization (theory)
25:1-25:18
Regular Paper
Video of the Presentation: https://doi.org/10.5446/46826
Yufei
Tao
Yufei Tao
Chinese University of Hong Kong, Hong Kong
10.4230/LIPIcs.ICDT.2020.25
Serge Abiteboul, Richard Hull, and Victor Vianu. Foundations of Databases. Addison-Wesley, 1995.
Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Alexander Rasin, and Avi Silberschatz. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. Proceedings of the VLDB Endowment (PVLDB), 2(1):922-933, 2009.
Foto N. Afrati and Jeffrey D. Ullman. Optimizing Multiway Joins in a Map-Reduce Environment. IEEE Transactions on Knowledge and Data Engineering (TKDE), 23(9):1282-1298, 2011.
Albert Atserias, Martin Grohe, and Daniel Marx. Size Bounds and Query Plans for Relational Joins. SIAM J. Comput., 42(4):1737-1767, 2013.
Paul Beame, Paraschos Koutris, and Dan Suciu. Communication Steps for Parallel Query Processing. Journal of the ACM (JACM), 64(6):40:1-40:58, 2017.
Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 137-150, 2004.
Xiao Hu, Paraschos Koutris, and Ke Yi. An External-Memory Work-Depth Model and Its Applications to Massively Parallel Join Algorithms. Manuscript, 2018.
Xiaocheng Hu, Miao Qiao, and Yufei Tao. I/O-efficient join dependency testing, loomis-whitney join, and triangle enumeration. Journal of Computer and System Sciences (JCSS), 82(8):1300-1315, 2016.
Bas Ketsman and Dan Suciu. A Worst-Case Optimal Multi-Round Algorithm for Parallel Computation of Conjunctive Queries. In Proceedings of ACM Symposium on Principles of Database Systems (PODS), pages 417-428, 2017.
Paraschos Koutris, Paul Beame, and Dan Suciu. Worst-Case Optimal Algorithms for Parallel Query Processing. In Proceedings of International Conference on Database Theory (ICDT), pages 8:1-8:18, 2016.
Hung Q. Ngo, Ely Porat, Christopher Re, and Atri Rudra. Worst-case Optimal Join Algorithms. Journal of the ACM (JACM), 65(3):16:1-16:40, 2018.
Rasmus Pagh and Francesco Silvestri. The input/output complexity of triangle enumeration. In Proceedings of ACM Symposium on Principles of Database Systems (PODS), pages 224-233, 2014.
Edward R. Scheinerman and Daniel H. Ullman. Fractional Graph Theory: A Rational Approach to the Theory of Graphs. Wiley, New York, 1997.
Todd L. Veldhuizen. Triejoin: A Simple, Worst-Case Optimal Join Algorithm. In Proceedings of International Conference on Database Theory (ICDT), pages 96-106, 2014.
Mihalis Yannakakis. Algorithms for Acyclic Database Schemes. In Very Large Data Bases, 7th International Conference, September 9-11, 1981, Cannes, France, Proceedings, pages 82-94, 1981.
Yufei Tao
Creative Commons Attribution 3.0 Unported license
https://creativecommons.org/licenses/by/3.0/legalcode