Locality-Aware Distribution Schemes

Sundarmurthy, Bruhathi; Koutris, Paraschos; Naughton, Jeffrey

doi:10.4230/LIPIcs.ICDT.2021.22

Abstract

One of the bottlenecks in parallel query processing is the cost of shuffling data across nodes in a cluster. Ideally, given a distribution of the data across the nodes and a query, we want to execute the query by performing only local computation and no communication: in this case, the query is called parallel-correct with respect to the data distribution. Previous work studied this problem for Conjunctive Queries in the case where the distribution scheme is oblivious, i.e., the location of each tuple depends only on the tuple and is independent of the instance. In this work, we show that oblivious schemes have a fundamental theoretical limitation, and initiate the formal study of distribution schemes that are locality-aware. In particular, we focus on a class of distribution schemes called co-hash distribution schemes, which are widely used in parallel systems. In co-hash partitioning, some tables are initially hashed, and the remaining tables are co-located so that a join condition is always satisfied. Given a co-hash distribution scheme, we formally study the complexity of deciding various desirable properties, including obliviousness and redundancy. Then, for a given Conjunctive Query and co-hash scheme, we determine the computational complexity of deciding whether the query is parallel-correct. We also explore a stronger notion of correctness, called parallel disjoint correctness, which guarantees that the query result will be disjointly partitioned across nodes, i.e., there is no duplication of results.

Cite As Get BibTex

Bruhathi Sundarmurthy, Paraschos Koutris, and Jeffrey Naughton. Locality-Aware Distribution Schemes. In 24th International Conference on Database Theory (ICDT 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 186, pp. 22:1-22:25, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021) https://doi.org/10.4230/LIPIcs.ICDT.2021.22

Author Details

Bruhathi Sundarmurthy

University of Wisconsin-Madison, Madison, WI, USA

Paraschos Koutris

University of Wisconsin-Madison, Madison, WI, USA

Jeffrey Naughton

University of Wisconsin-Madison, Madison, WI, USA

Funding

Koutris, Paraschos: NSF CRII-1850348.

References

Foto N. Afrati, Paraschos Koutris, Dan Suciu, and Jeffrey D. Ullman. Parallel skyline queries. In ICDT, pages 274-284, 2012. URL: https://doi.org/10.1145/2274576.2274605.
Foto N. Afrati and Jeffrey D. Ullman. Optimizing joins in a map-reduce environment. In EDBT, pages 99-110, 2010. URL: https://doi.org/10.1145/1739041.1739056.
Tom J. Ameloot, Gaetano Geck, Bas Ketsman, Frank Neven, and Thomas Schwentick. Parallel-correctness and transferability for conjunctive queries. In PODS, 2015. URL: https://doi.org/10.1145/2745754.2745759.
Tom J. Ameloot, Gaetano Geck, Bas Ketsman, Frank Neven, and Thomas Schwentick. Data partitioning for single-round multi-join evaluation in massively parallel systems. SIGMOD Record, 45(1):33-40, 2016. URL: https://doi.org/10.1145/2949741.2949750.
Paul Beame, Paraschos Koutris, and Dan Suciu. Communication steps for parallel query processing. In PODS, pages 273-284, 2013. URL: https://doi.org/10.1145/2463664.2465224.
Paul Beame, Paraschos Koutris, and Dan Suciu. Communication steps for parallel query processing. J. ACM, 64(6):40:1-40:58, 2017. URL: https://doi.org/10.1145/3125644.
George Eadon, Eugene Inseok Chong, Shrikanth Shankar, Ananth Raghavan, Jagannathan Srinivasan, and Souripriya Das. Supporting table partitioning by reference in Oracle. In SIGMOD, pages 1111-1122, 2008. URL: https://doi.org/10.1145/1376616.1376727.
Sumit Ganguly, Avi Silberschatz, and Shalom Tsur. Parallel bottom-up processing of Datalog queries. J. Log. Program., 14(1-2):101-126, October 1992. URL: https://doi.org/10.1016/0743-1066(92)90048-8.
Gaetano Geck, Bas Ketsman, Frank Neven, and Thomas Schwentick. Parallel-correctness and containment for conjunctive queries with union and negation. In ICDT, pages 9:1-9:17, 2016. URL: https://doi.org/10.4230/LIPIcs.ICDT.2016.9.
Gaetano Geck, Bas Ketsman, Frank Neven, and Thomas Schwentick. Parallel-correctness and containment for conjunctive queries with union and negation. ACM Trans. Comput. Logic, 20(3):18:1-18:24, June 2019. URL: https://doi.org/10.1145/3329120.
Gaetano Geck, Frank Neven, and Thomas Schwentick. Distribution constraints: The chase for distributed data. In ICDT, pages 13:1-13:19, 2020. URL: https://doi.org/10.4230/LIPIcs.ICDT.2020.13.
Bas Ketsman, Aws Albarghouthi, and Paraschos Koutris. Distribution policies for Datalog. In ICDT, pages 17:1-17:22, 2018. URL: https://doi.org/10.4230/LIPIcs.ICDT.2018.17.
Bas Ketsman, Frank Neven, and Brecht Vandevoort. Parallel-correctness and transferability for conjunctive queries under bag semantics. In ICDT, pages 18:1-18:16, 2018. URL: https://doi.org/10.4230/LIPIcs.ICDT.2018.18.
Paraschos Koutris and Dan Suciu. Parallel evaluation of conjunctive queries. In PODS, pages 223-234, 2011. URL: https://doi.org/10.1145/1989284.1989310.
Paraschos Koutris and Dan Suciu. A guide to formal analysis of join processing in massively parallel systems. SIGMOD Record, 45(4):18-27, 2016. URL: https://doi.org/10.1145/3092931.3092934.
Yi Lu, Anil Shanbhag, Alekh Jindal, and Samuel Madden. AdaptDB: Adaptive partitioning for distributed joins. Proc. VLDB Endow., 10(5):589-600, January 2017. URL: https://doi.org/10.14778/3055540.3055551.
Rimma Nehme and Nicolas Bruno. Automated partitioning design in parallel database systems. In SIGMOD, pages 1137-1148, 2011. URL: https://doi.org/10.1145/1989323.1989444.
Frank Neven, Thomas Schwentick, Christopher Spinrath, and Brecht Vandevoort. Parallel-correctness and parallel-boundedness for Datalog programs. In ICDT, pages 14:1-14:19, 2019. URL: https://doi.org/10.4230/LIPIcs.ICDT.2019.14.
Jags Ramnarayan, Barzan Mozafari, Sumedh Wale, Sudhir Menon, Neeraj Kumar, Hemant Bhanawat, Soubhik Chakraborty, Yogesh Mahajan, Rishitesh Mishra, and Kishor Bachhav. SnappyData: A hybrid transactional analytical store built on Spark. In SIGMOD, pages 2153-2156, 2016. URL: https://doi.org/10.1145/2882903.2899408.
Jeff Shute, Radek Vingralek, Bart Samwel, Ben Handy, Chad Whipkey, Eric Rollins, Mircea Oancea, Kyle Littlefield, David Menestrina, Stephan Ellner, John Cieslewicz, Ian Rae, Traian Stancescu, and Himani Apte. F1: A distributed SQL database that scales. Proc. VLDB Endow., 6(11):1068-1079, August 2013. URL: https://doi.org/10.14778/2536222.2536232.
Young-Kyoon Suh, Ahmad Ghazal, Alain Crolotte, and Pekka Kostamaa. A new tool for multi-level partitioning in Teradata. In CIKM, pages 2214-2218, 2012. URL: https://doi.org/10.1145/2396761.2398604.
Khai Q. Tran, Jeffrey F. Naughton, Bruhathi Sundarmurthy, and Dimitris Tsirogiannis. JECB: A join-extension, code-based approach to OLTP data partitioning. In SIGMOD, pages 39-50, 2014. URL: https://doi.org/10.1145/2588555.2610532.
Erfan Zamanian, Carsten Binnig, and Abdallah Salama. Locality-aware partitioning in parallel database systems. In SIGMOD, pages 17-30, 2015. URL: https://doi.org/10.1145/2723372.2723718.

Locality-Aware Distribution Schemes

Authors Bruhathi Sundarmurthy, Paraschos Koutris, Jeffrey Naughton

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message