Distributed Streaming with Finite Memory
We introduce three formal models of distributed systems for query evaluation on massive databases: Distributed Streaming with Register Automata (DSAs), Distributed Streaming with Register Transducers (DSTs), and Distributed Streaming with Register Transducers and Joins (DSTJs). These models are based on the key-value paradigm where the input is transformed into a dataset of key-value pairs, and on each key a local computation is performed on the values associated with that key resulting in another set of key-value pairs. Computation proceeds in a constant number of rounds, where the result of the last round is the input to the next round, and transformation to key-value pairs is required to be generic. The difference between the three models is in the local computation part. In DSAs it is limited to making one pass over its input using a register automaton, while in DSTs it can make two passes: in the first pass it uses a finite-state automaton and in the second it uses a register transducer. The third model DSTJs is an extension of DSTs, where local computations are capable of constructing the Cartesian product of two sets. We obtain the following results: (1) DSAs can evaluate first-order queries over bounded degree databases; (2) DSTs can evaluate semijoin algebra queries over arbitrary databases; (3) DSTJs can evaluate the whole relational algebra over arbitrary databases; (4) DSTJs are strictly stronger than DSTs, which in turn, are strictly stronger than DSAs; (5) within DSAs, DSTs and DSTJs there is a strict hierarchy w.r.t. the number of rounds.
distributed systems
relational algebra
semijoin algebra
register automata
register transducers.
324-341
Regular Paper
Frank
Neven
Frank Neven
Nicole
Schweikardt
Nicole Schweikardt
Frédéric
Servais
Frédéric Servais
Tony
Tan
Tony Tan
10.4230/LIPIcs.ICDT.2015.324
F. Afrati, V. Borkar, M. Carey, N. Polyzotis, and J. Ullman. Map-reduce extensions and recursive queries. In ICDE, 2011.
F. Afrati, D. Fotakis, and J. Ullman. Enumerating subgraph instances using map-reduce. In ICDE, 2013.
F. Afrati, P. Koutris, D. Suciu, and J. Ullman. Parallel skyline queries. In ICDT, 2012.
F. Afrati, A. Dash Sarma, S. Salihoglu, and J. Ullman. Upper and lower bounds on the cost of a map-reduce computation. PVLDB, 6(4):277-288, 2013.
F. Afrati and J. Ullman. Optimizing joins in a map-reduce environment. In EDBT, 2010.
F. Afrati and J. Ullman. Transitive closure and recursive datalog implemented on clusters. In EDBT, 2012.
T. Ameloot, F. Neven, and J. Van den Bussche. Relational transducers for declarative networking. Journal of the ACM, 60(2):15, 2013.
Apache Bagel. Bagel. http://spark.apache.org/docs/0.7.3/bagel-programming-guide.html.
P. Beame, P. Koutris, and D. Suciu. Communication steps for parallel query processing. In PODS, 2013.
P. Beame, P. Koutris, and D. Suciu. Skew in parallel query processing. In PODS, 2014.
F. Chierichetti, R. Kumar, and A. Tomkins. Max-cover in map-reduce. In WWW, 2010.
E. Codd. A relational model of data for large shared data banks. Communication of the ACM, 13(6):377-387, 1970.
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, 2004.
J. Dean and S. Ghemawat. Mapreduce: a flexible data processing tool. Communication of the ACM, 53(1):72-77, 2010.
A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a highlevel dataflow system on top of mapreduce: The pig experience. PVLDB, 2(2):1414-1425, 2009.
J. Hellerstein. The declarative imperative: experiences and conjectures in distributed logic. SIGMOD Record, 39(1):5-19, 2010.
M. Kaminski and N. Francez. Finite-memory automata. Theoretical Computer Science, 134(2):329-363, 1994.
H. Karloff, S. Suri, and S. Vassilvitskii. A model of computation for mapreduce. In SODA, 2010.
P. Koutris and D. Suciu. Parallel evaluation of conjunctive queries. In PODS, 2011.
R. Kumar, B. Moseley, S. Vassilvitskii, and A. Vattani. Fast greedy algorithms in mapreduce and streaming. In SPAA, 2013.
S. Lattanzi, B. Moseley, S. Suri, and S. Vassilvitskii. Filtering: a method for solving graph problems in mapreduce. In SPAA, 2011.
F. Neven, T. Schwentick, and V. Vianu. Finite state machines for strings over infinite alphabets. ACM Transactions on Computational Logic, 5(3):403-435, 2004.
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD Conference, 2008.
Apache Pig. Pig. URL: http://pig.apache.org/.
http://pig.apache.org/
Apache Spark. Spark. URL: http://spark.apache.org.
http://spark.apache.org
Apache Spark. Spark programming guide. URL: http://spark.apache.org/docs/latest/programming-guide.html.
http://spark.apache.org/docs/latest/programming-guide.html
S. Suri and S. Vassilvitskii. Counting triangles and the curse of the last reducer. In WWW, 2011.
Y. Tao, W. Lin, and X. Xiao. Minimal mapreduce algorithms. In SIGMOD, 2013.
A. Thusoo, J. Sen Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive - a petabyte scale data warehouse using hadoop. In ICDE, 2010.
L. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103-111, 1990.
T. White. Hadoop - The Definitive Guide: Storage and Analysis at Internet Scale (3. ed., revised and updated). O'Reilly, 2012.
R. Xin, J. Rosen, M. Zaharia, M. Franklin, S. Shenker, and I. Stoica. Shark: Sql and rich analytics at scale. In SIGMOD, 2013.
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012.
Creative Commons Attribution 3.0 Unported license
https://creativecommons.org/licenses/by/3.0/legalcode