Distributed Streaming with Finite Memory

Authors Frank Neven, Nicole Schweikardt, Frédéric Servais, Tony Tan



PDF
Thumbnail PDF

File

LIPIcs.ICDT.2015.324.pdf
  • Filesize: 0.57 MB
  • 18 pages

Document Identifiers

Author Details

Frank Neven
Nicole Schweikardt
Frédéric Servais
Tony Tan

Cite As Get BibTex

Frank Neven, Nicole Schweikardt, Frédéric Servais, and Tony Tan. Distributed Streaming with Finite Memory. In 18th International Conference on Database Theory (ICDT 2015). Leibniz International Proceedings in Informatics (LIPIcs), Volume 31, pp. 324-341, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2015) https://doi.org/10.4230/LIPIcs.ICDT.2015.324

Abstract

We introduce three formal models of distributed systems for query evaluation on massive databases: Distributed Streaming with Register Automata (DSAs), Distributed Streaming with Register Transducers (DSTs), and Distributed Streaming with Register Transducers and Joins (DSTJs). These models are based on the key-value paradigm where the input is transformed into a dataset of key-value pairs, and on each key a local computation is performed on the values associated with that key resulting in another set of key-value pairs. Computation proceeds in a constant number of rounds, where the result of the last round is the input to the next round, and transformation to key-value pairs is required to be generic. The difference between the three models is in the local computation part. In DSAs it is limited to making one pass over its input using a register automaton, while in DSTs it can make two passes: in the first pass it uses a finite-state automaton and in the second it uses a register transducer. The third model DSTJs is an extension of DSTs, where local computations are capable of constructing the Cartesian product of two sets. We obtain the following results: (1) DSAs can evaluate first-order queries over bounded degree databases; (2) DSTs can evaluate semijoin algebra queries over arbitrary databases; (3) DSTJs can evaluate the whole relational algebra over arbitrary databases; (4) DSTJs are strictly stronger than DSTs, which in turn, are strictly stronger than DSAs; (5) within DSAs, DSTs and DSTJs there is a strict hierarchy w.r.t. the  number of rounds.

Subject Classification

Keywords
  • distributed systems
  • relational algebra
  • semijoin algebra
  • register automata
  • register transducers.

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. F. Afrati, V. Borkar, M. Carey, N. Polyzotis, and J. Ullman. Map-reduce extensions and recursive queries. In ICDE, 2011. Google Scholar
  2. F. Afrati, D. Fotakis, and J. Ullman. Enumerating subgraph instances using map-reduce. In ICDE, 2013. Google Scholar
  3. F. Afrati, P. Koutris, D. Suciu, and J. Ullman. Parallel skyline queries. In ICDT, 2012. Google Scholar
  4. F. Afrati, A. Dash Sarma, S. Salihoglu, and J. Ullman. Upper and lower bounds on the cost of a map-reduce computation. PVLDB, 6(4):277-288, 2013. Google Scholar
  5. F. Afrati and J. Ullman. Optimizing joins in a map-reduce environment. In EDBT, 2010. Google Scholar
  6. F. Afrati and J. Ullman. Transitive closure and recursive datalog implemented on clusters. In EDBT, 2012. Google Scholar
  7. T. Ameloot, F. Neven, and J. Van den Bussche. Relational transducers for declarative networking. Journal of the ACM, 60(2):15, 2013. Google Scholar
  8. Apache Bagel. Bagel. http://spark.apache.org/docs/0.7.3/bagel-programming-guide.html. Google Scholar
  9. P. Beame, P. Koutris, and D. Suciu. Communication steps for parallel query processing. In PODS, 2013. Google Scholar
  10. P. Beame, P. Koutris, and D. Suciu. Skew in parallel query processing. In PODS, 2014. Google Scholar
  11. F. Chierichetti, R. Kumar, and A. Tomkins. Max-cover in map-reduce. In WWW, 2010. Google Scholar
  12. E. Codd. A relational model of data for large shared data banks. Communication of the ACM, 13(6):377-387, 1970. Google Scholar
  13. J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, 2004. Google Scholar
  14. J. Dean and S. Ghemawat. Mapreduce: a flexible data processing tool. Communication of the ACM, 53(1):72-77, 2010. Google Scholar
  15. A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a highlevel dataflow system on top of mapreduce: The pig experience. PVLDB, 2(2):1414-1425, 2009. Google Scholar
  16. J. Hellerstein. The declarative imperative: experiences and conjectures in distributed logic. SIGMOD Record, 39(1):5-19, 2010. Google Scholar
  17. M. Kaminski and N. Francez. Finite-memory automata. Theoretical Computer Science, 134(2):329-363, 1994. Google Scholar
  18. H. Karloff, S. Suri, and S. Vassilvitskii. A model of computation for mapreduce. In SODA, 2010. Google Scholar
  19. P. Koutris and D. Suciu. Parallel evaluation of conjunctive queries. In PODS, 2011. Google Scholar
  20. R. Kumar, B. Moseley, S. Vassilvitskii, and A. Vattani. Fast greedy algorithms in mapreduce and streaming. In SPAA, 2013. Google Scholar
  21. S. Lattanzi, B. Moseley, S. Suri, and S. Vassilvitskii. Filtering: a method for solving graph problems in mapreduce. In SPAA, 2011. Google Scholar
  22. F. Neven, T. Schwentick, and V. Vianu. Finite state machines for strings over infinite alphabets. ACM Transactions on Computational Logic, 5(3):403-435, 2004. Google Scholar
  23. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD Conference, 2008. Google Scholar
  24. Apache Pig. Pig. URL: http://pig.apache.org/.
  25. Apache Spark. Spark. URL: http://spark.apache.org.
  26. Apache Spark. Spark programming guide. URL: http://spark.apache.org/docs/latest/programming-guide.html.
  27. S. Suri and S. Vassilvitskii. Counting triangles and the curse of the last reducer. In WWW, 2011. Google Scholar
  28. Y. Tao, W. Lin, and X. Xiao. Minimal mapreduce algorithms. In SIGMOD, 2013. Google Scholar
  29. A. Thusoo, J. Sen Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive - a petabyte scale data warehouse using hadoop. In ICDE, 2010. Google Scholar
  30. L. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103-111, 1990. Google Scholar
  31. T. White. Hadoop - The Definitive Guide: Storage and Analysis at Internet Scale (3. ed., revised and updated). O'Reilly, 2012. Google Scholar
  32. R. Xin, J. Rosen, M. Zaharia, M. Franklin, S. Shenker, and I. Stoica. Shark: Sql and rich analytics at scale. In SIGMOD, 2013. Google Scholar
  33. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail