Scalable and Fault-tolerant Stateful Stream Processing

Authors Raul Castro Fernandez, Matteo Migliavacca, Evangelia Kalyvianaki, Peter Pietzuch



PDF
Thumbnail PDF

File

OASIcs.ICCSW.2013.11.pdf
  • Filesize: 1.03 MB
  • 8 pages

Document Identifiers

Author Details

Raul Castro Fernandez
Matteo Migliavacca
Evangelia Kalyvianaki
Peter Pietzuch

Cite As Get BibTex

Raul Castro Fernandez, Matteo Migliavacca, Evangelia Kalyvianaki, and Peter Pietzuch. Scalable and Fault-tolerant Stateful Stream Processing. In 2013 Imperial College Computing Student Workshop. Open Access Series in Informatics (OASIcs), Volume 35, pp. 11-18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2013) https://doi.org/10.4230/OASIcs.ICCSW.2013.11

Abstract

As users of "big data" applications expect fresh results, we witness a new breed of stream processing systems (SPS) that are designed to scale to large numbers of cloud-hosted machines. Such systems face new challenges: (i) to benefit from the "pay-as-you-go" model of cloud computing, they must scale out on demand, acquiring additional virtual machines (VMs) and parallelising operators when the workload increases; (ii) failures are common with deployments on hundreds of VMs—systems must be fault-tolerant with fast recovery times, yet low per-machine overheads. An open question is how to achieve these two goals when stream queries include stateful operators, which must be scaled out and recovered without affecting query results.

Our key idea is to expose internal operator state explicitly to the SPS through a set of state management primitives. Based on them, we describe an integrated approach for dynamic scale out and recovery of stateful operators. Externalised operator state is checkpointed periodically by the SPS and backed up to upstream VMs. The SPS identifies individual operator bottlenecks and automatically scales them out by allocating new VMs and partitioning the checkpointed state. At any point, failed operators are recovered by restoring checkpointed state on a new VM and replaying unprocessed tuples. We evaluate this approach with the Linear Road Benchmark on the Amazon EC2 cloud platform and show that it can scale automatically to a load factor of L=350 with 50 VMs, while recovering quickly from failures.

Subject Classification

Keywords
  • Stateful stream processing
  • scalability
  • fault tolerance

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail