DROPS

Document

Resource-Efficient Machine Learning (Dagstuhl Seminar 24311)

Authors: Oana Balmau, Matthias Boehm, Ana Klimovic, Peter Pietzuch, and Pinar Tözün

Published in: Dagstuhl Reports, Volume 14, Issue 7 (2025)

Abstract

Machine learning (ML) enables forecasts, even in real-time, at ever lower cost and better accuracy. Today, data scientists are able to collect more data, access that data faster, and apply more complex data analysis than ever. As a result, ML impacts a variety of fields such as healthcare, finance, and entertainment. The advances in ML are mainly thanks to the exponential evolution of hardware, the availability of the large datasets, and the emergence of machine learning frameworks, which hide the complexities of the underlying hardware, boosting the productivity of data scientists. On the other hand, the computational need of the powerful ML models has increased several orders of magnitude in the past decade. A state-of-the-art large language processing model can cost of millions dollars to train in the cloud [The AI Index Report, 2024] without accounting for the electricity cost and carbon footprint [Dodge et al, 2022][Wu et al, 2024]. This makes the current rate of increase in model parameters, datasets, and compute budget unsustainable. To achieve a more sustainable progress in ML in the future, it is essential to invest in more resource-/energy-/cost-efficient solutions. In this Dagstuhl Seminar, our main goal was to reason critically about how we build software and hardware for end-to-end machine learning. The crowd was composed of experts from academia and industry across fields of data management, machine learning, compilers, systems, and computer architecture covering expertise of algorithmic optimizations in machine learning, job scheduling and resource management in distributed computing, parallel computing, and data management and processing. During the seminar, we explored how to improve ML resource efficiency through a holistic view of the ML landscape, which includes data preparation and loading, continual retraining of models in dynamic data environments, compiling ML on specialized hardware accelerators, hardware/software co-design for ML, and serving models for real-time applications with low-latency requirements and constrained resource environments. We hope that the discussions and the work planned during the seminar will lead to increased awareness for understanding the utilization of modern hardware and kickstart future developments to minimize hardware underutilization while still enabling emerging applications powered by ML.

Cite as

Oana Balmau, Matthias Boehm, Ana Klimovic, Peter Pietzuch, and Pinar Tözün. Resource-Efficient Machine Learning (Dagstuhl Seminar 24311). In Dagstuhl Reports, Volume 14, Issue 7, pp. 153-169, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@Article{balmau_et_al:DagRep.14.7.153,
  author =	{Balmau, Oana and Boehm, Matthias and Klimovic, Ana and Pietzuch, Peter and T\"{o}z\"{u}n, Pinar},
  title =	{{Resource-Efficient Machine Learning (Dagstuhl Seminar 24311)}},
  pages =	{153--169},
  journal =	{Dagstuhl Reports},
  ISSN =	{2192-5283},
  year =	{2025},
  volume =	{14},
  number =	{7},
  editor =	{Balmau, Oana and Boehm, Matthias and Klimovic, Ana and Pietzuch, Peter and T\"{o}z\"{u}n, Pinar},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagRep.14.7.153},
  URN =		{urn:nbn:de:0030-drops-229283},
  doi =		{10.4230/DagRep.14.7.153},
  annote =	{Keywords: Machine Learning, Modern Hardware, Sustainability, Energy-Efficiency, Benchmarking, Hardware-Software Co-Design, Data Management, Compilation}
}

Document

DOI: 10.4230/OASIcs.ICCSW.2013.11

Scalable and Fault-tolerant Stateful Stream Processing

Authors: Raul Castro Fernandez, Matteo Migliavacca, Evangelia Kalyvianaki, and Peter Pietzuch

Published in: OASIcs, Volume 35, 2013 Imperial College Computing Student Workshop

Abstract

As users of "big data" applications expect fresh results, we witness a new breed of stream processing systems (SPS) that are designed to scale to large numbers of cloud-hosted machines. Such systems face new challenges: (i) to benefit from the "pay-as-you-go" model of cloud computing, they must scale out on demand, acquiring additional virtual machines (VMs) and parallelising operators when the workload increases; (ii) failures are common with deployments on hundreds of VMs—systems must be fault-tolerant with fast recovery times, yet low per-machine overheads. An open question is how to achieve these two goals when stream queries include stateful operators, which must be scaled out and recovered without affecting query results. Our key idea is to expose internal operator state explicitly to the SPS through a set of state management primitives. Based on them, we describe an integrated approach for dynamic scale out and recovery of stateful operators. Externalised operator state is checkpointed periodically by the SPS and backed up to upstream VMs. The SPS identifies individual operator bottlenecks and automatically scales them out by allocating new VMs and partitioning the checkpointed state. At any point, failed operators are recovered by restoring checkpointed state on a new VM and replaying unprocessed tuples. We evaluate this approach with the Linear Road Benchmark on the Amazon EC2 cloud platform and show that it can scale automatically to a load factor of L=350 with 50 VMs, while recovering quickly from failures.

Cite as

Raul Castro Fernandez, Matteo Migliavacca, Evangelia Kalyvianaki, and Peter Pietzuch. Scalable and Fault-tolerant Stateful Stream Processing. In 2013 Imperial College Computing Student Workshop. Open Access Series in Informatics (OASIcs), Volume 35, pp. 11-18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2013)

Copy BibTex To Clipboard

@InProceedings{castrofernandez_et_al:OASIcs.ICCSW.2013.11,
  author =	{Castro Fernandez, Raul and Migliavacca, Matteo and Kalyvianaki, Evangelia and Pietzuch, Peter},
  title =	{{Scalable and Fault-tolerant Stateful Stream Processing}},
  booktitle =	{2013 Imperial College Computing Student Workshop},
  pages =	{11--18},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-939897-63-7},
  ISSN =	{2190-6807},
  year =	{2013},
  volume =	{35},
  editor =	{Jones, Andrew V. and Ng, Nicholas},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.ICCSW.2013.11},
  URN =		{urn:nbn:de:0030-drops-42669},
  doi =		{10.4230/OASIcs.ICCSW.2013.11},
  annote =	{Keywords: Stateful stream processing, scalability, fault tolerance}
}

Search Results

Documents authored by Pietzuch, Peter

Resource-Efficient Machine Learning (Dagstuhl Seminar 24311)

Abstract

Cite as

Scalable and Fault-tolerant Stateful Stream Processing

Abstract

Cite as

Thanks for your feedback!

Could not send message