DROPS

Document

Invited Paper

DOI: 10.4230/OASIcs.NG-RES.2024.1

HMB: Scheduling PREM-Like Real-Time Tasks at High Memory Bandwidth (Invited Paper)

Authors: Mohammadhassan Gholami Derouei, Paolo Valente, Marco Solieri, and Andrea Marongiu

Published in: OASIcs, Volume 117, Fifth Workshop on Next Generation Real-Time Embedded Systems (NG-RES 2024)

Abstract

Current homogeneous and heterogeneous computing systems reach high performance through parallelization. Yet, parallel execution of tasks entails non-trivial latency-vs-throughput issues when it comes to concurrent accesses to shared memory. In this respect, effective bandwidth regulation solutions do exist, and provide a basic mechanism to control the latency of memory accesses. Such solutions, though, are often cumbersome to deploy and to configure to guarantee both bounded latency and high utilization of the memory bandwidth. The problem is that memory latency varies non-linearly with the number and type of concurrent accesses, and the latter may in turn vary with time, often unpredictably. For this reason, previous attempts at memory regulation in scheduling solutions resulted either in poor real-time execution guarantees, or in severe underutilization of the memory bandwidth. In this paper, we outline High Memory Bandwidth (HMB), a scheduling solution that guarantees bounded response times to real-time task sets through memory regulation, while also reaching a high utilization memory bandwidth. Since the complete solution is complex, just like the problem it addresses, this preliminary work defines in full detail only the core mechanism. This mechanism builds on the notion of memory access slowdown experienced by any processor performing back-to-back memory operations; this slowdown is due to the interference generated by other processors also accessing the memory at the same time. The core mechanism assumes that each processor can tolerate a certain amount of slowdown before the timing behavior of the task(s) it is running is compromised. Each processor has a priority assigned: the higher the priority, the more stringent the timing requirements. The slowdown can be controlled by regulating with precision the maximum amount of system bandwidth each processor is allowed to use, based on its priority. The proposed mechanism finds the maximum bandwidth each processor can use such that the highest number of processors simultaneously accessing memory is found (thus avoiding memory bandwidth underutilization) while guaranteeing that the slowdown of each processor is kept within the tolerated limits.

Cite as

Mohammadhassan Gholami Derouei, Paolo Valente, Marco Solieri, and Andrea Marongiu. HMB: Scheduling PREM-Like Real-Time Tasks at High Memory Bandwidth (Invited Paper). In Fifth Workshop on Next Generation Real-Time Embedded Systems (NG-RES 2024). Open Access Series in Informatics (OASIcs), Volume 117, pp. 1:1-1:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{gholamiderouei_et_al:OASIcs.NG-RES.2024.1,
  author =	{Gholami Derouei, Mohammadhassan and Valente, Paolo and Solieri, Marco and Marongiu, Andrea},
  title =	{{HMB: Scheduling PREM-Like Real-Time Tasks at High Memory Bandwidth}},
  booktitle =	{Fifth Workshop on Next Generation Real-Time Embedded Systems (NG-RES 2024)},
  pages =	{1:1--1:18},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-313-3},
  ISSN =	{2190-6807},
  year =	{2024},
  volume =	{117},
  editor =	{Yomsi, Patrick Meumeu and Wildermann, Stefan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.NG-RES.2024.1},
  URN =		{urn:nbn:de:0030-drops-197049},
  doi =		{10.4230/OASIcs.NG-RES.2024.1},
  annote =	{Keywords: Heterogenous systems, Parallel execution, Shared memory, Bandwidth regulation, Memory access, Real-time execution, Memory bandwidth utilization, High Memory Bandwidth (HMB), Memory access slowdown, Memory interference, Memory-centric scheduling}
}

@InProceedings{gholamiderouei_et_al:OASIcs.NG-RES.2024.1,
  author =	{Gholami Derouei, Mohammadhassan and Valente, Paolo and Solieri, Marco and Marongiu, Andrea},
  title =	{{HMB: Scheduling PREM-Like Real-Time Tasks at High Memory Bandwidth}},
  booktitle =	{Fifth Workshop on Next Generation Real-Time Embedded Systems (NG-RES 2024)},
  pages =	{1:1--1:18},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-313-3},
  ISSN =	{2190-6807},
  year =	{2024},
  volume =	{117},
  editor =	{Yomsi, Patrick Meumeu and Wildermann, Stefan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.NG-RES.2024.1},
  URN =		{urn:nbn:de:0030-drops-197049},
  doi =		{10.4230/OASIcs.NG-RES.2024.1},
  annote =	{Keywords: Heterogenous systems, Parallel execution, Shared memory, Bandwidth regulation, Memory access, Real-time execution, Memory bandwidth utilization, High Memory Bandwidth (HMB), Memory access slowdown, Memory interference, Memory-centric scheduling}
}

Document

DOI: 10.4230/OASIcs.NG-RES.2020.3

Bao: A Lightweight Static Partitioning Hypervisor for Modern Multi-Core Embedded Systems

Authors: José Martins, Adriano Tavares, Marco Solieri, Marko Bertogna, and Sandro Pinto

Published in: OASIcs, Volume 77, Workshop on Next Generation Real-Time Embedded Systems (NG-RES 2020)

Abstract

Given the increasingly complex and mixed-criticality nature of modern embedded systems, virtualization emerges as a natural solution to achieve strong spatial and temporal isolation. Widely used hypervisors such as KVM and Xen were not designed having embedded constraints and requirements in mind. The static partitioning architecture pioneered by Jailhouse seems to address embedded concerns. However, Jailhouse still depends on Linux to boot and manage its VMs. In this paper, we present the Bao hypervisor, a minimal, standalone and clean-slate implementation of the static partitioning architecture for Armv8 and RISC-V platforms. Preliminary results regarding size, boot, performance, and interrupt latency, show this approach incurs only minimal virtualization overhead. Bao will soon be publicly available, in hopes of engaging both industry and academia on improving Bao’s safety, security, and real-time guarantees.

Cite as

José Martins, Adriano Tavares, Marco Solieri, Marko Bertogna, and Sandro Pinto. Bao: A Lightweight Static Partitioning Hypervisor for Modern Multi-Core Embedded Systems. In Workshop on Next Generation Real-Time Embedded Systems (NG-RES 2020). Open Access Series in Informatics (OASIcs), Volume 77, pp. 3:1-3:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{martins_et_al:OASIcs.NG-RES.2020.3,
  author =	{Martins, Jos\'{e} and Tavares, Adriano and Solieri, Marco and Bertogna, Marko and Pinto, Sandro},
  title =	{{Bao: A Lightweight Static Partitioning Hypervisor for Modern Multi-Core Embedded Systems}},
  booktitle =	{Workshop on Next Generation Real-Time Embedded Systems (NG-RES 2020)},
  pages =	{3:1--3:14},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-136-8},
  ISSN =	{2190-6807},
  year =	{2020},
  volume =	{77},
  editor =	{Bertogna, Marko and Terraneo, Federico},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.NG-RES.2020.3},
  URN =		{urn:nbn:de:0030-drops-117795},
  doi =		{10.4230/OASIcs.NG-RES.2020.3},
  annote =	{Keywords: Virtualization, hypervisor, static partitioning, safety, security, real-time, embedded systems, Arm, RISC-V}
}

Document

Artifact

DOI: 10.4230/DARTS.5.1.4

API Comparison of CPU-To-GPU Command Offloading Latency on Embedded Platforms (Artifact)

Authors: Roberto Cavicchioli, Nicola Capodieci, Marco Solieri, and Marko Bertogna

Published in: DARTS, Volume 5, Issue 1, Special Issue of the 31st Euromicro Conference on Real-Time Systems (ECRTS 2019)

Abstract

High-performance heterogeneous embedded platforms allow offloading of parallel workloads to an integrated accelerator, such as General Purpose-Graphic Processing Units (GP-GPUs). A time-predictable characterization of task submission is a must in real-time applications. We provide a profiler of the time spent by the CPU for submitting stereotypical GP-GPU workload shaped as a Deep Neural Network of parameterized complexity. The submission is performed using the latest API available: NVIDIA CUDA, including its various techniques, and Vulkan. Complete automation for the test on Jetson Xavier is also provided by scripts that install software dependencies, run the experiments, and collect results in a PDF report.

Cite as

Roberto Cavicchioli, Nicola Capodieci, Marco Solieri, and Marko Bertogna. API Comparison of CPU-To-GPU Command Offloading Latency on Embedded Platforms (Artifact). In Special Issue of the 31st Euromicro Conference on Real-Time Systems (ECRTS 2019). Dagstuhl Artifacts Series (DARTS), Volume 5, Issue 1, pp. 4:1-4:3, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@Article{cavicchioli_et_al:DARTS.5.1.4,
  author =	{Cavicchioli, Roberto and Capodieci, Nicola and Solieri, Marco and Bertogna, Marko},
  title =	{{API Comparison of CPU-To-GPU Command Offloading Latency on Embedded Platforms}},
  pages =	{4:1--4:3},
  journal =	{Dagstuhl Artifacts Series},
  ISSN =	{2509-8195},
  year =	{2019},
  volume =	{5},
  number =	{1},
  editor =	{Cavicchioli, Roberto and Capodieci, Nicola and Solieri, Marco and Bertogna, Marko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/DARTS.5.1.4},
  URN =		{urn:nbn:de:0030-drops-107322},
  doi =		{10.4230/DARTS.5.1.4},
  annote =	{Keywords: GPU, Applications, Heterogeneus systems}
}

Document

DOI: 10.4230/LIPIcs.ECRTS.2019.22

Novel Methodologies for Predictable CPU-To-GPU Command Offloading

Authors: Roberto Cavicchioli, Nicola Capodieci, Marco Solieri, and Marko Bertogna

Published in: LIPIcs, Volume 133, 31st Euromicro Conference on Real-Time Systems (ECRTS 2019)

Abstract

There is an increasing industrial and academic interest towards a more predictable characterization of real-time tasks on high-performance heterogeneous embedded platforms, where a host system offloads parallel workloads to an integrated accelerator, such as General Purpose-Graphic Processing Units (GP-GPUs). In this paper, we analyze an important aspect that has not yet been considered in the real-time literature, and that may significantly affect real-time performance if not properly treated, i.e., the time spent by the CPU for submitting GP-GPU operations. We will show that the impact of CPU-to-GPU kernel submissions may be indeed relevant for typical real-time workloads, and that it should be properly factored in when deriving an integrated schedulability analysis for the considered platforms. This is the case when an application is composed of many small and consecutive GPU compute/copy operations. While existing techniques mitigate this issue by batching kernel calls into a reduced number of persistent kernel invocations, in this work we present and evaluate three other approaches that are made possible by recently released versions of the NVIDIA CUDA GP-GPU API, and by Vulkan, a novel open standard GPU API that allows an improved control of GPU command submissions. We will show that this added control may significantly improve the application performance and predictability due to a substantial reduction in CPU-to-GPU driver interactions, making Vulkan an interesting candidate for becoming the state-of-the-art API for heterogeneous Real-Time systems. Our findings are evaluated on a latest generation NVIDIA Jetson AGX Xavier embedded board, executing typical workloads involving Deep Neural Networks of parameterized complexity.

Cite as

Roberto Cavicchioli, Nicola Capodieci, Marco Solieri, and Marko Bertogna. Novel Methodologies for Predictable CPU-To-GPU Command Offloading. In 31st Euromicro Conference on Real-Time Systems (ECRTS 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 133, pp. 22:1-22:22, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{cavicchioli_et_al:LIPIcs.ECRTS.2019.22,
  author =	{Cavicchioli, Roberto and Capodieci, Nicola and Solieri, Marco and Bertogna, Marko},
  title =	{{Novel Methodologies for Predictable CPU-To-GPU Command Offloading}},
  booktitle =	{31st Euromicro Conference on Real-Time Systems (ECRTS 2019)},
  pages =	{22:1--22:22},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-110-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{133},
  editor =	{Quinton, Sophie},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ECRTS.2019.22},
  URN =		{urn:nbn:de:0030-drops-107595},
  doi =		{10.4230/LIPIcs.ECRTS.2019.22},
  annote =	{Keywords: Heterogeneous systems, GPU, CUDA, Vulkan}
}

Document

DOI: 10.4230/LIPIcs.FSCD.2017.17

Is the Optimal Implementation Inefficient? Elementarily Not

Authors: Stefano Guerrini and Marco Solieri

Published in: LIPIcs, Volume 84, 2nd International Conference on Formal Structures for Computation and Deduction (FSCD 2017)

Abstract

Sharing graphs are a local and asynchronous implementation of lambda-calculus beta-reduction (or linear logic proof-net cut-elimination) that avoids useless duplications. Empirical benchmarks suggest that they are one of the most efficient machineries, when one wants to fully exploit the higher-order features of lambda-calculus. However, we still lack confirming grounds with theoretical solidity to dispel uncertainties about the adoption of sharing graphs. Aiming at analysing in detail the worst-case overhead cost of sharing operators, we restrict to the case of elementary and light linear logic, two subsystems with bounded computational complexity of multiplicative exponential linear logic. In these two cases, the bookkeeping component is unnecessary, and sharing graphs are simplified to the so-called "abstract algorithm". By a modular cost comparison over a syntactical simulation, we prove that the overhead of shared reductions is quadratically bounded to cost of the naive implementation, i.e. proof-net reduction. This result generalises and strengthens a previous complexity result, and implies that the price of sharing is negligible, if compared to the obtainable benefits on reductions requiring a large amount of duplication.

Cite as

Stefano Guerrini and Marco Solieri. Is the Optimal Implementation Inefficient? Elementarily Not. In 2nd International Conference on Formal Structures for Computation and Deduction (FSCD 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 84, pp. 17:1-17:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{guerrini_et_al:LIPIcs.FSCD.2017.17,
  author =	{Guerrini, Stefano and Solieri, Marco},
  title =	{{Is the Optimal Implementation Inefficient? Elementarily Not}},
  booktitle =	{2nd International Conference on Formal Structures for Computation and Deduction (FSCD 2017)},
  pages =	{17:1--17:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-047-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{84},
  editor =	{Miller, Dale},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.FSCD.2017.17},
  URN =		{urn:nbn:de:0030-drops-77337},
  doi =		{10.4230/LIPIcs.FSCD.2017.17},
  annote =	{Keywords: optimality, sharing graphs, lambda-calculus, complexity, linear logic, proof nets}
}

5 Search Results for "Solieri, Marco"

HMB: Scheduling PREM-Like Real-Time Tasks at High Memory Bandwidth (Invited Paper)

Abstract

Cite as

Bao: A Lightweight Static Partitioning Hypervisor for Modern Multi-Core Embedded Systems

Abstract

Cite as

API Comparison of CPU-To-GPU Command Offloading Latency on Embedded Platforms (Artifact)

Abstract

Cite as

Novel Methodologies for Predictable CPU-To-GPU Command Offloading

Abstract

Cite as

Is the Optimal Implementation Inefficient? Elementarily Not

Abstract

Cite as

Thanks for your feedback!

Could not send message