OASIcs, Volume 141

17th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 15th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2026)



Thumbnail PDF

Event

Editors

Davide Baroffio
  • Politecnico di Milano, Italy
Paola Busia
  • University of Cagliari, Italy
Lev Denisov
  • Politecnico di Milano, Italy
Nitin Shukla
  • CINECA, Casalecchio di Reno, Italy

Publication Details

  • published at: 2026-04-10
  • Publisher: Schloss Dagstuhl – Leibniz-Zentrum für Informatik
  • ISBN: 978-3-95977-416-1

Access Numbers

Documents

No documents found matching your filter selection.
Document
Complete Volume
OASIcs, Volume 141, PARMA-DITAM 2026, Complete Volume

Authors: Davide Baroffio, Paola Busia, Lev Denisov, and Nitin Shukla


Abstract
OASIcs, Volume 141, PARMA-DITAM 2026, Complete Volume

Cite as

17th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 15th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2026). Open Access Series in Informatics (OASIcs), Volume 141, pp. 1-110, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2026)


Copy BibTex To Clipboard

@Proceedings{baroffio_et_al:OASIcs.PARMA-DITAM.2026,
  title =	{{OASIcs, Volume 141, PARMA-DITAM 2026, Complete Volume}},
  booktitle =	{17th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 15th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2026)},
  pages =	{1--110},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-416-1},
  ISSN =	{2190-6807},
  year =	{2026},
  volume =	{141},
  editor =	{Baroffio, Davide and Busia, Paola and Denisov, Lev and Shukla, Nitin},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.PARMA-DITAM.2026},
  URN =		{urn:nbn:de:0030-drops-256940},
  doi =		{10.4230/OASIcs.PARMA-DITAM.2026},
  annote =	{Keywords: OASIcs, Volume 141, PARMA-DITAM 2026, Complete Volume}
}
Document
Front Matter
Front Matter, Table of Contents, Preface, Conference Organization

Authors: Davide Baroffio, Paola Busia, Lev Denisov, and Nitin Shukla


Abstract
Front Matter, Table of Contents, Preface, Conference Organization

Cite as

17th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 15th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2026). Open Access Series in Informatics (OASIcs), Volume 141, pp. 0:i-0:x, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2026)


Copy BibTex To Clipboard

@InProceedings{baroffio_et_al:OASIcs.PARMA-DITAM.2026.0,
  author =	{Baroffio, Davide and Busia, Paola and Denisov, Lev and Shukla, Nitin},
  title =	{{Front Matter, Table of Contents, Preface, Conference Organization}},
  booktitle =	{17th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 15th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2026)},
  pages =	{0:i--0:x},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-416-1},
  ISSN =	{2190-6807},
  year =	{2026},
  volume =	{141},
  editor =	{Baroffio, Davide and Busia, Paola and Denisov, Lev and Shukla, Nitin},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.PARMA-DITAM.2026.0},
  URN =		{urn:nbn:de:0030-drops-256934},
  doi =		{10.4230/OASIcs.PARMA-DITAM.2026.0},
  annote =	{Keywords: Front Matter, Table of Contents, Preface, Conference Organization}
}
Document
Invited Talk
Distributed Task Execution: Opportunities, Challenges and Lessons Learnt from OmpSs-2@Cluster (Invited Talk)

Authors: Paul Carpenter, Omar Shaaban, Juliette Fournis d'Albiat, and Isabel Piedrahita


Abstract
This talk will present recent advances in extending OmpSs-2 to distributed-memory systems, highlighting three contributions and the associated challenges. OmpSs-2@Cluster employs a common address space and weak accesses to support concurrent task creation and dataflow execution across nodes. Achieving good performance and scalability on 16 to 32 nodes requires detailed performance analysis together with a set of optimizations and runtime techniques, which I will outline in the talk. Second, I will describe how task offloading, in combination with BSC’s Dynamic Load Balancing (DLB), enables OmpSs-2@Cluster to mitigate load imbalance in MPI + OmpSs-2 programs with minimal application changes. Third, I will explain how the runtime can exploit the iterative structure of certain task dependency graphs to precompute communications and execute iterative regions efficiently, yielding performance and scalability comparable to state-of-the-art asynchronous MPI+X. Together, these results indicate that distributed tasking can combine productivity, adaptability, and high performance in modern HPC applications.

Cite as

Paul Carpenter, Omar Shaaban, Juliette Fournis d'Albiat, and Isabel Piedrahita. Distributed Task Execution: Opportunities, Challenges and Lessons Learnt from OmpSs-2@Cluster (Invited Talk). In 17th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 15th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2026). Open Access Series in Informatics (OASIcs), Volume 141, pp. 1:1-1:7, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2026)


Copy BibTex To Clipboard

@InProceedings{carpenter_et_al:OASIcs.PARMA-DITAM.2026.1,
  author =	{Carpenter, Paul and Shaaban, Omar and d'Albiat, Juliette Fournis and Piedrahita, Isabel},
  title =	{{Distributed Task Execution: Opportunities, Challenges and Lessons Learnt from OmpSs-2@Cluster}},
  booktitle =	{17th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 15th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2026)},
  pages =	{1:1--1:7},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-416-1},
  ISSN =	{2190-6807},
  year =	{2026},
  volume =	{141},
  editor =	{Baroffio, Davide and Busia, Paola and Denisov, Lev and Shukla, Nitin},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.PARMA-DITAM.2026.1},
  URN =		{urn:nbn:de:0030-drops-256685},
  doi =		{10.4230/OASIcs.PARMA-DITAM.2026.1},
  annote =	{Keywords: Task-based programming, distributed-memory clusters, programming models, runtime systems, task scheduling, data dependency management, load balancing, asynchronous communication}
}
Document
SLURM-Managed HyperParameter Optimization

Authors: Anusha Chattopadhyay, Hendrik Borras, Bernhard Klein, and Holger Fröning


Abstract
Hyperparameter optimization (HPO) is essential for achieving state-of-the-art performance in machine learning, yet it is computationally demanding, particularly on shared or resource-constrained clusters. We present a system that integrates the Asynchronous Successive Halving Algorithm (ASHA) with SEML, the SLURM Experiment Management Library - an experiment orchestration layer that provides declarative configuration, provenance, metric tracking, and robust SLURM job management. The resulting open-source tool enables scalable, fault-tolerant HPO on SLURM-managed infrastructure: SEML handles experiment specification, versioning, and scheduling, while ASHA performs asynchronous early stopping and resource reallocation to concentrate computation on promising configurations. Overall, the system streamlines experiment lifecycle management, enables distributed evaluations with minimal manual effort, and reduces the time required to reach high-quality configurations compared to conventional Grid and Random Search methods under similar compute budgets.

Cite as

Anusha Chattopadhyay, Hendrik Borras, Bernhard Klein, and Holger Fröning. SLURM-Managed HyperParameter Optimization. In 17th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 15th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2026). Open Access Series in Informatics (OASIcs), Volume 141, pp. 2:1-2:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2026)


Copy BibTex To Clipboard

@InProceedings{chattopadhyay_et_al:OASIcs.PARMA-DITAM.2026.2,
  author =	{Chattopadhyay, Anusha and Borras, Hendrik and Klein, Bernhard and Fr\"{o}ning, Holger},
  title =	{{SLURM-Managed HyperParameter Optimization}},
  booktitle =	{17th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 15th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2026)},
  pages =	{2:1--2:13},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-416-1},
  ISSN =	{2190-6807},
  year =	{2026},
  volume =	{141},
  editor =	{Baroffio, Davide and Busia, Paola and Denisov, Lev and Shukla, Nitin},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.PARMA-DITAM.2026.2},
  URN =		{urn:nbn:de:0030-drops-256697},
  doi =		{10.4230/OASIcs.PARMA-DITAM.2026.2},
  annote =	{Keywords: Hyperparameter optimization, Asynchronous Successive Halving Algorithm (ASHA), Experiment management, SLURM, SEML, Open Source}
}
Document
High Performance Visualization with VisIVO Across Cloud and HPC

Authors: Umer Arshad, Eva Sciacca, Nicola Tuccari, Fabio Pitari, and Giuseppa Muscianisi


Abstract
The rapid growth of data in Astrophysics and Cosmology creates significant challenges that require scalable computing and advanced visualization solutions. Cineca is Italy’s largest supercomputing center and a leading global provider of high-performance computing (HPC) services. This paper shows that the integration of the VisIVO scientific visualization framework is with the cloud-based InterActive Computing (IAC) service at Cineca. This integration enables GPU-accelerated, real-time visualization on HPC resources via a Jupyter interface in their browser. A new dedicated Python wrapper and a custom Jupyter kernel enable VisIVO to run smoothly from interactive notebooks, avoid command-line operations, and visualize data directly on HPC compute nodes. Furthermore, we enabled cloud-oriented RESTful APIs, built with the Flask framework, to perform VisIVO operations remotely via simple web services. This setup hides the backend’s complexity and simplifies connections with other applications. Our framework increases system accessibility, ensures reproducibility of results, and supports rapid data exploration for large astrophysical simulations. The system was evaluated using real-world cases, including visual analysis of cosmological simulations generated using the OpenGadget3 code. Results indicate that the system is scalable and reliable, and that it facilitates interactive scientific discovery on high-performance computing (HPC) infrastructures.

Cite as

Umer Arshad, Eva Sciacca, Nicola Tuccari, Fabio Pitari, and Giuseppa Muscianisi. High Performance Visualization with VisIVO Across Cloud and HPC. In 17th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 15th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2026). Open Access Series in Informatics (OASIcs), Volume 141, pp. 3:1-3:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2026)


Copy BibTex To Clipboard

@InProceedings{arshad_et_al:OASIcs.PARMA-DITAM.2026.3,
  author =	{Arshad, Umer and Sciacca, Eva and Tuccari, Nicola and Pitari, Fabio and Muscianisi, Giuseppa},
  title =	{{High Performance Visualization with VisIVO Across Cloud and HPC}},
  booktitle =	{17th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 15th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2026)},
  pages =	{3:1--3:11},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-416-1},
  ISSN =	{2190-6807},
  year =	{2026},
  volume =	{141},
  editor =	{Baroffio, Davide and Busia, Paola and Denisov, Lev and Shukla, Nitin},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.PARMA-DITAM.2026.3},
  URN =		{urn:nbn:de:0030-drops-256700},
  doi =		{10.4230/OASIcs.PARMA-DITAM.2026.3},
  annote =	{Keywords: High-performance computing, HPC, VisIVO, Scientific visualization, Interactive visualization, Cloud computing, Jupyter, Flask, REST API}
}
Document
Inter-Procedural Strength Reduction for Embedded Systems

Authors: Giovanni Agosta


Abstract
Embedded systems often rely on code generated from model-based design tools, which can result in inefficient implementations due to the loss of high-level semantic information during code generation. This paper explores an inter-procedural extension of the strength reduction transformation, traditionally applied within loops, to optimize repeated computations across function calls. The proposed technique identifies parameters acting as counters and replaces costly operations - such as exponentiation or multiplication - with incremental updates based on recurrence relations, using static variables to preserve state between calls. We formalize the transformation, discuss its applicability conditions, and analyse tradeoffs between computation and memory access costs. Experimental evaluation on ARMv8, AVR microcontrollers, and x86_64 platforms demonstrates significant speedups for power operations (up to 9 ×), while highlighting limitations for simpler operations due to memory overhead.

Cite as

Giovanni Agosta. Inter-Procedural Strength Reduction for Embedded Systems. In 17th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 15th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2026). Open Access Series in Informatics (OASIcs), Volume 141, pp. 4:1-4:10, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2026)


Copy BibTex To Clipboard

@InProceedings{agosta:OASIcs.PARMA-DITAM.2026.4,
  author =	{Agosta, Giovanni},
  title =	{{Inter-Procedural Strength Reduction for Embedded Systems}},
  booktitle =	{17th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 15th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2026)},
  pages =	{4:1--4:10},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-416-1},
  ISSN =	{2190-6807},
  year =	{2026},
  volume =	{141},
  editor =	{Baroffio, Davide and Busia, Paola and Denisov, Lev and Shukla, Nitin},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.PARMA-DITAM.2026.4},
  URN =		{urn:nbn:de:0030-drops-256717},
  doi =		{10.4230/OASIcs.PARMA-DITAM.2026.4},
  annote =	{Keywords: Compiler Optimization, Strength Reduction}
}
Document
LAMINA: An MLIR-Based Translation Library for Heterogeneous Quantum-Classical Compilation

Authors: Marco De Pascale, Mario Hernández Vera, Jorge Echavarria, Muhammad Nufail Farooqi, Martin Schulz, and Laura Schulz


Abstract
Quantum computing is increasingly integrated into High-Performance Computing (HPC) environments, where quantum processors act as specialized accelerators within hybrid workflows. The Munich Quantum Software Stack (MQSS) - a unified compilation and runtime framework for hybrid quantum–classical computing - provides the foundation for this integration. However, the growing heterogeneity of applications demands more flexible compilation tools. This work introduces an MultiLevel Intermediate Representation (MLIR)-based translation library that extends MQSS by enabling the conversion of CUDA-Quantum (CUDA-Q) (quake) dialects into machine learning–oriented MLIR representations compatible with modern compiler ecosystems. Leveraging MLIR’s dialect-driven design, the library enables hardware-agnostic transformations, device-specific optimizations, and seamless integration with MQSS components. The proposed approach bridges quantum compilation and contemporary machine learning frameworks, facilitating GPU-accelerated circuit simulation, hybrid quantum–classical workflows, and heterogeneous execution, thereby advancing a unified compiler infrastructure for quantum computing.

Cite as

Marco De Pascale, Mario Hernández Vera, Jorge Echavarria, Muhammad Nufail Farooqi, Martin Schulz, and Laura Schulz. LAMINA: An MLIR-Based Translation Library for Heterogeneous Quantum-Classical Compilation. In 17th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 15th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2026). Open Access Series in Informatics (OASIcs), Volume 141, pp. 5:1-5:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2026)


Copy BibTex To Clipboard

@InProceedings{depascale_et_al:OASIcs.PARMA-DITAM.2026.5,
  author =	{De Pascale, Marco and Hern\'{a}ndez Vera, Mario and Echavarria, Jorge and Nufail Farooqi, Muhammad and Schulz, Martin and Schulz, Laura},
  title =	{{LAMINA: An MLIR-Based Translation Library for Heterogeneous Quantum-Classical Compilation}},
  booktitle =	{17th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 15th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2026)},
  pages =	{5:1--5:13},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-416-1},
  ISSN =	{2190-6807},
  year =	{2026},
  volume =	{141},
  editor =	{Baroffio, Davide and Busia, Paola and Denisov, Lev and Shukla, Nitin},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.PARMA-DITAM.2026.5},
  URN =		{urn:nbn:de:0030-drops-256720},
  doi =		{10.4230/OASIcs.PARMA-DITAM.2026.5},
  annote =	{Keywords: HPCQC, MLIR, Quantum Computing, Heterogeneous Computing}
}
Document
Accelerating GPGPU Simulation by Strategically Parallelizing the Compute Bottleneck

Authors: Jakob Sachs, Tim Lühnen, and Sohan Lal


Abstract
Cycle-accurate GPGPU simulators like GPGPU-Sim provide invaluable insights for hardware architecture research but suffer from extremely long runtimes, hindering research productivity. This paper addresses this critical bottleneck by proposing a strategy to accelerate GPGPU-Sim. We first perform a holistic profiling analysis across diverse GPGPU benchmarks to identify the primary performance bottleneck, pinpointing the SIMT-Core cluster execution within the CORE-clock cycle. Based on this, we implement a parallelization scheme that strategically targets this hotspot, utilizing a thread pool to manage concurrent execution of SIMT-Core clusters. Our approach prioritizes minimal modifications to the existing GPGPU-Sim codebase to ensure long-term maintainability. Evaluation of a simulated NVIDIA H100 model demonstrates an average simulation wall-time speedup of 3.58× with 8 worker threads, and a maximum up to 4.38×, while incurring a maximum cycle count error of 3.22%, with some other benchmarks exhibiting no error at all.

Cite as

Jakob Sachs, Tim Lühnen, and Sohan Lal. Accelerating GPGPU Simulation by Strategically Parallelizing the Compute Bottleneck. In 17th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 15th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2026). Open Access Series in Informatics (OASIcs), Volume 141, pp. 6:1-6:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2026)


Copy BibTex To Clipboard

@InProceedings{sachs_et_al:OASIcs.PARMA-DITAM.2026.6,
  author =	{Sachs, Jakob and L\"{u}hnen, Tim and Lal, Sohan},
  title =	{{Accelerating GPGPU Simulation by Strategically Parallelizing the Compute Bottleneck}},
  booktitle =	{17th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 15th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2026)},
  pages =	{6:1--6:13},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-416-1},
  ISSN =	{2190-6807},
  year =	{2026},
  volume =	{141},
  editor =	{Baroffio, Davide and Busia, Paola and Denisov, Lev and Shukla, Nitin},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.PARMA-DITAM.2026.6},
  URN =		{urn:nbn:de:0030-drops-256736},
  doi =		{10.4230/OASIcs.PARMA-DITAM.2026.6},
  annote =	{Keywords: GPGPU, CUDA, Simulation, Computer Architecture, GPGPU-Sim, Parallel Simulation, Cycle-Accurate Simulation, Thread Pool}
}
Document
Linking High-Level Synthesis with FPGA Runtime Orchestration

Authors: Despoina Tomkou, Aggelos Ferikoglou, Dimosthenis Masouros, Sotirios Xydis, and Dimitrios Soudris


Abstract
FPGAs are increasingly being adopted across the edge-to-cloud continuum due to their ability to provide both high performance and energy efficiency. However, the complexity of programming FPGAs often leads to deployed designs that underutilize available resources. FPGA multi-tenancy has been proposed to enhance resource utilization, yet monolithic designs and dynamic workload demands continue to challenge efficient FPGA usage and compliance with Quality of Service requirements. To address these issues, we propose a novel framework for the optimal orchestration of FPGAs across the edge-to-cloud continuum while meeting user demands. The framework generates approximations of Pareto-optimal designs for each application, capturing trade-offs between performance and resource usage with minimal bitstream generation. This information allows the runtime orchestrator to select the most suitable design based on available PR regions and the QoS requirements of each user. Experimental results demonstrate that the proposed approach achieves an average reduction of QoS violations by a factor of 8.1× across diverse workloads and baseline configurations. Overall, the framework offers a practical and effective solution for realizing FPGA-as-a-Service across the edge-to-cloud continuum.

Cite as

Despoina Tomkou, Aggelos Ferikoglou, Dimosthenis Masouros, Sotirios Xydis, and Dimitrios Soudris. Linking High-Level Synthesis with FPGA Runtime Orchestration. In 17th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 15th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2026). Open Access Series in Informatics (OASIcs), Volume 141, pp. 7:1-7:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2026)


Copy BibTex To Clipboard

@InProceedings{tomkou_et_al:OASIcs.PARMA-DITAM.2026.7,
  author =	{Tomkou, Despoina and Ferikoglou, Aggelos and Masouros, Dimosthenis and Xydis, Sotirios and Soudris, Dimitrios},
  title =	{{Linking High-Level Synthesis with FPGA Runtime Orchestration}},
  booktitle =	{17th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 15th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2026)},
  pages =	{7:1--7:14},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-416-1},
  ISSN =	{2190-6807},
  year =	{2026},
  volume =	{141},
  editor =	{Baroffio, Davide and Busia, Paola and Denisov, Lev and Shukla, Nitin},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.PARMA-DITAM.2026.7},
  URN =		{urn:nbn:de:0030-drops-256746},
  doi =		{10.4230/OASIcs.PARMA-DITAM.2026.7},
  annote =	{Keywords: FPGA, Orchestration, Partial Reconfiguration, FPGAaaS}
}
Document
Performance Modeling & Mapping of LLM Inference on Heterogeneous Vectorized CGRAs

Authors: Dionysios Kefallinos, Georgios Alexandris, Alexis Maras, Panagiotis Chaidos, Manil Dev Gomony, Henk Corporaal, Dimitrios Soudris, and Sotirios Xydis


Abstract
Since the emergence of transformer-based models, the computational demands for Large Language Model (LLM) inference have been increasing exponentially, primarily due to their compounding parameter sizes, their structural complexity, and the use of non-linear functions. This tendency leads to the necessity of deploying them on low-power edge devices and DNN accelerators, to fuel next-generation agentic AI systems. Coarse-Grained Reconfigurable Architectures (CGRAs) have proven to be a compelling paradigm for edge acceleration, combining the programmability of general-purpose platforms with the high performance and energy efficiency associated with ASICs. In this work, we introduce an end-to-end performance modeling and mapping framework for LLM inference on heterogeneous CGRAs. Our methodology enables rapid exploration of the micro-architectural design space parameters, i.e., the number of processing elements, vector sizes, and memory configurations, by providing an accurate, explainable, and analytical CGRA performance modeling methodology, with an average cycle error of 0.9%. Architecturally, we build upon R-Blocks, a heterogeneous CGRA platform, and extend it to support floating-point arithmetic operations as well as a full-stack compilation and mapping flow for both full (FP32) and quantized (INT8) Llama2 models. The proposed methodology, evaluated on a 22nm technology node, achieves superior peak performance per Watt compared to related works such as REVAMP and CFEACT (1.8× and 2.8× respectively).

Cite as

Dionysios Kefallinos, Georgios Alexandris, Alexis Maras, Panagiotis Chaidos, Manil Dev Gomony, Henk Corporaal, Dimitrios Soudris, and Sotirios Xydis. Performance Modeling & Mapping of LLM Inference on Heterogeneous Vectorized CGRAs. In 17th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 15th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2026). Open Access Series in Informatics (OASIcs), Volume 141, pp. 8:1-8:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2026)


Copy BibTex To Clipboard

@InProceedings{kefallinos_et_al:OASIcs.PARMA-DITAM.2026.8,
  author =	{Kefallinos, Dionysios and Alexandris, Georgios and Maras, Alexis and Chaidos, Panagiotis and Gomony, Manil Dev and Corporaal, Henk and Soudris, Dimitrios and Xydis, Sotirios},
  title =	{{Performance Modeling \& Mapping of LLM Inference on Heterogeneous Vectorized CGRAs}},
  booktitle =	{17th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 15th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2026)},
  pages =	{8:1--8:14},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-416-1},
  ISSN =	{2190-6807},
  year =	{2026},
  volume =	{141},
  editor =	{Baroffio, Davide and Busia, Paola and Denisov, Lev and Shukla, Nitin},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.PARMA-DITAM.2026.8},
  URN =		{urn:nbn:de:0030-drops-256752},
  doi =		{10.4230/OASIcs.PARMA-DITAM.2026.8},
  annote =	{Keywords: Edge AI, LLM, CGRA, Heterogeneous Architectures, Performance Modeling, Hardware Acceleration, Low Power Computing}
}

Filters


Any Issues?
X

Feedback on the Current Page

CAPTCHA

Thanks for your feedback!

Feedback submitted to Dagstuhl Publishing

Could not send message

Please try again later or send an E-mail