Fifth Workshop on Next Generation Real-Time Embedded Systems

NG-RES 2024, January 17–19, 2024, Munich, Germany

Edited by
Patrick Meumeu Yomsi
Stefan Wildermann
OASIcs – OpenAccess Series in Informatics

OASIcs is a series of high-quality conference proceedings across all fields in informatics. OASIcs volumes are published according to the principle of Open Access, i.e., they are available online and free of charge.

Editorial Board

- Daniel Cremers (TU München, Germany)
- Barbara Hammer (Universität Bielefeld, Germany)
- Marc Langheinrich (Università della Svizzera Italiana – Lugano, Switzerland)
- Dorothea Wagner (Editor-in-Chief, Karlsruhe Institut für Technologie, Germany)

ISSN 1868-8969

https://www.dagstuhl.de/oasics
Contents

Preface
Patrick Meumeu Yomsi and Stefan Wildermann ......................... 0:vii

List of Authors
.................................................................................... 0:ix

Invited Papers

HMB: Scheduling PREM-Like Real-Time Tasks at High Memory Bandwidth
Mohammadhassan Gholami Derouei, Paolo Valente, Marco Solieri, and
Andrea Marongiu ................................................................. 1:1–1:18

A Multi-Modal Distributed Real-Time IoT System for Urban Traffic Control
Zeba Khanam, Vejey Pradeep Suresh Achari, Issam Boukhennoufa, Anish Jindal,
and Amit Kumar Singh ......................................................... 2:1–2:10

DynaVLC – Towards Dynamic GTS Allocation in VLC Networks
Harrison Kurunathan, Miguel Gutiérrez Gaitán, Ramiro Sámano-Robles, and
Eduardo Tovar ................................................................. 3:1–3:11

Regular Paper

History-Based Run-Time Requirement Enforcement of Non-Functional Properties
on MPSoCs
Khalil Esper and Jürgen Teich ................................................... 4:1–4:11
On behalf of the Technical Program Committee, we are pleased to welcome you to the Proceedings of the 5th Edition of the Workshop on Next Generation Real-Time Embedded Systems (NG-RES 2024), which was held on January 19, 2024 in Munich, Germany. The NG-RES workshop series focuses on real-time embedded systems, with an emphasis on distributed and parallel aspects. NG-RES serves as a platform for collaboration between the networking and multicore real-time communities and promotes cross-fertilization and multidisciplinary approaches to embedded systems design.

Key topics of interest for NG-RES 2024 included, but are not limited to:

- Application of formal methods to distributed and/or parallel real-time systems
- Programming models, paradigms and frameworks for real-time computation on parallel and heterogeneous architectures
- Applications of approximate computing in real-time systems
- Compiler-assisted solutions for distributed and/or parallel real-time systems
- Middlewares for distributed and/or parallel real-time systems
- Networking protocols and services (e.g., clock synchronization) for distributed real-time embedded systems
- Scheduling and schedulability analysis for distributed and/or parallel real-time systems
- System-level software and technologies (e.g. RTOSs, hypervisors, separation kernels, virtualization) for parallel and heterogeneous architectures

We express our gratitude to everyone involved in the organization of the workshop. Our special thanks go to the General Chair Federico Terraneo, Submission and Web Chair Daniele Cattaneo and sincere appreciation to the members of the Program Committee. Their dedicated support was crucial in putting together the workshop program and we thank you all.

Finally, a big thank you goes to all the authors who contributed to NG-RES 2024 with their work. Their valuable contributions make this workshop possible. We hope you will enjoy the event!

Patrick Meumeu Yomsi and Stefan Wildermann
List of Authors

Vejey Pradeep Suresh Achari (2)
Keele University, Keele, UK

Issam Boukhennoufa (2)
University of Essex, Colchester, UK

Khalil Esper (4)
Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Germany

Miguel Gutiérrez Gaitán (3)
Department of Electrical Engineering, Pontificia Universidad Católica de Chile, Santiago, Chile;
Faculty of Engineering, Universidad Andres Bello, Santiago, Chile;
CISTER/ISEP, Polytechnic Institute of Porto, Portugal

Mohammadhassan Gholami Derouei (1)
Università degli Studi di Modena e Reggio Emilia, Italy;
Minerva Systems srl, Modena, Italy

Anish Jindal (2)
Durham University, Durham, UK

Zeiba Khanam (2)
BT Security Research, Adastral Park, UK

Harrison Kurunathan (3)
CISTER/ISEP, Polytechnic Institute of Porto, Portugal

Andrea Marongiu (1)
Università degli Studi di Modena e Reggio Emilia, Italy

Amit Kumar Singh (2)
University of Essex, Colchester, UK

Marco Solieri (1)
Minerva Systems srl, Modena, Italy

Ramiro Sámano-Robles (3)
CISTER/ISEP, Polytechnic Institute of Porto, Portugal

Jürgen Teich (4)
Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Germany

Eduardo Tovar (3)
CISTER/ISEP, Polytechnic Institute of Porto, Portugal

Paolo Valente (1)
Università degli Studi di Modena e Reggio Emilia, Italy
HMB: Scheduling PREM-Like Real-Time Tasks at High Memory Bandwidth

Mohammadhassan Gholami Derouei
Università degli Studi di Modena e Reggio Emilia, Italy
Minerva Systems srl, Modena, Italy

Paolo Valente
Università degli Studi di Modena e Reggio Emilia, Italy

Marco Solieri
Minerva Systems srl, Modena, Italy

Andrea Marongiu
Università degli Studi di Modena e Reggio Emilia, Italy

Abstract

Current homogeneous and heterogeneous computing systems reach high performance through parallelization. Yet, parallel execution of tasks entails non-trivial latency-vs-throughput issues when it comes to concurrent accesses to shared memory. In this respect, effective bandwidth regulation solutions do exist, and provide a basic mechanism to control the latency of memory accesses. Such solutions, though, are often cumbersome to deploy and to configure to guarantee both bounded latency and high utilization of the memory bandwidth. The problem is that memory latency varies non-linearly with the number and type of concurrent accesses, and the latter may in turn vary with time, often unpredictably. For this reason, previous attempts at memory regulation in scheduling solutions resulted either in poor real-time execution guarantees, or in severe underutilization of the memory bandwidth. In this paper, we outline High Memory Bandwidth (HMB), a scheduling solution that guarantees bounded response times to real-time task sets through memory regulation, while also reaching a high utilization memory bandwidth. Since the complete solution is complex, just like the problem it addresses, this preliminary work defines in full detail only the core mechanism. This mechanism builds on the notion of memory access slowdown experienced by any processor performing back-to-back memory operations; this slowdown is due to the interference generated by other processors also accessing the memory at the same time. The core mechanism assumes that each processor can tolerate a certain amount of slowdown before the timing behavior of the task(s) it is running is compromised. Each processor has a priority assigned: the higher the priority, the more stringent the timing requirements. The slowdown can be controlled by regulating with precision the maximum amount of system bandwidth each processor is allowed to use, based on its priority. The proposed mechanism finds the maximum bandwidth each processor can use such that the highest number of processors simultaneously accessing memory is found (thus avoiding memory bandwidth underutilization) while guaranteeing that the slowdown of each processor is kept within the tolerated limits.

2012 ACM Subject Classification Computer systems organization → Real-time operating systems

Keywords and phrases Heterogenous systems, Parallel execution, Shared memory, Bandwidth regulation, Memory access, Real-time execution, Memory bandwidth utilization, High Memory Bandwidth (HMB), Memory access slowdown, Memory interference, Memory-centric scheduling

Digital Object Identifier 10.4230/OASIcs.NG-RES.2024.1

Category Invited Paper

Acknowledgements Authors are grateful to Tomasz Kloda at TUM, and Benjamin Rouxel at UNIMORE, with whom they enjoyed inspiring discussions in the conception of this work.
1 Introduction

Memory contention on parallel systems

Modern systems feature multiple execution units, including CPU cores and accelerator cores, running in parallel. These units may inherently perform memory accesses simultaneously. To support parallelism in memory accesses, these systems are equipped with cache hierarchies. Caches help to make memory accesses as local as possible, thereby reducing the likelihood of accessing shared memories and interconnects. However, in many cases, conflicting accesses to shared resources remain unavoidable. For instance, these conflicting accesses become unavoidable when there is insufficient space in local caches to accommodate the cumulative memory footprint of the processors (e.g., CPU cores, execution units) utilizing those caches. Additionally, such conflicts arise when distant processors need to communicate through shared memory. Contention arises across all shared resources along the path from the processors to the actual memory banks, including the interconnect, shared caches, memory bus, and others. This contention, in turn, results in a more or less significant and unpredictable inflation of memory latency, impacting the duration of the memory access performed by the competing processors. [10]. This is an evident problem in real-time applications. Increased memory latencies can slow down the execution of tasks that involve memory accesses. This may make it impossible for the tasks to meet their deadlines, thereby rendering the task set unfeasible. Several solutions have been proposed in the scientific literature to eliminate or control this slowdown, which can be broadly grouped into two classes: (i) exclusive memory accesses; and (ii) limited memory accesses.

Exclusive memory access: low bandwidth and high predictability

The first type of solution is based on allowing one processor (or, very few processors) at a time to access shared memory [20, 21]. This approach either eliminates or reduces interference to such a low level that no significant slowdown occurs. However, the available bandwidth from shared memories is typically sized to meet the cumulative average bandwidth demand of the set of processors connected to that memory. Consequently, the total bandwidth offered by the memory is often higher or much higher than the bandwidth that a single processor may request. In the end, if only one or relatively few processors access memory at the same time, the memory bandwidth may be underutilized, potentially to a severe extent. For example, the ratio between the memory bandwidth available to a single CPU core and the one available to the whole system (or the CPU complex, respectively) ranges between: 5.0% (or 11%) on an A57 core in the NXP i.MX 8QM platform, 13% (or 24%) on a Carmel core in the Nvidia Xavier AGX, 15% (or 35%) on a A53 core in the AMD Xilinx Zynq UltraScale+ [4].

Limited memory access: either high bandwidth, or high predictability

The other type of solutions follows a somewhat opposite approach, as it allows multiple processors to access memory in parallel [11, 18, 22]. In this case, slowdowns are controlled by imposing a limit on the maximum bandwidth at which each processor can access memory. This approach is effective as long as the sum of the per-processor bandwidths remains low compared to the total memory bandwidth. In this particular case, memory contention is negligible, and per-processor bandwidths add up linearly – the execution latencies of parallel tasks are nearly identical to when executed in isolation. Consequently, the only factor producing the slowdown is essentially the bandwidth limit itself. Therefore, in this case, the slowdown experienced by each processor can be conveniently controlled.
Conversely, operating at high utilization implies making memory work at or close to bandwidth saturation. However, in this regime, memory behavior becomes non-linear. The slowdown for a specific processor changes if other processor begin accessing memory and varies based on the types of memory accesses and bandwidth limits of the same processor and the competing ones. Several factors contribute to this behavior, such as varying conflicts on different memory banks and interconnect components, or contention affecting the cache hierarchy at the interconnect level. [4] Ultimately, slowdowns become complicated to predict and impossible to control through any static assignment of bandwidth limits. For these reasons, memory limitation mechanisms in real-time scenarios are typically employed conservatively, aiming to maintain memory bandwidth well below saturation. However, in such a configuration, memory bandwidth is once again underutilized, similar to the previous case.

Closing the gap with dynamic parallel access

How can we reconcile high predictability in system behavior with efficient memory bandwidth exploitation? In other words, how can we guarantee bounded slowdowns while fully utilizing memory bandwidth? In this paper, we propose a general solution to achieve such a goal. It builds on two main ingredients.

**PREM task model.** We adopt a PREM-like real-time task model, featuring a task set with per-task deadlines and a processor set where tasks are to be executed, and memory limits can be enforced. PREM-like tasks are identified by memory phases during which they execute contiguous memory accesses. In our model, we assume that memory phases can fall into two types: data prefetch, where only reads are executed, or data writeback, where only writes are performed. This is similar to other models such as the Logical Execution Time (LET) model of task execution, which distinguishes logical timing requirements from the actual physical platform execution. In the LET model, a task is sequential code with its own memory space and lacks internal synchronization points [8].

**Dynamic memory policy.** We assume to be given an execution policy that provides task allocation and scheduling — at any time, it assigns tasks to processors of interest. We do not introduce any further hypothesis on the execution policy, except for the fact it has to be memory-agnostic, i.e. allocation and scheduling choices must not depend on any memory concept like limits, contention or service times. We introduce a second policy, called the memory policy, which is responsible for the dynamic adjustment of the bandwidth limits of all active processors. This policy is a function of how many processors are accessing memory at the same time, and of which memory accesses they are performing between reads and writes. Limits are adjusted in such a way that the slowdowns experienced by each task are low enough to let the task still meet its deadlines, under the scheduling policy at hand.

The assumption regarding PREM-like tasks represents an important simplifying hypothesis as it enlarges the granularity of memory regulation decisions to a tractable size. Indeed, it costs time to detect a change in the memory access pattern of processors, compute new limits, and initiate the enforcement of these new limits. The highest time constant in play is the communication delay between the processors. For instance, in the case of a single CPU, which has lower delays compared to a heterogeneous system, we can estimate the delay to be in the order of 10 microseconds, assuming the communication is triggered by an inter-processor interrupt [15]. For this reason, our solution is feasible only for PREM-like tasks whose memory phases are longer than this base delay. For non-PREM tasks, it would
be challenging or even impossible to determine the type of each generic memory access right before it happens and to dynamically adjust bandwidth limits for each instantaneous change in the type of memory accesses. Nevertheless, other dynamic approaches could be considered, such as measuring per task/processor bandwidths online and adjusting limits on the fly. However, such extensions are beyond the scope of this initial proposal.

Given the PREM task set and a system where delays make our solution feasible, the core challenge lies in computing the bandwidth limits. All potential combinations of parallel access patterns must be considered, and their number grows exponentially with the number of processors and available limit values. However, only a limited set of configurations that drive memory bandwidth close to saturation are of interest. The number of these configurations is significantly lower than the total number of possible combinations. Therefore, a critical feasibility aspect is defining an algorithm that discards all useless configurations, eliminating the need to store them all and potentially avoiding their evaluation altogether.

We remark that, by construction, we structured the problem to maintain orthogonality between the execution and memory policies. This provides complete freedom in defining the execution policy. Specifically, one can establish either a global or a partitioned task scheduling policy with regulated parallel accesses to memory. A partitioned scheme is likely the preferred option for an initial solution, as it is typically characterized by simpler analysis and implementation.

2 System Model

This work focuses on a multi-processor system featuring a shared last-level cache and shared memory. The proposed idea in this work does not hinge on any specific task model or scheduling policy. Nevertheless, for illustrative purposes, the following system model is employed to present the idea. Each processor is assigned a partitioned subset of tasks. Tasks within the system follow a sequence: they prefetch data in the Read-memory phase, perform computations in a single computation phase, and write back the results in the Write-memory phase. On each processor, tasks are scheduled based on a dynamic-priority, non-preemptive policy.

2.1 Processors

The main memory is shared among \( m \) identical processors, each assigned a distinct static priority. These priorities govern the allocated bandwidth for parallel access to the main memory. A detailed explanation of how these priorities can be utilized to govern the parallel memory accesses will be provided in Section 4. Processors are indexed following their priorities, with \( P_1 \) possessing the highest priority and \( P_m \) the lowest.

2.2 Tasks

This work considers a partitioned system where each task is statically assigned to a single processor. The assignment of the \( i \)-th task to the \( k \)-th processor is denoted by the notation \( \tau_i \in P_k \). Each task, denoted as \( \tau_i \), exhibits a dynamic behavior by releasing an infinite sequence of jobs sporadically. Each individual job within this sequence is subject to a specified minimum inter-arrival time denoted as \( T_i \). Each job of \( \tau_i \) must be executed and completed within a fixed time limit from its release, specified by \( D_i \), the relative deadline. We employ a PREM-like task model to simplify governing the memory requests. This assumption is significant because it increases the granularity of memory policing decisions to
a manageable level. Detecting changes in memory access patterns, calculating new limits, and enforcing these limits all take time. Due to this constraint, our solution is practical only for tasks resembling PREM, where the duration of memory phases exceeds the base delay. For non-PREM tasks, accurately predicting the type of each generic memory access just before it occurs and dynamically adjusting bandwidth limits for every instantaneous change in memory access type would be challenging, if not impossible. In this PREM-like model, each task consists of three phases: a read-memory phase, a computation phase, and a write-memory phase. During the read-memory phase, the task prefetches data from the main memory. In the computation phase, the task performs computations exclusively on the prefetched data without making any requests to access the main memory. The result is then written back to the main memory during the write-memory phase. In both memory phases, the tasks execute only memory accesses.

Tasks are scheduled on each processor according to a dynamic-priority non-preemptive scheduling policy and are indexed by priority, with \( \tau_1 \) having the highest priority and \( \tau_n \) the lowest, where \( n \) is the number of tasks allocated to the processor under study.

2.3 Memory

Memory functions as a globally shared resource, accessible to all processors with identical memory access latencies. In this model, the processors are symmetric, and they have the potential to saturate the bandwidth, meaning their cumulative demand may exceed the total memory bandwidth. Consequently, due to memory interferences, the duration extension of memory phases for each processor becomes unpredictable without regulation. Additionally, we assume that the order in which the memory controller serves memory requests simultaneously issued by different processors is unknown.

3 The Scheduling Policy

In this section, we present a partitioned memory-centric scheduling policy and illustrate it with an example. We will delve into the basic idea and the principal rules of the proposed policy. Our applied scheduling policy consists of two main components: the execution policy to distribute the task set among the processors and to schedule the execution order of each subset of the task set on each processor, and the memory policy for regulating the bandwidth to control access to the main memory. These two parts are explained as follows.

Execution Policy. Each task in the task set is statically assigned to a single processor, and no migration is allowed. The task set is first sorted based on their relative deadlines. Subsequently, tasks are assigned one by one to the processors according to their priority. Assuming there are \( n_T \) tasks in the task set and \( m \) processors in the system, the task with the closest relative deadline is assigned to the highest priority processor, denoted as \( P_1 \). The task with the second closest relative deadline is assigned to \( P_2 \), and this assignment continues until each processor is assigned a task. Once every processor has a task, the \( m + 1 \)th closest relative deadline task is assigned to \( P_1 \) again. This procedure repeats until all tasks in the task set are assigned to processors. On each processor, tasks run based on a non-preemptive Earliest Deadline First (EDF) algorithm. Considering that all the required data for the execution of each task is prefetched during the read-memory phase, using a preemptive scheduling policy may increase the response time of each task by prefetching the same data several times. Therefore, choosing a non-preemptive policy can be more efficient. Moreover, by applying the EDF, the system can enjoy the benefits of dynamic task priorities.
**Memory Policy.** All the memory requests are governed globally through bandwidth regulation. When a new memory request arrives, or an existing memory access finishes, depending on the number of parallel memory accesses, and the workload on the targeted processor and the interfering ones, the supervisor should check the corresponding cell in the table of regulation factors ($RF$) to identify the amount of bandwidth allocated to each active processor.

## 4 Implementing The Policy

### 4.1 Memory bandwidth regulation

#### 4.1.1 Regulation mechanisms

Memory bandwidth limits are realized by *memory regulation mechanisms* that can be implemented in different ways – they can be either hardware assisted, or provided by software components, either external from, or internal to, the workload to be limited.

**HW regulation.** Hardware regulation is commercially available with Memory Bandwidth Allocation, part of the Resource Director Technology by Intel [9], that is mainly featured on higher end Xeon family processors. A similar technology is provided also by Arm with the Memory System Resource Partitioning and Monitoring (MPAM) set of IPs [2], but we are not aware of any commercially available chip including it. The common principle is that CPU cores can be assigned with a given quality of service for memory transactions, and that such limit is enforced by regulating the traffic originated from the last-level cache or the system-level cache.

**External SW regulation.** Software regulation can be conveniently offered by a component of the platform software. It can be the case of the operating system like in MemGuard [21], the hypervisor like in MinervaSys Jailhouse [6] or some system-resident firmware like in MemPol [22]. The underlying principle is the same – a processor is allotted a predefined maximum number of memory accesses (budget) to execute within a fixed period. Should the processor exhaust its budget before the period concludes, the regulation mechanism halts the processor until the period concludes, at which point the processor receives a new budget for the subsequent period.

**Internal SW regulation.** This method, also called Voluntary Throttling (VolT), entails augmenting the code of the memory phases of the PREM tasks so to introduce a number of NOPs (No Operation instructions) periodically during its memory phase(s) [4]. The number of NOPs is externally configurable and provides the mean to regulate the throttling length (or frequency) of throttling. The code augmentation can be inserted at compile time, which is especially convenient when code PREMization is already automated [7].

#### 4.1.2 Regulation factors

In order of us to abstract from the implementation details, we define the notion of regulation factor, that acts as a knob to adjust the bandwidth allocation for a process. A regulation factor of 0% indicates that no bandwidth is allocated for the service, while a factor of 100% implies that the entire available bandwidth is dedicated to the service.

**Definition 1** (*Regulation Factor* ($RF$)). For each processor configured with a memory limit, we define its regulation factor ($RF$) as the ratio between the limited memory bandwidth measured in isolation by performing back-to-back Reads (or Writes) operations, and the (unlimited) memory bandwidth measured in the same conditions while removing the memory limitation on the processor. For the sake of simplicity, we assume RFs to range among percentage integers between 0 (no bandwidth) to 100 (full bandwidth).
We remark that the target measurement is performed in isolation because the attainable bandwidth is, in general, influenced by the number of processors operating in parallel, and the workload on both the affected processor and any interfering processors. Consequently, let us also stress that the actual bandwidth experienced by a processor is influenced, not entirely determined, by its regulation factor.

Observe that the regulation primitives greatly differ among the various memory regulation mechanisms. A concrete manner is thus needed to compute the function that maps any configuration of the mechanisms-specific knobs to its corresponding regulation factors. In most cases, only an experimental method enables to obtain a precise definition, as the one provided in Subsubsection 4.1.3 for VOLT.

4.1.3 Example: VOLT regulation factors

In a VOLT system the only available knob is the number of NOPs injected in the code. To compute the RFs, we need to experimentally find, for each workload, for each kind of processor, and for each regulation factor, the number of NOPs that produce the limited bandwidth of our interest. For instance, when the regulation factor is 10%, it means that the number of NOPs is such that if the processor executes this specific workload (Read or Write) in isolation, it receives 10% of the unlimited bandwidth. This table can be constructed experimentally following the algorithm outlined in Algorithm 1.

For each memory access type, the algorithm initializes two counters: $N_{old}$ and $N_{new}$, representing the initial and current number of NOPs, respectively. Then, it sweeps through a range of RF values from 100 to 0 with a step of −10, representing increasing regulation levels. At each RF iteration, a Match flag is set to false, indicating that a precise match

\begin{algorithm}
\caption{Construct the translating table between regulation factors and limited bandwidth values.}
\begin{algorithmic}[1]
\State \textbf{Input} : Unlimited bandwidth values $UnB_R, UnB_W$ for Read and Write memory accesses
\State \textbf{Output} : Translating table $RF2BW$ between regulation factors and actual bandwidth values:
\For{$TMR \in \{R,W\}$}
\State $N_{old} = 1$
\State $N_{new} = 2$
\For{$RF = 100 \text{ to } 0 \text{ with step } -10$}
\State Match = False
\While{!Match}
\State $BW_{old} = \text{measure the bandwidth with } N_{old} \text{ NOPs}$
\State $BW_{new} = \text{measure the bandwidth with } N_{new} \text{ NOPs}$
\If{$|BW_{new} - \frac{RF}{100} \times UnB_{TMR}| < |BW_{old} - \frac{RF}{100} \times UnB_{TMR}|$}
\State $N_{old}++$
\State $N_{new}++$
\Else
\State $N(TMR, RF) = N_{old}$
\State Match = True
\EndIf
\EndWhile
\EndFor
\State \textbf{return} Result
\end{algorithmic}
\end{algorithm}
between \(RF\) and bandwidth has not yet been established. The algorithm then enters a loop that repeatedly measures bandwidth using two different NOP configurations. The first measurement \((BW_{\text{old}})\) reflects the current NOP count \((N_{\text{old}})\), while the second measurement \((BW_{\text{new}})\) employs \(N_{\text{old}}\) incremented by one \((N_{\text{new}} = N_{\text{old}} + 1)\).

The algorithm compares the absolute differences between the measured bandwidth values and the expected bandwidth value for each \(RF\), calculated by multiplying \(RF\) by the unlimited bandwidth for the corresponding memory access type \((UnB_{\text{TMR}})\). If the difference for the newer NOP setting \(|BW_{\text{new}} - RF \times UnB_{\text{TMR}}|\) is smaller than that for the older setting \(|BW_{\text{old}} - RF \times UnB_{\text{TMR}}|\), it suggests that the newer NOP configuration provides a more accurate bandwidth estimation for that particular \(RF\). In such a case, the algorithm increments both counters \((N_{\text{old}}++, N_{\text{new}}++)\), effectively refining the bandwidth estimation resolution. The algorithm continues iterating within this loop until the Match flag is set to true, indicating that the optimal match between \(RF\) and bandwidth has been identified. Upon reaching this point, the algorithm stores the corresponding NOP count \((N_{\text{old}})\) in the translating table \(RF\rightarrow BW\) for the specified memory access type \((TMR)\) and \(RF\) value. This process is repeated for both \(R\) and \(W\) memory access types to construct the complete translating table.

It’s crucial to recognize that since the number of NOPs is limited to integer values, the bandwidth values derived from regulation factors are not entirely precise – they represent the closest approximation of the true bandwidth achievable with a specific regulation level.

### 4.2 Construct the table of slowdown measurements

The table of regulation factors \((RF)\) comprises factors for regulating memory accesses corresponding to each workload pattern. To fill in the cells of this table, a series of slowdown measurements must be conducted for every conceivable set of regulation factors on active processors. This enables us to deduce the factors that more effectively align with our timing constraints.

As these slowdown measurements merely pertain to the specifications of the hardware in use, rather than the actual task set, we optimize system performance by employing a separate table, designated as the table of slowdown measurements \((SM)\), to store the recorded slowdown values for each scenario of memory accesses and a specific set of regulation factors on the active processors. The construction of the slowdown measurements table is a one-time necessity. Subsequently, based on the task set specifications, we can then select the set of suitable factors.

In the remainder of this section, we systematically introduce the algorithms for constructing the aforementioned tables, step by step. But first, we shall define some preliminary concepts.

#### 4.2.1 Formal definitions

The set of notations used is briefly explained in the Table 1.

To determine the appropriate set of throttling factors for effective bandwidth regulation, it is essential to undertake a series of slowdown measurements encompassing all possible combinations of throttling factors for active processors. In this section, we aim to outline a systematic approach for conducting these measurements while avoiding redundant cases. But before presenting the algorithm a few concepts should be clarified.
Table 1 The table of notations.

<table>
<thead>
<tr>
<th>Notation</th>
<th>Definition</th>
</tr>
</thead>
<tbody>
<tr>
<td>C(j, S_i, RF(j))</td>
<td>a generic configuration consists of a set of regulation factors, RF(j), its corresponding slowdown measurement SD(j, S_i, RF(j)), and the actual bandwidth of processors, or UB</td>
</tr>
<tr>
<td>j</td>
<td>number of processors accessing the memory in parallel</td>
</tr>
<tr>
<td>m</td>
<td>total number of processors in the system</td>
</tr>
<tr>
<td>N(TMHR, RF)</td>
<td>number of NOPs to achieve a specific regulation factor</td>
</tr>
<tr>
<td>RF_k</td>
<td>the regulation factor assigned to the k-th highest priority processor among those accessing the memory at the same time</td>
</tr>
<tr>
<td>RF</td>
<td>a generic regulation factor to regulate the bandwidth</td>
</tr>
<tr>
<td>RF(j)</td>
<td>a generic vector of regulation factors used to measure the slowdown values in case of j parallel memory accesses following S_i scenario corresponding to s measured slowdown value</td>
</tr>
<tr>
<td>RF2BW</td>
<td>the translating table between regulation factors and actual bandwidth values</td>
</tr>
<tr>
<td>S_i</td>
<td>each scenario of parallel memory requests</td>
</tr>
<tr>
<td>S(j)</td>
<td>set of all possible scenarios of parallel memory accesses for each value of j</td>
</tr>
<tr>
<td>SD(j, S_i, RF(j))</td>
<td>the vector of measured slowdown values corresponding to j parallel memory accesses following S_i scenario using RF(j)</td>
</tr>
<tr>
<td>SM</td>
<td>table of slowdown measurements</td>
</tr>
<tr>
<td>SM[S_i, j, C(j, S_i, RF(j))]</td>
<td>sub-table of slowdown measurements corresponding to j parallel memory accesses following S_i scenario</td>
</tr>
<tr>
<td>UB</td>
<td>the vector of actual bandwidth of the processors</td>
</tr>
<tr>
<td>ε</td>
<td>a very small positive value</td>
</tr>
</tbody>
</table>

**Definition 2 (Slowdown Measurement).** Processor slowdown measurement involves quantifying the reduction in the speed of a processor as it performs tasks. This measurement is typically expressed as a percentage decrease in processing speed compared to the processor’s original performance. In the context of this discussion, slowdown measurements refer to the decrease in speed of processors when accessing the memory as a consequence of throttling the bandwidth.

In practical scenarios, the memory phases of different processors may partially overlap. However, to consider the worst-case scenario, these measurements involve executing all conceivable combinations of memory phases in parallel continuously. In this scenario, the system encounters the most severe slowdown due to memory interference.

**Definition 3 (Set of workload combinations(S(j))).** For any given number j of parallel memory accesses, the set of all the scenarios of memory accesses, or in other words, the different patterns of Reads and Writes coming from different processors in parallel, is called the set of workload combinations and is denoted by S(j). Each scenario within this set is denoted by S_i.

For example, in the case of two parallel memory requests, there will be four distinct scenarios of memory accesses: R_h R_l, R_l W_l, W_h R_l, W_h W_l, where 'h' refers to the memory access of the higher-priority processor and 'l' to the lower-priority one. Therefore, S(2) will have, 2^2 elements. Following the same reasoning, in general S(j) includes 2^j elements.

**Definition 4 (Configuration of parallel memory accesses(C)).** For any given number j of parallel memory accesses, and any scenario of memory accesses S_i, a configuration of parallel memory accesses is a vector that concatenates a vector of regulation factors for the corresponding active processors RF, its corresponding measured slowdown SD(j, S_i, RF), and the vector of actual bandwidth of the active processors UB(j, S_i, RF). Or

C(j, S_i, RF(j)) = [RF(j); SD(j, S_i, RF(j)); UB(j, S_i, RF(j))]
Definition 5 (Slowdown measurements table \( SM(S_i, j, C(j, S_i), RF(j)) \)). Slowdown measurements table is a table of subtables, each denoted by \( SM(S_i, j, C(j, S_i), RF(j)) \), corresponding to each number of parallel memory accesses \( j \) and each scenario of memory access \( (S_i) \). Each subtable is represented as a two-dimensional array, where the rows represent different scenarios of memory accesses \( (S_i) \), and the columns represent different valid configurations of parallel memory accesses.

4.2.2 Slowdown interplay

When employing each set of regulation factors, the amount of slowdown experienced by each processor may vary depending on the specific combination of processors concurrently accessing memory, the overall number of active processors, and the type of memory access (read or write) on both the targeted processor and its competitors. Since this study operates under the assumption of a homogeneous system, the identity of the processors simultaneously requesting memory is irrelevant; the only relevant factors are their relative priorities, and their total number, represented by \( j \).

4.2.3 Algorithm core ideas

The algorithm to construct the slowdown measurement table receives the number of processors in the system as the input and outputs the table of slowdown measurements. For each number of parallel memory requests and each scenario, \( S_i \in S(j) \), the algorithm commences by measuring the slowdown values and the actual bandwidth of all the active processors for the full-throttling case. Subsequently, starting from the processor with the lowest priority and progressing to the higher-priority ones, the throttling factor of each processor is reduced by 10 percent, and the measurements are then repeated to track every possible configuration of parallel memory accesses.

The main objective is to maximize bandwidth utilization, therefore we need to maintain bandwidth as close to saturation as possible avoiding over-allocation.

Within the saturation zone, the relationship between utilized bandwidth and throttling factors is non-linear. Counter-intuitively, adjusting the throttling factor of one processor can lead to fluctuations in the actual bandwidth values of other processors. For instance, decreasing the throttling factor of one processor might increase the actual bandwidth of other processors. On the opposite end, in the underutilization zone, the bandwidth is out of saturation. Therefore, the behavior of the system becomes essentially linear. In this zone, decreasing the regulation factors of one processor will either decrease or maintain the bandwidth of that processor without affecting the bandwidth of other processors. Considering this behavior, we can develop a discarding technique to improve the efficiency of the algorithm for storing slowdown measurements. This technique involves retaining only configurations close to saturation.

4.2.4 Discarding configurations

There are two primary reasons for discarding a configuration: either when we have already identified a better configuration in terms of bandwidth utilization in the saturation zone, or when the current configuration results in under-utilization of the available bandwidth.

If reducing the regulation factor results in negligible changes to the measured slowdown, it implies that the previous configuration was saturating the bandwidth. Consequently, we can discard the previous configuration and retain the current one. Given the uniform decrease of
the regulation factors, this comparison can be conducted for every two consecutive measurements to eliminate configurations that result in over-allocation of bandwidth. Conversely, if a regulation requires less bandwidth than an already existing configuration for the same workload combination, it indicates that this configuration does not provide us with a better solution and there is no need to retain it. Considering the gradual reduction of regulation factors, this can be examined by comparing the actual bandwidth of all the processors in two consecutive measurements. If none of the processors gain a better bandwidth, it means this configuration under-utilizes the bandwidth. Therefore, we should discard it.

4.2.5 Termination

By the following termination rule, we can determine whether continuing the measurement will yield beneficial results or if we can terminate it. If modifying a single regulation factor solely affects the bandwidth of the corresponding processor, while the bandwidth of all other processors remains unchanged, it indicates that there is no interference between memory accesses, and the system is operating in the under-utilization zone. Therefore, we can terminate the current loop and proceed to the next outer loop. This nested table structure provides a concise and organized representation of the system’s slowdown values, enabling efficient retrieval and analysis of the impact of regulation factors.

4.3 Construct the table of regulation factors

Given that slowdown values are influenced by the selected regulation factors, we can control slowdown values by adjusting these factors. With a predefined maximum tolerable slowdown value, aligned with the timing constraints of the task set, we can choose regulation factors to facilitate parallel memory accesses. This concept can be implemented through a table, which provides the appropriate set of Regulation factors for each scenario of parallel memory accesses under predefined slowdown constraints.

Definition 6 (Table of Regulation factors ($RF$)). The table $RF$ is structured as a collection of sub-tables. For each number of parallel memory accesses, denoted as $j$, there exists a sub-table. Within each sub-table, for every possible scenario of memory accesses, it encapsulates the set of Regulation factors for active processors, tailored to meet the slowdown constraints imposed by the task set.

To present the algorithm to construct this table, we should clarify a few notations.

Definition 7 (Possible Configurations ($PC$)). The set of candidate configurations corresponding to $j$ parallel memory accesses with workload combination $S_i$ is shown by $PC(j, S_i)$.

Definition 8 (Maximum Bandwidth Utilization ($MBU$)). The set of regulation factors’ vectors with the same maximum bandwidth utilization by $MBU(j, S_i)$.

This algorithm receives as inputs: the table of slowdown measurements ($SM$), and the set of maximum tolerable values for slowdowns in line with the timing constraints of the task set, denoted by $MTS$. It provides as output the regulation factors table $RF$. Alongside this algorithm, for each number of parallel memory accesses, first, the set of all possible scenarios of parallel memory requests, or $S(j)$, is generated. Then for each scenario, all the configurations stored at sub-table $SM[S_i(j), C(j, S_i)]$ are compared to their corresponding maximum tolerable value in $MTS$. If the measured slowdown value is smaller or slightly larger than the maximum tolerable value, the algorithm appends this configuration to $PC(j, S_i)$.
Algorithm 2  Construct the slowdown measurements table.

Input: number $m$ of processors
Output: table $SM$ of slowdown measurements

for $j = 2 : m$ do
    Generate: $S(j)$
    foreach $S_i \in S(j)$ do
        $RF_{old} = [100, 100, \ldots , 100]$  
        Perform slowdown and used bandwidth measurements for unlimited bandwidth case $SD(j, S_i, RF_{old})$, $UB$
        $SD_{old} = SD(j, S_i, RF_{old})$
        $UB_{old} = UB$
        for $RF_1 = 100$ to 0 with step $-10$ do
            for $RF_2 = RF_1$ to 0 with step $-10$ do
                $\ldots$
                $RF_j = RF_{j-1}$
                while !UnderUtilization do
                    $RF_{new} = [RF_1, RF_2, \ldots, RF_j]$  
                    Perform measurements $SD(j, S_i, RF_{new})$, $UB$
                    $SD_{new} = SD(j, S_i, RF_{new})$
                    $UB_{new} = UB$
                    $C(j, S_i, RF_{old}) = C(j, S_i, RF_{new})$
                    $RF_{old} = RF_{new}$
                    $SD_{old} = SD_{new}$
                    $UB_{old} = UB_{new}$
                    // over-utilization check:
                    if $SD_{new} \approx SD_{old}$ then
                        $RF_{old} = RF_{new}$
                        $SD_{old} = SD_{new}$
                        $UB_{old} = UB_{new}$
                    // non-optimal solution check:
                    if $UB_{new} \leq UB_{old}$ then
                        Discard the new configuration
                    // termination condition:
                    if $(UB_{new}[1 : j - 1] \approx UB_{old}[1 : j - 1] \&\& UB_{new}[j] \leq UB_{old}[j]) \& RF_{j} == 0$ then
                        UnderUtilization = True
                    else
                        $RF_{j} = 10$
            return $Result$
Whenever $PC(j, S_i)$ is empty, the algorithm reports: “Not enough bandwidth”, which means we should suspend the memory access coming from the lowest priority processor to make sure this case will not happen. However, if $PC(j, S_i)$ is not empty, the algorithm looks for the configuration that yields the maximum used bandwidth (or $UB$). If this solution is not unique, among them, this algorithm picks the one with the maximum regulation factors.
Algorithm 3 Construct the table of regulation factors.

\begin{algorithm}
\textbf{Input} : Number $m$ of processors, set $MTS$ of maximum tolerable values for slowdowns in line with the timing constraints of the task set, table $SM$ of slowdown measurements

\textbf{Output} : Table $RF$ of regulation factors

\begin{algorithmic}
\For {$j = 2 : m$}
\State Generate: $S(j)$
\For {$\forall S_i \in S(j)$}
\State Move to sub-table $SM[S_i(j), C(j, S_i)]$ in $SM$
\If {$SD \leq (MTS(j, S_i) + \varepsilon)$}
\State Append $C$ to $PC(j, S_i)$
\EndIf
\EndFor
\If {$PC(j, S_i) \neq \emptyset$}
\State $MBU(j, S_i) = \max_{C \in PC(j, S_i)} UB$
\State $MBU(j, S_i) = \{C \in PC(j, S_i)|UB \approx MBU(j, S_i)\}$
\EndIf
\If {$MBU(j, S_i) \neq \emptyset$}
\State $RF(j, S_i) = \max_{C \in MBU(j, S_i)} RF$
\EndIf
\EndIf
\EndFor
\State Report: “Not enough bandwidth”
\EndFor
\State \textbf{return} Result
\end{algorithmic}
\end{algorithm}

\subsection{An illustrative example}

To illustrate the policy, the following example can provide clarity. Let’s consider a task set as described in the table below. All tasks in this set are activated synchronously. Our objective is to efficiently schedule this task set on a system with three processors. These tasks are prioritized based on their periods and scheduled accordingly. We also assume that no preemption is allowed. In table Table 2 the duration of the read memory phase, write memory phase, computation time, period(or minimum inter-arrival time), and the relative deadline of task $\tau_i$ are denoted by, $rm_i$, $wm_i$, $c_i$, $T_i$, and $D_i$, respectively. All the values are measured in microseconds:

\begin{table}[h]
\centering
\caption{Task Set Characteristics.}
\begin{tabular}{|c|c|c|c|c|}
\hline
$\tau_i$ & $rm_i (\times 10\mu s)$ & $c_i (\times 10\mu s)$ & $wm_i (\times 10\mu s)$ & $T_i (\times 10\mu s)$ & $D_i (\times 10\mu s)$ \\
\hline
$\tau_1$ & 3 & 4 & 2 & 20 & 10 \\
$\tau_2$ & 2 & 3 & 1 & 25 & 15 \\
$\tau_3$ & 2 & 2 & 1 & 30 & 20 \\
$\tau_4$ & 1 & 1 & 1 & 35 & 25 \\
$\tau_5$ & 1 & 4 & 1 & 40 & 30 \\
\hline
\end{tabular}
\end{table}

Following the partitioning policy, $\tau_1$, and $\tau_4$ are assigned to $P_1$, $\tau_2$, and $\tau_5$ to $P_2$, and $\tau_3$ is assigned to $P_3$. In Gantt charts, red, grey, and blue blocks represent the read-memory phase, the computation phase, and the write-memory phase, respectively. Initially, we assume no parallel memory access is allowed, limiting each processor to accessing memory one at a time according to their priorities. Tasks are partitioned using the same policy, and on each processor, tasks are executed using a non-preemptive Earliest Deadline First (EDF) scheduling algorithm.
Next, we aim to schedule the identical task set employing our policy. Broadly speaking, the asynchronous execution capability of the write memory phase renders it faster than the read memory phase. Therefore, we consider distinct values for unlimited bandwidth in read and write operations. As an illustrative numerical example, let’s assume the system has an unlimited bandwidth of 2.5 GB/s for read-memory phases and 8 GB/s for write-memory phases. The regulation factors for two and three parallel memory accesses are outlined in Table 3 and Table 4, respectively. In practice, these tables must be filled following the algorithms.

### Table 3 Table of regulation factors for two parallel memory accesses.

<table>
<thead>
<tr>
<th>Workload pattern</th>
<th>Regulation factors for $P_h$</th>
<th>Regulation factors for $P_l$</th>
</tr>
</thead>
<tbody>
<tr>
<td>RR</td>
<td>100</td>
<td>100</td>
</tr>
<tr>
<td>RW</td>
<td>90</td>
<td>70</td>
</tr>
<tr>
<td>WR</td>
<td>80</td>
<td>60</td>
</tr>
<tr>
<td>WW</td>
<td>60</td>
<td>40</td>
</tr>
</tbody>
</table>

### Table 4 Table of regulation factors for three parallel memory accesses.

<table>
<thead>
<tr>
<th>Workload pattern</th>
<th>Regulation factors for $P_1$</th>
<th>Regulation factors for $P_2$</th>
<th>Regulation factors for $P_3$</th>
</tr>
</thead>
<tbody>
<tr>
<td>RRR</td>
<td>100</td>
<td>100</td>
<td>100</td>
</tr>
<tr>
<td>RRW</td>
<td>90</td>
<td>70</td>
<td>50</td>
</tr>
<tr>
<td>RWR</td>
<td>80</td>
<td>60</td>
<td>40</td>
</tr>
<tr>
<td>WRR</td>
<td>70</td>
<td>50</td>
<td>40</td>
</tr>
<tr>
<td>RWW</td>
<td>80</td>
<td>40</td>
<td>30</td>
</tr>
<tr>
<td>WRW</td>
<td>60</td>
<td>30</td>
<td>30</td>
</tr>
<tr>
<td>WWR</td>
<td>60</td>
<td>30</td>
<td>30</td>
</tr>
<tr>
<td>WWW</td>
<td>60</td>
<td>30</td>
<td>10</td>
</tr>
</tbody>
</table>

According to the tables of regulation factors, the Gann chart will be:
As illustrated in the example, three read-memory phases can be executed in parallel with negligible extension of the memory phase duration. However, when two write-memory phases access the memory in parallel (as is the case with $w_4$ and $w_5$), according to the table, 60% of the unlimited bandwidth of the processor will be allocated to the higher priority processor, and 40% to the lower priority one. Consequently, the duration of these memory phases will be extended accordingly. Nevertheless, upon comparing the Gantt charts, there is a remarkable improvement in the overall time-span of the task set following our proposed policy.

5 Related Works

The impact of memory contention in contemporary systems has been extensively explored in prior scientific literature [10]. Previous studies have focused on investigating the decline in Worst-Case Execution Time (WCET) for applications contending for memory, particularly in multi-core embedded systems [13]. Proposals for memory-bandwidth partitioning schemes aimed at ensuring temporal isolation have been introduced [12]. In [21], the authors introduced a memory bandwidth reservation system named MemGuard. This system was proposed, designed, and implemented with the primary aim of providing bandwidth reservation to ensure temporal isolation, and maximizing the utilization of the reserved bandwidth. In [14], the memory utilization is periodically sampled, while using standard MemGuard’s interrupts – and associated overheads – to regulate cores and to trigger the sampling. While partitioning represents a straightforward and robust solution, it encounters challenges related to underutilizing the bandwidth. Moreover, it offers less refined control over task execution compared to the PREM approach. Similar challenges are observed in alternative hardware-level partitioning solutions and mechanisms for enforcing bandwidth allocation documented in existing literature [5, 18]. As an example in [22], known as MemPol, in introduced that operates a regulation mechanism from outside the cores, monitoring performance counters for the application core’s activity in main memory at a microsecond scale. In contrast, our work adopts an internal mechanism to regulate the bandwidth offering a more flexible scheme to maximize bandwidth utilization. A substantial body of literature addresses the application of PREM model to multi-core systems [1, 3, 17, 19]. However, a primary limitation of these studies is their restriction to permitting only one memory access at a time, leading to bandwidth underutilization. In [20], the authors extended PREM by accommodating more
than one task to access memory in parallel. Through experimentation, they demonstrated that the latency of main-memory accesses increases at a rate less than linear when multiple cores simultaneously access memory. Their model supports $k$ parallel memory accesses, where $k$ is a statically configurable number determined based on hardware specifications. The primary drawback of this model lies in its rigidity. As shown in [4], allowing $k$ cores to utilize bandwidth without constraint, whether needed or not, may lead to bandwidth overutilization while selecting $k - 1$ could result in bandwidth underutilization. Addressing this issue, the main advantage of our model lies in the dynamic allocation of bandwidth to processors based on workload patterns. This allows for the adaptive selection of the number of parallel memory accesses, optimizing resource utilization. Despite differences in scope, a comparable work to ours is [16], which introduces the Envelope-aWare Predictive model, abbreviated as E-WarP. It aims to provide both the technological foundations and theoretical bases for a workload-aware analysis of real-time systems.

6 Conclusion

6.1 Discussion

Contemporary homogeneous and heterogeneous computing systems attain enhanced performance levels through parallelization. However, the parallel execution of tasks introduces complex trade-offs between latency and throughput, particularly in the context of simultaneous accesses to shared memory. Numerous approaches aim to mitigate memory interference issues, with bandwidth regulation being popular. In the literature, various viable solutions for bandwidth provide basic mechanisms to address the latency of memory accesses. However, their primary drawbacks include the complexity of deployment and rigidity. It has been observed that existing solutions may lead to underutilization of the available bandwidth.

The challenge lies in the non-linear behavior of memory latency based on the number and type of concurrent accesses, which can fluctuate over time in an unpredictable manner. Past attempts to integrate memory regulation into scheduling solutions have, as a consequence, either failed to provide guarantees for real-time execution or led to significant underutilization of memory bandwidth.

In this paper, we introduce High Memory Bandwidth (HMB), a scheduling solution designed to guarantee bounded response times of real-time task sets through memory regulation while ensuring a high utilization of memory bandwidth. The intricate nature of both the problem and its potential solution necessitates a comprehensive approach, one that this preliminary work only begins to unfold. In this first step, we focus on the core mechanism of HMB, providing a detailed explanation of its inner workings and how it addresses the challenges posed by real-time task sets. Our goal is to lay a solid foundation for future research and development, paving the way toward a scheduling solution that seamlessly integrates the demands of real-time systems with the efficient utilization of memory resources.

The core concept of this work lies on the notion memory slowdown, a phenomenon that occurs when a processor’s memory access performance is hindered by the concurrent memory access patterns of other processors. This slowdown, particularly during bursts of back-to-back memory accesses, can significantly impact the execution time of real-time tasks, potentially violating their timing constraints. To address this challenge, HMB employs a novel mechanism that dynamically allocates memory bandwidth among processors based on their priority levels. By carefully balancing the needs of high-priority processors, which typically have stricter timing requirements, HMB ensures that their memory accesses are prioritized, minimizing slowdown and maintaining their responsiveness.
HMB’s efficiency stems from its ability to optimize memory bandwidth utilization while adhering to the priority-based allocation scheme. It continuously evaluates the system’s memory access patterns and dynamically adjusts bandwidth caps to accommodate the demands of high-priority processors without compromising overall efficiency. This intricate balance enables HMB to achieve both bounded response times for real-time tasks and high memory bandwidth utilization, a remarkable feat in the context of resource-constrained real-time systems.

6.2 Further works

In this short work-in-progress paper, our primary focus has been on expounding the core concept of HMB and its underlying mechanism for bandwidth regulation. The next crucial step involves conducting a comprehensive series of experiments to rigorously validate the feasibility and evaluate the efficiency of this proposed mechanism in real-world scenarios.

During the implementation phase, potential overheads can arise at multiple levels, demanding careful consideration and optimization. At the hardware level, we must carefully evaluate the size of tables required to effectively implement HMB and ensure that data transfer speeds are sufficient to support the proposed bandwidth allocation scheme. Additionally, we need to analyze the overhead introduced by the algorithm and the dispatcher at the execution level. To ensure feasibility, it is imperative that the overall overhead remains lower than the shortest memory phase.

To objectively assess the effectiveness of HMB, we can compare its performance to similar works in this domain, such as [20] and [16]. By comparing against these established solutions, we can gain valuable insights into the relative strengths and weaknesses of HMB, paving the way for further refinements and improvements.

References


A Multi-Modal Distributed Real-Time IoT System for Urban Traffic Control

Zeba Khanam  
BT Security Research, Adastral Park, UK

Vejey Pradeep Suresh Achari  
Keele University, Keele, UK

Issam Boukhennoufa  
University of Essex, Colchester, UK

Anish Jindal  
Durham University, Durham, UK

Amit Kumar Singh  
University of Essex, Colchester, UK

Abstract

Traffic congestion is one of the growing urban problem with associated problems like fuel wastage, loss of lives, and slow productivity. The existing traffic system uses programming logic control (PLC) with round-robin scheduling algorithm. Recent works have proposed IoT-based frameworks that use traffic density of each lane to control traffic movement, but they suffer from low accuracy due to lack of emergency vehicle image datasets for training deep neural networks. In this paper, we propose a novel distributed IoT framework that is based on two observations. The first observation is major structural changes to road are rare. This observation is exploited by proposing a novel two stage vehicle detector that is able to achieve 77% vehicle detection accuracy on UA-DETRAC dataset. The second observation is emergency vehicle have distinct siren sound that is detected using a novel acoustic detection algorithm on an edge device. The proposed system is able to detect emergency vehicles with an average accuracy of 99.4%.

2012 ACM Subject Classification  Computer systems organization → Real-time system architecture

Keywords and phrases  Vehicle Detection, Deep Neural Network, Traffic Control, Edge Computing, Emergency Vehicle Detection, Sliding Window

Digital Object Identifier  10.4230/OASIcs.NG-RES.2024.2

Category  Invited Paper

1 Introduction

One of the major problems plaguing urban cities is traffic congestion. This problem stems from lopsided growth in road infrastructure in comparison to traffic volume. The visible devastating effects on travelers and urban cities in general are increased global carbon footprint, low productivity, fuel wastage; resource depletion, loss of lives, and economic downfall. To address this pressing issue, governments have invested heavily in infrastructure upgradation with complex civil construction of bridges, roads, and new lanes addition. However, this solution has further complicated the issue. Big cities like London that have been implementing this solution are currently reeling under issues like urban sprawl, pollution, and stalling [8]. As the number of on-road vehicles is projected to increase manifold, this problem is going to intensify. With the global emphasis to cut the scope 1 emission due to road transportation for a sustainable future, an intelligent urban traffic system is the need of the hour.
Most of the cities deploy a traditional traffic control system that uses an array of programming logic controllers (PLC) and a round-robin scheduling algorithm to allow traffic for each lane to pass in a circular fashion [7]. However, few pilot studies have been conducted on the deployment of Internet of Things (IoT)-driven intelligent traffic control systems. Most of these studies have been part of the deployment of connected and automated vehicles and self-driving cars [16, 14]. Given the uncertainty around the future of self-driving cars and the urgency of decarbonization targets, there is a need for an IoT system for current on-road vehicles. Few recent IoT frameworks have used recent technologies like infrared cameras and GPS [3], RFID [6], Bluetooth and Zigbee [15]. However, these technologies require the deployment of an array of sensors in vehicles. These solutions require large investment and energy to revamp the existing vehicles and the accuracy of detection of traffic density estimation is not substantially high due to their susceptibility to noise from the environment [12].

To overcome the aforementioned problems, we proposed a novel framework, IoT based Intelligent Urban Traffic System (I²UTS) for traffic light control that leveraged the existing CCTV network for designing the IoT-based framework for traffic light control [1]. CCTV videos have proven to be efficient and economical in past. There are several works that have explored the potential of CCTV camera networks as input sensors for solving traffic problems like predicting accidents, monitoring spatio-temporal behavior of pedestrians, and the detection of firearms and knives [4]. I²UTS framework used CCTV video and state-of-the-art CNN to estimate the traffic density. Traffic density was a major input to control traffic lights. To address the privacy concern and computational resource requirements, Yolo v3 with a darknet backbone was deployed over edge device, Raspberry Pi, alongside our novel scheduling algorithm. Even though, I²UTS was able to achieve 68.10% vehicle detection accuracy, it inherited problems associated with end-to-end convolutional neural networks (CNNs) that rendered practical difficulties in real-time traffic network analysis.

The first major problem associated with I²UTS was the inability to find datasets with emergency vehicles. One of the novel contributions of our previous research was to account for the different traffic signal times incase emergency vehicles like police cars, firefighter trucks, and ambulances are part of the traffic scheduling algorithm. Since, the proposed CNNs Yolo v3-Efficient Net is dependent on labeled data, the availability of labeled dataset of the emergency vehicle was a challenge. Finding such a dataset is cumbersome given the rare occurrence of emergency vehicles in traffic conditions. The proposal of emergency vehicle datasets has been an area of research. Researchers have turned to various resources like google search alongside manual annotation of 1500 images, manual filtering of Kaggle dataset images, and youtube streams [13]. Given the amalgamation of various sources of image acquisition, these datasets suffer from large viewpoint variations, which renders them unsuitable for our IoT framework where the input sensor, CCTV camera, has a fixed viewpoint. Apart from viewpoint variation, another problem in the detection of emergency vehicles is weather variation. Inclement weather conditions present in the images often decrease vehicle detection accuracy significantly [5]. Since, the speed of non-emergency vehicles is comparatively slow, their occurrence on CCTV images for a fixed time frame is higher. This increases their detection accuracy alongside the availability of a large set of labeled images in varying weather as a training input for YOLO V3. These problems clearly illustrate the inhibition of using RGB cameras for emergency vehicles. This serves as a major motivation to re-investigate our previous work I²UTS for emergency vehicle detection. Emergency vehicles all over the world use loud-noise making noise, sirens, to alert the traffic of passage. In this work, we propose a multi-modal distributed IoT framework that detects the distinguished sound property of emergency vehicles alongside the image-based traffic density estimation.
The second major problem associated with \( I^2 UTS \) is the high variance in mean average precision (mAP) across different classes of vehicle detectors. Although, the major outlier was the class “van” where mAP was 37.49%. Further, False Discovery Rate (FDR) was 63.23%, which clearly indicates that the proposed YOLOV3-Efficient net has high false positive (FP) in comparison to true positive (TP) for “van”. YOLO is a preferred single-stage object detector in comparison to its counterparts two-stage detectors like RCNN due to its faster performance on edge devices. However, YOLO struggles to localize smaller objects/vehicles and identify the optimum number of clusters. This is the main reason that \( I^2 UTS \) has a high misclassified bus stops as vans due to similar features. Single-stage detectors classify and localize objects in a single shot using dense sampling whereas two-stage detector consists of an additional preliminary stage of region proposal. The region proposal stage indeed increases the performance of object detector but are computationally expensive. However, the CCTV camera on road is fixed and viewpoint variation in the images captured is hardly possible. Changes in road infrastructure are also minimal. Considering, this advantage, we propose a novel two-stage detector road-based YOLO (“R-YOLO”) that confines the search of vehicles to the road with an increased performance accuracy comparable to two-stage object detector (like RCNN) and low computational resource usage like single stage detector (like YOLO).

The remainder of the paper is organised as follows. Section II describes the proposed IoT framework. The experimental evaluation and results of the proposed distributed system is detailed in Section III. Finally, the conclusion is presented in Section IV.

## 2 Proposed Distributed IoT framework

The proposed distributed IoT based urban traffic management system contains two edge parts: 1) \( S_1 \): Vehicle Detector that uses Deep Neural Network, R-CNN. 2) \( S_2 \): Emergency Vehicle Detector that uses acoustic sliding window approach to detect siren’s of emergency vehicle.
2.1 Edge Devices and Sensors

The previous works deployed NVIDIA Jetson TX2 that is a computationally powerful edge device with the dual support of CPU and GPU [2]. However, the cost of Jetson Nano is 24 times more than average cost of other edge devices. Given the economic feasibility constraint of the system, we use Raspberry PI 4 as the edge device $S_1$ for vision algorithms with moderate memory of 4 GB and powerful Quad core cortex-A72 (Arm-8) 1.5GHz processor. The acoustics emergency vehicle detector uses Arduino Nano as Edge device $S_2$. Arduino Nano is an open source micro-controller based on ATmega328P architecture. With a flash memory of 32 KB, 16 Analog Pins and 22 I/O pins, SEN0232 noise meter is used. SEN0232 uses an instrument circuit and a low noise microphone, with a measuring decibel value ranges from 30dBA to 130dBA, accurately measuring noise level of the surrounding environment.

2.2 System Overview

A higher level overview of the distributed system is as follows.

1. **Cloud Layer: Road Detector**
   a. Train Faster-RCNN network on cloud to detect the vehicles.
   b. Detect the object road and Estimate the road mask.
   c. Train the YOLO v3 network with Efficient net as a backbone on cloud to detect the road.
   d. Transfer the trained weights to the Faster RCNN deployed on the edge device $S_1$ for testing.
   e. Transfer road mask to the edge device $S_2$ for road extraction.

2. **Edge $S_2$: Acoustic Emergency Vehicle Detector**
   a. Detect Siren Sound from continuous noise level monitoring.
   b. Send the emergency vehicle trigger to Edge Device $S_1$.

2. **Edge $S_1$: Vehicle Detector**
   a. Estimate the traffic density of each road lane by detecting vehicles using road mask and YOLOv3 vehicle detector.
   b. Use the parameters generated in previous steps as an input to the proposed traffic control algorithm.

2.2.1 Dataset

Many prior studies have utilized the KITTI and COCO datasets to train the YOLOv3 network. However, a more recent dataset, UA-DETRAC [18], developed by the University of Albany for object detection and tracking, is more profound as a benchmark dataset for real-world multi-object tracking. This dataset comprises traffic camera videos, recorded at 25 fps over a span of 10 hours, with a resolution of 960 x 540 pixels. The footage originates from 24 distinct locations in Beijing and Tianjin, China, offering a diverse and challenging environment. UA-DETRAC includes approximately 1.21 million labeled bounding boxes representing 8250 vehicles across four classes: car, van, bus, road and others. Figure 2 illustrates some CCTV images from the dataset. An additional strength of utilizing UA-DETRAC is its representation of multi-class weather conditions and variations in day and night illumination. This dataset is used for both the purposes-road and vehicle detection. For both the tasks, the dataset is divided into training, validation, and testing sets following a 70:15:15 ratio, respectively.
2.2.2 DNN for Road Detection

For road detection and classification, we use YOLO v3 CNN model [9] trained on the backbone of Efficient Net [17]. Unlike the traditional YOLO v3 model [9] which has used DarkNet-53 network as backbone, our network has a high object detection accuracy at low inference latency. Changes in road infrastructures are minimal, so we can assume CCTV images have advantage of viewpoint invariance. Since vehicles run on road, the bounding boxes of vehicles are embedded within the bounding box of road, pooling of region of interests, we select the road bounding box. Once the bounding box is selected, we calculate road mask. Road Mask is a binary image that is applied to CCTV images to mask out remaining image (converts the pixels to black) except road object. Road Mask image is helpful in limiting regions of interest. Fig 3b, 3d, and 3f show the CCTV image obtain after applying road mask.

2.2.3 DNN for Vehicle detection

For vehicle detection and classification into five annotation classes: bus, car, vans, road, and others, we use Faster R-CNN model [10]. Faster R-CNN in comparison to traditional regional based CNN [11] is manifold times faster due to region proposal network. However, its pooling characteristic is also extremely beneficial. Faster R-CNN has a better accuracy than single shot detectors like YOLO. However, the inference speed is extremely slow in comparison to YOLO, rendering it unsuitable for real-time object detection on edge devices. To overcome this, we limit the scope by inputing the image with road mask. Therefore, Faster R-CNN will now only detect object on road, significantly reducing the image area, number of objects, in turn reducing the processing time. Few instances of vehicle detection are illustrated in Figure 3.

2.2.4 Acoustic Emergency Vehicle Detection Framework

Most sirens are rated at around 124 dB when measured 10 feet in front of the sound source. As the distance from the siren doubles, the sound pressure of the siren will drop by approximately 6dB. This concept is known as the “inverse square law.” In our system, Data is collected with a frequency of 50Hz and a sliding window technique is employed to detect the emergency vehicle. A sliding window computes the area of the sound noise level over a certain period of time as shown in Figure 4. When the area exceeds a predefined value the detection algorithm dispatches the emergency light sequence. When it does not, normal sequence is carried out. The area value is computed using the formula:

\[
\text{Area} = \text{Frequency} \times \text{sound level}
\]
The system is continuously computing the area over the window length, noise values are summed together over a defined window. The highest sound level is set 100dB, this value has been chosen to take into account different distances of the ambulance from the sound sensor. It has been chosen as it is the average noise level when taking into account 80 feet (25m) as the noise level varies between 124dB and 76dB.

**Figure 3** Examples of vehicle detection by a), c), e) by I2UTS and in b), d), f) by our proposed system on same CCTV images from UA-DETRAC [18]. Proposed system performs masking while leaving road.

**Figure 4** Sample Sliding Window for Emergency vehicle’s siren.
Once the siren is detected, a trigger is generated in Edge device $S_2$. This is sent to edge device $S_1$ as shown in Figure 1. Edge device $S_1$ estimates the vehicle density of each lane using vision algorithm explained in Section 2.2.3. Traffic Control Algorithm proposed by us in our earlier work in $I^2UTS$ uses both these parameters to calculate traffic light sequence again on edge device $S_1$. The multi-modal nature of our framework that encapsulates both acoustic and vision sensor and processing to build a resilient system.

3 Experimentation and Results

In this section, we evaluate the performance of our proposed distributed IoT framework. The weights of the Faster RCNN and YOLO v3-EfficientNet are trained on a cloud server with Intel i7-9th generation as main processor alongside Nvidia 1660 Ti GPU on Linux 18.04 operating system. CUDA 10.1 with CuDNN 7 libraries were used for parallel computation on GPU. The edge device $S_1$, Raspberry Pi 4, has Quad core cortex-A72 (Arm-8) 1.5GHz processor, 4GB RAM and OpenGL ES 3.0 graphics with Raspbian Buster as the operating system.

The experimental setup for edge system for edge comprises of a SEN0232 sound meter that is connected through a SPI serial connection to an Arduino Nano. This edge device is further connected to Raspberry Pi either wirelessly or through USB. Raspberry Pi is edge device that is responsible for managing the traffic light sequence.

3.1 Emergency Vehicle Detection

Experiments were conducted to find the optimal window lengths. To do so multiple window sizes were chosen starting from 0.5s to 10s. After each trial the window length is incremented by 0.5s. For each window size, ten different sounds are played, of which three correspond to sirens sound of emergency vehicles (police, ambulance, fire-truck) while the rest are different urban noises. The detection time is measured, as well as detection accuracy. The overall operation is repeated 100 times. The detection time $D_t$ is the difference between the time at which the siren is first detected $T_d$ and the time at which the sound is played $T_p$.

$$D_t = T_d - T_p$$

The accuracy of detection is defined as the number of the truly predicted siren sounds $P_s$ and the truly predicted urban noises $P_u$ divided by the number all the tests $N$.

$$\text{Accuracy} = \frac{P_s + P_u}{N}$$

Table 1 Accuracy and detection time per window size.

<table>
<thead>
<tr>
<th>Window Length(s)</th>
<th>0.5</th>
<th>2</th>
<th>3.5</th>
<th>5</th>
<th>6.5</th>
<th>8</th>
<th>9.5</th>
</tr>
</thead>
<tbody>
<tr>
<td>Detection Accuracy (%)</td>
<td>63.6</td>
<td>88.4</td>
<td>94.5</td>
<td>99.5</td>
<td>97.4</td>
<td>99.2</td>
<td>99.4</td>
</tr>
<tr>
<td>Detection Time (s)</td>
<td>0.54</td>
<td>2.032</td>
<td>3.524</td>
<td>5.031</td>
<td>6.521</td>
<td>8.041</td>
<td>9.523</td>
</tr>
</tbody>
</table>

Results are shown in Table 1, accuracy is very low for short window sizes. The main reason behind this is that it detects numerous urban noises as being siren sounds. Short window sizes make the detection algorithm act as a threshold detection. Accuracy increases with the window length to attain a maximum of 99.5% at 5s and it stays at the same level relatively. From 1000 segments, only 5 sound segments were miss-classified, and these
segments belong to the urban noises’ sounds. The detection time for the different experiments takes an average time of 0.03s ± 0.007s. For 5s the maximum accuracy is reached, and it is selected to be utilised. Another reason is that it is more effective to have a quicker detection.

### 3.2 Vehicle Detection

While training and fine-tuning the hyper-parameters of both the DNNs for vehicle and road detection, we ensure model reaches optima by neither overfitting nor underfitting. The most important hyperparameter that we need to fine-tune will be number of epochs. To determine this, we plot number of epochs with respect to validation and training loss difference. Figure 5 shows that the loss value’s difference is lowest around 100th epoch.

The other important metric to measure efficiency of vehicle detection system is inference time. The inference time of our DNN on the edge device $S_1$, Raspberry Pi varied between 1.55 – 2.3 sec per frame. This is comparable to state-of-the-art $I^2UTS$ framework which had inference time of 1.45 – 1.57 sec per frame. The power consumption of edge device (Raspberry Pi 4) per second on different loads is presented in Table 2. The input voltage and current to edge device was DC 5.1V and 3A.

<table>
<thead>
<tr>
<th>Parameters</th>
<th>Current (Amps)</th>
<th>Voltage (Volts)</th>
<th>Power (Watts)</th>
</tr>
</thead>
<tbody>
<tr>
<td>IoT device not connected to monitor</td>
<td>0.76</td>
<td>5.8</td>
<td>3.12</td>
</tr>
<tr>
<td>IoT device connected to monitor</td>
<td>0.78</td>
<td>5.8</td>
<td>3.45</td>
</tr>
<tr>
<td>IoT device running only detector</td>
<td>1.34</td>
<td>5.23</td>
<td>6.87</td>
</tr>
<tr>
<td>IoT device running detector with connected monitor</td>
<td>1.4</td>
<td>5.23</td>
<td>7.182</td>
</tr>
</tbody>
</table>

The highest power consumption observed was 7.18 W when the detector ran alongside a monitor, constituting only half of the input power supplied. When connected to a traffic camera, the power consumption reduced to 6.87 W. The mean average precision (mAP) for vehicle detection DNN is 79.5% in comparison to 65.10% achieved by state-of-the-art framework $I^2UTS$. If both the metrics inference time and accuracy are looked together, we can easily say that our proposed novel two-stage detector R-YOLO is able to achieve better accuracy in similar inference time.
4 Conclusion

This paper proposes a distributed IoT framework for urban traffic management system using the CCTV camera and sound sensor. The framework uses the two important observations in urban traffic control: 1) Structural Changes to road are minimum. 2) Emergency Vehicles have distinct sound. The novel two stage detector exploits the first observation by detecting road in first stage and vehicles in second stage. The detector achieves 79.5% accuracy that can further be enhanced by training the network on multiple datasets with CCTV footage with viewpoint and illumination variation. The second observation is implemented using acoustic sliding window detection algorithm achieving 99.4% accuracy.

References


DynaVLC – Towards Dynamic GTS Allocation in VLC Networks

Harrison Kurunathan CISTER/ISEP, Polytechnic Institute of Porto, Portugal

Miguel Gutiérrez Gaitán Department of Electrical Engineering, Pontificia Universidad Católica de Chile, Santiago, Chile

Faculty of Engineering, Universidad Andres Bello, Santiago, Chile

CISTER/ISEP, Polytechnic Institute of Porto, Portugal

Ramiro Sámano-Robles CISTER/ISEP, Polytechnic Institute of Porto, Portugal

Eduardo Tovar CISTER/ISEP, Polytechnic Institute of Porto, Portugal

Abstract

Envisioned to deliver superior Quality of Service (QoS) by offering faster data rates and reduced latency in 6G communication scenarios, pioneering communication protocols like the IEEE 802.15.7 are poised to facilitate emerging application trends (e.g. metaverse). The IEEE 802.15.7 standard that supports visible light communication (VLC) provides determinism for time-critical reliable communication through its guaranteed time-slots mechanism of the contention-free period (CFP) while supporting non-time-critical communication through contention-access period (CAP). Nevertheless, the IEEE 802.15.7 MAC structure is fixed and statically defined at the beginning of the network creation. This rigid definition of the network can be detrimental when the traffic characteristics evolve dynamically, for example, due to environmental or user-driven workload conditions. To this purpose, this paper proposes a resource-aware dynamic architecture for IEEE 802.15.7 networks that efficiently adapts the superframe structure to traffic dynamics. Notably, this technique was shown to reduce the overall delay and throughput by up to 45% and 30%, respectively, when compared to the traditional IEEE 802.15.7 protocol performance under the same network conditions.

1 Introduction

With 6G expected on the horizon by 2025, new technologies must be adopted to ensure the flawless usage of 6G [5]. Likewise, with the ever-growing network traffic demand and the need to support high bandwidth applications, researchers are venturing into new communication possibilities, including Visible Light Communication (VLC) and the Terahertz (THz) band. VLC, particularly, is deemed to be well-suited to meet the criteria of emerging applications toward 6G, including Virtual Reality (VR) and augmented Reality (AR), among others.
IEEE 802.15.7 [13] is a communication protocol poised to realize communication in VLC networks. The architecture of this standard supports high data rates of up to 96 Mbps with almost 300 THz of unlicensed spectrum. This makes it ideal to support bandwidth-hungry applications, potentially using existing illumination infrastructure. The standard also offers predictable protocol features enabling the support of critical and demand strict timeliness requirements through its Guaranteed Time Slot (GTS) mechanism, which operates in a periodically synchronized superframe structure (Fig. 1).

Among the key parameters of the protocol is the superframe order (SO) which defines the duration of the active period of the superframe, a.k.a. the superframe duration (SD). Within this scheme, beacons are transmitted between subsequent superframes enabling time synchronization and MAC management. The time interval between beacons is known as the beacon interval (BI). All these parameters can be properly set statically at the beginning of the network to govern overall communication performance. This approach, although suitable in the case of highly stationary network scenarios, prevents achieving adequate Quality of Service (QoS) when traffic characteristics evolve dynamically, for example, due to environmental or user-driven workload conditions.

In fact, in several potential VLC scenarios for 6G, such as healthcare monitoring [9], underwater networks [1] or vehicular communication, to name a few, the data traffic and/or the number of nodes that connect (or disconnect) to a central coordinator can vary frequently, e.g., based on local environmental circumstances [2], mobility of the nodes from one area to another [9], and/or due to multiple nodes reaching the same area and creating a bottleneck, which implies more traffic to be accommodated. The aforementioned static settings are just examples of how a dynamic traffic behavior can lead to inevitable compromises on QoS on metrics such as (worst-case) delay or throughput. This raises a need for a novel VLC network architecture that can adapt protocol features on the fly to varying conditions.

In this paper, we propose an adaptive MAC architecture called the DynaVLC that will dynamically toggle the network parameters and make them suitable to the underlying traffic behavior. This method can adapt efficiently to scenarios where the data traffic demand grow either higher or lower while satisfying QoS requirements such as latency or throughput. This tuning technique can be facilitated by managing entities such as the network coordinators.
typically set to be aware of the network demand requirements to be served by the GTS. More concretely, making the network coordinators demand-aware can be done, for example, by integrating an RPL (Routing Protocol for Lossy Networks) layer over the VLC MAC layer.

We summarize the main contributions presented in this paper as follows:
- We provide a novel dynamic MAC structure tuning architecture called DynaVLC for IEEE 802.15.7 networks that yields better QoS performance.
- We introduce the so-called CAP reduction and modeling of the GTS under the DynaVLC architecture for several scenarios involving varying network demand.
- We derive the worst-case bounds and perform an in-depth performance analysis of the proposed structure covering both throughput and delay analysis.

The rest of the paper is organized as follows: Section 2 provides related works on some of the adaptive techniques devised for VLC networks and general communication protocols. Section 3 presents the CAP reduction technique as one of the key elements of the DynaVLC architecture to increase the number of GTS in the superframe. Section 4 introduces the system model and discusses the topologies and scenarios taken to demonstrate this architecture. Section 5 presents our novel DynaVLC architecture, and Section 6 analyzes its performance. Conclusions and a wrap-up with some discussion of future scope are presented in Section 7.

## 2 Related Works

The research work in [3] proposes a flexible superframe structure that enables sleep modes for priority data handling in IEEE 802.15.7-based real-time sensor networks. This method enables a hybrid mode in the contention access period and contention-free period (CFP) adaptively. The method works by shifting periods and sending priority data with lower bandwidth and delay. While the method holds promise in static/stationary conditions, the improvements do not show to be suitable for evolving traffic conditions.

A different approach is proposed by researchers in [17] who propose a priority MAC based on a multi-parameter for IEEE 802.15.7 VLC networks. They make use of common parameters such as the backoff times (NB), backoff exponent (BE), and contention window (CW) to enable priority-driven multilevel differentiated service. Moreover, using a discrete-time Markov chain model, the authors analyzed the impact of their multi-parameter traffic differentiation on throughput. More recently, a comparison between the traditional IEEE 802.15.7 frame and a novel energy-efficient superframe was done in [4]. This work also considered different inputs such as the biosensors’ battery life as well as adaptive data requirements to vary MAC parameters accordingly. However, in both of these works variations in traffic data were not considered. The data traffic was set as a constant and only the impact of the variation of the MAC parameters such as the BO and SO were considered.

In one of our previous works, we presented the worst-case bounds delay of IEEE 802.15.7 using network calculus [13]. In this work, we explored the possibility of a technique called CAP reduction functionality from IEEE 802.15.4 and carried out a detailed performance analysis. This technique increases the number of GTS slots in the traditional IEEE 802.15.7 MAC frame, thus increasing the scalability for critical communication nodes in the network. Based on these results, we recently presented in [14] the possibility of having a multichannel structure to enhance the allocation for GTS in IEEE 802.15.7 frames. While both of these works focused on increasing the number of GTS timeslots, both the methods are statically defined, and thus cannot adapt to traffic dynamics in evolving network scenarios.

Adaptive superframe is a concept that has been researched for several network protocols like the IEEE 802.15.4 and IEEE 802.15.6. The underlying idea is that superframes are flexible to support GTS requirements [8, 11] where the active period or the CFP is adapted.
as per the requested data. They also can be adapted to support priority data [15, 6] and support specific QoS like the energy efficiency [16]. Along this line of thought, in this work, we propose a novel technique called DynaVLC for IEEE 802.15.7 VLC networks where the superframe structure can be adjusted based on the oncoming traffic needs. The end goal is to significantly improve network throughput and reduce the overall worst-case delay toward deterministic 6G application scenarios.

3 Background to the CAP reduction architecture

CAP reduction is a technique where two or more superframes can be joined together as a multi-superframe and the CAP period between them can be removed and replaced with a CFP period. This technique was first introduced for the IEEE 802.15.4e - DSME network protocol [10] and then extended to the IEEE 802.15.7 protocol in [13].

\[ BI = a \text{Base}SD \times 2^{BO} \text{ optical clocks} \quad \text{for } 0 \leq BO \leq 14 \quad (1) \]

\[ SD = a \text{Base}SD \times 2^{SO} \text{ optical clocks} \quad \text{for } 0 \leq SO \leq BO \leq 14 \quad (2) \]

\[ MD = a \text{Base}SD \times 2^{MO} \text{ optical clocks} \quad \text{for } 0 \leq SO \leq MO \leq BO \leq 14. \quad (3) \]
The number of multi-superframes within a beacon interval is given by $2^{(BO-\text{MO})}$, and the number of superframes within the multi-superframe by $2^{(\text{MO}-\text{SO})}$. To illustrate this scheme we can take the configuration presented in Figure 2, which is a network infrastructure representation where $BO=3$, $MO=3$ and $SO=2$. This is a case where two superframes are stacked within a single multi-superframe. Note that after the network is initiated with these parameters the infrastructure remains unchanged and the setup repeats periodically. For clarity, we briefly describe the most relevant parameters as follows:

$a_{\text{BaseSD}}$ is defined as the minimum duration of a superframe and is set to 60 optical cycles at the initial order of the superframe (i.e., $SO=0$). Formally, this value is defined as:

$$a_{\text{BaseSD}} = \text{Slot Duration} \times T_s$$  \hspace{1cm} (4)

where $T_s$ is the size of the timeslot in the superframe. Note that $T_s$ in a superframe is made up of the data frames and idle frames. Data frames encompass the data transmissions and the idle frames encompass acknowledgments, long interframe spacing (LIFS), short interframe spacing (SIFS), and reduced interframe spacing (RIFS). Then, to develop the worst-case bounds analysis, we must include the GTS transmission, its respective acknowledgments and the CAP region within the multi-superframe. As every VLC superframe comprises 16 timeslots, the size of a single timeslot is denoted as:

$$T_s = \frac{SD}{16} = a_{\text{BaseSD}} \times 2^{SO-4}.$$  \hspace{1cm} (5)

For the parameters that define the multi-superframe architecture of Figure 2, the size of every single timeslot will be 15 optical cycles, the superframe duration will be 240 optical cycles and the entire multi-superframe will be 480 optical cycles. Among these optical cycles 210 optical cycles that correspond to 14 GTS timeslots across the multi-superframe support critical deterministic communication. Now for the same structure when we employ CAP reduction, the CAP region of the second superframe is replaced with a CFP and the inactive period is removed, thus drastically increasing the number of GTS timeslots in the network. In such a case, there will be 315 optical cycles corresponding to 21 GTS timeslots to support critical deterministic communication. Now when multichannel communication can exist over this architecture, i.e., over three multi-channels, there will be a total of 63 GTS timeslots over 315 optical cycles.

The setting of the network parameters and the CAP reduction can also be made application-specific. For instance, in a delay-sensitive network that carries priority traffic, we need an architecture with minimal SD size so that the next packet can be sent with minimal latency. With CAP enabled, the delay due to waiting for the inactive period and the adjacent CAP region can be avoided. In the case of a large-scale network, more nodes must be accommodated within a short period. In such cases, there is a need for a short SD duration but a larger number of superframes within a single multi-superframe. As an illustrative example, different network configurations and their respective application scenarios are shown in Table 1.

<table>
<thead>
<tr>
<th>Application</th>
<th>BO</th>
<th>SO</th>
<th>MO</th>
<th>CAP reduction</th>
<th>SD size</th>
</tr>
</thead>
<tbody>
<tr>
<td>Delay sensitive</td>
<td>6</td>
<td>0</td>
<td>2</td>
<td>enabled</td>
<td>60</td>
</tr>
<tr>
<td>Large scale</td>
<td>10</td>
<td>1</td>
<td>8</td>
<td>enabled</td>
<td>128</td>
</tr>
<tr>
<td>Energy critical</td>
<td>6</td>
<td>1</td>
<td>1</td>
<td>disabled</td>
<td>128</td>
</tr>
<tr>
<td>Reliability</td>
<td>8</td>
<td>2</td>
<td>2</td>
<td>disabled</td>
<td>240</td>
</tr>
</tbody>
</table>

Note: The number of multi-superframes within a beacon interval is given by $2^{(BO-\text{MO})}$, and the number of superframes within the multi-superframe by $2^{(\text{MO}-\text{SO})}$. To illustrate this scheme we can take the configuration presented in Figure 2, which is a network infrastructure representation where $BO=3$, $MO=3$ and $SO=2$. This is a case where two superframes are stacked within a single multi-superframe. Note that after the network is initiated with these parameters the infrastructure remains unchanged and the setup repeats periodically. For clarity, we briefly describe the most relevant parameters as follows:

$a_{\text{BaseSD}}$ is defined as the minimum duration of a superframe and is set to 60 optical cycles at the initial order of the superframe (i.e., $SO=0$). Formally, this value is defined as:

$$a_{\text{BaseSD}} = \text{Slot Duration} \times T_s$$  \hspace{1cm} (4)

where $T_s$ is the size of the timeslot in the superframe. Note that $T_s$ in a superframe is made up of the data frames and idle frames. Data frames encompass the data transmissions and the idle frames encompass acknowledgments, long interframe spacing (LIFS), short interframe spacing (SIFS), and reduced interframe spacing (RIFS). Then, to develop the worst-case bounds analysis, we must include the GTS transmission, its respective acknowledgments and the CAP region within the multi-superframe. As every VLC superframe comprises 16 timeslots, the size of a single timeslot is denoted as:

$$T_s = \frac{SD}{16} = a_{\text{BaseSD}} \times 2^{SO-4}.$$  \hspace{1cm} (5)

For the parameters that define the multi-superframe architecture of Figure 2, the size of every single timeslot will be 15 optical cycles, the superframe duration will be 240 optical cycles and the entire multi-superframe will be 480 optical cycles. Among these optical cycles 210 optical cycles that correspond to 14 GTS timeslots across the multi-superframe support critical deterministic communication. Now for the same structure when we employ CAP reduction, the CAP region of the second superframe is replaced with a CFP and the inactive period is removed, thus drastically increasing the number of GTS timeslots in the network. In such a case, there will be 315 optical cycles corresponding to 21 GTS timeslots to support critical deterministic communication. Now when multichannel communication can exist over this architecture, i.e., over three multi-channels, there will be a total of 63 GTS timeslots over 315 optical cycles.

The setting of the network parameters and the CAP reduction can also be made application-specific. For instance, in a delay-sensitive network that carries priority traffic, we need an architecture with minimal SD size so that the next packet can be sent with minimal latency. With CAP enabled, the delay due to waiting for the inactive period and the adjacent CAP region can be avoided. In the case of a large-scale network, more nodes must be accommodated within a short period. In such cases, there is a need for a short SD duration but a larger number of superframes within a single multi-superframe. As an illustrative example, different network configurations and their respective application scenarios are shown in Table 1.
4 System Model

With the possibility of having multiple channels, enabling multi-channel mesh networks would be feasible in VLC scenarios. Having this in mind, we assume a mesh network as shown in Figure 3. To emulate real networking scenarios, we consider dynamic nodes that join and leave the network. A mesh network consists of a PAN coordinator (node PAN-C in Figure 3), which can transceive messages and beacons. Then there will be Fully Functional Nodes (FFN) that facilitate routing and send beacons for association and timing synchronization. Finally, the Reduced Functional Nodes (RFN) are capable of only receiving messages. Such a network is facilitated with the aid of routing using protocols like the RPL by which a point-to-many-points (P2MP) tree-like network can be devised.

Node association is done through the PAN coordinator. At the inception of the network formation, new FFNs advertise their respective superframe through periodic beacons. Association to an FFN, RFN or a PAN-C is done through an association request.

The nodes in the association process are assumed to be RPL-enabled routing nodes. The PAN-C acts as the sink in the Destination-Oriented Directed Acyclic Graph (DODAG). PAN-C is responsible for transmitting DODAG messages. In the RPL overlay network, all routers (FFNs) continuously broadcast DAG Information Object (DIO) messages to announce the DODAG. A node listens to the DIO messages when it joins the network through the association process. Upon receiving a DIO message from the FFN, the joining node adds the sender’s DIO address to its parent list and calculates its rank based on the specified Objective Function (OF). The Objective Function for the DODAG can be QoS-defining factors such as Link Quality Indicator, Packet Delivery Ratio, or Power Consumption. Finally, the DIO message is updated with the newly computed ranks. The client node then selects its preferred parent from the list of FFNs as the default node through which inbound traffic is directed.

Figure 4 presents the mesh connection network for the network defined in the system model (Figure 3). An optimal schedule that utilizes the minimal number of time slots and channels can be defined using optimization methods like the Symphony [12] (adapted to IEEE 802.15.7 structure). Still, it must follow the mandate that the transmitting nodes do not overlap in time amongst themselves. By using Symphony, we provide a (near) optimal solution that uses 12 GTSs spanning over four channels and three timeslots. A transmission bitmap (Figure 4) will be created based on the transmissions of the mesh network and will be passed on to the underlying link layers using the RPL backbone periodically at every beacon interval.
The proposed solution in this work aids in tackling two major network issues caused by the static assignment of the network parameters at the inception of the network formation. The first problem is a requisite when there is a need for a more guaranteed bandwidth than what is available. More bandwidth will be provided if a smaller SO is defined at the beginning of the network definition. By setting a smaller SO more superframes can be affixed within the multi-superframe duration, further with CAP reduction the total number of guaranteed bandwidth to be serviced can also be drastically increased. However, in the case of a small SO with a large amount of bandwidth available, it could be a negative factor when there is a need for less bandwidth compared to what is available. The more suitable solution for these aforementioned problems is a tunable network architecture that can adjust its network parameters when the network demand changes.

5 DynaVLC architecture

The PAN-C establishes the multi-channel GTS allocation based on the number of channels, the number of GTS time slots and the total available GTS resources $N_{CFP}$. When the CAP reduction primitive is enabled the total number of GTS timeslots $N_{TS}$ augments to $7 + N_{CR}$, where $N_{CR}$ is the number of timeslots added through CAP reduction. In a system with $C$ channels, the total resources available can be computed as $C \times N_{TS}$.

The duration of timeslot in the multi-superframe $T_{MS}$ with $N_\eta$ symbols that encompasses the size of the CAP ($T_{CAP}$) and the CFP created through CAP reduction $T_{CFP}$ can be calculated as:

$$T_{MS} = \frac{N_\eta}{T_{CAP} + T_{CFP}}.$$  (6)

Let $GTS_{min}$ be the minimum number of superframe slots a single GTS can extend over. We present this constraint such that there is a limit for the GTS not to span over multi-superframe duration for a maximum forward delay of $D_{max}$.

$$GTS_{min} = \left\lceil \frac{D_{max}}{T_{MS}} \right\rceil.$$  (7)
For $n$ timeslots with a burst rate $b$ and a data rate $D$, the maximum forward delay $D_{\text{max}}$ can be obtained as:

$$D_{\text{max}} = \frac{b \times BI}{D \times T_{\text{data}}} + (BI - n(T_{MS})). \quad (8)$$

Then, since the maximum number of the GTS varies based on the CAP reduction technique, with $C$ number of channels spanning across the CFP timeslots, the max GTS can be defined as:

$$GTS_{\text{max}} = \min \left( \left( \frac{T_{\text{CAP}} + T_{\text{CFP}}}{GTS_{\text{min}}} \left( 1 - \frac{T_{\text{CAP}}}{T_{MS}} \right) \right), C \times N_{\text{CFP}} \right). \quad (9)$$

Following the availability of the transmission bitmap from the optimal schedule through the RPL backbone, the amount of the required resources $R$ is known to the PAN-C. Based on the requirement of resources, if needed more, the PAN-C adds/removes CAP reduction primitive and increments/decrements the value of MO in the subsequent beacon intervals as shown in Algorithm 1.

**Algorithm 1** DynaVLC algorithm to dynamically tune the superframe to the network demand.

**Input:** BO, SO, MO
optimal schedule from the RPL backbone
Number of channels ($C$) and Number of GTS available ($N_{\text{CFP}}$)

**Initialization**
repeat
  Schedule $R =$ Required number of resources to accommodate the network
  Resource test: check $N_{\text{CFP}} \geq R$ in a multi-superframe

**Problem 1: Minimal resources and high demand**
while $N_{\text{CFP}} \leq R$ do
  CAP Reduction = ON;
  if Resource test true then
    Print: DynaVLC is successful,
  else
    MO = MO + 1;
  end if
end while

**Problem 2: abundant resources and less demand**
while $N_{\text{CFP}} \geq R$ do
  CAP Reduction = OFF;
  if Resource test true then
    Print: DynaVLC is successful,
  else
    MO = MO - 1;
  end if
end while
until Every multi-superframe duration
In a multi-superframe duration \((MD)\) of \(N\) superframes, and for a data rate of \(D\) over \(C\) channels, the maximum throughput for a single multi-superframe duration can be given as:
\[
TH_{\text{max}} = \left( \frac{(N \times T_{\text{MS}}) - T_{\text{idle}}}{GT_{\text{max}}} \right) \times D \times C.
\] (10)

6 Numerical Analysis

To analyze the impact of DynaVLC, we consider an evolving network with the number of GTS transmissions increasing over time and analyze the delay. Let us consider a multi-superframe architecture with \(BO = 6\), \(MO = 1\), \(SO = 1\), such that there will be two superframes within a multi-superframe that repeats for every beacon interval. For this test let us consider three channels spanning over the 7 GTS in the classic IEEE 802.15.7 structure. In the classic VLC structure with 3 channels, there will be a total of 21 GTS slots, which will not be capable of accommodating more than 21 pairs of transmissions. However, when CAP reduction is added to the multi-superframe the number of available GTS increases to 63 individual GTS slots. However, in the case of static CAP reduction, after the 63 timeslots are filled, it waits for the entire CAP duration until the subsequent superframe starts allocating the GTSs.

In the case of DynaVLC, when the number of resources is minimal, initially the CAP reduction kicks in and we get almost the same performance as that of the static CAP reduction. When the entire CFP is full, the resource test \(R\) fails and the value of \(MO\) is increased adding another superframe to the multi-superframe. Now with CAP reduction on all of the superframes, we will have a total of 102 GTS slots for deterministic communication resulting in a decrease in delay by up to 45%.

Figure 5 Impact of DynaVLC on the overall delay of the network - with the increment of MO as the number of GTS transmissions increase more superframes are added into the multi-superframe duration to accommodate the GTSs.

In the second part of our numerical analysis, we study the throughput of the network, comparing the static settings against the DynaVLC. Under static CAP reduction, we switch it “ON” at the beginning of the network, hence it has enough amount of GTS to accommodate the traffic. Yer, as the number of GTS increases, the non-allocated slots will have to wait for an entire CAP period to get served in the subsequent superframe. This results in a decrease in the network throughput, but still, it is higher than the standard VLC by 20-30\%.
In the case of DynaVLC, we initially have the CAP reduction setting “ON” to support the network demand, hence, it provides an identical throughput as the example with CAP reduction. However, as the number of GTS increases, the value of MO is incremented resulting in the addition of more superframes into the multisuperframe. Around 125 GTS requirement with the addition of more superframes into the MD, we witness an increase of throughput and it slowly reduces with the increase of the GTS slots. The throughput will eventually converge when the values of BO and MO become equal and all the GTS slots are occupied.

Figure 6 Impact of DynaVLC on the overall throughput of the network - with the increment of MO as the number of GTS transmissions increase more superframes are added into the multisuperframe duration to accommodate the GTSs.

7 Conclusion

In current VLC network deployments QoS defining MAC parameters such as MO, SO, BI are statically defined. This is an impediment to constantly evolving networks with varying workload conditions. To address the compromises of these static networks, in this research work, we propose a dynamic tuning mechanism called DynaVLC that can adjust the value of MO and CAP reduction to increase the resources available based on the changes in network demand. With DynaVLC, we were able to witness a decrease of 15-45% in delay when compared to the network in static settings, as well as an improvement of 20-30% in terms of the overall throughput. As a future work, we intend to create an open-source implementation of the IEEE 802.15.7 protocol adaptations here introduced with further enhancements towards the existing VLC architecture.

References


Uma Shankar Pandey, Gulshan Soni, and Saroj Kumar Chandra. The impact of alteration of superframe duration on the consumption of energy in the IEEE 802.15.4 MAC. In 2023 5th International Conference on Smart Systems and Inventive Technology (ICSSIT), pages 254–260. IEEE, 2023.

History-Based Run-Time Requirement Enforcement of Non-Functional Properties on MPSoCs

Khalil Esper
Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Germany

Jürgen Teich
Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Germany

Abstract

Embedded system applications usually have requirements regarding non-functional properties of their execution like latency or power consumption. Enforcement of such requirements can be implemented by a reactive control loop, where an enforcer determines based on a system response (feedback) how to control the system, e.g., by selecting the number of active cores allocated to a program or by scaling their voltage/frequency mode. It is of a particular interest to design enforcement strategies for which it is possible to provide formal guarantees with respect to requirement violations, especially under a largely varying environmental input (workload) per execution. In this paper, we consider enforcement strategies that are modeled by a finite state machine (FSM) and the environment by a discrete-time Markov chain. Such a formalization enables the formal verification of temporal properties (verification goals) regarding the satisfaction of requirements of a given enforcement strategy.

In this paper, we propose history-based enforcement FSMs which compute a reaction not just on the current, but on a fixed history of $K$ previously observed system responses. We then analyze the quality of such enforcement FSMs in terms of the probability of satisfying a given set of verification goals and compare them to enforcement FSMs that react solely on the current system response. As experimental results, we present three use cases while considering requirements on latency and power consumption. The results show that history-based enforcement FSMs outperform enforcement FSMs that only consider the current system response regarding the probability of satisfying a given set of verification goals.

2012 ACM Subject Classification
Computer systems organization → Multicore architectures; Theory of computation → Linear logic; Theory of computation → Modal and temporal logics; Hardware → Finite state machines; Computer systems organization → Self-organizing autonomic computing; Theory of computation → Verification by model checking; Mathematics of computing → Probabilistic representations

Keywords and phrases Verification, Runtime Requirement Enforcement, History, Latency

Digital Object Identifier 10.4230/OASIcs.NG-RES.2024.4

Funding This work is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research-Foundation) – Project Number 146371743 – TRR 89 Invasive Computing.

1 Introduction

Embedded applications usually come with constraints on non-functional properties such as latency, power consumption, temperature, security, etc. A major uncertainty source that affects such properties is the varying workload of the input data\(^1\). Different run-time

\(^1\) Other uncertainties such as caused by resource sharing can be handled systematically by techniques for isolating application programs dynamically at run-time such as invasive computing [1, 32] and therefore not considered here.
management methods exist for dynamic control of program executions. However, most of them have as disadvantages that they cannot provide formal guarantees regarding their capability to fulfill the given requirements.

Run-time requirement enforcement (RRE) techniques [33] have been proposed to enforce a set of non-functional properties of execution of a given application program within defined bounds. Such techniques dynamically adapt system configurations including, e.g., the voltage/frequency settings and/or the number of active cores in reaction to observed system responses. Based on that, FSM-based RREs [10–13, 30] have been proposed for formally specifying and verifying control strategies. Such approaches consider execution properties that can be modeled by requirements [31], i.e. expressions on non-functional properties such as permitted corridors on latency, power consumption, etc. Different verification goals can be specified and formally verified, e.g., the probability with which program executions satisfy a given set of requirements.

The FSM-based RRE approaches in [10–13, 30] define and use a binary requirement response vector that specifies for each given requirement whether it has been satisfied (1) or not (0) in the current execution. Based on such a system response, then determines the next state, respectively configuration to be applied during the next execution. However, it is a challenge to design enforcement FSMs that satisfy a set of verification goals with maximized probabilities, especially if the considered requirements are conflicting with each other like latency and power consumption. A potential for improvements is to let the enforcer consider not only the current, but also system responses from earlier execution iterations when deciding for the next configuration. In this regard, this paper proposes history-based enforcement FSMs that not only consider the current system response, but also a history of previous system responses for reaction.

This paper is organized as follows. Section 2 discusses the related work. In Section 3, we introduce the system model, formally specify history-based enforcement FSMs, and propose three examples of history-based enforcement FSMs that use a history of previous system responses for determining a reaction. Section 4 describes the evaluation of history-based enforcement FSMs for three different use case applications and compares between the proposed history-based enforcement FSMs and enforcement FSMs that do not consider previous responses for reaction. Finally, Section 5 concludes this work.

2 Related Work

Approaches based on heuristics [35], online learning [4, 23–25], or statistical regression [8, 15] are generally not able to provide any formal guarantees regarding the satisfaction or violation of non-functional properties of program executions. Finite state machines (FSMs) have been proposed to formally specify functional system properties [5, 14, 29]. Based on the concept of Run-time Requirement Enforcement (RRE) [33], FSMs have been proposed in [10–13, 30] for feedback-based enforcement of non-functional properties on MPSoCs.

Such FSM-based RREs utilize a requirement response vector that abstracts the system as a function that specifies for each requirement whether it has been fulfilled or violated in the current execution. Based on such a system response, the enforcer reacts by determining the configuration to be applied in the next execution iteration. However, all of the previous approaches only consider the current system response for reaction. In our work, we take into account a time window of $K$ previous responses when deciding for the configuration for the next execution.
In control theory, the principle of *time-delayed feedback* [18, 36] has been proposed to increase the stability of a system. However, this concept has never been applied for controlling software systems. In addition, similar to [6, 16, 26, 27], approaches based on control theory can only give guarantees regarding the control stability or convergence, but not the satisfaction or violation of non-functional properties of program executions.

### 3 Method

In this section, we first present the considered system model and then formulate and propose three different history-based enforcement FSM strategies that take multiple previous system responses into account for taking a reaction.

#### 3.1 System Model

Embedded systems, especially MPSoCs, often consider the execution of periodic applications, e.g., image, video, or periodic control applications. For each individual program execution, a set of non-functional requirements shall be respected, even under environmental changes, e.g., varying input. This input can be represented for each discrete execution $k$ of an application by an *environment feature vector* $i(k) \in I$, where $I$ is called the *environment space* [11]. The program utilizes a number $n$ of cores that can be dynamically changed as well as the their voltage/frequency setting $m$. Such a setting $(n, m)$ is called a configuration $c$ and the set of available configurations a *configuration space* $C$. FSM-based RREs [11] react based on a feedback from the system-under-control by adapting the configuration $c(k+1)$ for the $(k+1)$th execution accordingly. Figure 1 illustrates the considered system model which is described more closely in the following.

![Figure 1](image.png)

**Figure 1** Illustration of a feedback-based RRE. A requirement response vector $r$ is mapped to a binary requirement response vector $\phi$ that will be used by an enforcement FSM $F$ to decide for the next configuration $c(k+1) \in C$. Reprinted from [11].

Depending on an input $i(k) \in I$ and a system configuration $c(k) \in C$, let the $k$-th execution result into a set of $H$ observable non-functional properties, e.g., latency and power consumption in case of $H = 2$ observable properties. The system-under-control can then be described by a *system response function* $r : I \times C \rightarrow \mathbb{R}^H$ [11]. Thus, the *system response* $r(i(k), c(k)) = (o_1(k), \ldots, o_H(k))$ at execution $k$ is a vector of $H$ execution properties of interest, see Figure 1. Now, requirements on these properties, e.g., deadlines, must be fulfilled for each execution, where each property $o_h$, $h = 1, \ldots, H$ can be formulated using corridors from which the following two propositions $\varphi_h^B$ and $\varphi_h^U$ can be derived.
4.4 History-Based Run-Time Requirement Enforcement

\[ \phi_{h}^{LB}(o_{h}(k)) = (LB_{h} \leq o_{h}(k)) \]  
\[ \phi_{h}^{UB}(o_{h}(k)) = (o_{h}(k) \leq UB_{h}) \]

where \( LB_{h} \) and \( UB_{h} \) refer to the lower and the upper bound, respectively, on the execution property \( o_{h} \). The information regarding which proposition is satisfied and which is violated at the \( k \)-th execution is represented by a binary vector \( \beta \) named requirement response. It is obtained from the system response \( r \) using the requirement response function \( \phi \) [11]

\[ \beta(k) := \phi(o_{1}(k), \ldots, o_{H}(k)) = \left( \phi^{LB}(o_{1}(k)), \phi^{UB}(o_{1}(k)), \ldots, \phi^{LB}(o_{H}(k)), \phi^{UB}(o_{H}(k)) \right) \in \{0,1\}^{2H}. \]

This binary requirement response vector \( \beta(k) \) constitutes the input to the enforcement FSM \( F \) that determines the next configuration \( c(k+1) \in C \) to enforce the given non-functional properties for the next execution.

An enforcement FSM (\( F \)) can be formally modeled by a deterministic finite state machine (Moore machine) which is described by a 6-tuple \((Z, z_{0}, B, \delta, C, \gamma)\) [11]:

- \( Z \) is a finite set of states.
- \( z_{0} \in Z \) is the initial state.
- \( B \) is the input alphabet.
- \( \delta \) is the transition relation: \( \delta \subseteq B \times Z \times Z \) with \((\beta, z, z')\) representing a transition from \( z \) to \( z' \) under input \( \beta \).
- \( C \) is the output alphabet, also called configuration space.
- \( \gamma \) is the output function that maps each state to an output (i.e., a configuration): \( \gamma : Z \rightarrow C \).

Finally, in order to quantitatively compare different enforcement strategies, verification goals can be formulated in temporal logic [7]. The two types of temporal logic are linear temporal logic and branching time logic. Linear temporal logic (LTL) describes events over a single time path in the FSM. Branching time logic such as computation tree logic (CTL) quantifies the possible paths from a given state in the FSM. Different levels of strictness of requirement enforcement can be differentiated, see [34]. Accordingly, different verification goals can be defined. For example, the CTL formula \( AF(\phi_{h}) \) for strict enforcement indicates that \( \phi \) holds for every path and at every state on the path. For loose enforcement, \( AF(\phi_{h}) \) specifies that for every possible path there exists a state at which \( \phi \) holds, see [11].

Probabilistic verification goals, based on PCTL [2], can also be formulated to specify stochastic verification goals for loose enforcement. We utilize probabilistic verification goals that are based on steady-state probabilities in Markov chains as they are helpful for obtaining requirement satisfaction probabilities of long execution runs of an application regardless of the initial state. The operator \( S \) is used in PRISM [20] to reason about the steady-state probability of a model [3]. We define the verification goal \( S_{\omega}[\psi] \) as the steady-state probability of being in a satisfying state for the requirement \( \psi \). Finally, we refer to the set of all considered verification goals by \( VG \).

3.2 History-based Enforcement FSMs

In this work, we propose enforcement strategies that not only react based on the current response vector \( \beta(k) \), but additionally on a history of system responses \((\beta(k-1), \ldots, \beta(k-K))\) belonging to the previous \( K \) execution iterations. Note that the case of \( K = 0 \) represents the case of enforcement FSMs that are not history-based, i.e., they only react on \( \beta(k) \) and do not consider previous system responses for reaction.
In this section, we introduce exemplarily three multi-requirement history-based enforcement FSMs $F_1, F_2, F_3$ that consider current response $\beta(k)$ and the previous response $\beta(k-1)$ ($K = 1$) to calculate a proper reaction. These FSMs execute on an MPSoC given with $n = 4$ available cores that can operate in $m = 20$ different power modes (voltage/frequency states). Thus, the size of the configuration space $C$ available for enforcement is $|C| = 4 \cdot 20 = 80$. We also assume these configurations to be power-ascending so that the configuration $c_j$ associated with $(n_j, m_j)$ has a higher power consumption than that of configuration $c_{j-1}$ where $0 \leq j < |C|$.

Let us consider the latency $\alpha_L$ and the power consumption $\alpha_P$ as properties of execution to be enforced, thus $H = 2$. For simplicity, we only utilize one-sided requirements so that the lower bounds are $LB_{\alpha_L} = 0$ and $LB_{\alpha_P} = 0$. Additionally, let the set of states $Z$ be $|Z| = |C|$ such that the output function is a bijection $\gamma : Z \leftrightarrow C$ of sets $Z$ and $C$. Thus, there is a one-to-one relation between enforcer states and configurations, such that each enforcer state $z \in Z$ uniquely outputs one configuration $c \in C$. Based on that, each enforcement FSM has as many states as $|Z| = |C| = 80$, thus, $Z = \{z_0, \ldots, z_{79}\}$, the input $\beta \in B = \{0, 1\}^H = \{0, 1\}^2$ with $\beta = \phi(r'(s,c)) = \phi(\alpha_L, \alpha_P) = (\alpha_L \leq UB_{\alpha_L}, \alpha_P \leq UB_{\alpha_P})$, an assumed initial state $z_0 = 0$. Finally, we assume both the latency requirement $\varphi_L(k)$ and the power requirement $\varphi_P(k)$ are satisfied for executions $k < 0$.

### 3.2.1 Latency Violation-Oriented History-Based Enforcement FSM

This enforcement FSM decreases the current state by exactly one step in case of a violation of a power requirement in both the current execution ($\varphi_P(k)$) and the previous one ($\varphi_P(k-1)$) only when the latency requirement in both the current execution ($\varphi_L(k)$) and the previous one ($\varphi_L(k-1)$) is satisfied. It stays in the same configuration for the other cases when the latency requirement is satisfied in both executions ($\varphi_L(k)$) and ($\varphi_L(k-1)$). It increases by one step in case of a violation of an latency requirement in either the current execution ($\varphi_L(k)$) or the previous one ($\varphi_L(k-1)$). Finally, it increases by two steps in case of a violation of an latency requirement in both the current execution ($\varphi_L(k)$) and the previous one ($\varphi_L(k-1)$). A corresponding enforcement FSM $F_1 = (Z, z_0, I, \gamma, C, \delta_1)$ has the transition relation $\delta_1$ shown in Table 1.

### 3.2.2 Power Violation-Oriented History-Based Enforcement FSM

This enforcement FSM increases the current state by exactly one step in case of a violation of a latency requirement in both the current execution ($\varphi_L(k)$) and the previous one ($\varphi_L(k-1)$) only when the power requirement in both the current execution ($\varphi_P(k)$) and the previous one ($\varphi_P(k-1)$) is satisfied. It stays in the same configuration for the other cases when the power requirement is satisfied in both executions ($\varphi_P(k)$) and ($\varphi_P(k-1)$). It decreases by one step in case of a violation of a power requirement in either the current execution ($\varphi_P(k)$) or the previous one ($\varphi_P(k-1)$). Finally, it decreases by two steps in case of a violation of a power requirement in both the current execution ($\varphi_P(k)$) and the previous one ($\varphi_P(k-1)$). A corresponding enforcement FSM $F_2 = (Z, z_0, I, \gamma, C, \delta_2)$ has the transition relation $\delta_2$ shown in Table 2.

### 3.2.3 Multi-Requirement History-Based Enforcement FSM

This enforcement FSM does not favor any requirement when transitioning between enforcer states. A corresponding enforcement FSM $F_3 = (Z, z_0, I, \gamma, C, \delta_3)$ has the transition relation $\delta_3$ shown in Table 3.
4:6 History-Based Run-Time Requirement Enforcement

Table 1 The transition relation $\delta_1$ of the latency-oriented history-based enforcement FSM $F_1$.

<table>
<thead>
<tr>
<th>$z(k)$</th>
<th>$\beta(k - 1)$</th>
<th>$\beta(k)$</th>
<th>$z(k + 1)$</th>
</tr>
</thead>
<tbody>
<tr>
<td>$z_j$</td>
<td>true</td>
<td>true</td>
<td>true</td>
</tr>
<tr>
<td>$z_j$</td>
<td>true</td>
<td>false</td>
<td>true</td>
</tr>
<tr>
<td>$z_j$</td>
<td>true</td>
<td>true</td>
<td>false</td>
</tr>
<tr>
<td>$z_j$</td>
<td>true</td>
<td>false</td>
<td>true</td>
</tr>
<tr>
<td>$z_j$</td>
<td>false</td>
<td>true</td>
<td>true</td>
</tr>
<tr>
<td>$z_j$</td>
<td>false</td>
<td>false</td>
<td>true</td>
</tr>
<tr>
<td>$z_j$</td>
<td>false</td>
<td>true</td>
<td>false</td>
</tr>
<tr>
<td>$z_j$</td>
<td>false</td>
<td>false</td>
<td>true</td>
</tr>
<tr>
<td>$z_j$</td>
<td>false</td>
<td>false</td>
<td>true</td>
</tr>
<tr>
<td>$z_j$</td>
<td>false</td>
<td>false</td>
<td>true</td>
</tr>
<tr>
<td>$z_j$</td>
<td>false</td>
<td>false</td>
<td>true</td>
</tr>
<tr>
<td>$z_j$</td>
<td>false</td>
<td>false</td>
<td>true</td>
</tr>
<tr>
<td>$z_j$</td>
<td>false</td>
<td>false</td>
<td>true</td>
</tr>
</tbody>
</table>

Table 2 The transition relation $\delta_2$ of the power-oriented history-based enforcement FSM $F_2$.

<table>
<thead>
<tr>
<th>$z(k)$</th>
<th>$\beta(k - 1)$</th>
<th>$\beta(k)$</th>
<th>$z(k + 1)$</th>
</tr>
</thead>
<tbody>
<tr>
<td>$z_j$</td>
<td>true</td>
<td>true</td>
<td>true</td>
</tr>
<tr>
<td>$z_j$</td>
<td>true</td>
<td>false</td>
<td>true</td>
</tr>
<tr>
<td>$z_j$</td>
<td>true</td>
<td>true</td>
<td>false</td>
</tr>
<tr>
<td>$z_j$</td>
<td>true</td>
<td>false</td>
<td>true</td>
</tr>
<tr>
<td>$z_j$</td>
<td>false</td>
<td>true</td>
<td>true</td>
</tr>
<tr>
<td>$z_j$</td>
<td>false</td>
<td>true</td>
<td>true</td>
</tr>
<tr>
<td>$z_j$</td>
<td>false</td>
<td>true</td>
<td>true</td>
</tr>
<tr>
<td>$z_j$</td>
<td>true</td>
<td>true</td>
<td>false</td>
</tr>
<tr>
<td>$z_j$</td>
<td>true</td>
<td>false</td>
<td>true</td>
</tr>
<tr>
<td>$z_j$</td>
<td>true</td>
<td>false</td>
<td>true</td>
</tr>
<tr>
<td>$z_j$</td>
<td>false</td>
<td>true</td>
<td>true</td>
</tr>
<tr>
<td>$z_j$</td>
<td>false</td>
<td>true</td>
<td>true</td>
</tr>
<tr>
<td>$z_j$</td>
<td>false</td>
<td>true</td>
<td>true</td>
</tr>
<tr>
<td>$z_j$</td>
<td>false</td>
<td>true</td>
<td>true</td>
</tr>
<tr>
<td>$z_j$</td>
<td>false</td>
<td>true</td>
<td>true</td>
</tr>
</tbody>
</table>

4 Experimental Results

In this section, we introduce three applications for evaluating the proposed enforcement FSMs in Section 3.2. For comparison, we also perform a design space exploration (DSE) method from [13] to generate optimized enforcement FSMs for a given set of verification goals $VG$ for each application, where these enforcement FSMs do not consider previous system responses for reaction. To perform the DSE, the NSGA-II [9] multi-objective evolutionary
Table 3 The transition relation $\delta_3$ of the multi-requirement history-based enforcement FSM $F_3$. 

<table>
<thead>
<tr>
<th>$z(k)$</th>
<th>$\beta(k-1)$</th>
<th>$\beta(k)$</th>
<th>$z(k+1)$</th>
</tr>
</thead>
<tbody>
<tr>
<td>$z_j$</td>
<td>true</td>
<td>true</td>
<td>$z_j$</td>
</tr>
<tr>
<td>$z_j$</td>
<td>true</td>
<td>false</td>
<td>$z_{j-1}$</td>
</tr>
<tr>
<td>$z_j$</td>
<td>true</td>
<td>false</td>
<td>$z_{j-1}$</td>
</tr>
<tr>
<td>$z_j$</td>
<td>true</td>
<td>false</td>
<td>$z_{j-2}$</td>
</tr>
<tr>
<td>$z_j$</td>
<td>false</td>
<td>true</td>
<td>$z_{j+1}$</td>
</tr>
<tr>
<td>$z_j$</td>
<td>false</td>
<td>true</td>
<td>$z_{j}$</td>
</tr>
<tr>
<td>$z_j$</td>
<td>false</td>
<td>false</td>
<td>$z_{j-1}$</td>
</tr>
<tr>
<td>$z_j$</td>
<td>true</td>
<td>false</td>
<td>$z_{j+1}$</td>
</tr>
<tr>
<td>$z_j$</td>
<td>true</td>
<td>false</td>
<td>$z_{j}$</td>
</tr>
<tr>
<td>$z_j$</td>
<td>true</td>
<td>false</td>
<td>$z_{j-1}$</td>
</tr>
<tr>
<td>$z_j$</td>
<td>false</td>
<td>true</td>
<td>$z_{j+2}$</td>
</tr>
<tr>
<td>$z_j$</td>
<td>false</td>
<td>false</td>
<td>$z_{j+1}$</td>
</tr>
</tbody>
</table>

algorithm provided by the optimization framework Opt4J [22] is used. Each run of the DSE features 100 iterations with a population size of 20 enforcement FSMs with a crossover probability of 0.9 and a mutation probability of 0.01. Each experiment was repeated three times to compensate for the randomness of the exploration.

4.1 Applications

We consider three applications for evaluation. Each application is modeled by a graph of actors, where each actor processes an input $i(k)$ in each iteration $k$. The applications execute on a tiled many-core system that consists of a set of processing cores, peripherals like memories, and a network adapter, which are interconnected via a tile-local bus system. For this matter, a simulation framework called InvadeSIM [28], a many-core simulator for parallel applications is used.

4.1.1 Object Detection Application

An image processing application that detects a given object in each image frame by applying a scale-invariant feature transform (SIFT) matching algorithm. We use a driving car image sequence $R$ of the KITTI-360 dataset [21] with $|R| = 100$ frames, a latency lower bound $LB_{oL} = 0$ ms and an upper bound (deadline) $UB_{oL} = 65$ W, an power lower bound $LB_{oP} = 0$ mJ and an upper bound $UB_{oP} = 5$ W.

4.1.2 String Search Application

This application stems from the ParMiBench benchmark suite [17] that searches in a given input text $k$ with $i(k)$ lines for a given pattern. For this use case, we create a trace of $|R| = 100$ randomly generated texts, each having $i(k)$ lines. We use the bounds $LB_{oL} = 0$ ms, $UB_{oL} = 15$ ms, $LB_{oP} = 0$ W and $UB_{oP} = 1.5$ W.
History-Based Run-Time Requirement Enforcement

4.1.3 Secure Hash Application

Another application from the ParMiBench benchmark suite [17]. This security application computes the hash for the input \( k\) that consists of \( i(k) \) messages. For this use case, we create a trace of \( |R| = 100 \) randomly generated inputs \( k\), with \( LB_L = 0\) ms, \( UB_L = 9\) ms, \( LB_P = 0\) W, and \( UB_P = 3\) W.

4.2 Results

Figure 2 shows the verification results of the proposed history-based enforcers \( F_1, F_2, F_3 \) that consider the previous response \( \beta(k-1) \) together with enforcement FSMs that are obtained from the DSE method in [13] and do not consider any previous response for reaction, as well as race-to-idle (i.e., running with the highest configuration \( c_{79} \)) and pace-to-idle (i.e., running with the slowest configuration \( c_0 \)) [19].

We notice in Figure 2 that the history-based enforcement FSMs \( F_1 \) and \( F_2 \) are not dominated by any other enforcement FSM in all of the three applications. Also, the history-based enforcement FSM \( F_3 \) is not dominated in the case of string search application. This shows that reacting based on a history of previous system responses can enhance the probability of satisfying the considered verification goals. The reason is the larger design space of transition possibilities in the enforcement FSM.

Table 4 shows the average verification time, number of states, and transitions for 10 randomly-generated enforcement FSMs with different history options. Enforcement FSMs with \( K = 0 \) indicates that they only react on the current system response \( \beta(k) \). History-based enforcement FSMs with \( K = 1 \) implies that they include the previous system response \( \beta(k-1) \) for reaction as well as \( \beta(k) \). Finally, history-based enforcement FSMs with \( K = 2 \) consider the system responses \( \beta(k-2), \beta(k-1), \) and \( \beta(k) \) for reaction. We notice that reacting based on previous system responses leads to a substantial increase in verification times. This is explained by the increase of number of states and transitions of the resulting enforcement FSM. We also notice that this increase is proportional to the length of the time window \( K \) of previously considered responses.

5 Conclusion

In this paper, we proposed to integrate a history of previous system responses into the design of enforcement strategies. The evaluation shows that such history-based enforcement FSMs have the potential to have higher probabilities of satisfying a given set of verification goals.
Table 4 Average verification time, number of states, and transitions for 10 randomly-generated enforcement FSMs with different history options.

<table>
<thead>
<tr>
<th>Application</th>
<th>$K = 0$</th>
<th>$K = 1$</th>
<th>$K = 2$</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>time (ms)</td>
<td>states</td>
<td>transitions</td>
</tr>
<tr>
<td>Object detection</td>
<td>133.0</td>
<td>262.7</td>
<td>710.7</td>
</tr>
<tr>
<td>String Search</td>
<td>416.1</td>
<td>1,259.4</td>
<td>5,166.7</td>
</tr>
<tr>
<td>SHA</td>
<td>179.3</td>
<td>453.9</td>
<td>1,513.5</td>
</tr>
</tbody>
</table>

than enforcement FSMs that do not consider any system response history. This offers system designers with a trade-off between complexity and performance. In the future, we aim to automatically optimize history-based FSMs for a given set of verification goals.

References


4:10  History-Based Run-Time Requirement Enforcement


