STR2RTS: Refactored StreamIT benchmarks into statically analyzable parallel benchmarks for WCET estimation real-time scheduling

We all had quite a time to ﬁnd non-proprietary architecture-independent exploitable parallel benchmarks for Worst-Case Execution Time (WCET) estimation and real-time scheduling. However, there is no consensus on a parallel benchmark suite, when compared to the single-core era and the Mälardalen benchmark suite [12]. This document bridges part of this gap, by presenting a collection of benchmarks with the following good properties: (i) easily analyzable by static WCET estimation tools (written in structured C language, in particular neither goto nor dynamic memory allocation, containing ﬂow information such as loop bounds); (ii) independent from any particular run-time system (MPI, OpenMP) or real-time operating system. Each benchmark is composed of the C source code of its tasks, and an XML description describing the structure of the application (tasks and amount of data exchanged between them when applicable). Each benchmark can be integrated in a full end-to-end empirical method validation protocol on multi-core architecture. This proposed collection of benchmarks is derived from the well known StreamIT [21] benchmark suite and will be integrated in the TACleBench suite [11] in a near future. All these benchmarks are available at https://gitlab.inria.fr/brouxel/STR2RTS. 1998 ACM Subject Classiﬁcation C.3 Real-time and embedded systems


1:2 STR2RTS
Several non proprietary parallel benchmarks already exist for the experimental validation of real time systems.However, they have some limitations.Some consist of periodical independent task sets [6,8] only, with no synchronization/communication between tasks.Some others lack information to perform WCET estimation or scheduling, e.g.: source code [9,10,20], dependency representation [2, 14,4,19,13], or are hardware or run-time system dependent.Other studies prefer task set generators for validation, but cannot be used for a full end-to-end experimental validation as they lack source code.
This document aims at providing a collection of parallel benchmarks for experimental evaluation of real-time systems on multi-/many-core architectures.The targeted audience is the real-time system research community at large, including researchers on WCET estimation and real-time scheduling.This document can be of benefit to experts in multi-core scheduling to experiment their techniques for task mapping and scheduling.It can be of benefit to researchers on worst-case execution time estimation, both on single-core architectures, by analyzing each task of the parallel application, and on multi-core architectures through an analysis of the entire parallel application, including for instance analyses of contentions at the shared resources such as bus, cache, etc.
To ease the creation of a collection of benchmarks with all the required information for WCET estimation and scheduling, we started from the StreamIT benchmark suite [21], which consists of a set of Digital Signal Processing (DSP) applications.Such applications consume incoming data and produce outgoing data at a specific rate, which is representative of many real-time applications.
The provided information for each application is an XML file and a C source file.The XML file describes the structure of the application through a directed acyclic graph (identification of tasks and dependencies between them, volume of data to be transmitted between tasks, WCET of tasks on a particular architecture if the benchmark is to be used for real-time scheduling only).The C code contains the source code of each task.The source code is statically analyzable and self-contained, to allow static WCET estimation techniques on any specific architecture, (but obviously other estimation techniques such as probabilistic or measurement-based are not left aside).In particular, the C code contains pragmas expressing loop bounds in the format used in the TACleBench benchmark suite [11].We plan to integrate them in the TACleBench benchmark suite as it aims to be the reference benchmark suite for WCET estimation at code level for both single-core and multi-/many-core architecture.
The rest of this document is organized as follows.First, Section 2 compares our work with existing benchmark suites.Section 3 presents background knowledge about the StreamIT benchmark suite, which is used as the basis for STR2RTS.Section 4 provides an overview of the provided material, and finally Section 5 gives some qualitative and quantitative information on the provided benchmarks, before concluding in Section 6.

Related Work
The usefulness of benchmarks for the validation of systems no longer has to be demonstrated.They have been vastly used in the past to experiment new algorithms, new software, or new pieces of hardware.In computer science, there exists hundreds of different benchmark suites with different purposes and different sizes: SPEC CPU 2006, PolyBench [17], ParMiBench [13], UTDSP [14], Parsec [4], JemBench [19], ParaSuite [2], and many more.However very few of them have been engineered for multi-core real-time systems.This kind of systems requires more information in the benchmark suite than just the code, typical input data and a description, which is the general provided material.Indeed, to be largely accepted A non-exhaustive list of the requirements for benchmarks targeting real-time embedded systems would include (i) structured self-contained source code -i.e.: no goto, no dynamic memory allocation, no call to external libraries, (ii) statically computable loop bounds or flow facts for loop bounds, (iii) deadlines and periods of tasks.Adding the multi-core constraints to the system would also add new requirements to the benchmark suite, such as the amount of data exchanged between communicating tasks, and a representation of dependencies between tasks if applicable.The benchmark suite should also remain independent from any specific run-time environment (i.e.: OpenMP, MPI, etc.) to be used as easily as possible.
Starting from the single-core era, the Mälardalen benchmark suite [12] has been accepted by the WCET estimation community.It consists of small pieces of key code representing some well-known code structures found in embedded real-time software.Although representative of embedded software, this benchmark suite contains sequential codes only, and the large majority of provided codes are very small.
A common practice to evaluate scheduling strategies is to use task graph generators.They have the benefit to be architecture independent and generate a vast amount of different topologies.Task Graph For Free (TGFF) [10], Synchronous Dataflow 3 (SDF3) [20] can generate task graphs with dependent tasks in a deterministic way, allowing anybody to replay an experiment as long as the configuration parameters are known.UUniFast [6] is an algorithm generating task sets with uniform distribution in a given space.Task graph generators are very useful when the goal is to empirically validate a method on a large variety of task graph topologies.However we all need concrete representative applications with code for further empirical validation which is what we aim at providing here.
Three real parallel applications targeting real-time systems are often used as benchmarks -i.e.: Debie1 [1], Papabench [15] and Rosace [16].All are control applications respectively for a satellite, a drone and a plane.But three concrete applications are not enough.Our objective with the benchmark suite we provide is to enrich the set of applications that can be used to validate multi-core real-time systems, and enlarge the scope of applications to include signal processing applications with dependencies between tasks.
De Bock et al [8] proposed a benchmark generator targeting multi-core platforms.The generator input is sequential code for each task.All tasks are independent.The benchmark generator output is a task-set fitting some requirements.In comparison, the benchmarks we provide include dependent tasks and a representation of these dependencies as well as the amount of data exchanged between dependent tasks.
To the best of our knowledge the benchmark suite closest to this work is the StreamIT benchmark suite [21] that we use as a baseline.In the original version of StreamIT, the authors provide a representation of task's dependencies -a task graph -with communication (exchanged tokens) and source code.However, the provided C source code is only sequential, and the generated code is not WCET-friendly as some benchmarks are impossible to statically analyze, i.e. statically extracting loop bounds might not be possible with available tools.In addition, there is nearly no cache reuse, since tasks performing the same function are systematically duplicated in the generated code.Moreover, dynamic memory allocation is used for allocating messages used for inter-task communication.The C version we provide respects the task graph extracted from the original StreamIT tool, with the benefit of allowing static analyses on each function in isolation.
Finally the new TACleBench suite [11] aims at becoming the de facto standard benchmark suite for timing analysis.This work will be integrated in TACleBench in order to strengthen W C E T 2 0 1 7 1:4

STR2RTS
the multi-/many-core dimension of this suite.

Background on StreamIT
StreamIT [21] is a high-level language for developing streaming applications (applications acting on flows of data) modeled as Synchronous Dataflow Graphs (SDF).The StreamIT language has a portable run-time environment and is architecture-independent.The main difference of StreamIT as compared with other streaming languages lies on a required welldefined structure on the streams, that are not an arbitrary network of nodes.One of the major properties of the StreamIT benchmarks lies on the data rate which is imposed to be fixed, thus known at compile time.
All graphs in the StreamIT languages consist of a hierarchical composition of nodes structured in pipeline, split-join and feedbackloop constructs.Streaming applications can then be represented as a Cyclo Static DataFlow graph (CSDF) [5].They are constructed over the execution of two phases : the initialization and the steady state, where the latter is considered indefinitely repeating, whereas the former is performed only once and aims at registering the tasks of the steady-state in the StreamIT scheduler.A streaming application can be seen as a flow of computational units producing and consuming data, the data stream.The basic computational unit of StreamIT applications is the filter.Each filter is a task that produces and consumes tokens.Communicating filters are organized in a stream in order to create a pipeline (chain) of filters.More complex stream structures can be realized with split-join and feedbackloop constructs.The former splits the data stream in parallel streams before joining them again, whereas the latter re-injects upstream data produced downstream.Conditional control-flow is not allowed at the application level (there is no concept of conditional execution of filters).In contrast, there may be control flow inside filter code.The data stream is propagated through the filters in the graph at a constant rate known, at compile time.This allows to statically know the amount of data exchanged between filters.Such data are transmitted through dataflow channels implemented as FIFO (First In First Out) queues.
The StreamIT language is illustrated on one of the smallest application from the StreamIT benchmark suite: the radix-2 case of a Fast Fournier Transform (FFT4.str).The application source code in StreamIT language is presented in Listing 1. Lines 1-21 specify the structure of the streaming applications, while lines 22-31 give the source code of the filters (for conciseness only the StreamIT code for filter Add is given).
The first element (line 1: FFT4) is the top-level envelope (equivalent to the main function in C code); it registers three other elements which are added to the global structure (a pipeline, i.e. chain of elements for FFT4).Elements are added directly or recursively explored depending of their type.For instance, OneSource, line 22 is added directly because it is a simple filter, whereas Butterfly, line 10 is explored because it is a composition of elements (here, a pipeline).The code of a very simple filter (Add) is given in lines 24-29.It is decomposed into two functions : the initialization part (line 25) and the work function for the steady-state (lines 26-28).Due to the simplistic nature of this example the initialization part is empty, but one could easily imagine some constant initialization for the steady-state.The work function corresponds to the C-like code that will be executed at each iteration of the steady-state.This function calls, at line 27, two functions pop/push to respectively fetch and store data from/to the FIFO channels connected to the previous/next dependent tasks.
The program structure extracted from the StreamIT application from Listing 1 is presented in Figure 1 and it illustrates the steady-state of the application.
The StreamIT benchmark suite comes with an end-to-end compilation tool chain illustrated in Figure 2. It first parses the StreamIT language and generates a Java version of the streaming application.This Java version is then converted to an intermediate representation used by internal tools to analyze the application.Following is a short summary of those tools: Partitioning: determining the number of fissions and fusions, used to determine where to insert/remove split-join nodes in the generated code ; Mapping: determining on which core each job implementing a filter will run ; Scheduling: determining in which order jobs will be executed ; Code generation: generating code for the targeted architecture (generally C/C++) W C E T 2 0 1 7 1:6

STR2RTS
through the provision of several back-ends.The last step generates a code which can be compiled in order to run the application on the targeted architecture (RAW processor, Tilera, RStream and so on).The Java version can also be executed using a simulation library included in the StreamIT project.This simulator runs a sequential version of the streaming application.
Despite the work done on the StreamIT toolchain, none of the provided back-ends generate code ready to be analyzed in the context of real-time systems.The simpleC backend generates only one big main function containing all the code, leading to a sequential version not suitable for multi-core analysis/execution.The newSimple back-end generates hard to read and to analyze source code, where it is not possible anymore to identify tasks.The cluster back-end generates code that may not be analyzable by loop bound extractors, and includes libraries provided by StreamIT with C++ classes and dynamic memory allocation, thus not suitable for static WCET analysis.In addition, when the same filter is used several times, its code is duplicated, thus degrading the WCET of tasks when considering architectures with caches.
As no back-end fulfills all the requirements implied by real-time systems and corresponding analyses, we modified the StreamIT benchmarks code, as detailed in Section 4.2, to fit the needs of the real-time system community.Among the tools coming with the StreamIT tool chain, we only used the simulation library to ensure that our modifications to the StreamIT codes are functionally equivalent to the original code.

Provided information
Each benchmark is divided in 4 files, an XML file, a DOT file, a C source file and its corresponding header file.The DOT file is a graphical representation of the tasks and their dependencies using the graphviz software1 , and will not be presented here, as well as the header file.Following is an example of an XML description with its corresponding C source code.
An XML file summarizing all the provided information is presented by Listing 2 and corresponds to our previous example from Figure 1.This file basically describes the structure of the application as a Directed Acyclic Graph (DAG), with tasks as nodes and channels as edges.It can be used by mapping/scheduling tools as input to experimentally evaluate new mapping/scheduling strategies involving either a single application or multiple applications both modeled as DAGs.Another usage, once tasks have been assigned to cores in a multi-core platform, is to use the XML file together with the code of tasks to perform WCET estimation on the application, in particular integrating contentions to access shared resources in the WCET of the application.For each task, the XML file contains the set of predecessors of the task with the amount of data received by each of them, as well as the task WCET.The XML tag prev represents task's dependencies as precedences, e.g.line 15 where Split2DUPLICATE is a predecessor of Subtract.The associated attribute data-sent specifies the amount of data needed for one execution of the task, and the attribute data-type specified the type of data (e.g.: int, double, float, etc.).
The attribute WCET is provided as information for people aiming at performing experiments on mapping/scheduling techniques and do not wishing to perform an initial WCET analysis step.Provided WCETs were estimated by our static WCET analysis tool Heptane [7] for a the MIPS instruction set, for an architecture without caches or pipelines (roughly the provided WCET corresponds to the worst-case number of instructions executed by each task).
Listing 3 introduces the structure of the provided C source code.Each task from the aforementioned graph appears as a C function in the source file.The code of filter/task Add is given as an example in lines 8-14.Depending on the value of GLOBAL_N (constant evaluated by the C pre-processor), this filter reads two float items from the input channel (pop_f loat), then sums them before writing the result into the output channel (push_f loat).The loop is annotated with a pragma specifying the loop bound, according to the TACleBench syntax for flow-facts annotations.The value of GLOBAL_N has an impact on the number of added tasks (number of added Butterfly from the Listing 1).In the C source code, we fix such parameter in the header file to have the C source code consistent with the XML description.In this example the value of GLOBAL_N is set to 2.

Listing 3 C version of the FFT4 stream program
Lines 22-41 point to the sequential_main function that corresponds to an execution of all tasks on a single-core architecture.Function sequential_main first calls the initialization function of all tasks having a non-empty initialization phase (line 23).This initialization step sets up every C structure, buffers or pre-computed data required by filters for the steady-state run.The sequential_main then calls each filter function in a loop of MAX_ITERATION iterations (lines 28-39) for the steady-state execution.Functions are called in an order that respects dependencies between tasks.This function is provided for users interested in single-core WCET estimation.It was also used to check the correctness of code modifications applied to the StreamIT benchmark, by comparing the results to those produced by the StreamIT Java simulator.
Regarding communications between tasks, a C file implementing the push/pop communication functions has to be provided and linked with the code of each application.Since the implementation of communications is architecture and system dependent, this file has to be provided for every (architecture, system) pair.As a start point we provide a simple implementation of push/pop operations that implement communications through shared memory, using statically allocated FIFO buffers.This simple implementation can be used on single-core architectures and multi-core architectures with shared memory.

Benchmark construction process
In order to extract the above information for each benchmark, we relied on the StreamIT compilation tools as much as possible and we then adapted their output to fit our needs.As presented by the dashed line in Figure 2, we modified the Java pretty-printer to generate a preliminary C version of the streaming application that later needs to be modified manually to match the analysis requirements.When finalizing the C source code through handmade modifications, we stayed independent from any specific run-time library and inter-core communication mechanism.Despite the error proneness of this method, this hand-made step is necessary to guaranty easy read/analyze/understand code with all required annotations.To validate the functional correctness of the final C source version, we performed non-regression tests considering the Java simulator output as the baseline.
To create the XML description, we needed the WCET of each task, the amount of data exchanged between task and the topology of the application's graph.For the first information, we relied on our tool Heptane [7] that gives us the WCET of each task in isolation.The amount of data exchanged and the topology of the graph are extracted manually from files generated by the Java simulator.

5
Provided benchmarks  1 summarizes the benchmarks that are ready to use at the time of writing.The first column presents the name of the benchmark (identical to the name in the original StreamIT benchmark suite), followed by the number of tasks, the number of split-join nodes and a quick description (also extracted from the original StreamIT benchmark suite).For application 802.11a coming in multiple versions (to be explained later), we provide the minimum and maximum of provided values among all versions.
Table 2 shows the complexity of each benchmark.After the name of the benchmark, W C E T 2 0 1 7 1:10 the second column shows the width of the graph (the maximum number of tasks at the same topological rank) which gives an idea of the amount of concurrency in the application.

STR2RTS
Following are information about task's WCET and amount of data exchanged between tasks.Both fields are described with an average and standard deviation.Table 3 indicates which benchmarks need a mathematic library to compile, and use input and/or output file.Nonetheless to ensure self-containment, we provide a dummy implementation (empty shell) for the needed functions.
We found some benchmarks with multiple usages of the same task with different input parameters at different points in the application.We thus generated two versions of each benchmark: one with shared code to allow cache reuse, and another one with duplicated code.The difference between both versions lies on the ability to exploit cache reuse or not, and also accuracy of flow-facts annotations (which are more precise with duplicated code).The last column of Table 3 indicates whether we created multiple versions of the benchmark with code reuse or not.
Finally, some benchmarks are customizable by modifying the value of some parameters inside the StreamIT source code, e.g. the data rate of the 802.11a application.As modifying such values has an impact on the application's structure, we generated multiple versions of the same benchmark for the different configurations.We successfully compiled the list of benchmarks presented in Table 1 for x86_64 architecture and validated their behavior by comparing their results with the one from the Java simulator provided by StreamIT.All these benchmarks are available at https://gitlab.inria.fr/brouxel/STR2RTS.

Conclusion
This document has presented a collection of benchmarks written in analyzable C language and based on the StreamIT benchmarks suite [21].The purpose of the refactoring of StreamIT applications we have performed is to create self-contained analyzable architecture-independent parallel C applications to allow any kind of experiments on WCET analysis and real-time scheduling on multi-core architectures.To largely spread our work, we will integrate this collection of benchmarks into the TACLeBench project.Due to the required handmade refactoring, we will be continuously adding new test cases over time while there are still some StreamIT applications to refactor.We foresee the end of refactoring in a couple of year for the 60% remaining benchmarks.

Figure 2
Figure 2 StreamIT Tool-chain by the real-time community a benchmark suite must include a source code that is statically analyzable, to allow experiments with both static and non-static WCET estimation methods.

Table 1
Description of provided benchmarks

Table 2
Statistics for provided benchmarks

Table 3
Properties of provided benchmarks