Lazy Spilling for a Time-Predictable Stack Cache: Implementation and Analysis

Sahar Abbaspour¹, Alexander Jordan¹, and Florian Brandner²

¹ Department of Applied Mathematics and Computer Science
  Technical University of Denmark {sabb, alejo}@dtu.dk
² Computer Science and System Engineering Department
  ENSTA ParisTech florian.brandner@ensta-paristech.fr

Abstract
The growing complexity of modern computer architectures increasingly complicates the prediction of the run-time behavior of software. For real-time systems, where a safe estimation of the program’s worst-case execution time is needed, time-predictable computer architectures promise to resolve this problem. A stack cache, for instance, allows the compiler to efficiently cache a program’s stack, while static analysis of its behavior remains easy. Likewise, its implementation requires little hardware overhead.

This work introduces an optimization of the standard stack cache to avoid redundant spilling of the cache content to main memory, if the content was not modified in the meantime. At first sight, this appears to be an average-case optimization. Indeed, measurements show that the number of cache blocks spilled is reduced to about 17% and 30% in the mean, depending on the stack cache size. Furthermore, we show that lazy spilling can be analyzed with little extra effort, which benefits the worst-case spilling behavior that is relevant for a real-time system.

Introduction
In order to meet the timing constraints in systems with hard deadlines, the worst-case execution time (WCET) of real-time software needs to be bounded. This WCET bound should never underestimate the execution time and should be as tight as possible. Many features of modern processor architectures, such as pipelining, caches, and branch predictors, improve the average performance, but have an adverse effect on WCET analysis. Time-predictable computer architectures propose alternative designs that are easier to analyze.

Memory accesses are crucial for performance. This also applies to time-predictable alternatives. Analyzable memory and cache designs as well as their analysis thus recently gained considerable attention [13, 8, 9]. One such alternative cache design is the stack cache [1, 5], i.e., a specialized cache dedicated to stack data, which is intended as a complement to a regular data cache. This design has several advantages. Firstly, the number of accesses going through the regular data cache is greatly reduced, promising improved analysis precision. For instance, imprecise information on access addresses can no longer interfere with the analysis of stack accesses (and vice versa). Secondly, the stack cache design is simple and thus easy to analyze [5]. The WCET analysis of traditional caches requires precise knowledge about the addresses of accesses [13] and has to take the (from an analysis point of view
complex) replacement policy into account. The analysis of the stack cache on the other hand is much easier and amounts to a simple analysis of the cache’s fill level (occupancy) [5].

In this paper we propose an optimization to improve the performance of the stack cache’s reserve operation based on the following observation: in many cases, the actual data spilled by a reserve has exactly the same value as the data already stored in main memory. This may happen in situations where data is repeatedly spilled, e.g., due to calls in a loop, but not modified in the meantime. Thus, the main idea is to track the amount of data that is coherent between the main memory and the cache. The cache can then avoid to needlessly spill coherent data. We show that this tracking can be realized efficiently using a single pointer, the so-called lazy pointer. We furthermore show that the existing analysis [5] can be easily adapted to this optimization, by simply refining the notion of occupancy, to account for the coherent space defined by the lazy pointer.

Section 2 introduces the stack cache, followed by a discussion of related work. Section 4 presents the motivation for our work. In Section 5, we explain lazily spilling and its static analysis. We finally present the results from our experiments in Section 6.

2 Background

The original stack cache [1] is implemented as a kind of ring buffer with two pointers: stack top (ST) and memory top (MT). The former points to the top of the logical stack, which consists of all the stack data that is either stored in the cache or main memory. The latter points to the top element present in main memory only. For simplicity, we herein assume a hardware implementation with a stack growing towards lower addresses.

The difference MT − ST represents the amount of occupied space in the stack cache. Clearly, this value cannot exceed the total size of the stack cache’s memory |SC|, thus 0 ≤ MT − ST ≤ |SC|. The stack control instructions hence manipulate the two stack pointers, while preserving this equation, and initiate the corresponding memory transfers as needed. A brief summary is given below (details are available in [1]):

- **sres x**: Subtract x from ST. If this violates the equation from above, i.e., the stack cache size is exceeded, a spill is initiated, which lowers MT until the invariant is satisfied again.
- **sfree x**: Add x to ST. If this results in a violation of the invariant, MT is incremented accordingly. Memory is not accessed.
- **sens x**: Ensure that the occupancy is larger than x. If this is not the case, a fill is initiated, which increments MT accordingly so that MT − ST ≥ x holds.

Using these three instructions, the compiler generates code to manage the stack frames of functions, quite similar to other architectures with exception of the ensure instruction. For brevity, we assume a simplified placement of these instructions. Stack frames are allocated upon entering a function (sres) and freed immediately before returning from a function (sfree). The content of a function’s stack frame might be evicted from the cache upon function calls. The compiler thus ensures a valid stack cache state, immediately after each call site (sens). This simplified placement can be relaxed as discussed in Section 5.2. We furthermore restrict functions to only operate on their locally allocated stack frames. Any stack data shared between functions, or exceeding the stack cache’s size, is allocated on the so-called shadow stack outside the stack cache.

---

1 Note that the stack control instructions could also be implemented by means of software.
3 Related Work

Static analysis [12, 3] of caches typically proceeds in two phases: (1) potential addresses of memory accesses are determined, (2) the potential cache content for every program point is computed. Through its simpler analysis model, the stack cache does not require the precise knowledge of addresses, thus eliminating a source of complexity and imprecision. It has been previously shown that the stack cache serves up to 75% of the dynamic memory accesses [1]. Our approach to compute the worst-case behavior of the stack cache has some similarity to techniques used to statically analyze the maximum stack depth [2]. Also related to the concept of the stack cache, is the register-window mechanism of the SPARC architecture, for which limited WCET analysis support exists in Tidorum Ltd.’s Bound-T tool [11, Section 2.2].

Alternative caching mechanisms for program data exist with the Stack Value File [6] and several solutions based on Scratchpad Memory (SPM) (e.g. [7]), which manage the stack in either hardware or software.

4 Motivating Example

Figure 1 shows a function bar using the stack cache allocating a stack frame of two words (l. 2) on a stack cache of size of 8 words. The stack frame is freed before returning from the function (l. 16). The loop (l. 5–14), repeatedly calls function foo. We assume function foo reserves 8 words, thus evicts the entire stack cache content and displaces 8 words.

Assuming no stack data has been allocated so far, the stack cache is entirely empty. The MT and ST pointers are pointing to the top of the stack at address 256. The sres instruction of function bar (l. 2) decrements ST by two, without any spilling. At this point, MT points to address 256 and ST to 248 resulting in an occupancy of 2 words.

The stack store and load instructions (l. 4, 7) do not modify the MT and ST pointers. After the function call to foo, an sres execution reserves 8 words. Decrementing ST by 32 with an unmodified MT would result in an occupancy of 10 words (256 – 216), exceeding the stack cache size. Two words are thus spilled and bar’s stack frame is transferred to the main memory (addresses [248, 255]). Before returning from foo, its stack frame is freed, by incrementing ST by 32 (ST = MT = 248), meaning that the stack cache is empty. The sens instruction (l. 11) executed next, is required to ensure that bar’s stack frame is present in the cache for the load of the next loop iteration. Therefore, a cache fill operation increments the MT pointer to 256. The current cache state is identical to the state before the function call. For subsequent loop iterations, the stack cache states change in exactly the same way. An interesting observation at this point is that the values of the respective stack slots are loop-invariant and do not change. Consequently, during every iteration the exact same values

Figure 1 foo evicts the entire stack cache, bar’s stack data is thus spilled on each iteration.
are written to the main memory. Even more, the respective memory cells already hold these values for all loop iterations except for the first. A standard stack cache obviously lacks the means to avoid this redundant spilling of stack data that is known to be coherent.

5 Lazy Spilling

We propose to introduce another pointer, which we call lazy pointer \( \text{LP} \), that keeps track of the stack elements in the cache that are known to be coherent with main memory. The lazy pointer only refers to the data in the stack cache, thus the invariant \( \text{ST} \leq \text{LP} \leq \text{MT} \) holds.

The LP divides the reserved space in the stack cache into two regions: (a) a region between the \( \text{ST} \) and \( \text{LP} \) and (b) a region between the \( \text{LP} \) and \( \text{MT} \). The former region defines the effective occupancy of the stack cache, i.e., it contains potentially modified data. The data of the second region is coherent between the main memory and the stack cache. Whenever the \( \text{ST} \) moves up past the \( \text{LP} \) or the \( \text{MT} \) moves down below the \( \text{LP} \), the \( \text{LP} \) needs to be adjusted accordingly. When a stack store instruction writes to an address above the \( \text{LP} \), some data potentially becomes incoherent. Hence, the \( \text{LP} \) should move up along with the effective address of the store.

5.1 Implementation

Following the above observations, we need to adapt the stack control instructions and the stack store instructions to support lazy spilling. Note, however, that lazy spilling does not impact the \( \text{ST} \) and the \( \text{MT} \), i.e., their respective values are identical to a standard stack cache.

\text{sres} \text{x}: The stack cache’s \text{sres} instruction decrements the \( \text{ST} \). Moreover, the reserve potentially spills some stack cache data to the main memory and thus may decrement the \( \text{MT} \). To respect the invariant from above, this may require an adjustment of the \( \text{LP} \) to stay below the \( \text{MT} \). We can, in addition, exploit the fact that the newly reserved cache content is uninitialized – and thus can be treated as coherent with respect to the main memory. For instance, when \( \text{ST} = \text{LP} \) before the reserve, all the stack cache content is known to be coherent, which allows us to retain the equality \( \text{ST} = \text{LP} \) even after the \text{sres} instruction. Moreover, when the space allocated by the current \text{sres} covers the entire stack cache, all the data in the stack cache is uninitialized. In this case it is again safe to assume \( \text{ST} = \text{LP} \) after the reserve. Apart from updating the \( \text{LP} \), the spilling mechanism itself requires modifications. Originally, the \( \text{MT} \) pointer is used to compute the amount of data to spill. However, when lazy spilling is used, the amount of data to spill does not depend on the \( \text{MT} \) anymore. Instead, the \( \text{LP} \) has to be used to account for the coherent data present in the stack cache. With respect to spilling, the \( \text{LP} \) effectively replaces the \( \text{MT} \) – hence the term effective occupancy for the region between the \( \text{ST} \) and the \( \text{LP} \).

\text{sfree} \text{x}: The stack cache’s \text{sfree} instruction increments the \( \text{ST} \). The \( \text{MT} \) may potentially increase as well. To satisfy the invariant from above, the \( \text{LP} \) is incremented in case it is below the \( \text{ST} \). No further action is required.

\text{sens} \text{x}: The stack cache’s \text{sens} instruction does not modify the \( \text{ST} \) and may only increase the \( \text{MT} \). The invariant is thus trivially respected. Moreover, coherency of the data loaded into the cache is guaranteed. The \text{sens} instruction thus requires no modification.

\text{store}: The stack store instruction writes to the stack cache and may thus change the value of previously coherent data. Therefore, the \( \text{LP} \) needs an adjustment whenever the effective address of the store is larger than the \( \text{LP} \), i.e., the \( \text{LP} \) needs to be greater than the effective address of the store instruction.
Algorithm 1: Construct Spill Cost Analysis Graph

\begin{algorithm}
\hspace{10pt}\textbf{Result:} Construct SCA Graph $G$
\begin{algorithmic}
\State initialize worklist $W$ with entry context $(s,0)$;
\While {unhandled context $c$ in $W$}
\State $f \leftarrow$ function of $c$;
\State $e \leftarrow$ effective occupancy of $c$;
\ForEach {call instruction $i$ in $f$ calling $f'$}
\State $e' \leftarrow \min(e + r, o_{\text{eff}}[i])$;
\State $c' \leftarrow (f', e')$;
\State add edge $c \rightarrow c'$ to $G$;
\If {$c'$ is a new context}
\State add new node to $G$;
\State add context $c'$ to $W$
\EndIf
\EndFor
\EndWhile
\end{algorithmic}
\caption{Construct Spill Cost Analysis Graph}
\end{algorithm}

Example 1. Consider again the program from Figure 1. The stack cache is initially empty and the ST, the MT, and the LP point to address 256. As before, the \texttt{sres} instruction moves the ST down to address 248. The LP moves along with the ST, since ST = LP and we assume that the newly reserved and uninitialized space is coherent with the main memory. Next, the store instruction (l. 4) modifies some data present in the stack cache. The LP consequently has to move upwards and now points to address 252. The first call to function \texttt{foo} causes two words to be spilled (see Section 4) and the MT is thus decremented to 248. As the LP is larger than the MT at this point, the LP would normally require an adjustment. However, since we assume that \texttt{foo} reserves the entire stack cache, it is safe to set the LP to the value of ST (216). Any stores within \texttt{foo} then automatically adjust the LP as needed. The stack cache is again empty after returning from function \texttt{foo}. All three pointers of the stack cache then point to address 248. The \texttt{sens} reloads the stack frame of function \texttt{bar} (l. 11), leaving both ST and LP unmodified, but incrementing MT to 256. The occupancy of the normal stack cache at this point is 8 (2 words), while the effective occupancy of the stack cache with lazy spilling is 0, i.e., the entire cache content is known to be coherent. The effective occupancy for the next call to \texttt{foo} as well as for subsequent iterations of the loop, stays at 0. The reserve within \texttt{foo} thus finds the entire stack cache content coherent and avoids spilling completely. Finally, when the program exits the loop, the \texttt{sfree} instruction (l. 16) increments ST to 256. In order to satisfy the invariant, LP also has to be incremented.

5.2 Static Analysis

With the restrictions from Section 2 in place, the analysis algorithms assume that reserve and free instructions appear at a function’s entry and exit; this may be relaxed, given the program remains well-formed regarding its stack space. Due to space restrictions, we refer to [5], which describes the relaxation and the analysis of the original stack cache design.

The filling behavior of ensure instructions is not affected by lazy spilling, therefore, the context-insensitive \texttt{ensure analysis} remains unchanged compared to its original. The spilling behavior of a reserve instruction depends on the state of the stack cache at function entry, which in turn depends on a nesting of stack cache states of function calls reaching the current function. To fully capture this context-sensitive information, we can represent spill cost in the \texttt{spill cost analysis graph} (SCA graph). An example SCA graph is depicted in Figure 2b. The effective spill cost for an \texttt{sres} instruction in a specific calling context is computed from the occupancy of the context and the locally reserved space $x$ (in \texttt{sres} $x$). This value is then multiplied with the timing overhead of the data transfer to external memory.
Without lazy spilling, the construction of the SCA graph depends on the notion of stack cache occupancy, i.e., the utilized stack cache region of size $MT - ST$. In order to benefit from the reduced overhead of lazy spilling, we need to consider the possibly smaller region $LP - ST$. Bounds for this effective occupancy of the stack cache are propagated through the call graph during reserve analysis, which constructs the SCA graph as shown in Algorithm 1. The $o_{eff}$-bound used by the algorithm improves the overestimation still present in context-sensitive reserve analysis. It can be computed by an intra-procedural data-flow analysis, which considers minimum displacement and the destinations of stack cache stores. For every point within a function, but most interestingly before calls, $o_{eff}$ represents a local upper bound on the effective occupancy.

As a pre-requisite for the intra-procedural analysis, minimum displacements for all functions in the program are computed. This is achieved by several shortest-path searches in the program’s call graph. The resulting values are bounded by the size of the stack cache $disp(i) = \min(|SC|, d_{min}(i))$ and used in the data-flow transfer functions of the analysis:

\[
OUT(i) = \begin{cases} 
\min(IN(i), |SC| - disp(i)) & \text{if } i = \text{call} \\
\max(IN(i), \text{blocks}(n)) & \text{if } i = \text{sws n} \\
IN(i) & \text{otherwise}
\end{cases}
\]

After returning from a call instruction $i$, the $LP$ cannot have moved upwards with respect to its location before the call. In fact, a function call can only leave the $LP$ unchanged or move it downwards. I.e., if stack contents of the current function are spilled, this potentially increases the coherent region of the stack cache and thus decreases the effective occupancy (1a). The amount of spilling is represented by the previously computed (worst-case) minimum displacement $disp(i)$.

Opposite to calls, the $LP$ moves up when $i$ is a store instruction (1b). Since the number of data words per stack cache block is fixed, the number of blocks that become incoherent due to a store, only depends on its destination $n$ (in $\text{blocks}(n) = \lceil n/\text{words-per-block} \rceil$).

Note that while an $\text{sens}$ instruction can impact the original analysis of occupancy bounds, it does not influence the effective occupancy and thus plays no role during reserve analysis in the presence of lazy spilling.

In order to safely initialize the $o_{eff}$ bound and propagate it between instructions (i.e., from the $OUT$-values of the predecessors of an instruction to its $IN$-value (2b)), we define the following transfer functions:

\[
IN(i) = \begin{cases} 
|SC| & \text{if } i \text{ is entry} \\
\max_{p \in \text{Preds}(i)}(OUT(p)) & \text{otherwise}
\end{cases}
\]

It is further worth noting that in practice not all stack store instructions need to be analyzed. As long as the effective occupancy value exceeds the stack cache size reserved by the current function (i.e., the coherent region does not overlap with the current stack frame), a particular store has no effect on the analysis state.

▶ Example 2. Consider the program from Figure 1, where two additional function calls to function $\text{baz}$ are performed. The first call appears right before entering the loop, while the other appears right after exiting the loop.

The call graph for this program is shown in Figure 2a. This graph also shows the size of the stack frame reserved by each function on the outgoing call edges. For functions without calls ($\text{baz}$ and $\text{foo}$), call edges to an artificial sink node $t$ are introduced. Using these annotations, the minimum and maximum displacement of each function can be determined...
using a shortest and longest path search respectively. The minimum displacements for the three functions are 8, 6, and 4; for foo, bar, and baz respectively.

The analysis of this program results in the SCA graph shown by Figure 2b. The analysis starts with an empty stack cache and finds that bar allocates 2 words (8 bytes), initializes the respective stack slots and then calls baz. Since bar’s stack content was not spilled before, the effective occupancy when entering baz evaluates to 2. Assuming a stack cache size of 8 words, the function can be executed without spilling. The analysis thus can deduce that the LP was not modified with regard to bar’s stack frame. Propagating this information to the function call of foo within the loop results in a context for foo with the same occupancy of 2. Once the program exits the loop, another call to baz is executed. Due to foo displacing the full stack cache and the lack of relevant stores, the data-flow analysis is able to propagate the effective occupancy of 0 down to this call. This is reflected in the SCA graph by containing a node for baz annotated with its effective 0-occupancy.

## 6 Experiments

We use the Patmos processor [10] to evaluate the impact of lazy spilling on hardware costs, average program performance, and worst-case spilling bounds. The hardware model of the processor was extended and statistics were collected on the speed and resource requirements after synthesis (Altera Quartus II 13.1, for Altera DE2-115). The average performance measurements were performed using the MiBench [4] benchmarks suite. The programs were compiled using the Patmos LLVM compiler (version 3.4) with full optimizations (-O3) and executed on a cycle-accurate simulator. We compare five configurations utilizing (a) a standard data cache combined with a lazily spilling stack cache having a size of 128 or 256 bytes (LP\textsubscript{128}, LP\textsubscript{256}), (b) a standard data cache combined with a standard stack cache (SC\textsubscript{128}, SC\textsubscript{256}), and (c) a standard data cache alone (DC). The stack caches perform spilling and filling using 4 byte blocks. The data cache is configured to have a size of 2 KB, a 4-way set-associative organization, with 32 byte cache lines, LRU replacement, and a write-through strategy (recommended for real-time systems [13]). In addition to data caches, the simulator is configured to use a 16 KB method cache for code. The main memory is accessed in 16 byte bursts and 7 cycles latency.

### 6.1 Implementation Overhead

The standard stack cache of Patmos is implemented by a controller, which executes spill and fill requests, and control instructions (sres, sfree, sens) sending requests to that controller. The controller is independent from the extensions presented here. It thus suffices to extend the sres, sfree, and stack store instruction as indicated in Section 5.1. The Patmos processor
has a five-stage pipeline (fetch, decode, execute, memory access, write back). The stack cache reads the ST and the MT in the decode stage and immediately computes the amount to spill/fill. The LP is read in the decode stage as well, which allows us to perform all LP-related updates in the decode and execute stages. That way additional logic on the critical path in the memory stage, where the cache is actually accessed, is avoided. This applies in particular to the store instruction, whose effective address only becomes available in the execute stage. For lazy spilling, a single additional register is needed (LP). Updating the LP in the sres instruction adds two multiplexers to the original implementation. The afree and stack store instructions each need an additional multiplexer. The area overhead is, therefore, very low. Moreover, these changes do not affect the processing frequency.

### 6.2 Average Performance

Table 1 (columns Spill) shows the reduction in the number of blocks spilled in comparison to the standard stack cache. Note that results for rawaudio and rawaudio are not shown, as they never spill due to their shallow call nesting depth. The best result for LP128 is achieved for bitcnts, where lazy spilling practically avoids spilling. In the mean, spilling is reduced to just 17%. The worst result is attained for qsort-small, where 62% of the blocks are spilled. For LP256 spilling is reduced to 30% in the mean. The best result is observed for search-large, where essentially all spilling is avoided. For crc-32, drijndael, erijndael, and sha only marginal improvements are possible, since these benchmarks already spill little in comparison to the other benchmarks. The worst result of those benchmarks with a relevant amount of spilling is obtained for qsort-small, where 76% of the blocks are spilled.

Miss rates are not suitable for comparison against standard data caches. We thus compare the number of bytes accessed by the processor through a cache in relation to the number of stall cycles it caused, i.e., \( \#RD + \#WR \). A high value in Table 1 (columns SC and DC) means that the cache is efficient, as data is frequently accessed without stalling. The data cache alone gives values up to 3.3 only. Ignoring benchmarks with little spilling, the best result for SC128 is achieved by dbf (477.4). For SC256, bitcnts gives the best result (17054.7). Lazy

<table>
<thead>
<tr>
<th>Benchmark</th>
<th>SC128</th>
<th>LP128</th>
<th>SC256</th>
<th>LP256</th>
</tr>
</thead>
<tbody>
<tr>
<td>basicmath-tiny</td>
<td>2.3</td>
<td>1.1</td>
<td>0.17</td>
<td>4.0</td>
</tr>
<tr>
<td>bitcnts</td>
<td>4.6</td>
<td>19.16</td>
<td>0.00</td>
<td>12.2</td>
</tr>
<tr>
<td>cjpeg-small</td>
<td>116.9</td>
<td>1.0</td>
<td>0.51</td>
<td>148.4</td>
</tr>
<tr>
<td>crc-32</td>
<td>9.0</td>
<td>0.9</td>
<td>0.03</td>
<td>21.3</td>
</tr>
<tr>
<td>csusan-small</td>
<td>11.3</td>
<td>2.2</td>
<td>0.16</td>
<td>18.6</td>
</tr>
<tr>
<td>dijkstra-small</td>
<td>19.5</td>
<td>1.4</td>
<td>0.20</td>
<td>32.8</td>
</tr>
<tr>
<td>dipeg-small</td>
<td>9.0</td>
<td>0.8</td>
<td>0.34</td>
<td>13.5</td>
</tr>
<tr>
<td>drijndael</td>
<td>15.8</td>
<td>0.9</td>
<td>0.20</td>
<td>28.7</td>
</tr>
<tr>
<td>ebf</td>
<td>172.5</td>
<td>1.0</td>
<td>0.44</td>
<td>224.6</td>
</tr>
<tr>
<td>erijndael</td>
<td>32.6</td>
<td>0.9</td>
<td>0.57</td>
<td>43.3</td>
</tr>
<tr>
<td>essusan-small</td>
<td>15.9</td>
<td>3.4</td>
<td>0.25</td>
<td>25.3</td>
</tr>
<tr>
<td>fft-tiny</td>
<td>3.1</td>
<td>1.1</td>
<td>0.08</td>
<td>5.8</td>
</tr>
<tr>
<td>iff-tiny</td>
<td>3.1</td>
<td>1.2</td>
<td>0.08</td>
<td>5.9</td>
</tr>
<tr>
<td>lucas</td>
<td>2.5</td>
<td>1.0</td>
<td>0.27</td>
<td>4.2</td>
</tr>
<tr>
<td>qsort-small</td>
<td>3.1</td>
<td>1.0</td>
<td>0.62</td>
<td>3.7</td>
</tr>
<tr>
<td>rsynth-tiny</td>
<td>16.0</td>
<td>1.9</td>
<td>0.08</td>
<td>29.9</td>
</tr>
<tr>
<td>search-large</td>
<td>2.9</td>
<td>0.8</td>
<td>0.48</td>
<td>3.9</td>
</tr>
<tr>
<td>sha</td>
<td>8.3</td>
<td>1.6</td>
<td>0.20</td>
<td>14.1</td>
</tr>
<tr>
<td>ssusan-small</td>
<td>29.2</td>
<td>17.1</td>
<td>0.20</td>
<td>43.9</td>
</tr>
</tbody>
</table>

Table 1 Bytes accessed per stall cycle and reduction in spilling for the various configurations.
spilling leads to consistent improvements over all benchmarks. An interesting observation is that for most benchmarks the presence of a stack cache improves the performance of the data cache. The best example for this is bitcnts, but also csusan and ssusan profit considerably, where the data cache alone delivers 1.2 bytes per stall cycle. When a stack cache is added to the system, this value jumps up to 192.4 and 196.1 respectively.

Our measurements show that lazy spilling eliminates most spilling in comparison to a standard stack cache. Also the efficiency of the standard data cache is improved in many cases. Due to the low memory latency assumed here, this translates to run-time gains of up to 21.8% in comparison to a system with a standard stack cache. In the mean, the speedup amounts to 8% and 9.2% in comparison to a system only equipped with a data cache.

### 6.3 Static Analysis

We evaluate the impact of lazy spilling on the static analysis by comparing its bounds on the worst-case spilling with the actual spilling behavior observed during program execution. In order to be considered safe, the analysis’ bounds always need to be larger or equal to the observed spilling of an execution run. Apart from a safe result, the analysis should also provide tight bounds. Note that the observed spilling relates to the average-case behavior of the programs, which does not necessarily trigger the worst-case stack cache behavior (the same inputs as in Section 6.2 have been used). However, we still report the estimation gap between worst case and average case, to show the effect of lazy spilling in the analysis.

We extended the Patmos simulator and first verified that the actual spilling of all sres instructions executed by a benchmark program is safely bounded by the analysis. We then measured the maximum difference between statically predicted and observed spilling (Max-Spilling-∆) for each sres in every of its contexts (i.e., considering all nodes of the SCA graph reachable by execution). Table 2 shows the results for the 128-byte stack cache configuration and allows for a comparison between statically predicted and dynamically observed spill costs. A first look reveals that static spill cost is reduced for all programs in our benchmark set (down to 12% for rsynth-tiny). Furthermore, when the estimation gap is initially low, lazy spilling tends to widen the gap between static and dynamic spill cost.

![Table 2](attachment://image.png)
slightly. But for those benchmarks that exhibit large estimation gaps with a standard stack cache, lazy spilling can even improve analysis precision.

7 Conclusion

From our experiments, we conclude that the benefits of lazy spilling extend from the average case performance to worst-case behavior, where it can even benefit analysis precision. In future work, we plan to introduce loop contexts and a limited notion of path-sensitivity to the analysis, to better capture occupancy states and thus spilling behavior.

Acknowledgment. This work was partially funded under the EU’s 7th Framework Programme, grant agreement no. 288008: Time-predictable Multi-Core Architecture for Embedded Systems (T-CREST).

References