Detecting Low-Density Mixtures in High-Quantile Tails for pWCET Estimation

Manau, Blau; Vilardell, Sergi; Serra, Isabel; Mezzetti, Enrico; Abella, Jaume; Cazorla, Francisco J.

doi:10.4230/LIPIcs.ECRTS.2025.20

Detecting Low-Density Mixtures in High-Quantile Tails for pWCET Estimation

Blau Manau

Barcelona Supercomputing Center (BSC), Spain Sergi Vilardell

Barcelona Supercomputing Center (BSC), Spain Isabel Serra

Universitat Autònoma de Barcelona (UAB), Spain Enrico Mezzetti

Barcelona Supercomputing Center (BSC), Spain Jaume Abella

Barcelona Supercomputing Center (BSC), Spain Francisco J. Cazorla

Barcelona Supercomputing Center (BSC), Spain

Abstract

The variability arising from sophisticated hardware and software solutions in cutting-edge embedded products causes software to exhibit complex execution time distributions. Mixture distributions can happen, with different density (weight), as a result of inherent different features in the execution platform and multiple operational scenarios. In the context of probabilistic WCET (pWCET) analysis based on Extreme Value Theory (EVT), where identical distribution is a pre-requisite, mixtures are typically intercepted by applying stationarity tests on the full sample. Those tests, however, are instructed to detect only mixtures with sufficiently high probability (weight) and disregard low-density mixtures (which are unlikely to be preserved in the high-quantile tail of the sample) as they would prevent any form of stationarity. Nonetheless, low-density mixture distributions can persist and even exacerbate in the tail, and, when not considered, they can impair pWCET estimation in EVT-based approaches, leading to overly pessimistic or optimistic bounds. In this work, we propose TailID, an iterative point-wise approach that builds on the asymptotic convergence of the Maximum Likelihood Estimator (MLE) of the Extreme Value Index (EVI) parameter $\xi$ to detect low-density mixture distributions on high-quantile tails and use this information to steer EVT tail selection. The benefits of the proposed method are assessed on synthetic mixture distributions and real data collected on an industrially representative embedded platform.

Keywords and phrases:

WCET, EVT

Copyright and License:

2012 ACM Subject Classification:

Computer systems organization

\rightarrow

Real-time system architecture ; Mathematics of computing

\rightarrow

Probability and statistics

Funding:

The research leading to these results has received funding from the European Union’s Horizon Europe Programme under the SAFEXPLAIN Project (www.safexplain.eu), grant agreement num. 101069595 and project PLEC2023-010240 funded by MICIU/AEI/10.13039/501100011033 (CAPSUL-IA). Authors also appreciate the support given to the Research Group SSAS (Code: 2021 SGR 00637) by the Research and University Department of the Generalitat de Catalunya. This work is partially supported by the Spanish MICIU with project PID2022-137414OB-I00.

DOI:

10.4230/LIPIcs.ECRTS.2025.20

Event:

37th Euromicro Conference on Real-Time Systems (ECRTS 2025)

Editors:

Renato Mancuso

Series and Publisher:

Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

The technology push for novel artificial intelligence (AI)-powered software functionalities is fostering the adoption of increasingly complex hardware and software solutions even in the traditionally conservative embedded critical systems domain. Despite the additional and significant challenges brought by AI software and complex MPSoCs (Multi-Processors System on Chip) on software timing verification, market-driven strategies relentlessly push embedded system providers towards the adoption of these technologies [38].

To handle the high complexity and limited information about hardware and software internals, both intrinsic traits of this scenario, pWCET techniques are gaining attention [19, 16] for their effectiveness in capturing execution time distributions of software programs even under such complex conditions. A common method for pWCET estimation is EVT [17], which is a branch of statistics that focuses on the behavior of the extreme values in a process. Instead of modeling the entire distribution, EVT specifically studies the extreme deviations of a distribution, i.e. the low-probability outcomes.

As a measurement-based technique, EVT-based pWCET estimation relies on the ability to perform adequate testing to capture, during the test campaign, those hardware and software execution conditions that can arise during system operation [16, 1]. The timing variability arising from intricate interactions between hardware and software elements results in programs with increasingly complex execution time distributions [23, 47, 2]: hardware-related variability patterns (due to uncountable stateful resources like multi-level caches, interconnects, on-core resources, …) and coexistence of multiple operational scenarios can lead to mixture distributions [13, 24, 29, 47], where a set of different probability distributions are compounded.

Mixture distributions are more challenging to address as they may break the Identical Distribution (ID) hypothesis on the entire data, which is a prerequisite for the applicability of EVT-based methods, together with statistical independence [31, 41, 24, 19, 16]. Mixture distributions are normally intercepted by the independence and identical distribution (IID) tests [26, 48, 30, 40, 19, 16] considered in state-of-the-art EVT-based timing analysis approaches. Those tests have been proved sound and effective at detecting dependency or non-identical distribution in the entire data [40]. In particular, stationarity [30] and weak-stationarity [31] tests detect the presence of mixture distributions in the sample. However, these statistical checks are configured to detect only those mixtures where the density (i.e. relevance with respect to the overall distribution) of compounded distributions is not negligible. In fact, detecting mixtures that exhibit a low density is cumbersome¹¹1In this work, we use the term low-density mixtures to refer to distribution components with low probability weight and weak presence in the sample.: the rejection criteria for those tests are already strict [40] and any more stringent criteria, for detecting low-density distributions, would necessarily lead to systematically reject the identically distributed null hypothesis [40, 29]. In principle, low-density mixture could be intercepted by choosing the appropriate threshold $t$ in EVT tail selection process. In practice, however, the selection of $t$ is hampered by the presence of mixtures, especially low-density ones, thus increasing the chances that $t$ is incorrectly chosen, ultimately causing low-density mixtures to remain undetected and even gain more relevance within the tail, causing a misestimation of the Extreme Value Index (EVI). The presence of uncaptured low-density mixtures in the tail can negatively affect EVT application and can lead to overly pessimistic or even optimistic results, depending on the considered EVT distribution (General Pareto Distribution (GPD) or Exponential Distribution (EXP)).

In this work, we provide evidence on how the presence of undetected low-density mixtures in the tail jeopardizes the application of EVT, and propose TailID, an automatic, iterative approach that extends state-of-the-art practice for IID testing by addressing the presence of mixture distributions in high-quantiles. In particular, TailID contributions are two-fold:

$\blacksquare$

First, TailID analyses the point-wise impact of tail data on the extremal distribution by checking whether each point in the tail falls within the theoretical confidence interval of the EVT EVI parameter $\xi$ and returns a set of points (if any) in the tail that are responsible for breaking the ID assumption.
$\blacksquare$

Second, when mixtures are detected, the proposed algorithm uses those same break points to define a trustworthy criteria for selecting the threshold $t$ . TailID builds such selection on the exact knowledge on the precise point where the tail behavior changes and the last low-density mixture begins.

We contend that TailID complements and strengthens existing state-of-the-art IID tests for EVT applicability by specifically capturing (low-density) mixtures in the tail, where existing tests only apply on the full sample and therefore do not cover the case where low-density mixtures are preserved and appear in the tail.

TailID is evaluated both on a set of parametric mixture distributions and on a real data collected from a AI-based applications running on an NVIDIA AGX Orin [27], a reference embedded heterogeneous platform for AI applications in the automotive domain. Note that, while we evaluate TailID on representative AI kernels in reason of the increasing incidence of AI software in real-time systems, TailID is not specific to AI applications. Our results confirm the effectiveness of the approach in capturing low-density mixtures of different types and complexity in the tail, as well as in avoiding the inconsistency they bring on the threshold selection process, by determining a trustworthy threshold $t$ . All in all, TailID enables a more robust application of EVT in the timing analysis domain by extending the scope of state-of-the-art IID testing and ensuring more precise timing bounds.

The rest of this paper is organized as follows: Section 2 recalls EVT basic concepts and reviews existing works on IID testing. Section 3 motivates our work by providing empirical evidence on how low-density mixtures in the tail negatively affect EVT application by jeopardizing the estimation of the threshold $t$ and EVI $\xi$ , and ultimately impairing its results. Section 4 develops on the theoretical uncertainty analysis when deriving the EVI $\xi$ on which TailID builds on. The proposed TailID approach is presented in Section 5 and thoroughly evaluated in Section 6. Finally, Section 7 presents the main conclusions of this work. To ease the reading, we refer to the acronyms in Table 1 and the notation in Table 2.

2 State of the Art on Identical Distribution Testing

Table 1: Acronyms used in this work.

Acro.	Definition	Acro.	Definition
AI	Artificial Intelligence	KPSS	Kwiatkowsky, Phillips, Schmidt and Shin
BDS	Brock, Dechert, Scheinkman and LeBaron	MLE	Maximum Likelihood Estimation
CDF	Cumulative Distribution Function	MoS	Minimum amount of Samples
EVI	Extreme Value Index	PDF	Probability Density Function
EVT	Extreme Value Theory	PoT	Peaks Over Threshold
GPD	Generalised Pareto Distribution	PPI	Probabilistic Predictability Index
ID	Identical Distribution/ed	(p)WCET	(probabilistic) Worst-Case Execution Time
IID	Independence and Identical Distribution/ed

EVT is a powerful theoretical approach in statistics that studies extreme values. Just as the Central Limit Theorem helps explaining the behavior of averages in large samples, EVT focuses on the extreme deviations of the distribution, the rare data points far from the average that represent unusual or extreme events. One of the main approaches within EVT is the Peaks Over Threshold (PoT) method, that considers the distribution of all values exceeding a certain high threshold $t$ , i.e. the tail of the distribution.

Definition 1 (Excess Random Variable).

Let $X$ be an IID random variable that follows a cumulative distribution function (CDF), $F$ , and consider a threshold $t>0$ . Then,

\displaystyle Y_{t}:=(X-t\,|\,X>t)

(1)

is the excess random variable at threshold $t$ which follows a CDF denoted by $F_{t}$ .

The Pickands-Balkema-de Haan theorem states that for any $F(x)<1$ , the distribution of $F_{t}$ converges in probability to a Generalized Pareto Distribution (GPD) as $t$ becomes large. Specifically, $F_{t}\xrightarrow{}G(Y_{t};\sigma,\xi)$ as $t\xrightarrow{}\infty$ , where:

Theorem 2 (Pickands–Balkema–de Haan theorem [7]).

G(Y_{t};\sigma,\xi)=\begin{cases}1-\left(1+\frac{\xi Y_{t}}{\sigma}\right)^{-1% /\xi},&\text{if }\xi\neq 0,\\[10.0pt] 1-\exp\left(-\frac{Y_{t}}{\sigma}\right),&\text{if }\xi=0.\end{cases}

(2)

With $\sigma>0$ , and $y\geq 0$ when $\xi\geq 0$ and $0\leq Y_{t}\leq-\frac{\sigma}{\xi}$ when $\xi<0$ . Where, $\sigma$ is the scale parameter, and $\xi$ is the shape parameter, or Extreme Value Index (EVI). When $\xi=0$ , the tail of the distribution decays at an exponential rate. A positive EVI, $\xi>0$ , indicates that the tail decays at a slower rate than the exponential. Finally, a negative EVI, $\xi<0$ , indicates a decay rate of the tail faster than the exponential.

In the last decades, significant research efforts [16, 19] have been devoted to formalize and consolidate the application of EVT for bounding the execution time of time-critical applications via pWCET estimates. An intuitive definition of pWCET for a program $\tau$ is given in [3] as a tuple $(b^{*},p)$ so that the probability of $\tau$ exceeding the execution time $b^{*}$ is lower than $p$ . For a more thorough definition of pWCET, we refer to [11].

The state of the art has thoroughly addressed the challenge of verifying whether the observed execution time samples (or traces) are amenable for EVT application. Initially, EVT was developed under the assumption that samples were IID. Subsequently, in later works, the IID assumption has been softened and replaced with stationarity (which covers for ID), and weaker independence conditions, better aligned with the peculiarities of concrete observations [31, 8, 3]. Even under these relaxations, it can still be trustworthy to apply EVT. The research community has extensively worked on the definition of accurate and reliable testing approaches for EVT assumptions [16, 19, 40, 33]. In this work we rely on state-of-the-art methods for testing EVT hypothesis on the bulk of data, like those proposed in [40, 33]. In the following, we provide some background on ID and stationarity testing only, as it matches the scope of this contribution.

In the scientific literature, statistical tests for stationarity are performed over the entire sample and not particularly on the extremes. In this direction, the main approach found in the literature [40, 24] involves checking stationarity of the whole sample using the Kwiatkowsky, Phillips, Schmidt and Shin (KPSS) test, under the assumption that if the process is stationary then the joint probability distribution of execution times does not change over time. Furthermore, Reghenzani et al. [40] present a method to systematically test all three hypotheses, stationarity, short-range and long-range dependence, unified under a single index, the Probabilistic Predictability Index (PPI), thereby ensuring a more comprehensive and strict evaluation of EVT applicability in execution time analysis.

Another approach to assessing ID on a full sample is the BoostIID method [50], which tracks how ID tests respond to changes in task execution times as new data is incorporated. By analyzing small, recent data batches, BoostIID detects shifts in the statistical distribution caused by faults. However, this approach does not focus on extreme deviations specifically.

In the state of the art, concerns have been recently raised on the impact of software “mode changes” in the distribution of execution time profile. Mode changes, in real-time literature [13], capture the transition between predefined system states and are commonly considered to be partially responsible for low-density mixtures in the distribution [23, 29]. For instance, [29] observes that statistical tests, when applied globally, consider the entire dataset as a single entity, possibly causing the impact of mode changes (low-density mixtures) to be too small relative to the entire dataset to be detected, ultimately leading to a misclassification of the sample as identically distributed. This work suggests a method named Walking Kolmogorov-Smirnov (WKS), which compares two small “windows” around each data point to better capture local distribution changes. More generally, [13] discusses the necessity of protocols for managing system “mode changes” (low-density mixtures), which play a vital role in timing analysis and system reliability.

While some works recognize the challenges posed by low-density mixtures in EVT estimation [29, 13], to the best of our knowledge, no specific approach to cope with them has been proposed in the literature. In particular, we identify the need for ID or stationarity tests specifically designed for intercepting low-density mixtures in the tail to prevent them from impairing EVT results.

3 Motivation and Problem Statement

Table 2: Mathematical notation used in this work.

Term	Definition	Term	Definition
$X$	Random variable	$x_{i}$	$i$ -th value of variable $X$
$\{X\}$	Number of elements in $X$	$Y_{t}$	Excess variable from threshold $t$
$t$	Tail threshold	$w_{i}$	weight of the $i$ -th mixture component
$\xi,\hat{\xi}_{mle}$	Extreme Value Index, and its MLE	$\mu,\sigma$	mean and standard deviation of the Gaussian
$p$	percentile	$C$	Set of candidates for ID testing
$\mathcal{N}(\mu,\sigma)$	Gaussian Distribution with $\mu$ , $\sigma$	$Y^{0}_{t}$	Excess of $X$ from threshold $t$ without values in $C$
$X^{0}$	$X$ without values in $C$	$\hat{\rho}$	Estimation of the extremogram
$\xi^{i}_{t}$	EVI with the candidate $c_{i}$ and threshold $t$	$I^{i}_{t}$	CI with the candidate $c_{i}$ and threshold $t$
$I$	Confidence Interval (CI)	$\gamma$	CI’s confidence level
$S$	Set of ID-sensitive points

In practice, the data used to fit EVT and GPD models are often taken from a time series.

Definition 3 (Time Series [12]).

Let $\{X_{t}\}$ be a collection of random variables indexed by time, referred to as a process. A time series refers to a single observed realization of this process, denoted by $\{x_{t}\}$ , where each $x_{t}$ is a specific value taken by $X_{t}$ .

$\blacktriangleright$ Remark 4.

In line with state of the art [12], we use the term time series to refer to both the data and the process of which the latter is a realization.

When dealing with EVT and GPD models, a primary concern is assessing whether the underlying random variables generating the observed data are ID.

Definition 5 (Identical Distribution [10]).

Let $\{X_{t}\}$ be a time series and $F_{X_{t}}(x)$ the CDF of the random variable $X_{t}$ . The time series $\{X_{t}\}$ is said to be identically distributed if, for any pair of time indices $(t_{1},t_{2})$ , the random variables $X_{t_{1}}$ and $X_{t_{2}}$ have the same probability distribution, meaning their CDFs are equal:

F_{X_{t_{1}}}(x)=F_{X_{t_{2}}}(x).

To verify the ID hypothesis, stationarity tests are typically applied [40, 24].

Definition 6 (Stationarity [37]).

Let $\{X_{t}\}$ be a time series and $F_{X_{t_{1}},\dots,X_{t_{n}}}$ the joint CDF of the random variables $X_{t_{1}},\dots,X_{t_{n}}$ . The time series $\{X_{t}\}$ is said to be stationary if, for any integer $n\in\mathbb{N}$ , any set of time indices $(t_{1},t_{2},\dots,t_{n})$ , and any time shift $\tau\in\mathbb{R}$ , the probability distribution of $X_{t_{1}},\dots,X_{t_{n}}$ does not change when all time indices are shifted by $\tau$ , meaning that their joint CDFs are equal:

F_{X_{t_{1}+\tau},\dots,X_{t_{n}+\tau}}(x_{1},\dots,x_{n})=F_{X_{t_{1}},\dots,% X_{t_{n}}}(x_{1},\dots,x_{n}).

(3)

Stationarity tests are used in conjunction with specific tests to analyze short-range and long-range independence where the former are used to confirm minimal correlation between closely spaced data points, and the latter to check that correlations diminish over time. The work of Reghenzani et al. [40] introduces a robust and strict solution to test IID hypothesis altogether on the whole distribution using: (i) the KPSS test for ID, (ii) the Brock, Dechert, Scheinkman and LeBaron (BDS) test [48] for short-range dependence, and (iii) the Hurst Index [26] for long-range dependence.

$\blacktriangleright$ Remark 7.

All samples considered in this work, both in the motivating example and the experimental evaluation, have passed (i.e. not been rejected) the three tests converging in the so-called PPI test [40], namely, KPSS, BDS, and the Hurst index.

It has been also argued in the literature that the stationarity condition in the whole data can be relaxed in favor of weak-stationarity as it does not affect the application of EVT [17]. Weak-stationarity means that instead of requiring the joint distribution function to be invariant at all times, as seen in Definition 6, it only requires the mean, variance, and covariance to be constant in time. In that regard, requiring stationarity is a much more strict condition to satisfy.

However, as we show later in this section, even when the stationarity test does not reject the ID hypothesis, low-density mixtures in the tail can appear and can impair the WCET estimation. The limitation of the KPSS test and traditional ID tests lies in that they do not fully reflect ID on the tail of the sample, often failing to capture variations. This is expected given that the tail carries a small weight in the distribution. If these mixtures are present but undetected, further analysis or models that rely on the data being ID may produce misleading results, for instance when applying EVT.

3.1 EVT with Low-Density Mixtures in the Tail

Although the IID properties are typically studied on the whole distribution, some studies, in addition to applying stationary tests on the entire sample, measure extremal dependency with the extremogram [18] in stationary time series to ensure the proper applicability of EVT [8, 41]. The extremogram is the most commonly used technique, which quantifies the probability that extreme values observed at one time point are followed by extreme values at a later time point. However, this approach is exclusively concerned with dependence and does not directly assess whether the tail satisfies the ID condition. In other words, the extremogram detects the presence of clustered extremes which can hint that the ID hypothesis is not fulfilled. However, in general the extremogram is not an ID test for the tail, as it does not detect the presence of unclustered low-density mixtures in the tail, which breaks the ID hypothesis. Notionally, when focusing on ID on the tail, instead, one could be tempted to apply stationarity tests (e.g. KPSS) on the tail, but with poor results as we will show in Section 6.

$\blacktriangleright$ Remark 8.

For all samples considered in this work, both in the motivating example and the experimental evaluation, we apply extremogram to detect dependencies in the tail and straightforwardly extend the application of the KPSS test to the tail, in order to test for ID.

Refer to caption — (a) Time series of the illustrative Gaussian mixture.

3.2 Illustrative Example

We illustrate with a parametric distribution example how even when the IID hypothesis on the whole distribution and the tail (see Remarks 7 and 8) have not been rejected, the WCET estimation can be misleading as it does not account for low-density mixtures in the tail. In particular, we generate a Gaussian mixture model, with one component having $w_{1}=0.99$ of the weight with mean $\mu_{1}=10$ and standard deviation $\sigma_{1}=1$ and the second component having weight $w_{2}=0.01$ , mean $\mu_{2}=20$ and standard deviation $\sigma_{2}=3$ . The use of a Gaussian distribution is useful as an example because it has exponential tails, so even in the presence of mixtures, theoretically the tail is localized in the second component with an EVI $\xi=0$ .

The parametric distribution data, which can be seen in Figure 1(a), was used to compute the pWCET estimate considering two models: a GPD model that can take either light or heavy tails, and an exponential model (recall that the exponential model is the theoretically appropriate one given the Gaussian distribution). The estimation of the threshold, $t$ , is performed throughout the paper with a threshold selection algorithm based on QQ plots [6], although other methodologies can be used to obtain similar results [42]. In Figure 1(b), we show the extremogram for the parametric distribution. The quantile chosen for the extremogram, the $0.98$ quantile, is based on the original work [18], and also employed in [41]. Furthermore, the cutoff value for the extremogram, to determine if there is no dependency, is set at $\hat{\rho}<0.1$ following [41]. Figure 1(b) shows a value for the extremogram lower than $\hat{\rho}<0.1$ , which indicated no dependency among extreme values.

The pWCET estimates are shown in Figure 1(c). Note that, the black points represent the CCDF of the observed values in Figure 1(a). In the case of the GPD, it produces heavy tails, given that EVI is bigger than $\xi>0$ . This is due to the threshold selection detecting that the tail begins before the second component of the mixture, which produces a seemingly heavy tail because the rest of the extreme points are further apart. This results in pessimistic upper-bounds, more than 3 times bigger than the actual tail for low probabilities. In the case of the EXP, the threshold has been selected before the second component, and thus the pWCET estimate is optimistic. Overall, this example highlights the connection between a miscalculation of the threshold $t$ due to mixture distributions and misestimation of the EVI and the subsequent impact on the pWCET estimation.

4 Uncertainty Analysis for the GPD

Most EVT methods to estimate the tail build around the stability of the EVI from a certain threshold $t$ onward [25, 15]. However, in the presence of low-density mixtures in the tail, the EVI can be found to be unstable for higher thresholds due to low amount of information, and therefore miscalculated. We showed in Section 3 how this misestimation can compromise the pWCET estimations given the threshold selected are lower due to the presence of low-density mixtures. In order to detect low-density mixtures in the tail and avoid such scenarios, this section develops on the principles of the estimation of the EVI and how the estimations can change as we approach low-density mixtures in the tail. In fact, detecting and quantifying this change in a robust manner is the goal of this work and the core principle of TailID.

4.1 Maximum Likelihood Estimation and Uncertainty Analysis

Given observations from a random variable $X$ , and considering the excess random variable over a threshold $t$ , defined as $Y_{t}:=(X-t\mid X>t)$ in Equation 1, the parameter $\xi$ of the GPD can be estimated. Apart from the model uncertainty associated to the expression, the estimation of $\xi$ will also contain sampling uncertainty due to the variability of the input, $y_{i}$ . To assess the uncertainty around the estimated EVI, $\hat{\xi}$ , of the GPD, it is necessary to compute a confidence interval, which gives a range of values within which the true parameter $\xi$ is likely to lie with a specified level of confidence $\gamma$ .

The classical method of calculating the confidence intervals for an estimator is to derive the asymptotic distribution of the errors. That is, the difference between the true value of the parameter and the estimator has a distribution which can be found. If the estimator is drawn from Maximum Likelihood Estimator (MLE) method, said difference converges in distribution to Gaussian as:

Theorem 9 (Asymptotic Normality of the Estimator [49]).

Let $\xi$ be the parameter’s true value and $\hat{\xi}_{mle}$ the Maximum Likelihood Estimation of $\xi$ , and $n$ the sample size, then:

\sqrt{n}\quantity(\hat{\xi}_{mle}-\xi)\xrightarrow{d}\mathcal{N}\quantity(0,% \text{var}(\xi)),

(4)

the difference between the true value and the parameter’s estimations converges in distribution to a Gaussian distribution $\mathcal{N}(0,\text{var}(\xi))$ .

This theorem combines key probabilistic results, specifically the Law of Large Numbers and the Central Limit Theorem. Equation 4 is a classical result found in many foundational probability and statistics books [49] [32]. In order to unfold Equation 4 we need to find the MLE of the Extreme Value Index, $\hat{\xi}_{mle}$ , and its variance, $\text{var}(\xi)$ . Then, Equation 4 allows us to compute confidence intervals on the estimation of $\xi$ .

First, we introduce the Maximum Likelihood Estimation (MLE) [35] method for parameter estimation, as it is the approach utilized in Equation 4. One of the most common methods for estimating the parameters of a distribution, in particular the EVI, $\xi$ , of the GPD, is MLE [35]. MLE is a statistical technique employed to find the parameter’s value that maximize the likelihood function, given a set of observed data. The likelihood function $L(\theta\mid X)$ for a sample $X=\{x_{i}\}_{1\leq i\leq n}$ from a distribution with parameter $\theta$ is given by:

Definition 10 (Likelihood).

Let $x_{1},\dots,x_{n}$ the sample of a random variable of size $n$ with probability density function $f$ , then the likelihood function is defined as

L(\theta\mid X)=\prod_{i=1}^{n}f(\theta\mid x_{i}),

(5)

where $f(\theta\mid x_{i})$ is the probability density function (PDF) of the distribution at $x_{i}$ given parameter $\theta$ .

Now, the log-likelihood is defined as:

Definition 11 (Log-likelihood).

Let $f(\theta\mid x_{i})$ be the PDF at $x_{i}$ given parameter $\theta$ , then

\ell(\theta\mid X)=\sum_{i=1}^{n}\log f(\theta\mid x_{i}),

(6)

is the log-likelihood function given parameter $\theta$ .

The MLE for $\theta$ , denoted $\hat{\theta}$ , is the value that satisfies the following equation:

Definition 12 (Maximum Likelihood Estimation).

Given a log-likelihood function $\ell(\theta\mid X)$ , $\hat{\theta}_{mle}$ is the value of the parameter $\theta$ which maximizes the log-likelihood, that is:

\hat{\theta}_{mle}=\arg\max_{\theta}\,\ell(\theta\mid X).

(7)

We now have the tools to prove the MLE of the EVI of the GPD.

4.2 Computing Confidence Intervals on the Extreme Value Index

Now we employ the definitions above for the particular case of the GPD. In the case of the GPD, given $x_{1},\dots,x_{n}$ observations and considering $Y_{t}$ the excesses for a given threshold $t$ , $Y_{t}=\{y_{i}\}_{1\leq i\leq n}$ , the likelihood function of the GPD is:

L_{gpd}(\xi,\sigma\mid Y_{t})=\prod_{i=1}^{n}f_{gpd}(\xi,\sigma\mid y_{i})=% \prod_{i=1}^{n}\frac{1}{\sigma}\left(1+\frac{\xi y_{i}}{\sigma}\right)^{-\frac% {1}{\xi}-1},

(8)

then the log-likelihood function takes the form:

\ell_{gpd}(\xi,\sigma\mid Y_{t})=-n\log\sigma-\left(1+\frac{1}{\xi}\right)\sum% _{i=1}^{n}\log\left(1+\frac{\xi y_{i}}{\sigma}\right).

(9)

Taking the change of variables $\psi=\frac{\sigma}{\xi}$ ,

\ell_{gpd}(\xi,\sigma\mid Y_{t})=-n\log(\xi\psi)-\left(1+\frac{1}{\xi}\right)% \sum_{i=1}^{n}\log\left(1+\frac{y_{i}}{\psi}\right).

(10)

Performing MLE to obtain $\hat{\xi}$ :

\frac{\partial\ell_{gpd}(\xi,\sigma\mid Y_{t})}{\partial\xi}=-\frac{n\psi}{\xi% }+\frac{1}{\xi^{2}}\sum_{i=1}^{n}\log\left(1+\frac{y_{i}}{\psi}\right)=0.

(11)

This leads to:

\hat{\xi}_{mle}=\frac{1}{n}\sum_{i=1}^{n}\log\left(1+\frac{y_{i}}{\psi}\right).

(12)

Given that the Extreme Value Index, $\xi$ , cannot be totally isolated, this expression needs to be solved numerically. To achieve this, the gpd.fit() function from the ismev package [21] is used to fit a GPD model to threshold exceedances using MLE.

In order to compute the variance of a parameter $\theta$ , $\text{var}(\theta)$ , we can use the Cramér-Rao bound, which finds the estimator with the lowest variance based on observations of a random variable $X$ , thus reducing the uncertainty [39]. The Cramér-Rao bound is stated as:

Theorem 13 (Cramér-Rao bound).

Let $\theta$ be a parameter with an unbiased estimator $\hat{\theta}$ , then the variance of the estimator is bounded by

\text{var}(\hat{\theta})\geq\frac{1}{I(\theta)},

(13)

where $I(\theta)$ is the Fisher information.

The Fisher information is defined as:

Definition 14 (Fisher Information).

Let $\ell(\theta\mid X)$ a likelihood function given a parameter $\theta$ , then

\displaystyle I(\theta)=-E\Bigg{[}\frac{\partial^{2}}{\partial\xi^{2}}\ell{(% \theta\mid X)}\Bigg{]},

(14)

which uses the expected value of the derivative of the MLE estimator. With this mathematical framework, we can compute the uncertainty around the MLE of the Extreme Value Index $\xi$ that we introduced in Equation 4.

Following the procedure indicated in Equation 13, the Cramér-Rao bound for the EVI is

\text{var}(\hat{\xi})\geq\frac{1}{I_{1}(\xi)}.

(15)

Then, the Fisher information, for $\sigma$ known, is computed as:

I_{1}(\hat{\xi})=-E\quantity(\frac{\partial^{2}\ell_{gpd}(\xi,\sigma\mid Y_{t}% )}{\partial\xi^{2}})=-E\quantity(\frac{1}{\hat{\xi}^{2}}-\frac{2}{\hat{\xi}^{3% }}\hat{\xi})=\frac{1}{\hat{\xi}^{2}}.

(16)

Plugging this result into Equation 13:

\text{var}(\hat{\xi})\geq\frac{1}{I_{1}(\hat{\xi})}=\hat{\xi}^{2}.

(17)

Finally, we complete Equation 4 for the case of the GPD

\sqrt{n}\quantity(\hat{\xi}-\xi)\xrightarrow{d}\mathcal{N}\quantity(0,\xi^{2}).

(18)

Equation 18 allows us to compute confidence intervals for the estimation of the EVI.

Example 15.

Let us consider a set of observations $y_{1},\dots,y_{100}$ with an estimated parameter $\hat{\xi}_{mle}=1$ using Equation 12. We want to compute the confidence interval for $\xi$ at a confidence level of $\gamma=0.95$ . According to Equation 18, we have:

\sqrt{100}\quantity(\hat{\xi}_{mle}-\xi)\xrightarrow{d}\mathcal{N}\quantity(0,% 1^{2}).

(19)

To construct a confidence interval, we define $\alpha=1-\gamma$ and recall that $Z\sim\mathcal{N}(0,1)$ satisfies:

P(-z_{\alpha/2}\leq Z\leq z_{\alpha/2})=1-\alpha=\gamma,

(20)

where $z_{\frac{\alpha}{2}}$ represents the $(1-\alpha/2)$ -quantile of the standard normal distribution. For $\alpha=0.05$ , $z_{\alpha/2}\approx 1.96$ .

Using the standardization from Equation 19, we obtain:

P\left(-z_{\alpha/2}\leq 10\quantity(\hat{\xi}_{mle}-\xi)\leq z_{\alpha/2}% \right)=\gamma.

(21)

Rearranging the inequality, we get:

P\left(\hat{\xi}_{mle}-\frac{z_{\alpha/2}}{10}\leq\xi\leq\hat{\xi}_{mle}+\frac% {z_{\alpha/2}}{10}\right)=\gamma.

(22)

Consequently, the $0.95$ confidence interval for $\xi$ is given by:

\xi\in\left[\hat{\xi}_{mle}-\frac{z_{\alpha/2}}{10},\hat{\xi}_{mle}+\frac{z_{% \alpha/2}}{10}\right]=\left[0.804,1.196\right].

(23)

recalling that $\hat{\xi}_{mle}=1$ and $z_{\alpha/2}\approx 1.96$ .

Given that we are using the MLE, we are reducing the model uncertainty as much as possible, while being able to capture the variability of the estimator.

Finally, the expression in Equation 18 is the core of the TailID algorithm, as it will be used to assess whether a change in the tail is present.

5 TailID methodology

The proposed approach, TailID, relies on the fact that low-density mixtures in the tail causing changes in the tail behaviour manifests in a change of the EVI $\xi$ . TailID builds on the theoretical development we showed in Section 4. In particular, the core of the algorithm is the confidence intervals of the MLE of the EVI of the GPD in Equation 18.

TailID algorithm analyzes iteratively the extreme values that may be triggering a change in the EVI $\xi$ , one by one, until a change of tail behaviour is detected or all points are processed. Note that, an estimation of $\xi$ that falls outside the confidence intervals, does not imply that the estimation is inappropriate, but that the behaviour has changed due to the detected mixture component and therefore we should apply a more nuanced tail analysis by taking into account the presence of mixtures in the tail estimation.

TailID, detailed in Algorithm 1, works on a sample data $X$ , a confidence interval $\gamma$ , an extreme value percentile $p_{M}$ , a percentile for the candidate points $p_{c_{1}}$ , and returns the set of points $S$ that are more sensitive to the hypothesis of the sample data being ID. Sensitive points are those that, if considered in the sample, can cause the EVI to fall outside the confidence interval computed without them. Also, TailID is agnostic on the number of compounded distributions in the mixture as it considers the last component (before the point breaking the ID assumption) for tail estimation by selecting a sufficiently high percentile of points $p_{c}$ . In Section 5.1, we show how we set $p_{c}$ to detect the last component.

Algorithm 1 Detection of Tail ID-Sensitive Points.

The algorithm first initializes the $S$ data structure (line 2) where sensitive points are stored. Given a time series $X=(x_{i})_{1\leq i\leq n}$ , the implemented algorithm applies the PoT method to select the tail of the distribution. The threshold $t_{M}$ for the PoT is computed based on the extreme value percentile $p_{M}$ such that $p_{M}\cdot 100$ percent of the data in $X$ falls below $t_{M}$ , i.e., $t_{M}:=$ Quantile $(X,p_{M})$ (line 3). In this work, the threshold selection algorithm is drawn from [6]. From the definition above of $t_{M}$ , the excess of the time series is defined as

Y_{t_{M}}:=(x_{i}-t_{M}\,|\,x_{i}>t_{M})\mbox{\quad for\quad}x_{i}\in X.\\

Next, we determine an initial ordered set $C$ of candidate points that are considered as the most sensitive to the assumption of the sample being ID. The definition of $C$ depends on the input parameter $p_{c_{1}}$ that is used as the criteria to define the first candidate as

\displaystyle c_{1}:=\text{Quantile}(X,p_{c_{1}}),

(24)

where, $p_{c_{1}}>p_{M}$ . Both choices of $p_{M}$ and $p_{c_{1}}$ will be discussed in Subsection 5.1. The ordered set of candidates $C$ is therefore initialized (line 4) as

C:=(x_{i}\in X\,|\,x_{i}\geq c_{1}).

(25)

For each of the candidates in $C$ , the algorithm will check whether it has a significant impact on the ID hypothesis. For this purpose, it is necessary to consider the set $X$ with the elements in $C$ removed, that is $X^{0}:=X\setminus C$ (line 5). The excess sample is then computed on $X^{0}$ (line 6) as

Y^{0}_{t_{M}}:=(x_{i}-t_{M}\,|\,x_{i}>t_{M})\mbox{\quad for\quad}x_{i}\in X^{0}.

(26)

The subset $Y_{t_{M}}^{0}$ is then used (line 7) to estimate the reference EVI parameter of the sample, denoted by $\xi^{0}_{t_{M}}$ . This step is crucial for understanding the core principle of TailID. The values in $Y^{0}_{t_{M}}$ are the tail of the distribution with part of the most extreme values removed and collected in the set $C$ for later use.

As the last step in the initialization phase (line 8) the TailID computes the confidence interval for this EVI, $I^{0}_{t_{M}}$ , at confidence level $\gamma$ . TailID estimates the range within which the true parameter value of the distribution of the maximum extremes is expected to lie with a probability of $\gamma$ , characterizing the tail behavior. The argument is, that if there are no mixtures in the tail, the re-estimation of the EVI adding back the stored extreme values in $C$ , will still fall within the confidence interval. The confidence interval is recomputed at each iteration of the core block of the algorithm to account for the additional information contributed by each extreme value in $C$ that is not added to $S$ . The implications in the choice of $\gamma$ and the suggested value will be discussed in Subsection 5.1.

The core of the algorithm applies to each element $c_{i}\in C$ (line 9) which is added back to the sample as $X^{1}:=X^{0}\cup c_{1}$ , line 11. As $C$ is an ordered set, the first candidate value $c_{1}$ will be the closest value to the sample mean, being $c_{1}$ the first point of the ordered set $C$ . Subsequently, the excess $Y^{i}$ will be computed on the new extended sample set $X^{i}$ (line 12), again according to Equation 26.

The new computed excess set $Y^{i}_{t_{M}}$ is then used to fit the EVI $\xi^{i}_{t_{M}}$ (line 13). Now if $\xi^{i}_{t_{M}}$ does not fall inside the confidence interval $I^{i-1}_{t_{M}}$ (line 14),

\displaystyle\xi^{i}_{t_{M}}(Y^{i}_{t_{M}})\notin I^{i-1}_{t_{M}},

(27)

there is sufficient evidence to conclude that $c_{i}$ is sensitive to the hypothesis of the tail of the time series being ID. So it will be added to the set of ID-Sensitive Points $S$ (line 15). Instead, if $\xi^{i}_{t_{M}}$ falls inside the confidence interval $I^{i-1}_{t_{M}}$ , then it will not be added to $S$ and $\xi^{i}_{t_{M}}$ will be used to update the confidence interval for the next candidate value $c_{i+1}$ (line 17).

The algorithm iteratively analyzes all elements in $C$ , following the same procedure until the first ID-Sensitive Point $c_{k}$ among the candidates is found. At that point, since $C$ is an ordered set, also all the following candidates, $\{c_{j}\in C\,|\,j\geq k\}$ , are sensitive and will be added to $S$ (line 20). This latter behavior is enforced under the assumption of $S$ being empty until the first sensitive point is found (line 10).

Finally, the algorithm returns $S$ as the list of sensitive points breaking the ID assumption (line 23).

From a theoretical perspective, it is worth considering that TailID builds on a ordinary set of assumptions. First, it assumes that the tail of the data follows a GPD. Second, the validity of the confidence intervals in Equation 18 requires the asymptotic normality of the MLE of the EVI $\xi$ . This, in turn, depends on regularity conditions. Specifically, the data $X$ are assumed to be IID with PDF $f(\xi_{0}\mid x_{i})$ such that $f(\xi\mid x_{i})\neq f(\xi_{0}\mid x_{i})$ for $\xi\neq\xi_{0}$ , the parameter space $\mathcal{X}$ is compact, and the true parameter $\xi_{0}$ lies in its interior. Moreover, $\log f(\xi\mid x_{i})$ is continuous in $\xi$ with probability one, and $\mathbb{E}\left[\sup_{\xi\in\mathcal{X}}\left|\log f(\xi\mid x_{i})\right|% \right]<\infty$ . In addition, the function $f(\xi\mid x_{i})$ is assumed to be twice continuously differentiable and strictly positive in a neighborhood $\mathcal{N}$ of $\xi_{0}$ . The following integrability conditions must also be satisfied: $\int\sup_{\xi\in\mathcal{N}}\|\nabla_{\xi}f(\xi\mid x_{i})\|\,dX<\infty$ , and $\int\sup_{\xi\in\mathcal{N}}\|\nabla_{\xi\xi}f(\xi\mid x_{i})\|\,dX<\infty$ . Furthermore, the Fisher information matrix must exist and be nonsingular, and it is required that $\mathbb{E}\left[\sup_{\xi\in\mathcal{N}}\|\nabla_{\xi\xi}\log f(\xi\mid x_{i})% \|\right]<\infty$ [36].

5.1 Discussion of Algorithm Parameters

Algorithm 1 works on three main parameters ( $\gamma$ , $p_{M}$ , and $p_{c_{1}}$ ) that call for a careful selection to guarantee the effectiveness of the results. In the following we discuss the implications on the parameter selection and the proposed selection criteria.

$\boldsymbol{p_{M}}$ . The choice of parameter $p_{M}$ determines the points in the sample that are considered part of the sample tail. Given the lack of consensus in the literature regarding the choice of percentile that defines the sample tail, several studies also criticize the practice of arbitrarily selecting a fixed percentile without first analyzing the data [43]. To address this, we determine $p_{M}$ using the methodology developed in [6] that evaluates multiple candidate percentiles and selects the one that best aligns with the assumptions of EVT. Any other threshold selection algorithm in the literature is also suitable.

$\boldsymbol{p_{c_{1}}}$ . The choice of $p_{c_{1}}$ determines the points in the sample that are considered candidates for being sensitive points. This percentile is tied to the selection of $p_{M}$ , such that, the candidates for sensitive points must belong to the sample tail and indeed represent data with very low probability outcomes. Additionally, as the goal of the TailID algorithm is to detect any inconsistencies in the ID of the tail, we avoid selecting a percentile $p_{c_{1}}$ that fails to detect any sensitive points while another percentile does. A Grid Search algorithm [9] employed on the kernel data, described in Section 6, has shown that choosing $p_{c_{1}}=1-(0.05\cdot(1-p_{M}))$ results in a stable number of detected sensitive points across different values of $p_{M}$ , and across multiple scenarios. Moreover, this percentile ensures that the algorithm consistently identifies a relevant proportion of sensitive points among the candidate points, e.g. for typical thresholds between percentiles $p_{M}=[0.85,0.95]$ , the candidate points selected are the top $0.75\%$ and $0.25\%$ of the sample respectively. Thus, we are aiming at the last component in the mixture. This choice minimizes the risk of failing to detect sensitive points when compared to other percentiles. Also, to improve robustness, we perform a grid search on the data before applying the algorithm. This preliminary step helps to determine an appropriate percentile $p_{c_{1}}$ .

$\boldsymbol{\gamma}$ . The confidence level $\gamma$ is set at $0.9999$ to identify only those points that truly impact the tail distribution, thereby minimizing the number of false positives. For a point to be classified as sensitive, the EVI of the tail, recalculated to include this point, must fall outside the confidence interval of the EVI computed without it. With $\gamma=99.99\%$ , the risk that the true parameter of the distribution lies outside this range is $0.01\%$ .

5.2 Decision Making on TailID outcomes

TailID improves the interpretability of the outcomes, allowing the user to extract conclusions on the extreme values of the data at analysis. In this line, we next define several outcomes (scenarios) from the execution TailID and how to interpret them. In order to create these scenarios we consider the minimum amount of samples (MoS) required for a proper estimation of the EVI. The topic of determining the necessary amount of extreme samples to obtain an accurate estimation on the tail is outside the scope of this work. However, we refer to works in other areas where they argue that the minimum number of extreme samples varies depending on the tail profile [34, 28]. The authors in [14] develop a method to estimate the MoS for each sample where, in general, it ranges between $40$ and $70$ . In line with it, in this work we conservatively set MoS to be at least $\mathit{MoS}=40$ and rely on [14] for further tuning. For instance, for an Extreme Value Index $\hat{\xi}=-0.5$ and $\mathit{MoS}=40$ , the $99\%$ confidence interval is $\hat{\xi}=-0.5\pm 0.10$ .

Overall, using the number of inconsistent points detected, that is the cardinality of the set $S$ (i.e., $|S|$ ), we define three scenarios according to how $|S|$ compares to $\mathit{MoS}$ :

Scenario 1.

( $|S|=0$ ) No inconsistent points detected. If the list of inconsistent points, $S$ , is empty, then no candidate point fell outside the confidence intervals, and the ID hypothesis on the identical distribution of the tail holds. The TailID outcome provides more evidence on the robustness of the tail estimations given the stability of the tail. In this scenario, TailID in conjunction with the KPSS test for the stationarity on the whole distribution, provides a solid analysis on the sample

Scenario 2.

( $|S|>\emph{MoS}$ ) High-number of inconsistent points detected. If the final list of inconsistent points detected by the TailID algorithm is bigger than theMoS required for a proper estimation of the EVI, then we are in the presence of multiple tail behaviors, which should be taken into account when performing pWCET estimations. Hence, the sample needs to be analyzed taking into account the presence of another source of uncertainty in the tail of the distribution. In this scenario, TailID considers the threshold of the tail at the first inconsistent value found with TailID, which would be on the last mixture component. Interestingly, in the case of multiple mixture components conforming the distribution, the PoT model implies that the last component would be employed for the tail estimation, while the rest would be considered on the bulk of the distribution.

Scenario 3.

( $|S|\leq\emph{MoS}$ ) Few inconsistent points detected. As in Scenario 2 there is mixture, but in this case more samples are needed to make a better estimation of the parameters of the distribution since with less than MoS points it would be imprecise. If after performing more runs $|S|>\emph{MoS}$ , we apply TailID as in Scenario 2. However, if the user reaches the number of runs he/she can perform – e.g. due to the associated time required to perform the runs – and still $|S|\leq\emph{MoS}$ , tail prediction should not be done otherwise the uncertainty in the estimation can be high.

6 Experimental Evaluation

We make a two-fold evaluation of our proposal. On the one hand, we build on a set of parameterized mixture distributions that allow us to ascertain how our technique works for scenarios of different complexity. Since we know the actual distribution of the tail, we can assess and assure with complete certainty the accuracy of TailID for the evaluated scenarios. On the other hand, we assess TailID on execution time samples collected on a state-of-the art board for embedded automotive systems, the NVIDIA AGX Orin.

6.1 Parametric Distributions

We use Weibull, Beta and Exponential distributions as they have different types of tails [4, 47]: the Weibull has a sharper slope and a maximum value; the Gumbel (Exponential tail) also has a relatively sharp slope but it has no maximum; and the Beta (light tail) distribution has a similar profile to the Weibull and is frequently used to model high quantiles of random variables with a finite specified interval, hence suiting WCET modeling.

From these uni-modal distributions we build the parameterized mixtures shown in Table 3, which specifies the parameters and weights of both components, the theoretical EVI. The goal when building the parametric mixture distributions is to ensure variety in three main aspects: i) the tail profile, ii) the sharpness of the tail, and iii) the density of the last component. For the tail profile, we select three different parametric distributions in Weibull, Beta and Exponential. For the sharpness, we select a different shape parameter for the distributions (the shape for Exponential is the same given that the EVI is always 0). Finally, we select a different density of the last component by changing the weight of each parametric distribution. With these three aspects, we are able to cover a wide variety of cases that are representative of WCET analysis scenarios.

Table 3: Parametric mixture distributions used in the experimental evaluation.

Acronym

Type

Parameters

EVI

TailID

noTailID

MixW_1

Weibull

\alpha=2

,

\lambda=\{10,40\}

w=\{0.99,0.01\}

-0.5

-0.27

0.26

MixW_2

Weibull

\alpha=\{2,3\}

,

\lambda=\{2,10\}

w=\{0.98,0.02\}

-0.33

-0.19

0.38

MixB_1

Beta

\alpha=5

,

\beta=\{25,5\}

w=\{0.99,0.01\}

-0.2

-0.18

0.23

MixB_2

Beta

\alpha=\{2,1\}

,

\beta=\{16,2\}

w=\{0.97,0.03\}

-0.5

-0.51

0.02

MixE_1

Expntl.

\lambda=\{10,0.5\}

w=\{0.97,0.03\}

0

0.006

0.55

(a) MixW_2.

(b) MixB_1.

(c) MixB_2.

(d) MixE_1.

Figure 3: Targeted candidate and inconsistent points.

(a) MixW_2.

(b) MixB_1.

(c) MixB_2.

(d) MixE_1.

Figure 4: Evolution of the GPD EVI.

MixW_1.

Figure 2(a) shows the sample where points targeted as candidates by the algorithm are marked with distinct shapes and colors, differentiating them from the rest. The red-shaded area highlights the $18$ points ultimately detected as inconsistent by the algorithm.

In Figure 2(b), the horizontal gray line represents the GPD EVI of the sample’s extreme set without including the candidate points, while the dashed line indicates the upper bound of its confidence interval. Each point corresponds to the GPD EVI when the candidate point (represented with the same color in the Figure 2(a)) is added to the sample’s extreme set, with each dashed line representing the upper bound of the confidence interval for that configuration. For instance, in Figure 2(b) the first candidate, the purple solid circle, is below the grey dashed line, therefore within confidence intervals. Then the confidence interval is recomputed considering the first candidate, and is shown in a dashed purple line, just above the grey dashed line. TailID iterates this process until the yellow square is found to be above the confidence interval, therefore inconsistent with the ID hypothesis. Finally, the yellow square can be used as the new threshold from where to estimate the pWCET.

The pWCET estimates with GPD and exponential tail (EXP) fitting are shown in Figure 2(c). In particular, we use the sample in Figure 2(a), which has size $n=10^{4}$ , to estimate the exceedance probability $p=1-10^{-6}$ We distinguish two scenarios: no TailID in which pWCET estimates are obtained with the first threshold selected taking the sample as is (dark green and blue projections); and the TailID in which pWCET estimates are obtained with TailID, i.e. using the new threshold found represented by the yellow square (light green and blue projections). The black dots represent the reference CDF of MixW_1 up to probability $p=10^{-6}$ which is used as ground truth.

For the no TailID scenario, the EXP model, which is usually deemed as trustworthy, produces an optimistic bound, while the GPD model produces a pessimistic bound. In the TailID scenario produced estimates with GPD and EXP are tighter: GPD results in a tighter tail yet slightly underestimating, while the EXP model produces the expected trustworthy pWCET with a tight estimation. The actual estimated EVI parameters for each model and scenario can be found in Table 3, where we can see that in all cases the EVI in no TailID produces heavy-tails, while the TailID scenario produces an EVI which is close to the real one reflected in the EVI column, which is the exact value for each distribution. The EVI for the EXP models is always exactly 0.

Overall, TailID brings a clearly positive impact on pWCET estimations. For the GPD models, the pessimism is reduced while occasionally incurring negligible underestimation when compared to that incurred by the EXP model in no TailID. Additionally, for the EXP models, TailID significantly reduces underestimation.

Rest of the distributions.

Figure 3 shows the candidates and inconsistent points detected for the rest of the parametric distributions in Table 3. There we show how for each distribution, the number of inconsistent points detected is different, with 52 for MixW_2, 24 for MixB_1, 28 for MixB_2 and 92 for MixE_1.

As we can see in Figure 4, the inconsistent points detected for all distributions are always found in the domain of the last mixture component, which can also be seen in Figure 3 where the red-shaded area is always found separately from the bulk of the distribution and it is different for each parametric distribution. For instance, in Figure 3(b), the first candidate points are not far from the bulk of the distribution, and therefore they could be considered as extreme values from the first component. Figure 4(b) confirms that suspicion, given that the first candidates are within confidence intervals, therefore, not belonging to the last mixture component. However, this assumption is broken at the magenta triangle point, which is much further from the bulk of the distribution, as seen in Figure 3(b) where the last mixture component has been detected. A similar scenario is captured in Figure 2(b) and 4(c). In Figures 4(a) and 4(d), we observe how the first candidates are already detected as inconsistent and, from Figures 3(a) and 3(d), it is not obvious that a mixture is present.

All of these findings are reflected in Figure 5 where the pWCET estimations for the rest of the parametric distributions are found. In more detail, the GPD with no TailID always produces a heavy tail, thus leading to a pessimistic bound (dark blue line). With TailID, the GPD model produces tighter bounds (light blue). For the EXP model, which is regarded as a trustworthy upper-bound, with no TailID there is risk of computing optimistic bounds as the threshold estimated is too low (dark green line). Instead, with TailID the EXP model consistently produces tight upper-bounds (light green line), given that the chosen threshold using TailID is on the last mixture component.

6.2 NVIDIA Orin AGX

Setup.

In this work we cover the execution time variability from the hardware and system software. At the application software level, we build on representative kernels that in general present low execution time variability. In particular, we focus on the three A78 clusters of the Jetson 64GB AGX Orin, simply referred in this work as Orin, a representative high-end MPSoC for the automotive domain [27]. An Orin CPU cluster implements four high-performance Arm^® Cortex^®-A78AE v8.2 cores each with its private data and instruction caches (iL1 and dL1, respectively in Figure 6) and a unified private L2 (L2). An L3 cache is shared at the cluster level. The L4 is mainly shared between the CPU and the GPU, with misses going to a system-level fabric and the memory controller. We use the default Linux-based setup provided by the manufacturer to run our experiments.

At the application level, we tested the performance of the TailID for several kernels used in AI-based edge applications, see Figure 7. At the top level applications \small{1}⃝ like autonomous driving and radar applications [22, 46], object detection, data fusion, and navigation are AI-fuelled and build on different types of Neural Networks (NN) \small{2}⃝ like recurrent and convolutional. As MPSoCs like the Orin allow consolidating several applications, there can be multiple NNs instances running on the CPU and accelerators \small{3}⃝. CPU is used when accelerators are busy or the working set is reduced and the overheads to offload the kernels (along with its data) to the accelerator is too high [44].

Applications build on AI libraries that include as a central operator (function) matrix multiplication [45], which accounts for a large portion of the execution time (beyond 65%) [20]). image-to-columns is used in convolutional for tensor lowering to enable matrix convolutions. Another common operator is rectifier that is the positive value of its argument [5]. Other pervasive functions in those libraries is vector-dot-product and matrix transpose [45]. Overall we use the kernels MM basic and tiled, Image to Column, sorting algorithm, Matrix transpose, and Rectified Linear Unit.

Results.

With these kernels we build $2790$ randomly generated workloads from 2 to 12 kernels each. From the execution times for each workload, we perform an initial filtering step. Specifically, we apply three statistical tests (KPSS for stationarity, BDS for short-range dependence, and the Hurst index for long-range dependence) following the methodology of Reghenzani et al. with the PPI test [40]. Additionally, we use the extremogram test to assess dependence in the extremes and apply the KPSS test to the tail as an initial approach for identifying identical distribution in extreme values. This analysis follows Remarks 7 and 8.

Out of the initial $2790$ initial workloads, $39$ cases were not rejected in any test and kept them for further analysis, the remainder does not meet the conditions for the application of EVT. As mentioned in Section 2, the PPI test is an already strict test that sets a high bar for the trustworthy application of EVT. We then apply TailID to the workloads which pass the IID test. From these cases, $12.8\%$ also passed the ID hypothesis for the tail, as no inconsistent points were detected (scenario 1). On the other hand, $23.1\%$ workloads resulted in with less than MoS IPD (scenario 3). Finally, for $64.1\%$ workloads the TailID algorithm detected more than MoS IPD. Therefore, TailID was able to automatically detect several workloads that do not fulfill the ID hypothesis in the tail. Next, we illustrate some workloads filling in the different scenarios.

(a) Targeted candidate
and inconsistent points.

(b) Evolution of the GPD EVI parameter.

(c) Targeted candidate
and inconsistent points.

(d) Evolution of the GPD EVI parameter.

Figure 11: Results for Scenario 3 when additional runs are performed and

\textit{IPD}\leq\textit{MoS}

. (a) and (b) correspond to the initial application of TailID, where some points are detected as inconsistent. Subsequently, (c) and (d) show the performance of TailID after more runs.