Impact of Transient Faults on Timing Behavior and Mitigation with Near-Zero WCET Overhead

Nikiema, Pegdwende Romaric; Kritikakou, Angeliki; Traiola, Marcello; Sentieys, Olivier

doi:10.4230/LIPIcs.ECRTS.2023.15

Abstract

As time-critical systems require timing guarantees, Worst-Case Execution Times (WCET) have to be employed. However, WCET estimation methods usually assume fault-free hardware. If proper actions are not taken, such fault-free WCET approaches become unsafe, when faults impact the hardware during execution. The majority of approaches, dealing with hardware faults, address the impact of faults on the functional behavior of an application, i.e., denial of service and binary correctness. Few approaches address the impact of faults on the application timing behavior, i.e., time to finish the application, and target faults occurring in memories. However, as the transistor size in modern technologies is significantly reduced, faults in cores cannot be considered negligible anymore. This work shows that faults not only affect the functional behavior, but they can have a significant impact on the timing behavior of applications. To expose the overall impact of faults, we enhance vulnerability analysis to include not only functional, but also timing correctness, and show that faults impact WCET estimations. As common techniques to deal with faults, such as watchdog timers and re-execution, have large timing overhead for error detection and correction, we propose a mechanism with near-zero and bounded timing overhead. A RISC-V core is used as a case study. The obtained results show that faults can lead up to almost 700% increase in the maximum observed execution time between fault-free and faulty execution without protection, affecting the WCET estimations. On the contrary, the proposed mechanism is able to restore fault-free WCET estimations with a bounded overhead of 2 execution cycles.

J. Abella, M. Padilla, J. Castillo, and F. Cazorla. Measurement-based worst-case execution time estimation using the coefficient of variation. ACM Trans. Des. Autom. Electron. Syst., 22(4), June 2017.
J. Abella, E. Quiñones, F. J. Cazorla, M. Valero, and Y. Sazeides. Rvc-based time-predictable faulty caches for safety-critical systems. In IEEE Int. On-Line Testing Symp. (IOLTS), pages 25-30, July 2011.
Z. Al-bayati, J. Caplan, B. H. Meyer, and H. Zeng. A four-mode model for efficient fault-tolerant mixed-criticality systems. In IEEE/ACM Design, Automation Test in Europe Conf. Exhibition (DATE), pages 97-102, March 2016.
D. Asciolla, L. Dilillo, D. Santos, D. Melo, A. Menicucci, and M. Ottavi. Characterization of a risc-v microcontroller through fault injection. In Applications in Electronics Pervading Industry, Environment and Society (APPLEPIES), Lecture Notes in Electrical Engineering, pages 91-101. Springer Open, 2019.
A. Bhat, S. Samii, and R. Rajkumar. Practical task allocation for software fault-tolerance and its implementation in embedded automotive systems. In IEEE Real-Time and Embedded Technology and Applications Symp. (RTAS), pages 87-98, April 2017.
L. Blasi, F. Vigli, A. Cheikh, A. Mastrandrea, F. Menichelli, and M. Olivieri. A RISC-V fault-tolerant microcontroller core architecture based on a hardware thread full/partial protection and a thread-controlled watch-dog timer. In Applications in Electronics Pervading Industry, Environment and Society (APPLEPIES), pages 505-511, 2019.
F. Cazorla, L. Kosmidis, E. Mezzetti, C. Hernandez, J. Abella, and T. Vardanega. Probabilistic worst-case timing analysis: Taxonomy and comprehensive survey. ACM Comput. Surv., 52(1), February 2019.
C. Chang, S. Lym, N. Kelly, M. B. Sullivan, and M. Erez. Hamartia: A fast and accurate error injection framework. In IEEE/IFIP Int. Conf. on Dependable Systems and Networks Workshops (DSN-W), 2018.
C. Chen, J. Panerati, I. Hafnaoui, and G. Beltrame. Static probabilistic timing analysis with a permanent fault detection mechanism. In IEEE Int. Symp. on Industrial Embedded Systems (SIES), pages 1-10, June 2017.
C. Chen, J. Panerati, M. Li, and G. Beltrame. Probabilistic timing analysis of time-randomised caches with fault detection mechanisms. IET Computers & Digital Techniques, 13(3):129-139, 2019.
C. Chen, L. Santinelli, J. Hugues, and G. Beltrame. Static probabilistic timing analysis in presence of faults. In IEEE Int. Symp. Industrial Embedded Systems (SIES), pages 1-10, Krakow, PL, July 2016.
G. Chen, N. Guan, K. Huang, and W. Yi. Fault-tolerant real-time tasks scheduling with dynamic fault handling. Journal of Systems Architecture, 102:101688, 2020.
L. Cucu-Grosjean, L. Santinelli, M. Houston, C. Lo, T. Vardanega, L. Kosmidis, J. Abella, E. Mezzetti, E. Quiñones, and F.J. Cazorla. Measurement-based probabilistic timing analysis for multi-path programs. In Euromicro Conference on Real-Time Systems (ECRTS), pages 91-101, 2012. URL: https://doi.org/10.1109/ECRTS.2012.31.
M. Cui, A. Kritikakou, L. Mo, and E. Casseau. Fault-tolerant mapping of real-time parallel applications under multiple dvfs schemes. In IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), 2021.
Á.B. de Oliveira, G.S. Rodrigues, F.L. Kastensmidt, N. Added, E.L.A. Macchione, V.A. P. Aguiar, N.H. Medina, and M.A.G. Silveira. Lockstep dual-core arm a9: Implementation and resilience analysis under heavy ion-induced soft errors. IEEE Trans. Nuclear Science, 65(8):1783-1790, 2018. URL: https://doi.org/10.1109/TNS.2018.2852606.
J.F. Deverge and I. Puaut. Safe measurement-based WCET estimation. In Int. Workshop on Worst-Case Execution Time Analysis (WCET), 2007.
A. Dixit and A. Wood. The impact of new technology on soft error rates. In Int. Reliability Physics Symp. (IRPS), pages 5B.4.1-5B.4.7, April 2011.
S. Edgar and A. Burns. Statistical analysis of wcet for scheduling. In IEEE Real-Time Systems Symposium (RTSS), pages 215-224, 2001. URL: https://doi.org/10.1109/REAL.2001.990614.
G. Fohler, G. Gala, D. Gracia Pérez, and C. Pagetti. Evaluation of DREAMS resource management solutions on a mixed-critical demonstrator, January 2018.
Marc Gatti. Development and certification of avionics platforms on multi-core processors. In Tutorial Mixed-Criticality Systems: Design and Certification Challenges, ESWeek, 2013.
S. Hamdioui, D. Gizopoulos, G. Guido, M. Nicolaidis, A. Grasset, and P. Bonnot. Reliability challenges of real-time systems in forthcoming technology nodes. In IEEE/ACM Design, Automation Test in Europe Conf. Exhibition (DATE), pages 129-134, March 2013.
D. Hardy and I. Puaut. Wcet analysis of multi-level non-inclusive set-associative instruction caches. In Real-Time Systems Symp. (RTSS), pages 456-466, November 2008.
D. Hardy and I. Puaut. Static probabilistic worst case execution time estimation for architectures with faulty instruction caches. Real-Time Systems, 51:128-152, March 2015.
D. Hardy, I. Puaut, and Y. Sazeides. Probabilistic wcet estimation in presence of hardware for mitigating the impact of permanent faults. In IEEE/ACM Design, Automation Test in Europe Conf. Exhibition (DATE), pages 91-96, March 2016.
P. Huang, H. Yang, and L. Thiele. On the scheduling of fault-tolerant mixed-criticality systems. In ACM/EDAC/IEEE Design Automation Conf. (DAC), pages 1-6, June 2014.
E. Ibe, H. Taniguchi, Y. Yahagi, K. Shimbo, and T. Toba. Impact of Scaling on Neutron-Induced Soft Error in SRAMs From a 250 nm to a 22 nm Design Rule. IEEE Trans. on Electron Devices, 57(7):1527-1538, July 2010.
J. Kim, G. Bhatia, R. Rajkumar, and M. Jochim. Safer: System-level architecture for failure evasion in real-time applications. In IEEE Real-Time Systems Symp. (RTSS), pages 227-236, December 2012.
A. Kritikakou, P. Nikolaou, I. Rodriguez-Ferrandez, J. Paturel, L. Kosmidis, M.K. Michael, O. Sentieys, and D. Steenari. Functional and timing implications of transient faults in critical systems. In IEEE Int. Symp. On-Line Testing and Robust System Design (IOLTS), pages 1-10, 2022.
R. Leveugle et al. Statistical fault injection: Quantified error and confidence. In IEEE/ACM Design, Automation Test in Europe Conference (DATE), pages 502-506, April 2009.
D. Li, J. S. Vetter, and W. Yu. Classifying Soft Error Vulnerabilities in Extreme-Scale Scientific Applications Using a Binary Instrumentation Tool. In Int. Conf. on High Performance Computing, Networking, Storage & Analysis (SC), pages 1-11, November 2012.
J. Li, S. Zhang, and C. Bao. Duckcore: A fault-tolerant processor core architecture based on the risc-v isa. Electronics, 11(1), 2022. URL: https://doi.org/10.3390/electronics11010122.
N. N. Mahatme, S. Jagannathan, T. D. Loveless, L. W. Massengill, B. L. Bhuva, S. Wen, and R. Wong. Comparison of Combinational and Sequential Error Rates for a Deep Submicron Process. IEEE Trans. on Nuclear Science, 58:2719-2725, December 2011.
B.O. Mutlu, G. Kestor, A. Cristal, O. Unsal, and S. Krishnamoorthy. Ground-truth prediction to accelerate soft-error impact analysis for iterative methods. In Int. Conf. High Performance Computing, Data, and Analytics (HiPC), pages 333-344, 2019.
B. Ozcelik Mutlu, G. Kestor, J. Manzano, O. Unsal, S. Chatterjee, and S. Krishnamoorthy. Characterization of the Impact of Soft Errors on Iterative Methods. In IEEE Int. Conf. on High Performance Computing (HiPC), pages 203-214, December 2018.
Risat Pathan. Real-time scheduling algorithm for safety-critical systems on faulty multicore environments. Real-Time Systems, September 2016.
R.M. Pathan. Fault-tolerant and real-time scheduling for mixed-criticality systems. Real-Time Systems, 50(4), July 2014.
J. Paturel, A. Kritikakou, and O. Sentieys. Fast Cross-Layer Vulnerability Analysis of Complex Hardware Designs. In IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pages 328-333, Limassol, Cyprus, July 2020. IEEE.
A. Ramos, J.A. Antonio Maestro, and P. Reviriego. Characterizing a RISC-V SRAM-based FPGA implementation against Single Event Upsets using fault injection. Microelectronics Reliability, 78, November 2017.
S. Rehman, M. Shafique, and J. Henkel. Reliable Software for Unreliable Hardware: A Cross Layer Perspective. Springer Publishing, 2016.
D. Rodopoulos, G. Psychou, M.M. Sabry, F. Catthoor, A. Papanikolaou, D. Soudris, T.G. Noll, and D. Atienza. Classification framework for analysis and modeling of physically induced reliability violations. ACM Comput. Surv., 47(3), February 2015.
S. Rokicki, D. Pala, J. Paturel, and O. Sentieys. What You Simulate Is What You Synthesize: Designing a Processor Core from C++ Specifications. In IEEE/ACM Int. Conf. on Computer-Aided Design (ICCAD). IEEE, November 2019.
D.A. Santos, L.M. Luza, C.A. Zeferino, L. Dilillo, and D.R Melo. A low-cost fault-tolerant risc-v processor for space systems. In Design & Technology of Integrated Systems in Nanoscale Era (DTIS), pages 1-5, 2020. URL: https://doi.org/10.1109/DTIS48698.2020.9081185.
N. Seifert, B. Gill, S. Jahinuzzaman, J. Basile, V. Ambrose, Q. Shi, R. Allmon, and A. Bramnik. Soft Error Susceptibilities of 22 nm Tri-Gate Devices. IEEE Trans. Nuclear Science, 59, 2012.
Hardik Shah, Andrew Coombes, Andreas Raabe, Kai Huang, and Alois Knoll. Measurement based wcet analysis for multi-core architectures. In Int. Conf. on Real-Time Networks and Systems (RTNS), pages 257-266, New York, NY, USA, 2014. Association for Computing Machinery.
K.P. Silva, L.F. Arcaro, and R. Silva De Oliveira. On using gev or gumbel models when applying evt for probabilistic wcet estimation. In IEEE Real-Time Systems Symposium (RTSS), pages 220-230, 2017. URL: https://doi.org/10.1109/RTSS.2017.00028.
M.T. Sim and Y. Zhuang. A dual lockstep processor system-on-a-chip for fast error recovery in safety-critical applications. In Conf. IEEE Industrial Electronics Society (IECON), pages 2231-2238, 2020.
S. Skalistis and A. Kritikakou. Timely fine-grained interference-sensitive run-time adaptation of time-triggered schedules. In IEEE Real-Time Systems Symp. (RTSS). IEEE, 2019.
M. Slijepcevic, L. Kosmidis, J. Abella, E. Quiñones, and F. J. Cazorla. Dtm: Degraded test mode for fault-aware probabilistic timing analysis. In Euromicro Conf. on Real-Time Systems (ECRTS), pages 237-248, July 2013.
M. Slijepcevic, L. Kosmidis, J. Abella, E. Quiñones, and F. J. Cazorla. Timing verification of fault-tolerant chips for safety-critical applications in harsh environments. IEEE Micro, 34(6):8-19, November 2014.
J. Song and G. Parmer. C'mon: a predictable monitoring infrastructure for system-level latent fault detection and recovery. In IEEE Real-Time and Embedded Technology and Applications Symp. (RTAS), pages 247-258, April 2015.
J. Song, J. Wittrock, and G. Parmer. Predictable, efficient system-level fault tolerance in c³. In IEEE Real-Time Systems Symp. (RTSS), pages 21-32, December 2013.
H. Theiling, C. Ferdinand, and R. Wilhelm. Fast and precise wcet prediction by separated cache andpath analyses. Real-Time Syst., 18(2/3):157-179, May 2000.
I. Tuzov, D. de Andrés, and J. Ruiz. Accurate Robustness Assessment of HDL Models Through Iterative Statistical Fault Injection. In European Dependable Computing Conf. (EDCC), pages 1-8, September 2018.
G. von der Brüggen, K.H. Chen, W.H. Huang, and J.J Chen. Systems with dynamic real-time guarantees in uncertain and faulty execution environments. In IEEE Real-Time Systems Symposium (RTSS), pages 303-314, 2016.
N.J. Wang, A. Mahesri, and S.J. Patel. Examining ace analysis reliability estimates using fault-injection. In Int. Symp. Computer Architecture (ISCA), pages 460-469, New York, NY, USA, 2007. Association for Computing Machinery.
J. Wei, A. Thomas, G. Li, and K. Pattabiraman. Quantifying the accuracy of high-level fault injection techniques for hardware faults. In IEEE/IFIP Int. Conf. Dependable Systems and Networks (DSN), pages 375-382, 2014.
R. Wilhelm, J. Engblom, A. Ermedahl, et al. The worst-case execution-time problem—overview of methods and survey of tools. ACM Trans. Embed. Comput. Syst., 7(3), May 2008.
A.E. Wilson, M. Wirthlin, and N.G. Baker. Neutron radiation testing of risc-v tmr soft processors on sram-based fpgas. IEEE Transactions on Nuclear Science, pages 1-1, 2023. URL: https://doi.org/10.1109/TNS.2023.3235582.
J. Yao, S. Okada, M. Masuda, K. Kobayashi, and Y. Nakashima. Dara: A low-cost reliable architecture based on unhardened devices and its case study of radiation stress test. IEEE Trans. Nuclear Science, 59(6):2852-2858, 2012. URL: https://doi.org/10.1109/TNS.2012.2223715.

Impact of Transient Faults on Timing Behavior and Mitigation with Near-Zero WCET Overhead

Authors Pegdwende Romaric Nikiema, Angeliki Kritikakou , Marcello Traiola, Olivier Sentieys

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Impact of Transient Faults on Timing Behavior and Mitigation with Near-Zero WCET Overhead

Authors Pegdwende Romaric Nikiema, Angeliki Kritikakou , Marcello Traiola, Olivier Sentieys

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Supplementary Materials

References

Thanks for your feedback!

Could not send message