Impact of Transient Faults on Timing Behavior and Mitigation with Near-Zero WCET Overhead

Authors Pegdwende Romaric Nikiema, Angeliki Kritikakou , Marcello Traiola, Olivier Sentieys



PDF
Thumbnail PDF

File

LIPIcs.ECRTS.2023.15.pdf
  • Filesize: 1.25 MB
  • 22 pages

Document Identifiers

Author Details

Pegdwende Romaric Nikiema
  • Univ Rennes, Inria, IRISA, CNRS, France
Angeliki Kritikakou
  • Univ Rennes, Inria, IRISA, CNRS, France
Marcello Traiola
  • Univ Rennes, Inria, IRISA, CNRS, France
Olivier Sentieys
  • Univ Rennes, Inria, IRISA, CNRS, France

Cite AsGet BibTex

Pegdwende Romaric Nikiema, Angeliki Kritikakou, Marcello Traiola, and Olivier Sentieys. Impact of Transient Faults on Timing Behavior and Mitigation with Near-Zero WCET Overhead. In 35th Euromicro Conference on Real-Time Systems (ECRTS 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 262, pp. 15:1-15:22, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.ECRTS.2023.15

Abstract

As time-critical systems require timing guarantees, Worst-Case Execution Times (WCET) have to be employed. However, WCET estimation methods usually assume fault-free hardware. If proper actions are not taken, such fault-free WCET approaches become unsafe, when faults impact the hardware during execution. The majority of approaches, dealing with hardware faults, address the impact of faults on the functional behavior of an application, i.e., denial of service and binary correctness. Few approaches address the impact of faults on the application timing behavior, i.e., time to finish the application, and target faults occurring in memories. However, as the transistor size in modern technologies is significantly reduced, faults in cores cannot be considered negligible anymore. This work shows that faults not only affect the functional behavior, but they can have a significant impact on the timing behavior of applications. To expose the overall impact of faults, we enhance vulnerability analysis to include not only functional, but also timing correctness, and show that faults impact WCET estimations. As common techniques to deal with faults, such as watchdog timers and re-execution, have large timing overhead for error detection and correction, we propose a mechanism with near-zero and bounded timing overhead. A RISC-V core is used as a case study. The obtained results show that faults can lead up to almost 700% increase in the maximum observed execution time between fault-free and faulty execution without protection, affecting the WCET estimations. On the contrary, the proposed mechanism is able to restore fault-free WCET estimations with a bounded overhead of 2 execution cycles.

Subject Classification

ACM Subject Classification
  • General and reference → Reliability
  • General and reference → Measurement
  • Hardware → Error detection and error correction
  • Hardware → Transient errors and upsets
  • Hardware → Safety critical systems
  • Computer systems organization → Real-time system architecture
Keywords
  • Transient faults
  • Timing impact
  • Near-zero WCET error detection and correction
  • Vulnerability analysis

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. J. Abella, M. Padilla, J. Castillo, and F. Cazorla. Measurement-based worst-case execution time estimation using the coefficient of variation. ACM Trans. Des. Autom. Electron. Syst., 22(4), June 2017. Google Scholar
  2. J. Abella, E. Quiñones, F. J. Cazorla, M. Valero, and Y. Sazeides. Rvc-based time-predictable faulty caches for safety-critical systems. In IEEE Int. On-Line Testing Symp. (IOLTS), pages 25-30, July 2011. Google Scholar
  3. Z. Al-bayati, J. Caplan, B. H. Meyer, and H. Zeng. A four-mode model for efficient fault-tolerant mixed-criticality systems. In IEEE/ACM Design, Automation Test in Europe Conf. Exhibition (DATE), pages 97-102, March 2016. Google Scholar
  4. D. Asciolla, L. Dilillo, D. Santos, D. Melo, A. Menicucci, and M. Ottavi. Characterization of a risc-v microcontroller through fault injection. In Applications in Electronics Pervading Industry, Environment and Society (APPLEPIES), Lecture Notes in Electrical Engineering, pages 91-101. Springer Open, 2019. Google Scholar
  5. A. Bhat, S. Samii, and R. Rajkumar. Practical task allocation for software fault-tolerance and its implementation in embedded automotive systems. In IEEE Real-Time and Embedded Technology and Applications Symp. (RTAS), pages 87-98, April 2017. Google Scholar
  6. L. Blasi, F. Vigli, A. Cheikh, A. Mastrandrea, F. Menichelli, and M. Olivieri. A RISC-V fault-tolerant microcontroller core architecture based on a hardware thread full/partial protection and a thread-controlled watch-dog timer. In Applications in Electronics Pervading Industry, Environment and Society (APPLEPIES), pages 505-511, 2019. Google Scholar
  7. F. Cazorla, L. Kosmidis, E. Mezzetti, C. Hernandez, J. Abella, and T. Vardanega. Probabilistic worst-case timing analysis: Taxonomy and comprehensive survey. ACM Comput. Surv., 52(1), February 2019. Google Scholar
  8. C. Chang, S. Lym, N. Kelly, M. B. Sullivan, and M. Erez. Hamartia: A fast and accurate error injection framework. In IEEE/IFIP Int. Conf. on Dependable Systems and Networks Workshops (DSN-W), 2018. Google Scholar
  9. C. Chen, J. Panerati, I. Hafnaoui, and G. Beltrame. Static probabilistic timing analysis with a permanent fault detection mechanism. In IEEE Int. Symp. on Industrial Embedded Systems (SIES), pages 1-10, June 2017. Google Scholar
  10. C. Chen, J. Panerati, M. Li, and G. Beltrame. Probabilistic timing analysis of time-randomised caches with fault detection mechanisms. IET Computers & Digital Techniques, 13(3):129-139, 2019. Google Scholar
  11. C. Chen, L. Santinelli, J. Hugues, and G. Beltrame. Static probabilistic timing analysis in presence of faults. In IEEE Int. Symp. Industrial Embedded Systems (SIES), pages 1-10, Krakow, PL, July 2016. Google Scholar
  12. G. Chen, N. Guan, K. Huang, and W. Yi. Fault-tolerant real-time tasks scheduling with dynamic fault handling. Journal of Systems Architecture, 102:101688, 2020. Google Scholar
  13. L. Cucu-Grosjean, L. Santinelli, M. Houston, C. Lo, T. Vardanega, L. Kosmidis, J. Abella, E. Mezzetti, E. Quiñones, and F.J. Cazorla. Measurement-based probabilistic timing analysis for multi-path programs. In Euromicro Conference on Real-Time Systems (ECRTS), pages 91-101, 2012. URL: https://doi.org/10.1109/ECRTS.2012.31.
  14. M. Cui, A. Kritikakou, L. Mo, and E. Casseau. Fault-tolerant mapping of real-time parallel applications under multiple dvfs schemes. In IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), 2021. Google Scholar
  15. Á.B. de Oliveira, G.S. Rodrigues, F.L. Kastensmidt, N. Added, E.L.A. Macchione, V.A. P. Aguiar, N.H. Medina, and M.A.G. Silveira. Lockstep dual-core arm a9: Implementation and resilience analysis under heavy ion-induced soft errors. IEEE Trans. Nuclear Science, 65(8):1783-1790, 2018. URL: https://doi.org/10.1109/TNS.2018.2852606.
  16. J.F. Deverge and I. Puaut. Safe measurement-based WCET estimation. In Int. Workshop on Worst-Case Execution Time Analysis (WCET), 2007. Google Scholar
  17. A. Dixit and A. Wood. The impact of new technology on soft error rates. In Int. Reliability Physics Symp. (IRPS), pages 5B.4.1-5B.4.7, April 2011. Google Scholar
  18. S. Edgar and A. Burns. Statistical analysis of wcet for scheduling. In IEEE Real-Time Systems Symposium (RTSS), pages 215-224, 2001. URL: https://doi.org/10.1109/REAL.2001.990614.
  19. G. Fohler, G. Gala, D. Gracia Pérez, and C. Pagetti. Evaluation of DREAMS resource management solutions on a mixed-critical demonstrator, January 2018. Google Scholar
  20. Marc Gatti. Development and certification of avionics platforms on multi-core processors. In Tutorial Mixed-Criticality Systems: Design and Certification Challenges, ESWeek, 2013. Google Scholar
  21. S. Hamdioui, D. Gizopoulos, G. Guido, M. Nicolaidis, A. Grasset, and P. Bonnot. Reliability challenges of real-time systems in forthcoming technology nodes. In IEEE/ACM Design, Automation Test in Europe Conf. Exhibition (DATE), pages 129-134, March 2013. Google Scholar
  22. D. Hardy and I. Puaut. Wcet analysis of multi-level non-inclusive set-associative instruction caches. In Real-Time Systems Symp. (RTSS), pages 456-466, November 2008. Google Scholar
  23. D. Hardy and I. Puaut. Static probabilistic worst case execution time estimation for architectures with faulty instruction caches. Real-Time Systems, 51:128-152, March 2015. Google Scholar
  24. D. Hardy, I. Puaut, and Y. Sazeides. Probabilistic wcet estimation in presence of hardware for mitigating the impact of permanent faults. In IEEE/ACM Design, Automation Test in Europe Conf. Exhibition (DATE), pages 91-96, March 2016. Google Scholar
  25. P. Huang, H. Yang, and L. Thiele. On the scheduling of fault-tolerant mixed-criticality systems. In ACM/EDAC/IEEE Design Automation Conf. (DAC), pages 1-6, June 2014. Google Scholar
  26. E. Ibe, H. Taniguchi, Y. Yahagi, K. Shimbo, and T. Toba. Impact of Scaling on Neutron-Induced Soft Error in SRAMs From a 250 nm to a 22 nm Design Rule. IEEE Trans. on Electron Devices, 57(7):1527-1538, July 2010. Google Scholar
  27. J. Kim, G. Bhatia, R. Rajkumar, and M. Jochim. Safer: System-level architecture for failure evasion in real-time applications. In IEEE Real-Time Systems Symp. (RTSS), pages 227-236, December 2012. Google Scholar
  28. A. Kritikakou, P. Nikolaou, I. Rodriguez-Ferrandez, J. Paturel, L. Kosmidis, M.K. Michael, O. Sentieys, and D. Steenari. Functional and timing implications of transient faults in critical systems. In IEEE Int. Symp. On-Line Testing and Robust System Design (IOLTS), pages 1-10, 2022. Google Scholar
  29. R. Leveugle et al. Statistical fault injection: Quantified error and confidence. In IEEE/ACM Design, Automation Test in Europe Conference (DATE), pages 502-506, April 2009. Google Scholar
  30. D. Li, J. S. Vetter, and W. Yu. Classifying Soft Error Vulnerabilities in Extreme-Scale Scientific Applications Using a Binary Instrumentation Tool. In Int. Conf. on High Performance Computing, Networking, Storage & Analysis (SC), pages 1-11, November 2012. Google Scholar
  31. J. Li, S. Zhang, and C. Bao. Duckcore: A fault-tolerant processor core architecture based on the risc-v isa. Electronics, 11(1), 2022. URL: https://doi.org/10.3390/electronics11010122.
  32. N. N. Mahatme, S. Jagannathan, T. D. Loveless, L. W. Massengill, B. L. Bhuva, S. Wen, and R. Wong. Comparison of Combinational and Sequential Error Rates for a Deep Submicron Process. IEEE Trans. on Nuclear Science, 58:2719-2725, December 2011. Google Scholar
  33. B.O. Mutlu, G. Kestor, A. Cristal, O. Unsal, and S. Krishnamoorthy. Ground-truth prediction to accelerate soft-error impact analysis for iterative methods. In Int. Conf. High Performance Computing, Data, and Analytics (HiPC), pages 333-344, 2019. Google Scholar
  34. B. Ozcelik Mutlu, G. Kestor, J. Manzano, O. Unsal, S. Chatterjee, and S. Krishnamoorthy. Characterization of the Impact of Soft Errors on Iterative Methods. In IEEE Int. Conf. on High Performance Computing (HiPC), pages 203-214, December 2018. Google Scholar
  35. Risat Pathan. Real-time scheduling algorithm for safety-critical systems on faulty multicore environments. Real-Time Systems, September 2016. Google Scholar
  36. R.M. Pathan. Fault-tolerant and real-time scheduling for mixed-criticality systems. Real-Time Systems, 50(4), July 2014. Google Scholar
  37. J. Paturel, A. Kritikakou, and O. Sentieys. Fast Cross-Layer Vulnerability Analysis of Complex Hardware Designs. In IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pages 328-333, Limassol, Cyprus, July 2020. IEEE. Google Scholar
  38. A. Ramos, J.A. Antonio Maestro, and P. Reviriego. Characterizing a RISC-V SRAM-based FPGA implementation against Single Event Upsets using fault injection. Microelectronics Reliability, 78, November 2017. Google Scholar
  39. S. Rehman, M. Shafique, and J. Henkel. Reliable Software for Unreliable Hardware: A Cross Layer Perspective. Springer Publishing, 2016. Google Scholar
  40. D. Rodopoulos, G. Psychou, M.M. Sabry, F. Catthoor, A. Papanikolaou, D. Soudris, T.G. Noll, and D. Atienza. Classification framework for analysis and modeling of physically induced reliability violations. ACM Comput. Surv., 47(3), February 2015. Google Scholar
  41. S. Rokicki, D. Pala, J. Paturel, and O. Sentieys. What You Simulate Is What You Synthesize: Designing a Processor Core from C++ Specifications. In IEEE/ACM Int. Conf. on Computer-Aided Design (ICCAD). IEEE, November 2019. Google Scholar
  42. D.A. Santos, L.M. Luza, C.A. Zeferino, L. Dilillo, and D.R Melo. A low-cost fault-tolerant risc-v processor for space systems. In Design & Technology of Integrated Systems in Nanoscale Era (DTIS), pages 1-5, 2020. URL: https://doi.org/10.1109/DTIS48698.2020.9081185.
  43. N. Seifert, B. Gill, S. Jahinuzzaman, J. Basile, V. Ambrose, Q. Shi, R. Allmon, and A. Bramnik. Soft Error Susceptibilities of 22 nm Tri-Gate Devices. IEEE Trans. Nuclear Science, 59, 2012. Google Scholar
  44. Hardik Shah, Andrew Coombes, Andreas Raabe, Kai Huang, and Alois Knoll. Measurement based wcet analysis for multi-core architectures. In Int. Conf. on Real-Time Networks and Systems (RTNS), pages 257-266, New York, NY, USA, 2014. Association for Computing Machinery. Google Scholar
  45. K.P. Silva, L.F. Arcaro, and R. Silva De Oliveira. On using gev or gumbel models when applying evt for probabilistic wcet estimation. In IEEE Real-Time Systems Symposium (RTSS), pages 220-230, 2017. URL: https://doi.org/10.1109/RTSS.2017.00028.
  46. M.T. Sim and Y. Zhuang. A dual lockstep processor system-on-a-chip for fast error recovery in safety-critical applications. In Conf. IEEE Industrial Electronics Society (IECON), pages 2231-2238, 2020. Google Scholar
  47. S. Skalistis and A. Kritikakou. Timely fine-grained interference-sensitive run-time adaptation of time-triggered schedules. In IEEE Real-Time Systems Symp. (RTSS). IEEE, 2019. Google Scholar
  48. M. Slijepcevic, L. Kosmidis, J. Abella, E. Quiñones, and F. J. Cazorla. Dtm: Degraded test mode for fault-aware probabilistic timing analysis. In Euromicro Conf. on Real-Time Systems (ECRTS), pages 237-248, July 2013. Google Scholar
  49. M. Slijepcevic, L. Kosmidis, J. Abella, E. Quiñones, and F. J. Cazorla. Timing verification of fault-tolerant chips for safety-critical applications in harsh environments. IEEE Micro, 34(6):8-19, November 2014. Google Scholar
  50. J. Song and G. Parmer. C'mon: a predictable monitoring infrastructure for system-level latent fault detection and recovery. In IEEE Real-Time and Embedded Technology and Applications Symp. (RTAS), pages 247-258, April 2015. Google Scholar
  51. J. Song, J. Wittrock, and G. Parmer. Predictable, efficient system-level fault tolerance in c³. In IEEE Real-Time Systems Symp. (RTSS), pages 21-32, December 2013. Google Scholar
  52. H. Theiling, C. Ferdinand, and R. Wilhelm. Fast and precise wcet prediction by separated cache andpath analyses. Real-Time Syst., 18(2/3):157-179, May 2000. Google Scholar
  53. I. Tuzov, D. de Andrés, and J. Ruiz. Accurate Robustness Assessment of HDL Models Through Iterative Statistical Fault Injection. In European Dependable Computing Conf. (EDCC), pages 1-8, September 2018. Google Scholar
  54. G. von der Brüggen, K.H. Chen, W.H. Huang, and J.J Chen. Systems with dynamic real-time guarantees in uncertain and faulty execution environments. In IEEE Real-Time Systems Symposium (RTSS), pages 303-314, 2016. Google Scholar
  55. N.J. Wang, A. Mahesri, and S.J. Patel. Examining ace analysis reliability estimates using fault-injection. In Int. Symp. Computer Architecture (ISCA), pages 460-469, New York, NY, USA, 2007. Association for Computing Machinery. Google Scholar
  56. J. Wei, A. Thomas, G. Li, and K. Pattabiraman. Quantifying the accuracy of high-level fault injection techniques for hardware faults. In IEEE/IFIP Int. Conf. Dependable Systems and Networks (DSN), pages 375-382, 2014. Google Scholar
  57. R. Wilhelm, J. Engblom, A. Ermedahl, et al. The worst-case execution-time problem—overview of methods and survey of tools. ACM Trans. Embed. Comput. Syst., 7(3), May 2008. Google Scholar
  58. A.E. Wilson, M. Wirthlin, and N.G. Baker. Neutron radiation testing of risc-v tmr soft processors on sram-based fpgas. IEEE Transactions on Nuclear Science, pages 1-1, 2023. URL: https://doi.org/10.1109/TNS.2023.3235582.
  59. J. Yao, S. Okada, M. Masuda, K. Kobayashi, and Y. Nakashima. Dara: A low-cost reliable architecture based on unhardened devices and its case study of radiation stress test. IEEE Trans. Nuclear Science, 59(6):2852-2858, 2012. URL: https://doi.org/10.1109/TNS.2012.2223715.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail