Quantifying the Resiliency of Fail-Operational Real-Time Networked Control Systems

Authors Arpan Gujarati, Mitra Nasri, Björn B. Brandenburg

Thumbnail PDF


  • Filesize: 0.85 MB
  • 24 pages

Document Identifiers

Author Details

Arpan Gujarati
  • Max Planck Institute for Software Systems (MPI-SWS), Kaiserslautern, Germany
Mitra Nasri
  • Max Planck Institute for Software Systems (MPI-SWS), Kaiserslautern, Germany
Björn B. Brandenburg
  • Max Planck Institute for Software Systems (MPI-SWS), Kaiserslautern, Germany

Cite AsGet BibTex

Arpan Gujarati, Mitra Nasri, and Björn B. Brandenburg. Quantifying the Resiliency of Fail-Operational Real-Time Networked Control Systems. In 30th Euromicro Conference on Real-Time Systems (ECRTS 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 106, pp. 16:1-16:24, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


In time-sensitive, safety-critical systems that must be fail-operational, active replication is commonly used to mitigate transient faults that arise due to electromagnetic interference (EMI). However, designing an effective and well-performing active replication scheme is challenging since replication conflicts with the size, weight, power, and cost constraints of embedded applications. To enable a systematic and rigorous exploration of the resulting tradeoffs, we present an analysis to quantify the resiliency of fail-operational networked control systems against EMI-induced memory corruption, host crashes, and retransmission delays. Since control systems are typically robust to a few failed iterations, e.g., one missed actuation does not crash an inverted pendulum, traditional solutions based on hard real-time assumptions are often too pessimistic. Our analysis reduces this pessimism by modeling a control system's inherent robustness as an (m,k)-firm specification. A case study with an active suspension workload indicates that the analytical bounds closely predict the failure rate estimates obtained through simulation, thereby enabling a meaningful design-space exploration, and also demonstrates the utility of the analysis in identifying non-trivial and non-obvious reliability tradeoffs.

Subject Classification

ACM Subject Classification
  • Computer systems organization → Embedded and cyber-physical systems
  • probabilistic analysis
  • reliability analysis
  • networked control systems


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. IEEE standard for a precision clock synchronization protocol for networked measurement and control systems. IEEE Std 1588-2008 (Revision of IEEE Std 1588-2002), pages 1-300, July 2008. URL: http://dx.doi.org/10.1109/IEEESTD.2008.4579760.
  2. Masakazu Adachi, Yiannis Papadopoulos, Septavera Sharvia, David Parker, and Tetsuya Tohdo. An approach to optimization of fault tolerant architectures using hip-hops. Software: Practice and Experience, 41(11):1303-1327, 2011. Google Scholar
  3. Zaid Al-Ars and Ad J van de Goor. Transient faults in DRAMs: Concept, analysis and impact on tests. In International Workshop on Memory Technology, Design and Testing, pages 59-64. IEEE, 2001. Google Scholar
  4. Adolfo Anta and Paulo Tabuada. On the benefits of relaxing the periodicity assumption for networked control systems over CAN. In Proceedings of the 30th Real-Time Systems Symposium, pages 3-12. IEEE, 2009. Google Scholar
  5. Robert B Ash. Basic Probability Theory. Courier Corporation, 2012. Google Scholar
  6. Algirdas Avizienis, J-C Laprie, Brian Randell, and Carl Landwehr. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1):11-33, 2004. Google Scholar
  7. Ali Bakhoda, Seyed Ghassem Miremadi, and Hamid R Zarandi. Investigation of transient effects on fpga-based embedded systems. In Proceedings of the 2nd International Conference on Embedded Software and Systems, pages 6-pp. IEEE, 2005. Google Scholar
  8. Michael Barborak, Anton Dahbura, and Miroslaw Malek. The consensus problem in fault-tolerant computing. ACM Computing Surveys, 25(2):171-220, 1993. Google Scholar
  9. Rainer Blind and Frank Allgöwer. Towards networked control systems with guaranteed stability: Using weakly hard real-time constraints to model the loss process. In Proceedings of the 54th Annual Conference on Decision and Control, pages 7510-7515. IEEE, 2015. Google Scholar
  10. Björn B Brandenburg. The schedulability test collection and toolkit, 2017. Available at URL: https://people.mpi-sws.org/~bbb/projects/schedcat.
  11. Ian Broster, Guillem Bernat, and Alan Burns. Weakly hard real-time constraints on controller area network. In Proceedings of the 14th Euromicro Conference on Real-Time Systems, pages 134-141. IEEE, 2002. Google Scholar
  12. Ian Broster, Alan Burns, and Guillermo Rodriguez-Navas. Probabilistic analysis of CAN with faults. In Proceedings of the 23rd Real-Time Systems Symposium, pages 269-278. IEEE, 2002. Google Scholar
  13. Ian Broster, Alan Burns, and Guillermo Rodriguez-Navas. Timing analysis of real-time communication under electromagnetic interference. Real-Time Systems, 30(1-2):55-81, 2005. Google Scholar
  14. Ahmet Cetinkaya, Hideaki Ishii, and Tomohisa Hayakawa. Networked control under random and malicious packet losses. Transactions on Automatic Control, 62(5):2434-2449, 2017. Google Scholar
  15. Cristian Ionut Chihaia. Active Fault-Tolerance in Wireless Networked Control Systems. PhD thesis, Universität Duisburg-Essen, Fakultät für Ingenieurwissenschaften / Elektrotechnik und Informationstechnik / Automatisierungstechnik und komplexe Systeme, 2010. Google Scholar
  16. Robert I Davis, Alan Burns, Reinder J Bril, and Johan J Lukkien. Controller area network (CAN) schedulability analysis: Refuted, revisited and revised. Real-Time Systems, 35(3):239-272, 2007. Google Scholar
  17. Christian Dehnert, Sebastian Junges, Joost-Pieter Katoen, and Matthias Volk. A storm is coming: A modern probabilistic model checker. In Computer Aided Verification - 29th International Conference, CAV 2017, Heidelberg, Germany, July 24-28, 2017, Proceedings, Part II, pages 592-600, 2017. URL: http://dx.doi.org/10.1007/978-3-319-63390-9_31.
  18. Joanne Bechta Dugan and Randy Van Buren. Reliability evaluation of fly-by-wire computer systems. Journal of Systems and software, 25(1):109-120, 1994. Google Scholar
  19. Jonas Elmqvist and Simin Nadjm-Tehrani. Formal support for quantitative analysis of residual risks in safety-critical systems. In Proceedings of the 11th High Assurance Systems Engineering Symposium, pages 154-164. IEEE, 2008. Google Scholar
  20. Joaquim Ferreira, Arnaldo Oliveira, Pedro Fonseca, and José Fonseca. An experiment to assess bit error rate in CAN. In Proceedings of the 3rd International Workshop of Real-Time Networks, pages 15-18, 2004. Google Scholar
  21. Martin Gergeleit and Hermann Streich. Implementing a distributed high-resolution real-time clock using the CAN-bus. In Proceedings of the 1st International CAN Conference, volume 94, 1994. Google Scholar
  22. Alain Girault, Hamoudi Kalla, and Yves Sorel. An active replication scheme that tolerates failures in distributed embedded real-time systems. In Design Methods and Applications for Distributed Embedded Systems, pages 83-92. Springer, 2004. Google Scholar
  23. Arpan Gujarati and Björn B Brandenburg. When is CAN the weakest link? A bound on failures-in-time in CAN-based real-time systems. In Proceedings of the Real-Time Systems Symposium, pages 249-260. IEEE, 2015. Google Scholar
  24. Arpan Gujarati, Mitra Nasri, and Björn B Brandenburg. Lower-bounding the MTTF for systems with (m,k) constraints and IID iteration failure probabilities. Technical Report MPI-SWS-2018-004, Max Planck Institute for Software Systems, Germany, 2018. URL: http://www.mpi-sws.org/tr/2018-004.pdf.
  25. Arpan Gujarati, Mitra Nasri, and Björn B Brandenburg. Quantifying the resiliency of fail-operational real-time networked control systems. Technical Report MPI-SWS-2018-005, Max Planck Institute for Software Systems, Germany, 2018. URL: http://www.mpi-sws.org/tr/2018-005.pdf.
  26. Rachana A Gupta and Mo-Yuen Chow. Overview of networked control systems. In Networked Control Systems, pages 1-23. Springer, 2008. Google Scholar
  27. Moncef Hamdaoui and Parameswaran Ramanathan. A dynamic priority assignment technique for streams with (m, k)-firm deadlines. IEEE Transactions on Computers, 44(12):1443-1451, 1995. Google Scholar
  28. Peter Hazucha and Christer Svensson. Impact of CMOS technology scaling on the atmospheric neutron soft error rate. IEEE Transactions on Nuclear Science, 47(6):2586-2594, 2000. Google Scholar
  29. Rolf Isermann, Ralf Schwarz, and Stefan Stolzl. Fault-tolerant drive-by-wire systems. IEEE Control Systems, 22(5):64-81, 2002. Google Scholar
  30. Ning Jia, Ye-Qiong Song, and Rui-Zhong Lin. Analysis of networked control system with packet drops governed by (m, k)-firm constraint. In Fieldbus Systems and Their Applications 2005, pages 63-70. Elsevier, 2006. Google Scholar
  31. Fredrik Johansson. mpmath - Python library for arbitrary-precision floating-point arithmetic, 2017. Available at URL: http://mpmath.org/.
  32. Way Kuo and Ming J Zuo. Optimal Reliability Modeling: Principles and Applications. John Wiley &Sons, 2003. Google Scholar
  33. Marta Kwiatkowska, Gethin Norman, and David Parker. Controller dependability analysis by probabilistic model checking. Control Engineering Practice, 15(11):1427-1434, 2007. Google Scholar
  34. Marta Kwiatkowska, Gethin Norman, and David Parker. PRISM 4.0: Verification of probabilistic real-time systems. In International Conference on Computer Aided Verification, pages 585-591. Springer, 2011. Google Scholar
  35. Florian Leitner-Fischer. Causality Checking of Safety-Critical Software and Systems. PhD thesis, University of Konstanz, Germany, 2015. URL: http://kops.uni-konstanz.de/handle/123456789/30778.
  36. Hongyi Li. Robust Control Design for Vehicle Active Suspension Systems with Uncertainty. PhD thesis, University of Portsmouth, Portsmouth, 2012. Google Scholar
  37. Xiaodong Li, Sarita V Adve, Pradip Bose, and Jude A Rivers. Architecture-level soft error analysis: Examining the limits of common assumptions. In Proceedings of the 37th International Conference on Dependable Systems and Networks, pages 266-275. IEEE, 2007. Google Scholar
  38. Feng-Li Lian, James Moyne, and Dawn Tilbury. Analysis and modeling of networked control systems: MIMO case with multiple time delays. In Proceedings of the American Control Conference, volume 6, pages 4306-4312. IEEE, 2001. Google Scholar
  39. George MA Lima and Alan Burns. A consensus protocol for CAN-based systems. In Proceedings of the 24th Real-Time Systems Symposium, pages 420-429. IEEE, 2003. Google Scholar
  40. Yu Lu. Probabilistic Verification of Satellite Systems for Mission Critical Applications. PhD thesis, University of Glasgow, 2016. Google Scholar
  41. Renato Mancuso. Next-Generation Safety-Critical Systems on Multi-Core COTS Platforms. PhD thesis, University of Illinois at Urbana-Champaign, 2017. Available at URL: http://hdl.handle.net/2142/97399.
  42. Shubhendu S Mukherjee, Christopher Weaver, Joel Emer, Steven K Reinhardt, and Todd Austin. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In Proceedings of the 36th International Symposium on Microarchitecture, pages 29-40. IEEE, 2003. Google Scholar
  43. Nithin Nakka, Giacinto Paolo Saggese, Zbigniew Kalbarczyk, and Ravishankar K Iyer. An architectural framework for detecting process hangs/crashes. In Proceedings of the European Dependable Computing Conference, pages 103-121. Springer, 2005. Google Scholar
  44. Marco Di Natale, Haibo Zeng, Paolo Giusto, and Arkadeb Ghosal. Understanding and Using the Controller Area Network Communication Protocol: Theory and Practice. Springer, 2012. Google Scholar
  45. Nicolas Navet, Y-Q Song, and Françoise Simonot. Worst-case deadline failure probability in real-time applications distributed over Controller Area Network. Journal of Systems Architecture, 2000. Google Scholar
  46. Johan Nilsson. Real-Time Control Systems with Delays. PhD thesis, Lund Institute of Technology Lund, Sweden, 1998. Google Scholar
  47. John Noto, Gary Fenical, and Colin Tong. Automotive EMI shielding-controlling automotive electronic emissions and susceptibility with proper EMI suppression methods. URL: https://www.lairdtech.com/sites/default/files/public/solutions/Laird-EMI-WP-Automotive-EMI-Shielding-040114.pdf.
  48. Stefan Poledna. Fault-Tolerant Real-Time Systems: The Problem of Replica Determinism, volume 345. Springer Science &Business Media, 2007. Google Scholar
  49. Sasikumar Punnekkat, Hans Hansson, and Christer Norstrom. Response time analysis under errors for CAN. In Proceedings of the 6th Real-Time Technology and Applications Symposium, pages 258-265. IEEE, 2000. Google Scholar
  50. Parameswaran Ramanathan. Overload management in real-time control applications using (m, k)-firm guarantee. Transactions on Parallel and Distributed Systems, 10(6):549-559, 1999. Google Scholar
  51. Jose Rufino, Paulo Verissimo, Guilherme Arroz, Carlos Almeida, and Luis Rodrigues. Fault-tolerant broadcasts in CAN. In Proceedings of the 28th International Symposium on Fault-Tolerant Computing, pages 150-159. IEEE, 1998. Google Scholar
  52. Indranil Saha, Sanjoy Baruah, and Rupak Majumdar. Dynamic scheduling for networked control systems. In Proceedings of the 18th International Conference on Hybrid Systems: Computation and Control, pages 98-107. ACM, 2015. Google Scholar
  53. Maurice Sebastian, Philip Axer, and Rolf Ernst. Utilizing hidden markov models for formal reliability analysis of real-time communication systems with errors. In Proceedings of the 17th Pacific Rim International Symposium on Dependable Computing, pages 79-88. IEEE, 2011. Google Scholar
  54. M. Sfakianakis, S. Kounias, and A. Hillaris. Reliability of a consecutive k-out-of-r-from-n:F system. Transactions on Reliability, 41(3):442-447, 1992. Google Scholar
  55. Purnendu Sinha. Architectural design and reliability analysis of a fail-operational brake-by-wire system from iso 26262 perspectives. Reliability Engineering &System Safety, 96(10):1349-1359, 2011. Google Scholar
  56. Fedor Smirnov, Michael Glaß, Felix Reimann, and Jürgen Teich. Formal reliability analysis of switched ethernet automotive networks under transient transmission errors. In Proceedings of the 53nd Design Automation Conference, pages 1-6. IEEE, 2016. Google Scholar
  57. Susan Stanley. MTBF, MTTR, MTTF & FIT explanation of terms. URL: http://imcnetworks.com/wp-content/uploads/2014/12/MTBF-MTTR-MTTF-FIT.pdf.
  58. Anton Tarasyuk, Elena Troubitsyna, and Linas Laibinis. Augmenting formal development of control systems with quantitative reliability assessment. In Proceedings of the 2nd International Workshop on Software Engineering for Resilient Systems, pages 61-70. ACM, 2010. Google Scholar
  59. Ken Tindell and Alan Burns. Guaranteeing message latencies on Control Area Network (CAN). In Proceedings of the 1st International CAN Conference, 1994. Google Scholar
  60. Nicholas J Wang, Justin Quek, Todd M Rafacz, and Sanjay J Patel. Characterizing the effects of transient faults on a high-performance processor pipeline. In Proceedings of the International Conference on Dependable Systems and Networks, pages 61-70. IEEE, 2004. Google Scholar
Questions / Remarks / Feedback

Feedback for Dagstuhl Publishing

Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail