Quantifying the Resiliency of Fail-Operational Real-Time Networked Control Systems

Gujarati, Arpan; Nasri, Mitra; Brandenburg, Björn B.

doi:10.4230/LIPIcs.ECRTS.2018.16

Abstract

In time-sensitive, safety-critical systems that must be fail-operational, active replication is commonly used to mitigate transient faults that arise due to electromagnetic interference (EMI). However, designing an effective and well-performing active replication scheme is challenging since replication conflicts with the size, weight, power, and cost constraints of embedded applications. To enable a systematic and rigorous exploration of the resulting tradeoffs, we present an analysis to quantify the resiliency of fail-operational networked control systems against EMI-induced memory corruption, host crashes, and retransmission delays. Since control systems are typically robust to a few failed iterations, e.g., one missed actuation does not crash an inverted pendulum, traditional solutions based on hard real-time assumptions are often too pessimistic. Our analysis reduces this pessimism by modeling a control system's inherent robustness as an (m,k)-firm specification. A case study with an active suspension workload indicates that the analytical bounds closely predict the failure rate estimates obtained through simulation, thereby enabling a meaningful design-space exploration, and also demonstrates the utility of the analysis in identifying non-trivial and non-obvious reliability tradeoffs.

IEEE standard for a precision clock synchronization protocol for networked measurement and control systems. IEEE Std 1588-2008 (Revision of IEEE Std 1588-2002), pages 1-300, July 2008. URL: http://dx.doi.org/10.1109/IEEESTD.2008.4579760.
Masakazu Adachi, Yiannis Papadopoulos, Septavera Sharvia, David Parker, and Tetsuya Tohdo. An approach to optimization of fault tolerant architectures using hip-hops. Software: Practice and Experience, 41(11):1303-1327, 2011.
Zaid Al-Ars and Ad J van de Goor. Transient faults in DRAMs: Concept, analysis and impact on tests. In International Workshop on Memory Technology, Design and Testing, pages 59-64. IEEE, 2001.
Adolfo Anta and Paulo Tabuada. On the benefits of relaxing the periodicity assumption for networked control systems over CAN. In Proceedings of the 30th Real-Time Systems Symposium, pages 3-12. IEEE, 2009.
Robert B Ash. Basic Probability Theory. Courier Corporation, 2012.
Algirdas Avizienis, J-C Laprie, Brian Randell, and Carl Landwehr. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1):11-33, 2004.
Ali Bakhoda, Seyed Ghassem Miremadi, and Hamid R Zarandi. Investigation of transient effects on fpga-based embedded systems. In Proceedings of the 2nd International Conference on Embedded Software and Systems, pages 6-pp. IEEE, 2005.
Michael Barborak, Anton Dahbura, and Miroslaw Malek. The consensus problem in fault-tolerant computing. ACM Computing Surveys, 25(2):171-220, 1993.
Rainer Blind and Frank Allgöwer. Towards networked control systems with guaranteed stability: Using weakly hard real-time constraints to model the loss process. In Proceedings of the 54th Annual Conference on Decision and Control, pages 7510-7515. IEEE, 2015.
Björn B Brandenburg. The schedulability test collection and toolkit, 2017. Available at URL: https://people.mpi-sws.org/~bbb/projects/schedcat.
Ian Broster, Guillem Bernat, and Alan Burns. Weakly hard real-time constraints on controller area network. In Proceedings of the 14th Euromicro Conference on Real-Time Systems, pages 134-141. IEEE, 2002.
Ian Broster, Alan Burns, and Guillermo Rodriguez-Navas. Probabilistic analysis of CAN with faults. In Proceedings of the 23rd Real-Time Systems Symposium, pages 269-278. IEEE, 2002.
Ian Broster, Alan Burns, and Guillermo Rodriguez-Navas. Timing analysis of real-time communication under electromagnetic interference. Real-Time Systems, 30(1-2):55-81, 2005.
Ahmet Cetinkaya, Hideaki Ishii, and Tomohisa Hayakawa. Networked control under random and malicious packet losses. Transactions on Automatic Control, 62(5):2434-2449, 2017.
Cristian Ionut Chihaia. Active Fault-Tolerance in Wireless Networked Control Systems. PhD thesis, Universität Duisburg-Essen, Fakultät für Ingenieurwissenschaften / Elektrotechnik und Informationstechnik / Automatisierungstechnik und komplexe Systeme, 2010.
Robert I Davis, Alan Burns, Reinder J Bril, and Johan J Lukkien. Controller area network (CAN) schedulability analysis: Refuted, revisited and revised. Real-Time Systems, 35(3):239-272, 2007.
Christian Dehnert, Sebastian Junges, Joost-Pieter Katoen, and Matthias Volk. A storm is coming: A modern probabilistic model checker. In Computer Aided Verification - 29th International Conference, CAV 2017, Heidelberg, Germany, July 24-28, 2017, Proceedings, Part II, pages 592-600, 2017. URL: http://dx.doi.org/10.1007/978-3-319-63390-9_31.
Joanne Bechta Dugan and Randy Van Buren. Reliability evaluation of fly-by-wire computer systems. Journal of Systems and software, 25(1):109-120, 1994.
Jonas Elmqvist and Simin Nadjm-Tehrani. Formal support for quantitative analysis of residual risks in safety-critical systems. In Proceedings of the 11th High Assurance Systems Engineering Symposium, pages 154-164. IEEE, 2008.
Joaquim Ferreira, Arnaldo Oliveira, Pedro Fonseca, and José Fonseca. An experiment to assess bit error rate in CAN. In Proceedings of the 3rd International Workshop of Real-Time Networks, pages 15-18, 2004.
Martin Gergeleit and Hermann Streich. Implementing a distributed high-resolution real-time clock using the CAN-bus. In Proceedings of the 1st International CAN Conference, volume 94, 1994.
Alain Girault, Hamoudi Kalla, and Yves Sorel. An active replication scheme that tolerates failures in distributed embedded real-time systems. In Design Methods and Applications for Distributed Embedded Systems, pages 83-92. Springer, 2004.
Arpan Gujarati and Björn B Brandenburg. When is CAN the weakest link? A bound on failures-in-time in CAN-based real-time systems. In Proceedings of the Real-Time Systems Symposium, pages 249-260. IEEE, 2015.
Arpan Gujarati, Mitra Nasri, and Björn B Brandenburg. Lower-bounding the MTTF for systems with (m,k) constraints and IID iteration failure probabilities. Technical Report MPI-SWS-2018-004, Max Planck Institute for Software Systems, Germany, 2018. URL: http://www.mpi-sws.org/tr/2018-004.pdf.
Arpan Gujarati, Mitra Nasri, and Björn B Brandenburg. Quantifying the resiliency of fail-operational real-time networked control systems. Technical Report MPI-SWS-2018-005, Max Planck Institute for Software Systems, Germany, 2018. URL: http://www.mpi-sws.org/tr/2018-005.pdf.
Rachana A Gupta and Mo-Yuen Chow. Overview of networked control systems. In Networked Control Systems, pages 1-23. Springer, 2008.
Moncef Hamdaoui and Parameswaran Ramanathan. A dynamic priority assignment technique for streams with (m, k)-firm deadlines. IEEE Transactions on Computers, 44(12):1443-1451, 1995.
Peter Hazucha and Christer Svensson. Impact of CMOS technology scaling on the atmospheric neutron soft error rate. IEEE Transactions on Nuclear Science, 47(6):2586-2594, 2000.
Rolf Isermann, Ralf Schwarz, and Stefan Stolzl. Fault-tolerant drive-by-wire systems. IEEE Control Systems, 22(5):64-81, 2002.
Ning Jia, Ye-Qiong Song, and Rui-Zhong Lin. Analysis of networked control system with packet drops governed by (m, k)-firm constraint. In Fieldbus Systems and Their Applications 2005, pages 63-70. Elsevier, 2006.
Fredrik Johansson. mpmath - Python library for arbitrary-precision floating-point arithmetic, 2017. Available at URL: http://mpmath.org/.
Way Kuo and Ming J Zuo. Optimal Reliability Modeling: Principles and Applications. John Wiley &Sons, 2003.
Marta Kwiatkowska, Gethin Norman, and David Parker. Controller dependability analysis by probabilistic model checking. Control Engineering Practice, 15(11):1427-1434, 2007.
Marta Kwiatkowska, Gethin Norman, and David Parker. PRISM 4.0: Verification of probabilistic real-time systems. In International Conference on Computer Aided Verification, pages 585-591. Springer, 2011.
Florian Leitner-Fischer. Causality Checking of Safety-Critical Software and Systems. PhD thesis, University of Konstanz, Germany, 2015. URL: http://kops.uni-konstanz.de/handle/123456789/30778.
Hongyi Li. Robust Control Design for Vehicle Active Suspension Systems with Uncertainty. PhD thesis, University of Portsmouth, Portsmouth, 2012.
Xiaodong Li, Sarita V Adve, Pradip Bose, and Jude A Rivers. Architecture-level soft error analysis: Examining the limits of common assumptions. In Proceedings of the 37th International Conference on Dependable Systems and Networks, pages 266-275. IEEE, 2007.
Feng-Li Lian, James Moyne, and Dawn Tilbury. Analysis and modeling of networked control systems: MIMO case with multiple time delays. In Proceedings of the American Control Conference, volume 6, pages 4306-4312. IEEE, 2001.
George MA Lima and Alan Burns. A consensus protocol for CAN-based systems. In Proceedings of the 24th Real-Time Systems Symposium, pages 420-429. IEEE, 2003.
Yu Lu. Probabilistic Verification of Satellite Systems for Mission Critical Applications. PhD thesis, University of Glasgow, 2016.
Renato Mancuso. Next-Generation Safety-Critical Systems on Multi-Core COTS Platforms. PhD thesis, University of Illinois at Urbana-Champaign, 2017. Available at URL: http://hdl.handle.net/2142/97399.
Shubhendu S Mukherjee, Christopher Weaver, Joel Emer, Steven K Reinhardt, and Todd Austin. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In Proceedings of the 36th International Symposium on Microarchitecture, pages 29-40. IEEE, 2003.
Nithin Nakka, Giacinto Paolo Saggese, Zbigniew Kalbarczyk, and Ravishankar K Iyer. An architectural framework for detecting process hangs/crashes. In Proceedings of the European Dependable Computing Conference, pages 103-121. Springer, 2005.
Marco Di Natale, Haibo Zeng, Paolo Giusto, and Arkadeb Ghosal. Understanding and Using the Controller Area Network Communication Protocol: Theory and Practice. Springer, 2012.
Nicolas Navet, Y-Q Song, and Françoise Simonot. Worst-case deadline failure probability in real-time applications distributed over Controller Area Network. Journal of Systems Architecture, 2000.
Johan Nilsson. Real-Time Control Systems with Delays. PhD thesis, Lund Institute of Technology Lund, Sweden, 1998.
John Noto, Gary Fenical, and Colin Tong. Automotive EMI shielding-controlling automotive electronic emissions and susceptibility with proper EMI suppression methods. URL: https://www.lairdtech.com/sites/default/files/public/solutions/Laird-EMI-WP-Automotive-EMI-Shielding-040114.pdf.
Stefan Poledna. Fault-Tolerant Real-Time Systems: The Problem of Replica Determinism, volume 345. Springer Science &Business Media, 2007.
Sasikumar Punnekkat, Hans Hansson, and Christer Norstrom. Response time analysis under errors for CAN. In Proceedings of the 6th Real-Time Technology and Applications Symposium, pages 258-265. IEEE, 2000.
Parameswaran Ramanathan. Overload management in real-time control applications using (m, k)-firm guarantee. Transactions on Parallel and Distributed Systems, 10(6):549-559, 1999.
Jose Rufino, Paulo Verissimo, Guilherme Arroz, Carlos Almeida, and Luis Rodrigues. Fault-tolerant broadcasts in CAN. In Proceedings of the 28th International Symposium on Fault-Tolerant Computing, pages 150-159. IEEE, 1998.
Indranil Saha, Sanjoy Baruah, and Rupak Majumdar. Dynamic scheduling for networked control systems. In Proceedings of the 18th International Conference on Hybrid Systems: Computation and Control, pages 98-107. ACM, 2015.
Maurice Sebastian, Philip Axer, and Rolf Ernst. Utilizing hidden markov models for formal reliability analysis of real-time communication systems with errors. In Proceedings of the 17th Pacific Rim International Symposium on Dependable Computing, pages 79-88. IEEE, 2011.
M. Sfakianakis, S. Kounias, and A. Hillaris. Reliability of a consecutive k-out-of-r-from-n:F system. Transactions on Reliability, 41(3):442-447, 1992.
Purnendu Sinha. Architectural design and reliability analysis of a fail-operational brake-by-wire system from iso 26262 perspectives. Reliability Engineering &System Safety, 96(10):1349-1359, 2011.
Fedor Smirnov, Michael Glaß, Felix Reimann, and Jürgen Teich. Formal reliability analysis of switched ethernet automotive networks under transient transmission errors. In Proceedings of the 53nd Design Automation Conference, pages 1-6. IEEE, 2016.
Susan Stanley. MTBF, MTTR, MTTF & FIT explanation of terms. URL: http://imcnetworks.com/wp-content/uploads/2014/12/MTBF-MTTR-MTTF-FIT.pdf.
Anton Tarasyuk, Elena Troubitsyna, and Linas Laibinis. Augmenting formal development of control systems with quantitative reliability assessment. In Proceedings of the 2nd International Workshop on Software Engineering for Resilient Systems, pages 61-70. ACM, 2010.
Ken Tindell and Alan Burns. Guaranteeing message latencies on Control Area Network (CAN). In Proceedings of the 1st International CAN Conference, 1994.
Nicholas J Wang, Justin Quek, Todd M Rafacz, and Sanjay J Patel. Characterizing the effects of transient faults on a high-performance processor pipeline. In Proceedings of the International Conference on Dependable Systems and Networks, pages 61-70. IEEE, 2004.

Quantifying the Resiliency of Fail-Operational Real-Time Networked Control Systems

Authors Arpan Gujarati, Mitra Nasri, Björn B. Brandenburg

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Quantifying the Resiliency of Fail-Operational Real-Time Networked Control Systems

Authors Arpan Gujarati, Mitra Nasri, Björn B. Brandenburg

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message