Consensual Resilient Control: Stateless Recovery of Stateful Controllers

Authors Aleksandar Matovic, Rafal Graczyk, Federico Lucchetti, Marcus Völp



PDF
Thumbnail PDF

File

LIPIcs.ECRTS.2023.14.pdf
  • Filesize: 2.54 MB
  • 27 pages

Document Identifiers

Author Details

Aleksandar Matovic
  • Interdisciplinary Centre for Security, Reliability and Trust, University of Luxembourg, Luxembourg
Rafal Graczyk
  • Interdisciplinary Centre for Security, Reliability and Trust, University of Luxembourg, Luxembourg
Federico Lucchetti
  • Interdisciplinary Centre for Security, Reliability and Trust, University of Luxembourg, Luxembourg
Marcus Völp
  • Interdisciplinary Centre for Security, Reliability and Trust, University of Luxembourg, Luxembourg

Acknowledgements

Thanks to the anonymous reviewers and shepherd for their fruitful comments and suggestions how to improve this paper. A special thanks goes to Martina Maggio and to Filip Markovic for their helpful feedback and advice.

Cite AsGet BibTex

Aleksandar Matovic, Rafal Graczyk, Federico Lucchetti, and Marcus Völp. Consensual Resilient Control: Stateless Recovery of Stateful Controllers. In 35th Euromicro Conference on Real-Time Systems (ECRTS 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 262, pp. 14:1-14:27, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.ECRTS.2023.14

Abstract

Safety-critical systems have to absorb accidental and malicious faults to obtain high mean-times-to-failures (MTTFs). Traditionally, this is achieved through re-execution or replication. However, both techniques come with significant overheads, in particular when cold-start effects are considered. Such effects occur after replicas resume from checkpoints or from their initial state. This work aims at improving on the performance of control-task replication by leveraging an inherent stability of many plants to tolerate occasional control-task deadline misses and suggests masking faults just with a detection quorum. To make this possible, we have to eliminate cold-start effects to allow replicas to rejuvenate during each control cycle. We do so, by systematically turning stateful controllers into instants that can be recovered in a stateless manner. We highlight the mechanisms behind this transformation, how it achieves consensual resilient control, and demonstrate on the example of an inverted pendulum how accidental and maliciously-induced faults can be absorbed, even if control tasks run in less predictable environments.

Subject Classification

ACM Subject Classification
  • Computer systems organization → Real-time systems
  • Computer systems organization → Embedded and cyber-physical systems
  • Computer systems organization → Dependable and fault-tolerant systems and networks
Keywords
  • resilience
  • control
  • replication

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Loveless A, Dreslinski R, Kasikci B, and Phan LT. Igor: Accelerating byzantine fault tolerance for real-time systems with eager execution. In IEEE 27th Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 360-373, May 2021. Google Scholar
  2. Leonie Ahrendts, Sophie Quinton, Thomas Boroske, and Rolf Ernst. Verifying weakly-hard real-time properties of traffic streams in switched networks. In ECRTS 2018-30th Euromicro Conference on Real-Time Systems, pages 1-22, 2018. Google Scholar
  3. Charles W Anderson. Learning to control an inverted pendulum using neural networks. IEEE Control Systems Magazine, 9(3):31-37, 1989. Google Scholar
  4. Hakan Aydin and Qi Yang. Energy-aware partitioning for multiprocessor real-time systems. In Proceedings International Parallel and Distributed Processing Symposium, pages 9-pp. IEEE, 2003. Google Scholar
  5. Günther Bauer and Hermann Kopetz. Transparent redundancy in the time-triggered architecture. In Proceeding International Conference on Dependable Systems and Networks. DSN 2000, pages 5-13. IEEE, 2000. Google Scholar
  6. Robert C Baumann. Soft errors in advanced semiconductor devices-part i: the three radiation sources. IEEE Transactions on device and materials reliability, 1(1):17-22, 2001. Google Scholar
  7. Guillem Bernat, Alan Burns, and Albert Liamosi. Weakly hard real-time systems. IEEE transactions on Computers, 50(4):308-321, 2001. Google Scholar
  8. Hristo Bojinov, Dan Boneh, Rich Cannings, and Iliyan Malchev. Address space randomization for mobile devices. In Proceedings of the Fourth ACM Conference on Wireless Network Security (WiSec’11), pages 127-138, 2011. URL: https://doi.org/doi:10.1145/1998412.1998434.
  9. Erik Buchanan, Ryan Roemer, and Stefan Savage. Return-oriented programming: Exploits without code injection. In Black HAT USA, August 2008. URL: https://hovav.net/ucsd/talks/blackhat08.html.
  10. Gang Chen, Nan Guan, Kai Huang, and Wang Yi. Fault-tolerant real-time tasks scheduling with dynamic fault handling. Journal of Systems Architecture, 102:101688, 2020. Google Scholar
  11. Thomas Chen and Saeed Abu-Nimeh. Lessons from stuxnet. Computer, 44(4):91-93, 2011. Google Scholar
  12. Hongjun Choi, Sayali Kate, Yousra Aafer, Xiangyu Zhang, and Dongyan Xu. Software-based realtime recovery from sensor attacks on robotic vehicles. In 23rd International Symposium on Research in Attacks, Intrusions and Defenses (RAID), pages 349-364, 2020. Google Scholar
  13. Roth E and Haeberlen A. Do not overpay for fault tolerance! In IEEE 27th Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 374-386, May 2021. Google Scholar
  14. Haoyang Fan, Fan Zhu, Changchun Liu, Liangliang Zhang, Li Zhuang, Dong Li, Weicheng Zhu, Jiangtao Hu, Hongye Li, and Qi Kong. Baidu apollo em motion planner. arXiv preprint arXiv:1807.08048, 2018. Google Scholar
  15. Joachim Fellmuth, Thomas Göthel, and Sabine Glesner. Instruction Caches in Static WCET Analysis of Artificially Diversified Software. In Sebastian Altmeyer, editor, 30th Euromicro Conference on Real-Time Systems (ECRTS 2018), volume 106 of Leibniz International Proceedings in Informatics (LIPIcs), pages 21:1-21:23, Dagstuhl, Germany, 2018. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. URL: https://doi.org/10.4230/LIPIcs.ECRTS.2018.21.
  16. S. Forrest, A. Somayaji, and D. H. Ackley. Building diverse computer systems. In Hot Topics in Operating Systems, pages 67-72, 1997. URL: https://doi.org/doi:10.1109/HOTOS.1997.595185.
  17. Markus Fras, H Kroha, O Reimann, B Weber, and R Richter. Use of triple modular redundancy (tmr) technology in fpgas for the reduction of faults due to radiation in the readout of the atlas monitored drift tube (mdt) chambers. Journal of Instrumentation, 5(11):C11009, 2010. Google Scholar
  18. Goran Frehse, Arne Hamann, Sophie Quinton, and Matthias Woehrle. Formal analysis of timing effects on closed-loop properties of control software. In 2014 IEEE Real-Time Systems Symposium, pages 53-62. IEEE, 2014. Google Scholar
  19. Inês Pinto Gouveia, Marcus Völp, and Paulo Esteves-Verissimo. Behind the last line of defense: Surviving soc faults and intrusions. Computers & Security, 123:102920, 2022. Google Scholar
  20. Blessing Guembe, Ambrose Azeta, Sanjay Misra, Victor Chukwudi Osamor, Luis Fernandez-Sanz, and Vera Pospelova. The emerging threat of ai-driven cyber attacks: A review. Applied Artificial Intelligence, 36(1):2037254, 2022. Google Scholar
  21. Li H, Lu C, and Gill CD. Rt-zookeeper: Taming the recovery latency of a coordination service. In ACM Transactions on Embedded Computing Systems (TECS), volume 20, pages 1-22, September 2021. Google Scholar
  22. Andreas Haeberlen, Petr Kouznetsov, and Peter Druschel. Peerreview: Practical accountability for distributed systems. ACM SIGOPS operating systems review, 41(6):175-188, 2007. Google Scholar
  23. Zain AH Hammadeh, Rolf Ernst, Sophie Quinton, Rafik Henia, and Laurent Rioux. Bounding deadline misses in weakly-hard real-time systems with task dependencies. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017, pages 584-589. IEEE, 2017. Google Scholar
  24. Richard W Hamming. Error detecting and error correcting codes. The Bell system technical journal, 29(2):147-160, 1950. Google Scholar
  25. Pengcheng Huang, Hoeseok Yang, and Lothar Thiele. On the scheduling of fault-tolerant mixed-criticality systems. In Proceedings of the 51st annual design automation conference, pages 1-6, 2014. Google Scholar
  26. Todd E Humphreys, Brent M Ledvina, Mark L Psiaki, Brady W O'Hanlon, Paul M Kintner, et al. Assessing the spoofing threat: Development of a portable gps civilian spoofer. In Proceedings of the 21st International Technical Meeting of the Satellite Division of The Institute of Navigation (ION GNSS 2008), pages 2314-2325, 2008. Google Scholar
  27. Yuchong Huo, François Bouffard, and Géza Joós. Integrating learning and explicit model predictive control for unit commitment in microgrids. Applied Energy, 306:118026, 2022. Google Scholar
  28. Greg Jaffe and Thomas Erdbrink. Iran says it downed u.s. stealth drone; pentagon acknowledges aircraft downing. The Washington Post, December 2011. Google Scholar
  29. Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, Ji Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad Lai, and Onur Mutlu. Flipping bits in memory without accessing them: An experimental study of dram disturbance errors. ACM SIGARCH Computer Architecture News, 42(3):361-372, 2014. Google Scholar
  30. H. Kopetz and G. Grunsteidl. Ttp - a time-triggered protocol for fault-tolerant real-time systems. In FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing, pages 524-533, 1993. URL: https://doi.org/10.1109/FTCS.1993.627355.
  31. Hermann Kopetz and Günther Bauer. The time-triggered architecture. Proceedings of the IEEE, 91(1):112-126, 2003. Google Scholar
  32. C Mani Krishna. Fault-tolerant scheduling in homogeneous real-time systems. ACM Computing Surveys (CSUR), 46(4):1-34, 2014. Google Scholar
  33. Leslie Lamport. The part-time parliament. In Concurrency: the Works of Leslie Lamport, pages 277-317. ACM, 2019. Google Scholar
  34. Ralph Langner. Stuxnet: Dissecting a cyberwarfare weapon. IEEE Security and Privacy, 9(3):49-51, 2011. Google Scholar
  35. P. Larsen, A. Homescu, S. Brunthaler, and M. Franz. Sok: Automated software diversity. In IEEE Symposium on Security and Privacy, 2014. URL: https://doi.org/doi:10.1109/SP.2014.25.
  36. Robert M. Lee, Michael J. Assante, and Tim Conway. German steel mill cyber attack. Industrial Control Systems - avail at: https://ics.sans.org/media/ICS-CPPE-case-Study-2-German-Steelworks Facility.pdf, December 2014. Google Scholar
  37. Robert E Lyons and Wouter Vanderkulk. The use of triple-modular redundancy to improve computer reliability. IBM journal of research and development, 6(2):200-209, 1962. Google Scholar
  38. Michael R Lyu et al. Handbook of software reliability engineering, volume 222. IEEE computer society press Los Alamitos, 1996. Google Scholar
  39. Martina Maggio, Arne Hamann, Eckart Mayer-John, and Dirk Ziegenbein. Control-system stability under consecutive deadline misses constraints. In 32nd Euromicro Conference on Real-Time Systems (ECRTS 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020. Google Scholar
  40. Ibtissem Malouche, A Kheriji Abbes, and Faouzi Bouani. Automatic model predictive control implementation in a high-performance microcontroller. In 2015 IEEE 12th International Multi-Conference on Systems, Signals & Devices (SSD15), pages 1-6. IEEE, 2015. Google Scholar
  41. Aleksandar Matović. Case studies on modeling security implications on safety, 2019. Google Scholar
  42. Sibin Mohan, Stanley Bak, Emiliano Betti, Heechul Yun, Lui Sha, and Marco Caccamo. S3a: Secure system simplex architecture for enhanced security and robustness of cyber-physical systems. In Proceedings of the 2nd ACM international conference on High confidence networked systems, pages 65-74, 2013. Google Scholar
  43. Djob Mvondo, Alain Tchana, Renaud Lachaize, Daniel Hagimont, and Noël De Palma. Fine-grained fault tolerance for resilient pVM-based virtual machine monitors. In 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 197-208. IEEE, 2020. Google Scholar
  44. Diego Ongaro and John Ousterhout. In search of an understandable consensus algorithm. In 2014 USENIX Annual Technical Conference (USENIX ATC 14), pages 305-319, 2014. Google Scholar
  45. Claire Pagetti, Julien Forget, Frédéric Boniol, Mikel Cordovilla, and David Lesens. Multi-task implementation of multi-periodic synchronous programs. Discrete event dynamic systems, 21:307-338, 2011. Google Scholar
  46. Risat Mahmud Pathan. Fault-tolerant and real-time scheduling for mixed-criticality systems. Real-Time Systems, 50:509-547, 2014. Google Scholar
  47. Krithi Ramamritham and John A. Stankovic. Scheduling algorithms and operating systems support for real-time systems. Proceedings of the IEEE, 82(1):55-67, 1994. Google Scholar
  48. Reza Ramezani and Yasser Sedaghat. An overview of fault tolerance techniques for real-time operating systems. ICCKE 2013, pages 1-6, 2013. Google Scholar
  49. Michael Riley and John Walcott. China-based hacking of 760 companies shows cyber cold war. Bloomberg, Dec, 14, 2011. Google Scholar
  50. Ryan Roemer, Erik Buchanan, Hovav Shacham, and Stefan Savage. Return-oriented programming: Systems, languages, and applications. ACM Transactions on Information and System Security (TISSEC), 15(1):1-34, 2012. Google Scholar
  51. John Rushby. Bus architectures for safety-critical embedded systems. In International Workshop on Embedded Software, pages 306-323. Springer, 2001. Google Scholar
  52. Moritz Schloegel, Tim Blazytko, Moritz Contag, Cornelius Aschermann, Julius Basler, Thorsten Holz, and Ali Abbasi. Loki: Hardening code obfuscation against automated attacks. In 31st USENIX Security Symposium (USENIX Security 22), pages 3055-3073, Boston, MA, August 2022. USENIX Association. URL: https://www.usenix.org/conference/usenixsecurity22/presentation/schloegel.
  53. Sarah Scoles. The feds want these teams to hack a satellite - From home. The wired - https://www.wired.com/story/the-feds-want-these-teams-to-hack-a-satellite-from-home/, August 2020.
  54. Danbing Seto and Lui Sha. A case study on analytical analysis of the inverted pendulum real-time control system. Technical report, Carnegie-Mellon University, 1999. Google Scholar
  55. Yanyan Shen, Gernot Heiser, and Kevin Elphinstone. Fault tolerance through redundant execution on cots multicores: Exploring trade-offs. In 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 188-200, 2019. URL: https://doi.org/10.1109/DSN.2019.00031.
  56. D. Shepard, J. Bhatti, and T. Humphreys. Drone hack. GPS World, 23(8):30-33, 2012. Google Scholar
  57. Douglas Simoes Silva, Rafal Graczyk, Jérémie Decouchant, Marcus Völp, and Paulo Esteves-Verissimo. Threat adaptive byzantine fault tolerant state-machine replication. In 40th International Symposium on Reliable Distributed Systems (SRDS), September 2021. Google Scholar
  58. Jill Slay and Michael Miller. Lessons learned from the maroochy water breach. Critical Infrastructure Protection, pages 73-82, 2007. Google Scholar
  59. Paulo Sousa, Nuno Ferreira Neves, and Paulo Veríssimo. Proactive resilience through architectural hybridization. In Proceedings of the 2006 ACM Symposium on Applied Computing, pages 686-690, 2006. Google Scholar
  60. Rong Su. Supervisor synthesis to thwart cyber attack with bounded sensor reading alterations. Automatica, 94:35-44, 2018. Google Scholar
  61. Infineon Technologies. 32-bit aurix™ tricore™ microcontroller. URL: https://www.infineon.com/cms/en/product/microcontroller/32-bit-tricore-microcontroller/.
  62. Infineon Technologies. ccu4 capture and compare unit 4. URL: https://www.infineon.com/dgdl/Infineon-IP_CCU4_XMC-TR-v01_00-EN.pdf?fileId=5546d4624ad04ef9014b0780bb082263&ack=t.
  63. Nils Ole Tippenhauer, Christina Pöpper, Kasper Bonne Rasmussen, and Srdjan Capkun. On the requirements for successful gps spoofing attacks. In Proceedings of the 18th ACM conference on Computer and communications security, pages 75-86, 2011. Google Scholar
  64. Ulf Troppens, Rainer Erkens, and Wolfgang Müller. Storage networks explained: basics and application of fibre channel SAN, NAS, iSCSI and InfiniBand. John Wiley & Sons, 2005. Google Scholar
  65. Nils Vreman, Anton Cervin, and Martina Maggio. Stability and Performance Analysis of Control Systems Subject to Bursts of Deadline Misses. In Björn B. Brandenburg, editor, 33rd Euromicro Conference on Real-Time Systems (ECRTS 2021), volume 196 of Leibniz International Proceedings in Informatics (LIPIcs), pages 15:1-15:23, Dagstuhl, Germany, 2021. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. URL: https://doi.org/10.4230/LIPIcs.ECRTS.2021.15.
  66. Xin Wang, Keith Holbert, and Lawrence T Clark. Using tmr to mitigate seus for digital instrumentation and control in nuclear power plants. In 7th International Topical Meeting on Nuclear Plant Instrumentation, Control, and Human-Machine Interface Technologies 2010, NPIC and HMIT 2010, pages 925-934, 2010. Google Scholar
  67. Victor Williams and Kiyotoshi Matsuoka. Learning to balance the inverted pendulum using neural networks. In [Proceedings] 1991 IEEE International Joint Conference on Neural Networks, pages 214-219. IEEE, 1991. Google Scholar
  68. Aibin Yan, Zhelong Xu, Kang Yang, Jie Cui, Zhengfeng Huang, Patrick Girard, and Xiaoqing Wen. A novel low-cost tmr-without-voter based his-insensitive and mnu-tolerant latch design for aerospace applications. IEEE Transactions on Aerospace and Electronic Systems, 56(4):2666-2676, 2019. Google Scholar
  69. Kim Zetter. Google hack attack was ultra sophisticated, new details show, January 2010. URL: https://www.wired.com/2010/01/operation-aurora/.
  70. Kim Zetter. A cyberattack has caused confirmed physical damage for the second time ever. https://www.wired.com/2015/01/german-steel-mill-hack-destruction, 2015.
  71. Ying Zhang and Krishnendu Chakrabarty. Fault recovery based on checkpointing for hard real-time embedded systems. In Proceedings 18th IEEE Symposium on Defect and Fault Tolerance in VLSI Systems, pages 320-327. IEEE, 2003. Google Scholar
  72. Junlong Zhou, Min Yin, Zhifang Li, Kun Cao, Jianming Yan, Tongquan Wei, Mingsong Chen, and Xin Fu. Fault-tolerant task scheduling for mixed-criticality real-time systems. Journal of Circuits, Systems and Computers, 26(01):1750016, 2017. Google Scholar
  73. Xingliang Zou, Albert MK Cheng, and Yu Jiang. P-frp task scheduling: A survey. In 2016 1st CPSWeek Workshop on Declarative Cyber-Physical Systems (DCPS), pages 1-8. IEEE, 2016. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail