Recovery Time Considerations in Real-Time Systems Employing Software Fault Tolerance

Authors Anand Bhat , Soheil Samii, Ragunathan (Raj) Rajkumar

Thumbnail PDF


  • Filesize: 4.01 MB
  • 22 pages

Document Identifiers

Author Details

Anand Bhat
  • Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA
Soheil Samii
  • General Motors R&D, Warren, MI, USA and Linköping University, Sweden
Ragunathan (Raj) Rajkumar
  • Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA

Cite AsGet BibTex

Anand Bhat, Soheil Samii, and Ragunathan (Raj) Rajkumar. Recovery Time Considerations in Real-Time Systems Employing Software Fault Tolerance. In 30th Euromicro Conference on Real-Time Systems (ECRTS 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 106, pp. 23:1-23:22, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Safety-critical real-time systems like modern automobiles with advanced driving-assist features must employ redundancy for crucial software tasks to tolerate permanent crash faults. This redundancy can be achieved by using techniques like active replication or the primary-backup approach. In such systems, the recovery time which is the amount of time it takes for a redundant task to take over execution on the failure of a primary task becomes a very important design parameter. The recovery time for a given task depends on various factors like task allocation, primary and redundant task priorities, system load and the scheduling policy. Each task can also have a different recovery time requirement (RTR). For example, in automobiles with automated driving features, safety-critical tasks like perception and steering control have strict RTRs, whereas such requirements are more relaxed in the case of tasks like heating control and mission planning. In this paper, we analyze the recovery time for software tasks in a real-time system employing Rate-Monotonic Scheduling (RMS). We derive bounds on the recovery times for different redundant task options and propose techniques to determine the redundant-task type for a task to satisfy its RTR. We also address the fault-tolerant task allocation problem, with the additional constraint of satisfying the RTR of each task in the system. Given that the problem of assigning tasks to processors is a well-known NP-hard bin-packing problem we propose computationally-efficient heuristics to find a feasible allocation of tasks and their redundant copies. We also apply the simulated annealing method to the fault-tolerant task allocation problem with RTR constraints and compare against our heuristics.

Subject Classification

ACM Subject Classification
  • Software and its engineering → Software fault tolerance
  • Software and its engineering → Real-time systems software
  • Computer systems organization → Real-time systems
  • fault tolerance
  • real-time embedded systems
  • recovery time
  • real-time schedulability


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. IEEE802.1cb-frame replication and elimination for reliability, howpublished =, note = Accessed: 2018-01-12.
  2. KapDae Ahn, Jong Kim, and SungJe Hong. Fault-tolerant real-time scheduling using passive replicas. In Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems, pages 98-103, Dec 1997. URL:
  3. A. A. Bertossi, L. V. Mancini, and A. Menapace. Scheduling hard-real-time tasks with backup phasing delay. In 2006 Tenth IEEE International Symposium on Distributed Simulation and Real-Time Applications, pages 107-118, Oct 2006. URL:
  4. A. Bhat, S. Samii, and R. Rajkumar. Practical task allocation for software fault-tolerance and its implementation in embedded automotive systems. In 2017 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 87-98, April 2017. URL:
  5. F. V. Brasileiro, P. D. Ezhilchelvan, S. K. Shrivastava, N. A. Speirs, and S. Tao. Implementing fail-silent nodes for distributed systems. IEEE Transactions on Computers, 45(11):1226-1238, Nov 1996. URL:
  6. Navin Budhiraja, Keith Marzullo, Fred B. Schneider, and Sam Toueg. The primary-backup approach. In Sape Mullender, editor, Distributed Systems (2Nd Ed.), pages 199-216. ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, 1993. URL:
  7. A. Burns, R. Davis, and S. Punnekkat. Feasibility analysis of fault-tolerant real-time task sets. In Proceedings of the Eighth Euromicro Workshop on Real-Time Systems, pages 29-33, Jun 1996. URL:
  8. J. J. Chen, C. Y. Yang, T. W. Kuo, and S. Y. Tseng. Real-time task replication for fault tolerance in identical multiprocessor systems. In 13th IEEE Real Time and Embedded Technology and Applications Symposium (RTAS'07), pages 249-258, April 2007. URL:
  9. Jean claude Laprie and Brian Randell. Fundamental concepts of computer systems dependability. In In Proceedings of the 3rd IEEE Information Survivability, Boston, Massachusetts, USA, October 2000, pages 24-26, 2001. Google Scholar
  10. Robert I. Davis, Alan Burns, Reinder J. Bril, and Johan J. Lukkien. Controller area network (can) schedulability analysis: Refuted, revisited and revised. Real-Time Systems, 35(3):239-272, Apr 2007. URL:
  11. Paul Emberson, Roger Stafford, and Robert I Davis. Techniques for the synthesis of multiprocessor tasksets. In proceedings 1st International Workshop on Analysis Tools and Methodologies for Embedded and Real-time Systems (WATERS 2010), pages 6-11, 2010. Google Scholar
  12. Krzysztof Fleszar and Khalil S. Hindi. New heuristics for one-dimensional bin-packing. Comput. Oper. Res., 29(7):821-839, 2002. URL:
  13. S. Gopalakrishnan and M. Caccamo. Task partitioning with replication upon heterogeneous multiprocessor systems. In 12th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS'06), pages 199-207, April 2006. URL:
  14. Rachid Guerraoui and André Schiper. Fault-tolerance by replication in distributed systems. In Alfred Strohmeier, editor, Reliable Software Technologies - Ada-Europe '96, pages 38-57, Berlin, Heidelberg, 1996. Springer Berlin Heidelberg. Google Scholar
  15. P. Guo and Z. Xue. Improved task partition based fault-tolerant rate-monotonic scheduling algorithm. In 2016 International Conference on Security of Smart Cities, Industrial Control System and Communications (SSIC), pages 1-5, July 2016. URL:
  16. K Hasimoto, Tatsuhiro Tsuchiya, and T Kikuno. Effective scheduling of duplicated tasks for fault tolerance in multiprocessor systems. IEICE TRANSACTIONS on Information and Systems, E85-D:525-534, 03 2002. Google Scholar
  17. J. J. Horning, H. C. Lauer, P. M. Melliar-Smith, and B. Randell. A Program Structure for Error Detection and Recovery, pages 53-68. Springer Berlin Heidelberg, Berlin, Heidelberg, 1985. URL:
  18. David Johnson. Near-optimal bin packing algorithms. Ph.D. Dissertation, MIT, MA, 08 2010. Google Scholar
  19. J. Kim, G. Bhatia, R. Rajkumar, and M. Jochim. Safer: System-level architecture for failure evasion in real-time applications. In 2012 IEEE 33rd Real-Time Systems Symposium, pages 227-236, Dec 2012. URL:
  20. J. Kim, K. Lakshmanan, and R. Rajkumar. R-batch: Task partitioning for fault-tolerant multiprocessor real-time systems. In 2010 10th IEEE International Conference on Computer and Information Technology, pages 1872-1879, June 2010. URL:
  21. S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. SCIENCE, 220(4598):671-680, 1983. Google Scholar
  22. Kay Klobedanz, Jan Jatzkowski, Achim Rettberg, and Wolfgang Mueller. Fault-tolerant deployment of real-time software in autosar ecu networks. In Gunar Schirner, Marcelo Götz, Achim Rettberg, Mauro C. Zanella, and Franz J. Rammig, editors, Embedded Systems: Design, Analysis and Verification, pages 238-249, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg. Google Scholar
  23. C. L. Liu and James W. Layland. Scheduling algorithms for multiprogramming in a hard-real-time environment. J. ACM, 20(1):46-61, 1973. URL:
  24. N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller. Equation of state calculations by fast computing machines. jcp, 21:1087-1092, jun 1953. URL:
  25. Dong-Ik Oh and T.P. Bakker. Utilization bounds for n-processor rate monotone scheduling with static processor assignment. Real-Time Systems, 15(2):183-192, Sep 1998. URL:
  26. Yingfeng Oh and Sang H. Son. Enhancing fault-tolerance in rate-monotonic scheduling. Real-Time Systems, 7(3):315-329, Nov 1994. URL:
  27. C. Pinello, L. P. Carloni, and A. L. Sangiovanni-Vincentelli. Fault-tolerant distributed deployment of embedded control software. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 27(5):906-919, May 2008. URL:
  28. Traian Pop, Paul Pop, Petru Eles, Zebo Peng, and Alexandru Andrei. Timing analysis of the flexray communication protocol. Real-Time Systems, 39(1):205-235, Aug 2008. URL:
  29. R.L. Rao and S.S. Iyengar. Bin-packing by simulated annealing. Computers and Mathematics with Applications, 27(5):71-82, 1994. URL:
  30. Jorge Real and Alfons Crespo. Mode change protocols for real-time systems: A survey and a new proposal. Real-Time Systems, 26(2):161-197, Mar 2004. URL:
  31. Taxonomy and definitions for terms related to on-road motor vehicle automated driving systems., . Google Scholar
  32. C. Schonfeld. Redundancy approaches in spacecraft computers. In 28th Israel Annual Conference on Aviation and Astronautics, pages 148-156, 1986. Google Scholar
  33. L. Sha, R. Rajkumar, and J. P. Lehoczky. Priority inheritance protocols: an approach to real-time synchronization. IEEE Transactions on Computers, 39(9):1175-1185, Sep 1990. URL:
  34. D. Thiele, P. Axer, and R. Ernst. Improving formal timing analysis of switched ethernet by exploiting fifo scheduling. In 2015 52nd ACM IEEE Design Automation Conference (DAC), pages 1-6, June 2015. URL:
  35. Chris Urmson, Joshua Anhalt, Drew Bagnell, Christopher Baker, Robert Bittner, M. N. Clark, John Dolan, Dave Duggins, Tugrul Galatali, Chris Geyer, Michele Gittleman, Sam Harbaugh, Martial Hebert, Thomas M. Howard, Sascha Kolski, Alonzo Kelly, Maxim Likhachev, Matt McNaughton, Nick Miller, Kevin Peterson, Brian Pilnick, Raj Rajkumar, Paul Rybski, Bryan Salesky, Young-Woo Seo, Sanjiv Singh, Jarrod Snider, Anthony Stentz, William "Red" Whittaker, Ziv Wolkowicki, Jason Ziglar, Hong Bae, Thomas Brown, Daniel Demitrish, Bakhtiar Litkouhi, Jim Nickolaou, Varsha Sadekar, Wende Zhang, Joshua Struble, Michael Taylor, Michael Darms, and Dave Ferguson. Autonomous Driving in Urban Environments: Boss and the Urban Challenge, pages 1-59. Springer Berlin Heidelberg, Berlin, Heidelberg, 2009. URL:
  36. A.J. Wellings. Applying new scheduling theory to static priority pre-emptive scheduling. Software Engineering Journal, 8:284-292(8), September 1993. URL:
  37. Thomas Wolf and Alfred Strohmeier. Fault tolerance by transparent replication for distributed ada 95. In Michael González Harbour and Juan A. de la Puente, editors, Reliable Software Technologies - Ada-Europe' 99, pages 412-424, Berlin, Heidelberg, 1999. Springer Berlin Heidelberg. Google Scholar