Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems

Authors Ming Yang, Nathan Otterness, Tanya Amert, Joshua Bakita, James H. Anderson, F. Donelson Smith

Thumbnail PDF


  • Filesize: 0.83 MB
  • 21 pages

Document Identifiers

Author Details

Ming Yang
  • The University of North Carolina at Chapel Hill, USA
Nathan Otterness
  • The University of North Carolina at Chapel Hill, USA
Tanya Amert
  • The University of North Carolina at Chapel Hill, USA
Joshua Bakita
  • The University of North Carolina at Chapel Hill, USA
James H. Anderson
  • The University of North Carolina at Chapel Hill, USA
F. Donelson Smith
  • The University of North Carolina at Chapel Hill, USA

Cite AsGet BibTex

Ming Yang, Nathan Otterness, Tanya Amert, Joshua Bakita, James H. Anderson, and F. Donelson Smith. Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems. In 30th Euromicro Conference on Real-Time Systems (ECRTS 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 106, pp. 20:1-20:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


NVIDIA's CUDA API has enabled GPUs to be used as computing accelerators across a wide range of applications. This has resulted in performance gains in many application domains, but the underlying GPU hardware and software are subject to many non-obvious pitfalls. This is particularly problematic for safety-critical systems, where worst-case behaviors must be taken into account. While such behaviors were not a key concern for earlier CUDA users, the usage of GPUs in autonomous vehicles has taken CUDA programs out of the sole domain of computer-vision and machine-learning experts and into safety-critical processing pipelines. Certification is necessary in this new domain, which is problematic because GPU software may have been developed without any regard for worst-case behaviors. Pitfalls when using CUDA in real-time autonomous systems can result from the lack of specifics in official documentation, and developers of GPU software not being aware of the implications of their design choices with regards to real-time requirements. This paper focuses on the particular challenges facing the real-time community when utilizing CUDA-enabled GPUs for autonomous applications, and best practices for applying real-time safety-critical principles.

Subject Classification

ACM Subject Classification
  • Computer systems organization → Heterogeneous (hybrid) systems
  • Computer systems organization → Embedded software
  • Computer systems organization → Real-time systems
  • Computer systems organization → Embedded and cyber-physical systems
  • Software and its engineering → Scheduling
  • Software and its engineering → Concurrency control
  • Software and its engineering → Process synchronization
  • real-time systems
  • graphics processing units
  • scheduling algorithms
  • parallel computing
  • embedded software


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. T. Amert, N. Otterness, M. Yang, J. Anderson, and F. D. Smith. GPU scheduling on the NVIDIA TX2: Hidden details revealed. In RTSS 2017, pages 104-115. IEEE Computer Society, 2017. URL: http://dx.doi.org/10.1109/RTSS.2017.00017.
  2. J. Aumiller, S. Brandt, S. Kato, and N. Rath. Supporting low-latency CPS using GPUs and direct I/O schemes. In RTCSA '12, pages 437-442. IEEE Computer Society, 2012. URL: http://dx.doi.org/10.1109/RTCSA.2012.59.
  3. C. Basaran and K. Kang. Supporting preemptive task executions and memory copies in GPGPUs. In ECRTS '12, pages 287-296. IEEE Computer Society, 2012. URL: http://dx.doi.org/10.1109/ECRTS.2012.15.
  4. K. Berezovskyi, K. Bletsas, and B. Andersson. Makespan computation for GPU threads running on a single streaming multiprocessor. In ECRTS '12, pages 277-286. IEEE Computer Society, 2012. URL: http://dx.doi.org/10.1109/ECRTS.2012.16.
  5. K. Berezovskyi, K. Bletsas, and S. Petters. Faster makespan estimation for GPU threads on a single streaming multiprocessor. In ETFA '13, pages 1-8. IEEE, 2013. URL: http://dx.doi.org/10.1109/ETFA.2013.6647966.
  6. K. Berezovskyi, F. Guet, L. Santinelli, K. Bletsas, and E. Tovar. Measurement-based probabilistic timing analysis for graphics processor units. In ARCS '16, volume 9637 of Lecture Notes in Computer Science, pages 223-236. Springer, 2016. URL: http://dx.doi.org/10.1007/978-3-319-30695-7_17.
  7. K. Berezovskyi, L. Santinelli, K. Bletsas, and E. Tovar. WCET measurement-based and extreme value theory characterisation of CUDA kernels. In RTNS '14, page 279. ACM, 2014. URL: http://dx.doi.org/10.1145/2659787.2659827.
  8. A. Betts and A. Donaldson. Estimating the WCET of GPU-accelerated applications using hybrid analysis. In ECRTS '13, pages 193-202. IEEE Computer Society, 2013. URL: http://dx.doi.org/10.1109/ECRTS.2013.29.
  9. N. Capodieci, R. Cavicchioli, P. Valente, and M. Bertogna. SiGAMMA: Server based integrated GPU arbitration mechanism for memory accesses. In RTNS 2017, pages 48-57. ACM, 2017. URL: http://dx.doi.org/10.1145/3139258.3139270.
  10. R. Cavicchioli, N. Capodieci, and M. Bertogna. Memory interference characterization between CPU cores and integrated GPUs in mixed-criticality platforms. In RTNS 2017, pages 1-10. IEEE, 2017. URL: http://dx.doi.org/10.1109/ETFA.2017.8247615.
  11. G. Elliott, B. Ward, and J. Anderson. GPUSync: A framework for real-time GPU management. In RTSS '13, pages 33-44, 2013. URL: http://dx.doi.org/10.1109/RTSS.2013.12.
  12. B. Forsberg, A. Marongiu, and L. Benini. Gpuguard: Towards supporting a predictable execution model for heterogeneous SoC. In DATE '17, pages 318-321. IEEE, 2017. URL: http://dx.doi.org/10.23919/DATE.2017.7927008.
  13. A. Horga, S. Chattopadhyayb, P. Elesa, and Z. Peng. Systematic detection of memory related performance bottlenecks in GPGPU programs. In JSA '16, 2016. Google Scholar
  14. P. Houdek, M. Sojka, and Z. Hanzálek. Towards predictable execution model on ARM-based heterogeneous platforms. In ISIE '17, pages 1297-1302. IEEE, 2017. URL: http://dx.doi.org/10.1109/ISIE.2017.8001432.
  15. S. Kato, K. Lakshmanan, A. Kumar, M. Kelkar, Y. Ishikawa, and R. Rajkumar. RGEM: A responsive GPGPU execution model for runtime engines. In RTSS '11, pages 57-66. IEEE Computer Society, 2011. URL: http://dx.doi.org/10.1109/RTSS.2011.13.
  16. S. Kato, K. Lakshmanan, R. Rajkumar, and Y. Ishikawa. TimeGraph: GPU scheduling for real-time multi-tasking environments. In USENIX ATC '11. USENIX Association, 2011. URL: https://www.usenix.org/conference/usenixatc11/timegraph-gpu-scheduling-real-time-multi-tasking-environments.
  17. H. Lee and M. Abdullah Al Faruque. Run-time scheduling framework for event-driven applications on a GPU-based embedded system. In TCAD '16, 2016. Google Scholar
  18. A. Li, G. van den Braak, A. Kumar, and H. Corporaal. Adaptive and transparent cache bypassing for GPUs. In SIGHPC '15, pages 17:1-17:12. ACM, 2015. URL: http://dx.doi.org/10.1145/2807591.2807606.
  19. X. Mei and X. Chu. Dissecting GPU memory hierarchy through microbenchmarking. In TPDS '16, 2016. Google Scholar
  20. Multi-process service. Online at https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf.
  21. NVIDIA. Embedded systems developer kits and modules. Online at http://www.nvidia.com/object/embedded-systemsdev-kits-modules.html.
  22. NVIDIA. Best practices guide. Online at http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html, 2017.
  23. NVIDIA. CUDA toolkit documentation v9.1.85. Online at http://docs.nvidia.com/cuda/, 2018.
  24. N. Otterness, V. Miller, M. Yang, J. Anderson, F.D. Smith, and S. Wang. GPU sharing for image processing in embedded real-time systems. In OSPERT '16, 2016. Google Scholar
  25. N. Otterness, M. Yang, T. Amert, J. Anderson, and F.D. Smith. Inferring the scheduling policies of an embedded CUDA GPU. In OSPERT '17, 2017. Google Scholar
  26. N. Otterness, M. Yang, S. Rust, E. Park, J. Anderson, F.D. Smith, A. Berg, and S. Wang. An evaluation of the NVIDIA TX1 for supporting real-time computer-vision workloads. In RTAS '17, pages 353-364, 2017. URL: http://dx.doi.org/10.1109/RTAS.2017.3.
  27. U. Verner, A. Mendelson, and A. Schuster. Scheduling processing of real-time data streams on heterogeneous multi-GPU systems. In SYSTOR '12, page 7. ACM, 2012. URL: http://dx.doi.org/10.1145/2367589.2367596.
  28. U. Verner, A. Mendelson, and A. Schuster. Batch method for efficient resource sharing in real-time multi-GPU systems. In ICDCN '14, volume 8314 of Lecture Notes in Computer Science, pages 347-362. Springer, 2014. URL: http://dx.doi.org/10.1007/978-3-642-45249-9_23.
  29. U. Verner, A. Mendelson, and A. Schuster. Scheduling periodic real-time communication in multi-GPU systems. In ICCCN '14, pages 1-8. IEEE, 2014. URL: https://doi.org/10.1109/ICCCN.2014.6911778, URL: http://dx.doi.org/10.1109/ICCCN.2014.6911778.
  30. H. Wong, M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying GPU microarchitecture through microbenchmarking. In ISPASS '10, pages 235-246. IEEE Computer Society, 2010. URL: http://dx.doi.org/10.1109/ISPASS.2010.5452013.
  31. Y. Xu, R. Wang, T. Li, M. Song, L. Gao, Z. Luan, and D. Qian. Scheduling tasks with mixed timing constraints in GPU-powered real-time systems. In ICS '16, pages 30:1-30:13. ACM, 2016. URL: http://dx.doi.org/10.1145/2925426.2926265.
  32. J. Zhong and B. He. Kernelet: High-throughput GPU kernel executions with dynamic slicing and scheduling. IEEE Transactions on Parallel and Distributed Systems, 25:1522–1532, 2014. Google Scholar
  33. H. Zhou, G. Tong, and C. Liu. GPES: A preemptive execution system for GPGPU computing. In RTAS '15, pages 87-97. IEEE Computer Society, 2015. URL: http://dx.doi.org/10.1109/RTAS.2015.7108420.
Questions / Remarks / Feedback

Feedback for Dagstuhl Publishing

Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail