Novel Methodologies for Predictable CPU-To-GPU Command Offloading

Authors Roberto Cavicchioli , Nicola Capodieci, Marco Solieri , Marko Bertogna

Thumbnail PDF


  • Filesize: 1.24 MB
  • 22 pages

Document Identifiers

Author Details

Roberto Cavicchioli
  • Università di Modena e Reggio Emilia, Italy
Nicola Capodieci
  • Università di Modena e Reggio Emilia, Italy
Marco Solieri
  • Università di Modena e Reggio Emilia, Italy
Marko Bertogna
  • Università di Modena e Reggio Emilia, Italy

Cite AsGet BibTex

Roberto Cavicchioli, Nicola Capodieci, Marco Solieri, and Marko Bertogna. Novel Methodologies for Predictable CPU-To-GPU Command Offloading. In 31st Euromicro Conference on Real-Time Systems (ECRTS 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 133, pp. 22:1-22:22, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


There is an increasing industrial and academic interest towards a more predictable characterization of real-time tasks on high-performance heterogeneous embedded platforms, where a host system offloads parallel workloads to an integrated accelerator, such as General Purpose-Graphic Processing Units (GP-GPUs). In this paper, we analyze an important aspect that has not yet been considered in the real-time literature, and that may significantly affect real-time performance if not properly treated, i.e., the time spent by the CPU for submitting GP-GPU operations. We will show that the impact of CPU-to-GPU kernel submissions may be indeed relevant for typical real-time workloads, and that it should be properly factored in when deriving an integrated schedulability analysis for the considered platforms. This is the case when an application is composed of many small and consecutive GPU compute/copy operations. While existing techniques mitigate this issue by batching kernel calls into a reduced number of persistent kernel invocations, in this work we present and evaluate three other approaches that are made possible by recently released versions of the NVIDIA CUDA GP-GPU API, and by Vulkan, a novel open standard GPU API that allows an improved control of GPU command submissions. We will show that this added control may significantly improve the application performance and predictability due to a substantial reduction in CPU-to-GPU driver interactions, making Vulkan an interesting candidate for becoming the state-of-the-art API for heterogeneous Real-Time systems. Our findings are evaluated on a latest generation NVIDIA Jetson AGX Xavier embedded board, executing typical workloads involving Deep Neural Networks of parameterized complexity.

Subject Classification

ACM Subject Classification
  • Computer systems organization → System on a chip
  • Computer systems organization → Real-time system architecture
  • Heterogeneous systems
  • GPU
  • CUDA
  • Vulkan


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Mohammed Alandoli, Mahmoud Al-Ayyoub, Mohammad Al-Smadi, Yaser Jararweh, and Elhadj Benkhelifa. Using Dynamic Parallelism to Speed Up Clustering-Based Community Detection in Social Networks. In Future Internet of Things and Cloud Workshops (FiCloudW), IEEE International Conference on, pages 240-245. IEEE, 2016. Google Scholar
  2. Waqar Ali and Heechul Yun. Work-in-progress: Protecting real-time GPU applications on integrated CPU-GPU SoC platforms. In Real-Time and Embedded Technology and Applications Symposium (RTAS), 2017 IEEE, pages 141-144. IEEE, 2017. Google Scholar
  3. Tanya Amert, Nathan Otterness, Ming Yang, James H Anderson, and F Donelson Smith. GPU scheduling on the NVIDIA TX2: Hidden details revealed. In 2017 IEEE Real-Time Systems Symposium (RTSS), pages 104-115. IEEE, 2017. Google Scholar
  4. Sanjoy Baruah, Marko Bertogna, and Giorgio Buttazzo. Multiprocessor Scheduling for Real-Time Systems. Springer, 2015. Google Scholar
  5. Jens Breitbart. Static GPU threads and an improved scan algorithm. In European Conference on Parallel Processing, pages 373-380. Springer, 2010. Google Scholar
  6. Nicola Capodieci and Paolo Burgio. Efficient implementation of Genetic Algorithms on GP-GPU with scheduled persistent CUDA threads. In Parallel Architectures, Algorithms and Programming (PAAP), 2015 Seventh International Symposium on, pages 6-12. IEEE, 2015. Google Scholar
  7. Nicola Capodieci, Roberto Cavicchioli, and Marko Bertogna. Work-in-Progress: NVIDIA GPU Scheduling Details in Virtualized Environments. In 2018 International Conference on Embedded Software (EMSOFT), pages 1-3. IEEE, 2018. Google Scholar
  8. Nicola Capodieci, Roberto Cavicchioli, Marko Bertogna, and Aingara Paramakuru. Deadline-based Scheduling for GPU with Preemption Support. In 2018 IEEE Real-Time Systems Symposium (RTSS), pages 119-130. IEEE, 2018. Google Scholar
  9. Roberto Cavicchioli, Nicola Capodieci, and Marko Bertogna. Memory Interference Characterization between CPU cores and integrated GPUs in Mixed-Criticality Platforms. In 22nd IEEE International Conference on Emerging Technologies And Factory Automation (ETFA), 2017. Google Scholar
  10. Guoyang Chen, Yue Zhao, Xipeng Shen, and Huiyang Zhou. EffiSha: A software framework for enabling effficient preemptive scheduling of GPU. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 3-16. ACM, 2017. Google Scholar
  11. Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pages 379-387, 2016. Google Scholar
  12. Glenn A Elliott and James H Anderson. Real-world constraints of GPUs in real-time systems. In Embedded and Real-Time Computing Systems and Applications (RTCSA), 2011 IEEE 17th International Conference on, volume 2, pages 48-54. IEEE, 2011. Google Scholar
  13. Glenn A Elliott and James H Anderson. Robust real-time multiprocessor interrupt handling motivated by GPUs. In Real-Time Systems (ECRTS), 2012 24th Euromicro Conference on, pages 267-276. IEEE, 2012. Google Scholar
  14. Glenn A Elliott, Bryan C Ward, and James H Anderson. GPUSync: A framework for real-time GPU management. In Real-Time Systems Symposium (RTSS), 2013 IEEE 34th, pages 33-44. IEEE, 2013. Google Scholar
  15. Kshitij Gupta, Jeff A Stuart, and John D Owens. A study of persistent threads style GPU programming for GPGPU workloads. In Innovative Parallel Computing (InPar), 2012, pages 1-14. IEEE, 2012. Google Scholar
  16. Islam Harb and Wu-Chun Feng. Characterizing Performance and Power towards Efficient Synchronization of GPU Kernels. In Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), 2016 IEEE 24th International Symposium on, pages 451-456. IEEE, 2016. Google Scholar
  17. Cheol-Ho Hong, Ivor Spence, and Dimitrios S Nikolopoulos. GPU virtualization and scheduling methods: A comprehensive survey. ACM Computing Surveys (CSUR), 50(3):35, 2017. Google Scholar
  18. Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint, 2017. URL:
  19. Khronos Group Khronos. Khronos SPIR-V Registry. Khronos Group, 2016. URL:
  20. Khronos Group Khronos. The OpenGL Shading Language Language Version: 4.50. Khronos Group, 2016. URL:
  21. Khronos Group Khronos. Vulkan 1.0.98 - A Specification. Khronos Group, 2019. URL:
  22. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097-1105, 2012. Google Scholar
  23. B Neelima, Bharath Shamsundar, Anjjan Narayan, Rithesh Prabhu, and Crystal Gomes. Kepler GPU accelerated recursive sorting using dynamic parallelism. Concurrency and Computation: Practice and Experience, 29(4):e3865, 2017. Google Scholar
  24. CUDA Nvidia. Programming Guide Version 10.0. Nvidia Corporation, 2018. URL:
  25. Ignacio Sañudo Olmedo, Nicola Capodieci, and Roberto Cavicchioli. A Perspective on Safety and Real-Time Issues for GPU Accelerated ADAS. In IECON 2018-44th Annual Conference of the IEEE Industrial Electronics Society, pages 4071-4077. IEEE, 2018. Google Scholar
  26. Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779-788, 2016. Google Scholar
  27. Joseph Redmon and Ali Farhadi. YOLO9000: better, faster, stronger. arXiv preprint, 2016. URL:
  28. Davesh Shingari, Akhil Arunkumar, and Carole-Jean Wu. Characterization and throttling-based mitigation of memory interference for heterogeneous smartphones. In 2015 IEEE International Symposium on Workload Characterization (IISWC), pages 22-33. IEEE, 2015. Google Scholar
  29. Joseph A Shiraef. An exploratory study of high performance graphics application programming interfaces. Master’s thesis, University of Tennessee at Chattanooga, 2016. Google Scholar
  30. Jan-Philipp Stauffert, Florian Niebling, and Marc Erich Latoschik. Towards comparable evaluation methods and measures for timing behavior of virtual reality systems. In Proceedings of the 22nd ACM Conference on Virtual Reality Software and Technology, pages 47-50. ACM, 2016. Google Scholar
  31. Bo Wu, Xu Liu, Xiaobo Zhou, and Changjun Jiang. FLEP: Enabling Flexible and Efficient Preemption on GPUs. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 2017. Google Scholar
  32. Ming Yang, Tanya Amert, Kecheng Yang, Nathan Otterness, James H Anderson, F Donelson Smith, and Shige Wang. Making OpenVX Really "Real Time". In 2018 IEEE Real-Time Systems Symposium (RTSS), pages 80-93. IEEE, 2018. Google Scholar
  33. Ming Yang, Nathan Otterness, Tanya Amert, Joshua Bakita, James H Anderson, and F Donelson Smith. Avoiding pitfalls when using NVIDIA GPUs for real-time tasks in autonomous systems. In 30th Euromicro Conference on Real-Time Systems (ECRTS 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018. Google Scholar
  34. Peter Zhang, Eric Holk, John Matty, Samantha Misurda, Marcin Zalewski, Jonathan Chu, Scott McMillan, and Andrew Lumsdaine. Dynamic parallelism for simple and efficient GPU graph algorithms. In Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms, page 11. ACM, 2015. Google Scholar