Novel Methodologies for Predictable CPU-To-GPU Command Offloading

Cavicchioli, Roberto; Capodieci, Nicola; Solieri, Marco; Bertogna, Marko

doi:10.4230/LIPIcs.ECRTS.2019.22

Abstract

There is an increasing industrial and academic interest towards a more predictable characterization of real-time tasks on high-performance heterogeneous embedded platforms, where a host system offloads parallel workloads to an integrated accelerator, such as General Purpose-Graphic Processing Units (GP-GPUs). In this paper, we analyze an important aspect that has not yet been considered in the real-time literature, and that may significantly affect real-time performance if not properly treated, i.e., the time spent by the CPU for submitting GP-GPU operations. We will show that the impact of CPU-to-GPU kernel submissions may be indeed relevant for typical real-time workloads, and that it should be properly factored in when deriving an integrated schedulability analysis for the considered platforms. 
This is the case when an application is composed of many small and consecutive GPU compute/copy operations. While existing techniques mitigate this issue by batching kernel calls into a reduced number of persistent kernel invocations, in this work we present and evaluate three other approaches that are made possible by recently released versions of the NVIDIA CUDA GP-GPU API, and by Vulkan, a novel open standard GPU API that allows an improved control of GPU command submissions. We will show that this added control may significantly improve the application performance and predictability due to a substantial reduction in CPU-to-GPU driver interactions, making Vulkan an interesting candidate for becoming the state-of-the-art API for heterogeneous Real-Time systems. 
Our findings are evaluated on a latest generation NVIDIA Jetson AGX Xavier embedded board, executing typical workloads involving Deep Neural Networks of parameterized complexity.

Mohammed Alandoli, Mahmoud Al-Ayyoub, Mohammad Al-Smadi, Yaser Jararweh, and Elhadj Benkhelifa. Using Dynamic Parallelism to Speed Up Clustering-Based Community Detection in Social Networks. In Future Internet of Things and Cloud Workshops (FiCloudW), IEEE International Conference on, pages 240-245. IEEE, 2016.
Waqar Ali and Heechul Yun. Work-in-progress: Protecting real-time GPU applications on integrated CPU-GPU SoC platforms. In Real-Time and Embedded Technology and Applications Symposium (RTAS), 2017 IEEE, pages 141-144. IEEE, 2017.
Tanya Amert, Nathan Otterness, Ming Yang, James H Anderson, and F Donelson Smith. GPU scheduling on the NVIDIA TX2: Hidden details revealed. In 2017 IEEE Real-Time Systems Symposium (RTSS), pages 104-115. IEEE, 2017.
Sanjoy Baruah, Marko Bertogna, and Giorgio Buttazzo. Multiprocessor Scheduling for Real-Time Systems. Springer, 2015.
Jens Breitbart. Static GPU threads and an improved scan algorithm. In European Conference on Parallel Processing, pages 373-380. Springer, 2010.
Nicola Capodieci and Paolo Burgio. Efficient implementation of Genetic Algorithms on GP-GPU with scheduled persistent CUDA threads. In Parallel Architectures, Algorithms and Programming (PAAP), 2015 Seventh International Symposium on, pages 6-12. IEEE, 2015.
Nicola Capodieci, Roberto Cavicchioli, and Marko Bertogna. Work-in-Progress: NVIDIA GPU Scheduling Details in Virtualized Environments. In 2018 International Conference on Embedded Software (EMSOFT), pages 1-3. IEEE, 2018.
Nicola Capodieci, Roberto Cavicchioli, Marko Bertogna, and Aingara Paramakuru. Deadline-based Scheduling for GPU with Preemption Support. In 2018 IEEE Real-Time Systems Symposium (RTSS), pages 119-130. IEEE, 2018.
Roberto Cavicchioli, Nicola Capodieci, and Marko Bertogna. Memory Interference Characterization between CPU cores and integrated GPUs in Mixed-Criticality Platforms. In 22nd IEEE International Conference on Emerging Technologies And Factory Automation (ETFA), 2017.
Guoyang Chen, Yue Zhao, Xipeng Shen, and Huiyang Zhou. EffiSha: A software framework for enabling effficient preemptive scheduling of GPU. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 3-16. ACM, 2017.
Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pages 379-387, 2016.
Glenn A Elliott and James H Anderson. Real-world constraints of GPUs in real-time systems. In Embedded and Real-Time Computing Systems and Applications (RTCSA), 2011 IEEE 17th International Conference on, volume 2, pages 48-54. IEEE, 2011.
Glenn A Elliott and James H Anderson. Robust real-time multiprocessor interrupt handling motivated by GPUs. In Real-Time Systems (ECRTS), 2012 24th Euromicro Conference on, pages 267-276. IEEE, 2012.
Glenn A Elliott, Bryan C Ward, and James H Anderson. GPUSync: A framework for real-time GPU management. In Real-Time Systems Symposium (RTSS), 2013 IEEE 34th, pages 33-44. IEEE, 2013.
Kshitij Gupta, Jeff A Stuart, and John D Owens. A study of persistent threads style GPU programming for GPGPU workloads. In Innovative Parallel Computing (InPar), 2012, pages 1-14. IEEE, 2012.
Islam Harb and Wu-Chun Feng. Characterizing Performance and Power towards Efficient Synchronization of GPU Kernels. In Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), 2016 IEEE 24th International Symposium on, pages 451-456. IEEE, 2016.
Cheol-Ho Hong, Ivor Spence, and Dimitrios S Nikolopoulos. GPU virtualization and scheduling methods: A comprehensive survey. ACM Computing Surveys (CSUR), 50(3):35, 2017.
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint, 2017. URL: http://arxiv.org/abs/1704.04861.
Khronos Group Khronos. Khronos SPIR-V Registry. Khronos Group, 2016. URL: https://www.khronos.org/registry/spir-v/#spec.
Khronos Group Khronos. The OpenGL Shading Language Language Version: 4.50. Khronos Group, 2016. URL: https://www.khronos.org/registry/OpenGL/specs/gl/GLSLangSpec.4.50.pdf.
Khronos Group Khronos. Vulkan 1.0.98 - A Specification. Khronos Group, 2019. URL: https://www.khronos.org/registry/vulkan/specs/1.0/html/vkspec.html.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097-1105, 2012.
B Neelima, Bharath Shamsundar, Anjjan Narayan, Rithesh Prabhu, and Crystal Gomes. Kepler GPU accelerated recursive sorting using dynamic parallelism. Concurrency and Computation: Practice and Experience, 29(4):e3865, 2017.
CUDA Nvidia. Programming Guide Version 10.0. Nvidia Corporation, 2018. URL: https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf.
Ignacio Sañudo Olmedo, Nicola Capodieci, and Roberto Cavicchioli. A Perspective on Safety and Real-Time Issues for GPU Accelerated ADAS. In IECON 2018-44th Annual Conference of the IEEE Industrial Electronics Society, pages 4071-4077. IEEE, 2018.
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779-788, 2016.
Joseph Redmon and Ali Farhadi. YOLO9000: better, faster, stronger. arXiv preprint, 2016. URL: http://arxiv.org/abs/1612.08242.
Davesh Shingari, Akhil Arunkumar, and Carole-Jean Wu. Characterization and throttling-based mitigation of memory interference for heterogeneous smartphones. In 2015 IEEE International Symposium on Workload Characterization (IISWC), pages 22-33. IEEE, 2015.
Joseph A Shiraef. An exploratory study of high performance graphics application programming interfaces. Master’s thesis, University of Tennessee at Chattanooga, 2016.
Jan-Philipp Stauffert, Florian Niebling, and Marc Erich Latoschik. Towards comparable evaluation methods and measures for timing behavior of virtual reality systems. In Proceedings of the 22nd ACM Conference on Virtual Reality Software and Technology, pages 47-50. ACM, 2016.
Bo Wu, Xu Liu, Xiaobo Zhou, and Changjun Jiang. FLEP: Enabling Flexible and Efficient Preemption on GPUs. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 2017.
Ming Yang, Tanya Amert, Kecheng Yang, Nathan Otterness, James H Anderson, F Donelson Smith, and Shige Wang. Making OpenVX Really "Real Time". In 2018 IEEE Real-Time Systems Symposium (RTSS), pages 80-93. IEEE, 2018.
Ming Yang, Nathan Otterness, Tanya Amert, Joshua Bakita, James H Anderson, and F Donelson Smith. Avoiding pitfalls when using NVIDIA GPUs for real-time tasks in autonomous systems. In 30th Euromicro Conference on Real-Time Systems (ECRTS 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.
Peter Zhang, Eric Holk, John Matty, Samantha Misurda, Marcin Zalewski, Jonathan Chu, Scott McMillan, and Andrew Lumsdaine. Dynamic parallelism for simple and efficient GPU graph algorithms. In Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms, page 11. ACM, 2015.

Novel Methodologies for Predictable CPU-To-GPU Command Offloading

Authors Roberto Cavicchioli , Nicola Capodieci, Marco Solieri , Marko Bertogna

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Novel Methodologies for Predictable CPU-To-GPU Command Offloading

Authors Roberto Cavicchioli , Nicola Capodieci, Marco Solieri , Marko Bertogna

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Supplementary Materials

References

Thanks for your feedback!

Could not send message