Generating and Exploiting Deep Learning Variants to Increase Heterogeneous Resource Utilization in the NVIDIA Xavier

Pujol, Roger; Tabani, Hamid; Kosmidis, Leonidas; Mezzetti, Enrico; Abella, Jaume; Cazorla, Francisco J.

doi:10.4230/LIPIcs.ECRTS.2019.23

Abstract

Deep learning-based solutions and, in particular, deep neural networks (DNNs) are at the heart of several functionalities in critical-real time embedded systems (CRTES) from vision-based perception (object detection and tracking) systems to trajectory planning. As a result, several DNN instances simultaneously run at any time on the same computing platform. However, while modern GPUs offer a variety of computing elements (e.g. CPUs, GPUs, and specific accelerators) in which those DNN tasks can be executed depending on their computational requirements and temporal constraints, current DNNs are mainly programmed to exploit one of them, namely, regular cores in the GPU. This creates resource imbalance and under-utilization of GPU resources when executing several DNN instances, causing an increase in DNN tasks' execution time requirements. In this paper, (a) we develop different variants (implementations) of well-known DNN libraries used in the Apollo Autonomous Driving (AD) software for each of the computing elements of the latest NVIDIA Xavier SoC. Each variant can be configured to balance resource requirements and performance: the regular CPU core implementation that can run on 2, 4, and 6 cores; the GPU regular and Tensor core variants that can run in 4 or 8 GPU’s Streaming Multiprocessors (SM); and 1 or 2 NVIDIA’s Deep Learning Accelerators (NVDLA); (b) we show that each particular variant/configuration offers a different resource utilization/performance point; finally, (c) we show how those heterogeneous computing elements can be exploited by a static scheduler to sustain the execution of multiple and diverse DNN variants on the same platform.

Implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA CUDA runtime. URL: http://docs.nvidia.com/cuda/cublas/.
Intel® GO™ Automated Driving Solution Product Brief. URL: https://www.intel.es/content/dam/www/public/us/en/documents/platform-briefs/go-automated-accelerated-product-brief.pdf.
NVIDIA DRIVE PX. Scalable supercomputer for autonomous driving. URL: http://www.nvidia.com/object/drive-px.html.
QUALCOMM Snapdragon 820 Automotive Processor. URL: https://www.qualcomm.com/products/snapdragon/processors/820-automotive.
RENESAS R-Car H3. URL: https://www.renesas.com/en-us/solutions/automotive/products/rcar-h3.html.
TensorRT: A platform for high-performance deep learning inference. URL: https://developer.nvidia.com/tensorrt.
TensorRT Support Matrix. URL: https://docs.nvidia.com/deeplearning/sdk/tensorrt-support-matrix/index.html.
AUTOMATED DRIVING, Levels of driving automation are deined in new SAE International standard J3016., 2014. URL: https://www.sae.org/standards/content/j3016_201609/.
APOLLO, an open autonomous driving platform., 2018. URL: http://apollo.auto/.
Deep Learning SDK Documentation, 2018. URL: https://docs.nvidia.com/deeplearning/sdk/tensorrt-archived/tensorrt-504/tensorrt-support-matrix/index.html.
Self-driving Safety Report, 2018. URL: https://www.nvidia.com/en-us/self-driving-cars/safety-report/.
Tensor Core, The Next Generation of Deep Learning., 2018. URL: https://www.nvidia.com/en-us/data-center/tensorcore/.
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL: https://www.tensorflow.org/.
Sergi Alcaide, Leonidas Kosmidis, Hamid Tabani, Carles Hernandez, Jaume Abella, and Francisco J Cazorla. Safety-Related Challenges and Opportunities for GPUs in the Automotive Domain. IEEE Micro, 38(6):46-55, 2018.
Tanya Amert, Nathan Otterness, Ming Yang, James H. Anderson, and F. Donelson Smith. GPU Scheduling on the NVIDIA TX2: Hidden Details Revealed. In IEEE Real-Time Systems Symposium (RTSS), 2017.
ARINC. Specification 651: Design Guide for Integrated Modular Avionics. Aeronautical Radio, Inc, 1997.
ARM. ARM Expects Vehicle Compute Performance to Increase 100x in Next Decade, 2015. URL: https://www.arm.com/about/newsroom/arm-expects-vehicle-compute-performance-to-increase-100x-in-next-decade.php.
AUTOSAR. Specification of RTE Software - AUTOSAR CP Release 4.3.1, 2017.
Sanjoy K. Baruah, Vincenzo Bonifaci, Renato Bruni, and Alberto Marchetti-Spaccamela. ILP models for the allocation of recurrent workloads upon heterogeneous multiprocessors. Journal of Scheduling, pages 1-15, 2018.
Soroush Bateni and Cong Liu. ApNet: Approximation-Aware Real-Time Neural Network. In IEEE Real-Time Systems Symposium (RTSS), 2018.
Soroush Bateni, Husheng Zhou, Yuankun Zhu, and Cong Liu. PredJoule: A Timing-Predictable Energy Optimization Framework for Deep Neural Networks. In IEEE Real-Time Systems Symposium (RTSS), 2018.
Enrico Bini and Giorgio C. Buttazzo. Measuring the Performance of Schedulability Tests. Real-Time Systems, 30(1):129-154, 2005.
Alan Burns, C Deutschbein, Thomas David Fleming, and S Baruah. Multi-core Cyclic Executives for Safety-Critical Systems. Dependable Software Engineering Theories, Tools and Application, 172:94-109, 2017.
Nicola Capodieci, Roberto Cavicchioli, Marko Bertogna, and Aingara Paramakuru. Deadline-based Scheduling for GPU with Preemption Support. In IEEE Real-Time Systems Symposium (RTSS), 2018.
Roberto Cavicchioli, Nicola Capodieci, and Marko Bertogna. Memory interference characterization between CPU cores and integrated GPUs in mixed-criticality platforms. In IEEE Emerging Technologies and Factory Automation (ETFA), 2017.
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cuDNN: Efficient primitives for deep learning. arXiv preprint, 2014. URL: http://arxiv.org/abs/1410.0759.
François Chollet. Keras, 2015. URL: https://github.com/fchollet/keras.
Tesla Corp. Tesla Autopilot, 2018. URL: https://www.tesla.com/autopilot.
Leonardo Dagum and Ramesh Menon. OpenMP: an industry standard API for shared-memory programming. IEEE Computational Science and Engineering (CiSE), 5(1):46-55, 1998.
Nachiket Deo and Mohan M Trivedi. Looking at the Driver/Rider in Autonomous Vehicles to Predict Take-Over Readiness. arXiv preprint, 2018. URL: http://arxiv.org/abs/1811.06047.
Enrique Díaz, Enrico Mezzetti, Leonidas Kosmidis, Jaume Abella, and Francisco J. Cazorla. Modelling multicore contention on the AURIX^TM TC27x. In ACM/ESDA/IEEE Design Automation Conference (DAC), 2018.
Glenn A. Elliott and James H. Anderson. Robust Real-Time Multiprocessor Interrupt Handling Motivated by GPUs. In Euromicro Conference on Real-Time Systems (ECRTS), 2012.
Glenn A. Elliott and James H. Anderson. Exploring the Multitude of Real-Time Multi-GPU Configurations. In IEEE Real-Time Systems Symposium (RTSS), 2014.
Glenn A. Elliott, Bryan C. Ward, and James H. Anderson. GPUSync: A Framework for Real-Time GPU Management. In IEEE Real-Time Systems Symposium (RTSS), 2013.
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE Computer Vision and Pattern Recognition (CVPR), 2014.
Joël Goossens, Pascal Richard, Markus Lindström, Irina Iulia Lupu, and Frédéric Ridouard. Job Partitioning Strategies for Multiprocessor Scheduling of Real-time Periodic Tasks with Restricted Migrations. In ACM Real-Time and Network Systems (RTNS), 2012.
Richard Karp. Reducibility Among Combinatorial Problems. Complexity of Computer Computations, 40:85-103, 1972.
Jan Nowotsch, Michael Paulitsch, Daniel Bühler, Henrik Theiling, Simon Wegener, and Michael Schmidt. Multi-core Interference-Sensitive WCET Analysis Leveraging Runtime Resource Capacity Enforcement. In Euromicro Conference on Real-Time Systems (ECRTS), 2014.
Nathan Otterness, Ming Yang, Sarah Rust, Eunbyung Park, James H. Anderson, F. Donelson Smith, Alex Berg, and Shige Wang. An Evaluation of the NVIDIA TX1 for Supporting Real-Time Computer-Vision Workloads. In IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), 2017.
Hamid Tabani, Leonidas Kosmidis, Jaume Abella, Guillem Bernat, and Francisco J Cazorla. Assessing the Adherence of Industrial Autonomous Driving Software to ISO-26262 Guidelines for Software. In ACM/ESDA/IEEE Design Automation Conference (DAC), 2019.
Ming Yang, Tanya Amert, Kecheng Yang, Nathan Otterness, James H. Anderson, F. Donelson Smith, and Shige Wang. Making OpenVX Really "Real Time". In IEEE Real-Time Systems Symposium (RTSS), 2018.
Ming Yang, Nathan Otterness, Tanya Amert, Joshua Bakita, James H. Anderson, and F. Donelson Smith. Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems. In Euromicro Conference on Real-Time Systems (ECRTS), 2018.
Husheng Zhou, Soroush Bateni, and Cong Liu. S³DNN: Supervised Streaming and Scheduling for GPU-Accelerated Real-Time DNN Workloads. In IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), 2018.
Alex Zyner, Stewart Worrall, and Eduardo Nebot. A Recurrent Neural Network Solution for Predicting Driver Intention at Unsignalized Intersections. IEEE Robotics and Automation Letters (RA-L), 3(3):1759-1764, 2018.
Alex Zyner, Stewart Worrall, and Eduardo Nebot. Naturalistic Driver Intention and Path Prediction Using Recurrent Neural Networks. arXiv preprint, 2018. URL: http://arxiv.org/abs/1807.09995.

Generating and Exploiting Deep Learning Variants to Increase Heterogeneous Resource Utilization in the NVIDIA Xavier

Authors Roger Pujol , Hamid Tabani , Leonidas Kosmidis , Enrico Mezzetti , Jaume Abella , Francisco J. Cazorla

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Generating and Exploiting Deep Learning Variants to Increase Heterogeneous Resource Utilization in the NVIDIA Xavier

Authors Roger Pujol , Hamid Tabani , Leonidas Kosmidis , Enrico Mezzetti , Jaume Abella , Francisco J. Cazorla

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message