Generating and Exploiting Deep Learning Variants to Increase Heterogeneous Resource Utilization in the NVIDIA Xavier

Authors Roger Pujol , Hamid Tabani , Leonidas Kosmidis , Enrico Mezzetti , Jaume Abella , Francisco J. Cazorla



PDF
Thumbnail PDF

File

LIPIcs.ECRTS.2019.23.pdf
  • Filesize: 1.12 MB
  • 23 pages

Document Identifiers

Author Details

Roger Pujol
  • Universitat Politecnica de Catalunya (UPC), Spain
  • Barcelona Supercomputing Center (BSC), Spain
Hamid Tabani
  • Barcelona Supercomputing Center (BSC), Spain
Leonidas Kosmidis
  • Barcelona Supercomputing Center (BSC), Spain
Enrico Mezzetti
  • Barcelona Supercomputing Center (BSC), Spain
Jaume Abella
  • Barcelona Supercomputing Center (BSC), Spain
Francisco J. Cazorla
  • Barcelona Supercomputing Center (BSC), Spain

Cite AsGet BibTex

Roger Pujol, Hamid Tabani, Leonidas Kosmidis, Enrico Mezzetti, Jaume Abella, and Francisco J. Cazorla. Generating and Exploiting Deep Learning Variants to Increase Heterogeneous Resource Utilization in the NVIDIA Xavier. In 31st Euromicro Conference on Real-Time Systems (ECRTS 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 133, pp. 23:1-23:23, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)
https://doi.org/10.4230/LIPIcs.ECRTS.2019.23

Abstract

Deep learning-based solutions and, in particular, deep neural networks (DNNs) are at the heart of several functionalities in critical-real time embedded systems (CRTES) from vision-based perception (object detection and tracking) systems to trajectory planning. As a result, several DNN instances simultaneously run at any time on the same computing platform. However, while modern GPUs offer a variety of computing elements (e.g. CPUs, GPUs, and specific accelerators) in which those DNN tasks can be executed depending on their computational requirements and temporal constraints, current DNNs are mainly programmed to exploit one of them, namely, regular cores in the GPU. This creates resource imbalance and under-utilization of GPU resources when executing several DNN instances, causing an increase in DNN tasks' execution time requirements. In this paper, (a) we develop different variants (implementations) of well-known DNN libraries used in the Apollo Autonomous Driving (AD) software for each of the computing elements of the latest NVIDIA Xavier SoC. Each variant can be configured to balance resource requirements and performance: the regular CPU core implementation that can run on 2, 4, and 6 cores; the GPU regular and Tensor core variants that can run in 4 or 8 GPU’s Streaming Multiprocessors (SM); and 1 or 2 NVIDIA’s Deep Learning Accelerators (NVDLA); (b) we show that each particular variant/configuration offers a different resource utilization/performance point; finally, (c) we show how those heterogeneous computing elements can be exploited by a static scheduler to sustain the execution of multiple and diverse DNN variants on the same platform.

Subject Classification

ACM Subject Classification
  • Computer systems organization → Neural networks
  • Computer systems organization → System on a chip
  • Computing methodologies → Graphics processors
Keywords
  • Deep Neural Network (DNN)
  • GPU
  • Heterogenous Resources

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA CUDA runtime. URL: http://docs.nvidia.com/cuda/cublas/.
  2. Intel® GO™ Automated Driving Solution Product Brief. URL: https://www.intel.es/content/dam/www/public/us/en/documents/platform-briefs/go-automated-accelerated-product-brief.pdf.
  3. NVIDIA DRIVE PX. Scalable supercomputer for autonomous driving. URL: http://www.nvidia.com/object/drive-px.html.
  4. QUALCOMM Snapdragon 820 Automotive Processor. URL: https://www.qualcomm.com/products/snapdragon/processors/820-automotive.
  5. RENESAS R-Car H3. URL: https://www.renesas.com/en-us/solutions/automotive/products/rcar-h3.html.
  6. TensorRT: A platform for high-performance deep learning inference. URL: https://developer.nvidia.com/tensorrt.
  7. TensorRT Support Matrix. URL: https://docs.nvidia.com/deeplearning/sdk/tensorrt-support-matrix/index.html.
  8. AUTOMATED DRIVING, Levels of driving automation are deined in new SAE International standard J3016., 2014. URL: https://www.sae.org/standards/content/j3016_201609/.
  9. APOLLO, an open autonomous driving platform., 2018. URL: http://apollo.auto/.
  10. Deep Learning SDK Documentation, 2018. URL: https://docs.nvidia.com/deeplearning/sdk/tensorrt-archived/tensorrt-504/tensorrt-support-matrix/index.html.
  11. Self-driving Safety Report, 2018. URL: https://www.nvidia.com/en-us/self-driving-cars/safety-report/.
  12. Tensor Core, The Next Generation of Deep Learning., 2018. URL: https://www.nvidia.com/en-us/data-center/tensorcore/.
  13. Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL: https://www.tensorflow.org/.
  14. Sergi Alcaide, Leonidas Kosmidis, Hamid Tabani, Carles Hernandez, Jaume Abella, and Francisco J Cazorla. Safety-Related Challenges and Opportunities for GPUs in the Automotive Domain. IEEE Micro, 38(6):46-55, 2018. Google Scholar
  15. Tanya Amert, Nathan Otterness, Ming Yang, James H. Anderson, and F. Donelson Smith. GPU Scheduling on the NVIDIA TX2: Hidden Details Revealed. In IEEE Real-Time Systems Symposium (RTSS), 2017. Google Scholar
  16. ARINC. Specification 651: Design Guide for Integrated Modular Avionics. Aeronautical Radio, Inc, 1997. Google Scholar
  17. ARM. ARM Expects Vehicle Compute Performance to Increase 100x in Next Decade, 2015. URL: https://www.arm.com/about/newsroom/arm-expects-vehicle-compute-performance-to-increase-100x-in-next-decade.php.
  18. AUTOSAR. Specification of RTE Software - AUTOSAR CP Release 4.3.1, 2017. Google Scholar
  19. Sanjoy K. Baruah, Vincenzo Bonifaci, Renato Bruni, and Alberto Marchetti-Spaccamela. ILP models for the allocation of recurrent workloads upon heterogeneous multiprocessors. Journal of Scheduling, pages 1-15, 2018. Google Scholar
  20. Soroush Bateni and Cong Liu. ApNet: Approximation-Aware Real-Time Neural Network. In IEEE Real-Time Systems Symposium (RTSS), 2018. Google Scholar
  21. Soroush Bateni, Husheng Zhou, Yuankun Zhu, and Cong Liu. PredJoule: A Timing-Predictable Energy Optimization Framework for Deep Neural Networks. In IEEE Real-Time Systems Symposium (RTSS), 2018. Google Scholar
  22. Enrico Bini and Giorgio C. Buttazzo. Measuring the Performance of Schedulability Tests. Real-Time Systems, 30(1):129-154, 2005. Google Scholar
  23. Alan Burns, C Deutschbein, Thomas David Fleming, and S Baruah. Multi-core Cyclic Executives for Safety-Critical Systems. Dependable Software Engineering Theories, Tools and Application, 172:94-109, 2017. Google Scholar
  24. Nicola Capodieci, Roberto Cavicchioli, Marko Bertogna, and Aingara Paramakuru. Deadline-based Scheduling for GPU with Preemption Support. In IEEE Real-Time Systems Symposium (RTSS), 2018. Google Scholar
  25. Roberto Cavicchioli, Nicola Capodieci, and Marko Bertogna. Memory interference characterization between CPU cores and integrated GPUs in mixed-criticality platforms. In IEEE Emerging Technologies and Factory Automation (ETFA), 2017. Google Scholar
  26. Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cuDNN: Efficient primitives for deep learning. arXiv preprint, 2014. URL: http://arxiv.org/abs/1410.0759.
  27. François Chollet. Keras, 2015. URL: https://github.com/fchollet/keras.
  28. Tesla Corp. Tesla Autopilot, 2018. URL: https://www.tesla.com/autopilot.
  29. Leonardo Dagum and Ramesh Menon. OpenMP: an industry standard API for shared-memory programming. IEEE Computational Science and Engineering (CiSE), 5(1):46-55, 1998. Google Scholar
  30. Nachiket Deo and Mohan M Trivedi. Looking at the Driver/Rider in Autonomous Vehicles to Predict Take-Over Readiness. arXiv preprint, 2018. URL: http://arxiv.org/abs/1811.06047.
  31. Enrique Díaz, Enrico Mezzetti, Leonidas Kosmidis, Jaume Abella, and Francisco J. Cazorla. Modelling multicore contention on the AURIX^TM TC27x. In ACM/ESDA/IEEE Design Automation Conference (DAC), 2018. Google Scholar
  32. Glenn A. Elliott and James H. Anderson. Robust Real-Time Multiprocessor Interrupt Handling Motivated by GPUs. In Euromicro Conference on Real-Time Systems (ECRTS), 2012. Google Scholar
  33. Glenn A. Elliott and James H. Anderson. Exploring the Multitude of Real-Time Multi-GPU Configurations. In IEEE Real-Time Systems Symposium (RTSS), 2014. Google Scholar
  34. Glenn A. Elliott, Bryan C. Ward, and James H. Anderson. GPUSync: A Framework for Real-Time GPU Management. In IEEE Real-Time Systems Symposium (RTSS), 2013. Google Scholar
  35. Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE Computer Vision and Pattern Recognition (CVPR), 2014. Google Scholar
  36. Joël Goossens, Pascal Richard, Markus Lindström, Irina Iulia Lupu, and Frédéric Ridouard. Job Partitioning Strategies for Multiprocessor Scheduling of Real-time Periodic Tasks with Restricted Migrations. In ACM Real-Time and Network Systems (RTNS), 2012. Google Scholar
  37. Richard Karp. Reducibility Among Combinatorial Problems. Complexity of Computer Computations, 40:85-103, 1972. Google Scholar
  38. Jan Nowotsch, Michael Paulitsch, Daniel Bühler, Henrik Theiling, Simon Wegener, and Michael Schmidt. Multi-core Interference-Sensitive WCET Analysis Leveraging Runtime Resource Capacity Enforcement. In Euromicro Conference on Real-Time Systems (ECRTS), 2014. Google Scholar
  39. Nathan Otterness, Ming Yang, Sarah Rust, Eunbyung Park, James H. Anderson, F. Donelson Smith, Alex Berg, and Shige Wang. An Evaluation of the NVIDIA TX1 for Supporting Real-Time Computer-Vision Workloads. In IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), 2017. Google Scholar
  40. Hamid Tabani, Leonidas Kosmidis, Jaume Abella, Guillem Bernat, and Francisco J Cazorla. Assessing the Adherence of Industrial Autonomous Driving Software to ISO-26262 Guidelines for Software. In ACM/ESDA/IEEE Design Automation Conference (DAC), 2019. Google Scholar
  41. Ming Yang, Tanya Amert, Kecheng Yang, Nathan Otterness, James H. Anderson, F. Donelson Smith, and Shige Wang. Making OpenVX Really "Real Time". In IEEE Real-Time Systems Symposium (RTSS), 2018. Google Scholar
  42. Ming Yang, Nathan Otterness, Tanya Amert, Joshua Bakita, James H. Anderson, and F. Donelson Smith. Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems. In Euromicro Conference on Real-Time Systems (ECRTS), 2018. Google Scholar
  43. Husheng Zhou, Soroush Bateni, and Cong Liu. S³DNN: Supervised Streaming and Scheduling for GPU-Accelerated Real-Time DNN Workloads. In IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), 2018. Google Scholar
  44. Alex Zyner, Stewart Worrall, and Eduardo Nebot. A Recurrent Neural Network Solution for Predicting Driver Intention at Unsignalized Intersections. IEEE Robotics and Automation Letters (RA-L), 3(3):1759-1764, 2018. Google Scholar
  45. Alex Zyner, Stewart Worrall, and Eduardo Nebot. Naturalistic Driver Intention and Path Prediction Using Recurrent Neural Networks. arXiv preprint, 2018. URL: http://arxiv.org/abs/1807.09995.