Resource Aware GPU Scheduling in Kubernetes Infrastructure

Authors Aggelos Ferikoglou, Dimosthenis Masouros , Achilleas Tzenetopoulos, Sotirios Xydis , Dimitrios Soudris

Thumbnail PDF


  • Filesize: 3.28 MB
  • 12 pages

Document Identifiers

Author Details

Aggelos Ferikoglou
  • Microprocessors and Digital Systems Laboratory, ECE, National Technical University of Athens, Greece
Dimosthenis Masouros
  • Microprocessors and Digital Systems Laboratory, ECE, National Technical University of Athens, Greece
Achilleas Tzenetopoulos
  • Microprocessors and Digital Systems Laboratory, ECE, National Technical University of Athens, Greece
Sotirios Xydis
  • Department of Informatics and Telematics, DIT, Harokopio University of Athens, Greece
Dimitrios Soudris
  • Microprocessors and Digital Systems Laboratory, ECE, National Technical University of Athens, Greece

Cite AsGet BibTex

Aggelos Ferikoglou, Dimosthenis Masouros, Achilleas Tzenetopoulos, Sotirios Xydis, and Dimitrios Soudris. Resource Aware GPU Scheduling in Kubernetes Infrastructure. In 12th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and 10th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2021). Open Access Series in Informatics (OASIcs), Volume 88, pp. 4:1-4:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Nowadays, there is an ever-increasing number of artificial intelligence inference workloads pushed and executed on the cloud. To effectively serve and manage the computational demands, data center operators have provisioned their infrastructures with accelerators. Specifically for GPUs, support for efficient management lacks, as state-of-the-art schedulers and orchestrators, threat GPUs only as typical compute resources ignoring their unique characteristics and application properties. This phenomenon combined with the GPU over-provisioning problem leads to severe resource under-utilization. Even though prior work has addressed this problem by colocating applications into a single accelerator device, its resource agnostic nature does not manage to face the resource under-utilization and quality of service violations especially for latency critical applications. In this paper, we design a resource aware GPU scheduling framework, able to efficiently colocate applications on the same GPU accelerator card. We integrate our solution with Kubernetes, one of the most widely used cloud orchestration frameworks. We show that our scheduler can achieve 58.8% lower end-to-end job execution time 99%-ile, while delivering 52.5% higher GPU memory usage, 105.9% higher GPU utilization percentage on average and 44.4% lower energy consumption on average, compared to the state-of-the-art schedulers, for a variety of ML representative workloads.

Subject Classification

ACM Subject Classification
  • Computing methodologies
  • Computer systems organization → Cloud computing
  • Computer systems organization → Heterogeneous (hybrid) systems
  • Hardware → Emerging architectures
  • cloud computing
  • GPU scheduling
  • kubernetes
  • heterogeneity


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Alibaba GPU Sharing Scheduler Extension. URL:
  2. CUDA Streams. URL:
  3. GPU Memory Over-provisioning. URL:
  4. Kubernetes GPU Scheduler Extension. URL:
  5. NVIDIA Data Center GPU Manager. URL:
  6. Prometheus. URL:
  7. B. S. Everitt A.Skrondal. The Cambridge Dictionary of Statistics. Cambridge University Press, 2554. URL:
  8. Rolando Brondolin, Marco D Santambrogio, and Politecnico Milano. A Black-box Monitoring Approach to Measure Microservices Runtime Performance. ACM Transactions on Architecture and Code Optimization, 17(4), 2020. Google Scholar
  9. Brendan Burns, Brian Grant, David Oppenheimer, Eric Brewer, and John Wilkes. Borg, Omega, and Kubernetes. Commun. ACM, 59(5):50-57, April 2016. URL:
  10. Quan Chen, Hailong Yang, Jason Mars, and Lingjia Tang. Baymax: QoS awareness and increased utilization for non-preemptive accelerators in warehouse scale computers. International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS, 02-06-Apri:681-696, 2016. URL:
  11. James Gleeson and Eyal de Lara. Heterogeneous GPU reallocation. 9th USENIX Workshop on Hot Topics in Cloud Computing, HotCloud 2017, co-located with USENIX ATC 2017, 2017. Google Scholar
  12. Vishakha Gupta, Karsten Schwan, Niraj Tolia, Vanish Talwar, and Parthasarathy Ranganathan. Pegasus: Coordinated scheduling for virtualized accelerator-based systems. In USENIXATC'11: Proceedings of the 2011 USENIX conference on USENIX annual technical conference, 2011. Google Scholar
  13. Johann Hauswald, Michael A Laurenzano, Yunqi Zhang, Cheng Li, Austin Rovinski, Arjun Khurana, Ronald G Dreslinski, Trevor Mudge, Vinicius Petrucci, Lingjia Tang, and Jason Mars. Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers. SIGARCH Comput. Archit. News, 43(1):223-238, March 2015. URL:
  14. Howard J. Seltman. Experimental Design and Analysis. Revista, 20(2), 2016. URL:
  15. VMware Inc. Containers on virtual machines or bare metal ? Deploying and Securely Managing Containerized Applications at Scale, White Paper, December 2018. Google Scholar
  16. John A. Gubner. Probability and Random Processes for Electrical and Computer Engineers. Cambridge University Press, 2554. URL:
  17. D Kang, T J Jun, D Kim, J Kim, and D Kim. ConVGPU: GPU Management Middleware in Container Based Virtualized Environment. In 2017 IEEE International Conference on Cluster Computing (CLUSTER), pages 301-309, 2017. URL:
  18. Ram Srivatsa Kannan, Lavanya Subramanian, Ashwin Raju, Jeongseob Ahn, Jason Mars, and Lingjia Tang. GrandSLAm. EuroSys '19: Proceedings of the Fourteenth EuroSys Conference 2019, pages 1-16, 2019. URL:
  19. D Masouros, S Xydis, and D Soudris. Rusty: Runtime Interference-Aware Predictive Monitoring for Modern Multi-Tenant Systems. IEEE Transactions on Parallel and Distributed Systems, 32(1):184-198, January 2021. URL:
  20. Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters. Proceedings of the 13th EuroSys Conference, EuroSys 2018, 2018-Janua, 2018. URL:
  21. Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan Deng, Greg Diamos, Jared Duke, Dave Fick, J Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B Jablin, Jeff Jiao, Tom St. John, Pankaj Kanwar, David Lee, Jeffery Liao, Anton Lokhmotov, Francisco Massa, Peng Meng, Paulius Micikevicius, Colin Osborne, Gennady Pekhimenko, Arun Tejusve Raghunath Rajan, Dilip Sequeira, Ashish Sirasao, Fei Sun, Hanlin Tang, Michael Thomson, Frank Wei, Ephrem Wu, Lingjie Xu, Koichi Yamada, Bing Yu, George Yuan, Aaron Zhong, Peizhao Zhang, and Yuchen Zhou. MLPerf Inference Benchmark, 2019. URL:
  22. Multi-process Service. Multi-process service, 2020. Google Scholar
  23. Robert Shumway and David Stoffer. Time Series Analysis and Its Applications: With R Examples. Springer, 2017. URL:
  24. I Tanasic, I Gelado, J Cabezas, A Ramirez, N Navarro, and M Valero. Enabling preemptive multiprogramming on GPUs. In 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), pages 193-204, 2014. Google Scholar
  25. Prashanth Thinakaran, Jashwant Raj Gunasekaran, Bikash Sharma, Mahmut Taylan Kandemir, and Chita R. Das. Kube-Knots: Resource Harvesting through Dynamic Container Orchestration in GPU-based Datacenters. Proceedings - IEEE International Conference on Cluster Computing, ICCC, 2019-Septe:1-13, 2019. URL:
  26. Achilleas Tzenetopoulos, Dimosthenis Masouros, Sotirios Xydis, and Dimitrios Soudris. Interference-Aware Orchestration in Kubernetes. In International Conference on High Performance Computing, pages 321-330. Springer, 2020. Google Scholar
  27. Y Ukidave, X Li, and D Kaeli. Mystic: Predictive Scheduling for GPU Based Cloud Servers Using Machine Learning. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 353-362, 2016. URL:
  28. Shaoqi Wang, Oscar J Gonzalez, Xiaobo Zhou, and Thomas Williams. An Efficient and Non-Intrusive GPU Scheduling Framework for Deep Learning Training Systems. Sc, 2020. Google Scholar
  29. Ting-An Yeh, Hung-Hsin Chen, and Jerry Chou. KubeShare: A Framework to Manage GPUs as First-Class and Shared Resources in Container Cloud. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, HPDC '20, pages 173-184, New York, NY, USA, 2020. Association for Computing Machinery. URL: