Towards Studying the Effect of Compiler Optimizations and Software Randomization on GPU Reliability

Authors Pau López Castillón , Xavier Caricchio Hernández , Leonidas Kosmidis



PDF
Thumbnail PDF

File

OASIcs.PARMA-DITAM.2025.4.pdf
  • Filesize: 0.5 MB
  • 10 pages

Document Identifiers

Author Details

Pau López Castillón
  • Universitat Politècnica de Barcelona (UPC), Spain
  • Barcelona Supercomputing Center (BSC), Spain
Xavier Caricchio Hernández
  • Universitat Politècnica de Barcelona (UPC), Spain
  • Barcelona Supercomputing Center (BSC), Spain
Leonidas Kosmidis
  • Barcelona Supercomputing Center (BSC), Spain
  • Universitat Politècnica de Barcelona (UPC), Spain

Cite As Get BibTex

Pau López Castillón, Xavier Caricchio Hernández, and Leonidas Kosmidis. Towards Studying the Effect of Compiler Optimizations and Software Randomization on GPU Reliability. In 16th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 14th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2025). Open Access Series in Informatics (OASIcs), Volume 127, pp. 4:1-4:10, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025) https://doi.org/10.4230/OASIcs.PARMA-DITAM.2025.4

Abstract

The evolution of Graphics Processing Unit (GPU) compilers has facilitated the support for general-purpose programming languages across various architectures. The NVIDIA CUDA Compiler (NVCC) employs multiple compilation levels prior to generating machine code, implementing intricate optimizations to enhance performance. These optimizations influence the manner in which software is mapped to the underlying hardware, which can also impact GPU reliability.
TASA is a source-to-source code randomization tool designed to alter the mapping of software onto the underlying hardware. It achieves this by generating random permutations of variable and function declarations, thereby introducing random padding between declarations of different types and modifying the program memory layout. Since this modifies their location in the memory, it also modifies their cache placement, affecting both their execution time (due to the different conflicts between them, which result in a different amount of cache misses in every execution), as well as their lifetime in the cache.
In this work, which is part of the HiPEAC Student Challenge 2025, we first examine the reproducibility of a subset of data presented in the ACM TACO paper "Assessing the Impact of Compiler Optimizations on GPU Reliability" [Santos et al., 2024], and second we extend it by combining it with our proposal of software randomization. The paper indicates that the -O3 optimization flag facilitates an increased workload before failures occur within the application. By employing TASA, we investigate the impact of GPU randomization on reliability and performance metrics. 
By reproducing the results of the paper on a different GPU platform, we observe the same trend as reported in the original publication. Moreover, our preliminary results with the application of software randomization show in several cases an improved Mean Waiting Before Failure (MWBF) compared to the original source code.

Subject Classification

ACM Subject Classification
  • General and reference → Reliability
  • Software and its engineering → Compilers
  • Computing methodologies → Graphics processors
Keywords
  • Graphics processing units
  • reliability
  • software randomization
  • error rate

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC), pages 44-54, 2009. URL: https://doi.org/10.1109/IISWC.2009.5306797.
  2. Liliana Cucu-Grosjean, Luca Santinelli, Michael Houston, Code Lo, Tullio Vardanega, Leonidas Kosmidis, Jaume Abella, Enrico Mezzetti, Eduardo Quiñones, and Francisco J. Cazorla. Measurement-Based Probabilistic Timing Analysis for Multi-path Programs. In Robert Davis, editor, 24th Euromicro Conference on Real-Time Systems, ECRTS 2012, Pisa, Italy, July 11-13, 2012, pages 91-101. IEEE Computer Society, 2012. URL: https://doi.org/10.1109/ECRTS.2012.31.
  3. David Steenari et al. On-Board Processing Benchmarks, 2021. http://obpmark.github.io/. Google Scholar
  4. Leonidas Kosmidis, Matina Maria Trompouki, Pau Lopez Castillon, Eric Rufart Blasco, Javier Fernandez Salgado, and Andreas Jung. Open Source Software Randomisation Framework for Probabilistic WCET Prediction on Multicore CPUs, GPUs and Accelerators. In Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), Lecture Notes in Computer Science. Springer, 2024. Google Scholar
  5. Leonidas Kosmidis, Roberto Vargas, David Morales, Eduardo Quiñones, Jaume Abella, and Francisco J. Cazorla. TASA: Toolchain-Agnostic Static Software Randomisation for Critical Real-time Systems. In Frank Liu, editor, Proceedings of the 35th International Conference on Computer-Aided Design, ICCAD 2016, Austin, TX, USA, November 7-10, 2016, page 59. ACM, 2016. URL: https://doi.org/10.1145/2966986.2967078.
  6. Ivan Rodriguez, Leonidas Kosmidis, Jerome Lachaize, Olivier Notebaert, and David Steenari. GPU4S Bench: Design and Implementation of an Open GPU Benchmarking Suite for Space On-board Processing. Technical Report UPC-DAC-RR-CAP-2019-1, Universitat Politècnica de Catalunya, 2019. URL: https://www.ac.upc.edu/app/research-reports/public/html/research_center_index-CAP-2019,en.html.
  7. Ivan Rodriguez Ferrandez, Alvaro Jover Alvarez, Matina Maria Trompouki, Leonidas Kosmidis, and Francisco J. Cazorla. Worst Case Execution Time and Power Estimation of Multicore and GPU Software: A Pedestrian Detection Use Case. Ada Lett., 43(1):111-117, October 2023. URL: https://doi.org/10.1145/3631483.3631502.
  8. Ivan Rodriguez-Ferrandez, Leonidas Kosmidis, Maris Tali, David Steenari, Alex Hands, and Camille Bélanger-Champagne. Proton Evaluation of Single Event Effects in the NVIDIA GPU Orin SoM: Understanding Radiation Vulnerabilities Beyond the SoC. In 30th IEEE International Symposium on On-Line Testing and Robust System Design, IOLTS 2024, Rennes, France, July 3-5, 2024, pages 1-7. IEEE, 2024. URL: https://doi.org/10.1109/IOLTS60994.2024.10616076.
  9. Ivan Rodriguez-Ferrandez, Maris Tali, Leonidas Kosmidis, Marta Rovituso, and David Steenari. Sources of Single Event Effects in the NVIDIA Xavier SoC Family under Proton Irradiation. In Alessandro Savino, Paolo Rech, Stefano Di Carlo, and Dimitris Gizopoulos, editors, 28th IEEE International Symposium on On-Line Testing and Robust System Design, IOLTS 2022, Torino, Italy, September 12-14, 2022, pages 1-7. IEEE, 2022. URL: https://doi.org/10.1109/IOLTS56730.2022.9897236.
  10. Fernando Fernandes Dos Santos, Luigi Carro, Flavio Vella, and Paolo Rech. Assessing the Impact of Compiler Optimizations on GPUs Reliability. ACM Trans. Archit. Code Optim., 21(2), February 2024. URL: https://doi.org/10.1145/3638249.
  11. David Steenari, Leonidas Kosmidis, Ivan Rodríguez-Ferrández, Álvaro Jover-Álvarez, and Kyra Förster. OBPMark (On-Board Processing Benchmarks) - Open Source Computational Performance Benchmarks for Space Applications. In 2nd European Workshop on On-Board Data Processing (OBDP), 2021. URL: https://doi.org/10.5281/zenodo.5638577.
  12. Timothy Tsai, Siva Kumar Sastry Hari, Michael Sullivan, Oreste Villa, and Stephen W. Keckler. NVBitFI: Dynamic Fault Injection for GPUs. In 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 284-291, 2021. URL: https://doi.org/10.1109/DSN48987.2021.00041.
  13. Reinhard Wilhelm, Jakob Engblom, Andreas Ermedahl, Niklas Holsti, Stephan Thesing, David B. Whalley, Guillem Bernat, Christian Ferdinand, Reinhold Heckmann, Tulika Mitra, Frank Mueller, Isabelle Puaut, Peter P. Puschner, Jan Staschulat, and Per Stenström. The Worst-case Execution-time Problem - Overview of Methods and Survey of Tools. ACM Trans. Embed. Comput. Syst., 7(3):36:1-36:53, 2008. URL: https://doi.org/10.1145/1347375.1347389.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail