BifurKTM: Approximately Consistent Distributed Transactional Memory for GPUs

Irving, Samuel; Peng, Lu; Busch, Costas; Peir, Jih-Kwon

doi:10.4230/OASIcs.PARMA-DITAM.2021.2

File

OASIcs.PARMA-DITAM.2021.2.pdf

Filesize: 1.05 MB
15 pages

Document Identifiers

DOI: 10.4230/OASIcs.PARMA-DITAM.2021.2
URN: urn:nbn:de:0030-drops-136386

Author Details

Samuel Irving

Louisiana State University, Baton Rouge, LA, USA

Lu Peng

Louisiana State University, Baton Rouge, LA, USA

Costas Busch

Augusta University, GA, USA

Jih-Kwon Peir

University of Florida, Gainesville, FL, USA

Cite AsGet BibTex

Samuel Irving, Lu Peng, Costas Busch, and Jih-Kwon Peir. BifurKTM: Approximately Consistent Distributed Transactional Memory for GPUs. In 12th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and 10th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2021). Open Access Series in Informatics (OASIcs), Volume 88, pp. 2:1-2:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/OASIcs.PARMA-DITAM.2021.2

Abstract

We present BifurKTM, the first read-optimized Distributed Transactional Memory system for GPU clusters. The BifurKTM design includes: GPU KoSTM, a new software transactional memory conflict detection scheme that exploits relaxed consistency to increase throughput; and KoDTM, a Distributed Transactional Memory model that combines the Data- and Control- flow models to greatly reduce communication overheads. Despite the allure of huge speedups, GPUs are limited in use due to their programmability and extreme sensitivity to workload characteristics. These become daunting concerns when considering a distributed GPU cluster, wherein a programmer must design algorithms to hide communication latency by exploiting data regularity, high compute intensity, etc. The BifurKTM design allows GPU programmers to exploit a new workload characteristic: the percentage of the workload that is Read-Only (e.g. reads but does not modify shared memory), even when this percentage is not known in advance. Programmers designate transactions that are suitable for Approximate Consistency, in which transactions "appear" to execute at the most convenient time for preventing conflicts. By leveraging Approximate Consistency for Read-Only transactions, the BifurKTM runtime system offers improved performance, application flexibility, and programmability without introducing any errors into shared memory. Our experiments show that Approximate Consistency can improve BkTM performance by up to 34x in applications with moderate network communication utilization and a read-intensive workload. Using Approximate Consistency, BkTM can reduce GPU-to-GPU network communication by 99%, reduce the number of aborts by up to 100%, and achieve an average speedup of 18x over a similarly sized CPU cluster while requiring minimal effort from the programmer.

Subject Classification

ACM Subject Classification

Computer systems organization → Heterogeneous (hybrid) systems

Keywords

GPU
Distributed Transactional Memory
Approximate Consistency

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Basem Assiri and Costas Busch. Approximate consistency in transactional memory. International Journal of Networking and Computing, 8(1):93-123, 2018.
Daniel Castro, Paolo Romano, Aleksandar Ilic, and Amin M Khan. Hetm: Transactional memory for heterogeneous systems. In 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 232-244. IEEE, 2019.
Daniel Cederman, Philippas Tsigas, and Muhammad Tayyab Chaudhry. Towards a software transactional memory for graphics processors. In EGPGV, pages 121-129, 2010.
Sui Chen and Lu Peng. Efficient gpu hardware transactional memory through early conflict resolution. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 274-284. IEEE, 2016.
Sui Chen, Lu Peng, and Samuel Irving. Accelerating gpu hardware transactional memory with snapshot isolation. In Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on, pages 282-294. IEEE, 2017.
Pascal Felber, Christof Fetzer, Patrick Marlier, and Torvald Riegel. Time-based software transactional memory. IEEE Transactions on Parallel and Distributed Systems, 21(12):1793-1807, 2010.
Wilson WL Fung, Inderpreet Singh, Andrew Brownsword, and Tor M Aamodt. Hardware transactional memory for gpu architectures. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pages 296-307. ACM, 2011.
Maurice Herlihy and J Eliot B Moss. Transactional memory: Architectural support for lock-free data structures. In Proceedings of the 20th annual international symposium on computer architecture, pages 289-300, 1993.
Maurice Herlihy and Ye Sun. Distributed transactional memory for metric-space networks. Distributed Computing, 20:195-208, 2007.
Anup Holey and Antonia Zhai. Lightweight software transactions on gpus. In Parallel Processing (ICPP), 2014 43rd International Conference on, pages 461-470. IEEE, 2014.
Samuel Irving, Sui Chen, Lu Peng, Costas Busch, Maurice Herlihy, and Christopher Michael. Cuda-dtm: Distributed transactional memory for gpu clusters. In Proceedings of the 7th International Conference on Networked Systems, 2019.
Jiri Kraus. An introduction to cuda-aware mpi. Weblog entry]. PARALLEL FORALL, 2013.
Chi Cao Minh, JaeWoong Chung, Christos Kozyrakis, and Kunle Olukotun. Stamp: Stanford transactional applications for multi-processing. In 2008 IEEE International Symposium on Workload Characterization, pages 35-46. IEEE, 2008.
Sudhanshu Mishra, Alexandru Turcu, Roberto Palmieri, and Binoy Ravindran. Hyflowcpp: A distributed transactional memory framework for c++. In Network Computing and Applications (NCA), 2013 12th IEEE International Symposium on, pages 219-226. IEEE, 2013.
John Nickolls, Ian Buck, and Michael Garland. Scalable parallel programming. In 2008 IEEE Hot Chips 20 Symposium (HCS), pages 40-53. IEEE, 2008.
Mohamed M Saad and Binoy Ravindran. Snake: control flow distributed software transactional memory. In Symposium on Self-Stabilizing Systems, pages 238-252. Springer, 2011.
Alejandro Villegas, Angeles Navarro, Rafael Asenjo, and Oscar Plata. Toward a software transactional memory for heterogeneous cpu-gpu processors. The Journal of Supercomputing, pages 1-16, 2017.