DynaSOAr: A Parallel Memory Allocator for Object-Oriented Programming on GPUs with Efficient Memory Access

Springer, Matthias; Masuhara, Hidehiko

doi:10.4230/LIPIcs.ECOOP.2019.17

Abstract

Object-oriented programming has long been regarded as too inefficient for SIMD high-performance computing, despite the fact that many important HPC applications have an inherent object structure. On SIMD accelerators, including GPUs, this is mainly due to performance problems with memory allocation and memory access: There are a few libraries that support parallel memory allocation directly on accelerator devices, but all of them suffer from uncoalesed memory accesses. We discovered a broad class of object-oriented programs with many important real-world applications that can be implemented efficiently on massively parallel SIMD accelerators. We call this class Single-Method Multiple-Objects (SMMO), because parallelism is expressed by running a method on all objects of a type. To make fast GPU programming available to domain experts who are less experienced in GPU programming, we developed DynaSOAr, a CUDA framework for SMMO applications. DynaSOAr consists of (1) a fully-parallel, lock-free, dynamic memory allocator, (2) a data layout DSL and (3) an efficient, parallel do-all operation. DynaSOAr achieves performance superior to state-of-the-art GPU memory allocators by controlling both memory allocation and memory access. DynaSOAr improves the usage of allocated memory with a Structure of Arrays (SOA) data layout and achieves low memory fragmentation through efficient management of free and allocated memory blocks with lock-free, hierarchical bitmaps. Contrary to other allocators, our design is heavily based on atomic operations, trading raw (de)allocation performance for better overall application performance. In our benchmarks, DynaSOAr achieves a speedup of application code of up to 3x over state-of-the-art allocators. Moreover, DynaSOAr manages heap memory more efficiently than other allocators, allowing programmers to run up to 2x larger problem sizes with the same amount of memory.

James Abel, Kumar Balasubramanian, Mike Bargeron, Tom Craver, and Mike Phlipot. Applications Tuning for Streaming SIMD Extensions. Intel Technology Journal, Q2:13, May 1999.
Andy Adinets. CUDA pro tip: Optimized filtering with warp-aggregated atomics. https://devblogs.nvidia.com/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics/, 2017.
Andrew V. Adinetz and Dirk Pleiter. Halloc: A High-Throughput Dynamic Memory Allocator for GPGPU Architectures. In GPU Technology Conference 2014, 2014.
Stephen G. Alexander and Craig B. Agnor. N-Body Simulations of Late Stage Planetary Formation with a Simple Fragmentation Model. Icarus, 132(1):113-124, 1998. URL: http://dx.doi.org/10.1006/icar.1998.5905.
Jade Alglave, Mark Batty, Alastair F. Donaldson, Ganesh Gopalakrishnan, Jeroen Ketema, Daniel Poetzl, Tyler Sorensen, and John Wickerson. GPU concurrency: Weak behaviours and programming assumptions. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 577-591, New York, NY, USA, 2015. ACM. URL: http://dx.doi.org/10.1145/2694344.2694391.
Robert J. Allan. Survey of Agent Based Modelling and Simulation Tools. Technical Report DL-TR-2010-007, Science and Technology Facilities Council, Warrington, United Kingdom, October 2010.
Saman Ashkiani, Martin Farach-Colton, and John D. Owens. A Dynamic Hash Table for the GPU. CoRR, abs/1710.11246, 2017. URL: http://arxiv.org/abs/1710.11246.
Darius Bakunas-Milanowski, Vernon Rego, Janche Sang, and Chansu Yu. Efficient Algorithms for Stream Compaction on GPUs. International Journal of Networking and Computing, 7(2):208-226, 2017. URL: http://dx.doi.org/10.15803/ijnc.7.2_208.
Stefania Bandini, Sara Manzoni, and Giuseppe Vizzari. Agent Based Modeling and Simulation: An Informatics Perspective. Journal of Artificial Societies and Social Simulation, 12(4):4, 2009.
Eli Bendersky. The many faces of operator new in C++. https://eli.thegreenplace.net/2011/02/17/the-many-faces-of-operator-new-in-c, 2011.
Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson. Hoard: A Scalable Memory Allocator for Multithreaded Applications. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS IX, pages 117-128, New York, NY, USA, 2000. ACM. URL: http://dx.doi.org/10.1145/378993.379232.
Paul Besl. A case study comparing AoS (Arrays of Structures) and SoA (Structures of Arrays) data layouts for a compute-intensive loop run on Intel Xeon processors and Intel Xeon Phi product family coprocessors. Technical report, Intel Corporation, 2013.
Markus Billeter, Ola Olsson, and Ulf Assarsson. Efficient Stream Compaction on Wide SIMD Many-core Architectures. In Proceedings of the Conference on High Performance Graphics 2009, HPG '09, pages 159-166, New York, NY, USA, 2009. ACM. URL: http://dx.doi.org/10.1145/1572769.1572795.
Robert D. Blumofe and Charles E. Leiserson. Scheduling Multithreaded Computations by Work Stealing. J. ACM, 46(5):720-748, September 1999. URL: http://dx.doi.org/10.1145/324133.324234.
Trevor Alexander Brown. Reclaiming Memory for Lock-Free Data Structures: There Has to Be a Better Way. In Proceedings of the 2015 ACM Symposium on Principles of Distributed Computing, PODC '15, pages 261-270, New York, NY, USA, 2015. ACM. URL: http://dx.doi.org/10.1145/2767386.2767436.
Martin Burtscher and Keshav Pingali. Chapter 6 - An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body Algorithm. In Wen mei W. Hwu, editor, GPU Computing Gems Emerald Edition, Applications of GPU Computing Series, pages 75-92. Morgan Kaufmann, Boston, 2011. URL: http://dx.doi.org/10.1016/B978-0-12-384988-5.00006-1.
John R. Cary, Svetlana G. Shasharina, Julian C. Cummings, John V.W. Reynders, and Paul J. Hinker. Comparison of C++ and Fortran 90 for object-oriented scientific programming. Computer Physics Communications, 105(1):20-36, 1997. URL: http://dx.doi.org/10.1016/S0010-4655(97)00043-X.
Trishul M. Chilimbi, Bob Davidson, and James R. Larus. Cache-conscious Structure Definition. In Proceedings of the ACM SIGPLAN 1999 Conference on Programming Language Design and Implementation, PLDI '99, pages 13-24, New York, NY, USA, 1999. ACM. URL: http://dx.doi.org/10.1145/301618.301635.
NVIDIA Corporation. CUDA C best practices guide. https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#coalesced-access-to-global-memory, 2018.
Cederman Daniel, Gidenstam Anders, Ha Phuong, Sundell Hkan, Papatriantafilou Marina, and Tsigas Philippas. Lock-Free Concurrent Data Structures, chapter 3, pages 59-79. Wiley-Blackwell, 2017. URL: http://dx.doi.org/10.1002/9781119332015.ch3.
Kei Davis and Jörg Striegnitz. Parallel Object-Oriented Scientific Computing Today. In Frank Buschmann, Alejandro P. Buchmann, and Mariano A. Cilia, editors, Object-Oriented Technology. ECOOP 2003 Workshop Reader, pages 11-16, Berlin, Heidelberg, 2004. Springer-Verlag. URL: http://dx.doi.org/10.1007/978-3-540-25934-3_2.
Simon Garcia De Gonzalo, Sitao Huang, Juan Gómez-Luna, Simon Hammond, Onur Mutlu, and Wen-mei Hwu. Automatic Generation of Warp-level Primitives and Atomic Instructions for Fast and Portable Parallel Reduction on GPUs. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2019, pages 73-84, Piscataway, NJ, USA, February 2019. IEEE Press. URL: http://dx.doi.org/10.1109/CGO.2019.8661187.
Alexander K. Dewdney. Computer Creations: Sharks and fish wage an ecological war on the toroidal planet Wa-Tor. Scientific American, 251(6):14-26, December 1984.
Carlchristian H. J. Eckert. Enhancements of the massively parallel memory allocator ScatterAlloc and its adaption to the general interface mallocMC, October 2014. Junior thesis. Technische Universität Dresden. URL: http://dx.doi.org/10.5281/zenodo.34461.
Harold C. Edwards and Daniel A. Ibanez. Kokkos' Task DAG Capabilities. Technical Report SAND2017-10464, Sandia National Laboratories, Albuquerque, New Mexico, USA, September 2017. URL: http://dx.doi.org/10.2172/1398234.
Faith Ellen, Yossi Lev, Victor Luchangco, and Mark Moir. SNZI: Scalable nonzero indicators. In Proceedings of the Twenty-sixth Annual ACM Symposium on Principles of Distributed Computing, PODC '07, pages 13-22, New York, NY, USA, 2007. ACM. URL: http://dx.doi.org/10.1145/1281100.1281106.
Joshua M. Epstein and Robert Axtell. Growing Artificial Societies: Social Science from the Bottom Up, volume 1. The MIT Press, 1 edition, 1996.
Bruce W.R. Forde, Ricardo O. Foschi, and Siegfried F. Stiemer. Object-oriented finite element analysis. Computers & Structures, 34(3):355-374, 1990. URL: http://dx.doi.org/10.1016/0045-7949(90)90261-Y.
Juliana Franco, Martin Hagelin, Tobias Wrigstad, Sophia Drossopoulou, and Susan Eisenbach. You Can Have It All: Abstraction and Good Cache Performance. In Proceedings of the 2017 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, Onward! 2017, pages 148-167, New York, NY, USA, 2017. ACM. URL: http://dx.doi.org/10.1145/3133850.3133861.
Dietma Gallistl. The adaptive finite element method. Snapshots of modern mathematics from Oberwolfach, 13, 2016. URL: http://dx.doi.org/10.14760/SNAP-2016-013-EN.
Isaac Gelado and Michael Garland. Throughput-oriented GPU Memory Allocation. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, PPoPP '19, pages 27-37, New York, NY, USA, 2019. ACM. URL: http://dx.doi.org/10.1145/3293883.3295727.
Dirk Grunwald, Benjamin Zorn, and Robert Henderson. Improving the Cache Locality of Memory Allocation. In Proceedings of the ACM SIGPLAN 1993 Conference on Programming Language Design and Implementation, PLDI '93, pages 177-186, New York, NY, USA, 1993. ACM. URL: http://dx.doi.org/10.1145/155090.155107.
Pawan Harish and P. J. Narayanan. Accelerating Large Graph Algorithms on the GPU Using CUDA. In Proceedings of the 14th International Conference on High Performance Computing, HiPC'07, pages 197-208, Berlin, Heidelberg, 2007. Springer-Verlag. URL: http://dx.doi.org/10.1007/978-3-540-77220-0_21.
Mark Harris. CUDA pro tip: Write flexible kernels with grid-stride loops. https://devblogs.nvidia.com/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/, 2013.
Kevlin Henney. Valued Conversions. C++ Report, 12:37-40, July 2000.
Holger Homann and Francois Laenen. SoAx: A generic C++ structure of arrays for handling particles in HPC codes. Computer Physics Communications, 224:325-332, 2018. URL: http://dx.doi.org/10.1016/j.cpc.2017.11.015.
Xiaohuang Huang, Christopher I. Rodrigues, Stephen Jones, Ian Buck, and Wen-Mei Hwu. XMalloc: A Scalable Lock-free Dynamic Memory Allocator for Many-core Machines. In 2010 10th IEEE International Conference on Computer and Information Technology, pages 1134-1139, June 2010. URL: http://dx.doi.org/10.1109/CIT.2010.206.
Byunghyun Jang, Dana Schaa, Perhaad Mistry, and David Kaeli. Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures. IEEE Transactions on Parallel and Distributed Systems, 22(1):105-118, January 2011. URL: http://dx.doi.org/10.1109/TPDS.2010.107.
Laxmikant V. Kale and Sanjeev Krishnan. CHARM++: A portable concurrent object oriented system based on C++. In Proceedings of the Eighth Annual Conference on Object-oriented Programming Systems, Languages, and Applications, OOPSLA '93, pages 91-108, New York, NY, USA, 1993. ACM. URL: http://dx.doi.org/10.1145/165854.165874.
Klaus Kofler, Biagio Cosenza, and Thomas Fahringer. Automatic Data Layout Optimizations for GPUs. In Jesper Larsson Träff, Sascha Hunold, and Francesco Versaci, editors, Euro-Par 2015: Parallel Processing, pages 263-274, Berlin, Heidelberg, 2015. Springer-Verlag. URL: http://dx.doi.org/10.1007/978-3-662-48096-0_21.
Florian Lemaitre and Lionel Lacassagne. Batched Cholesky factorization for tiny matrices. In 2016 Conference on Design and Architectures for Signal and Image Processing (DASIP), pages 130-137, October 2016. URL: http://dx.doi.org/10.1109/DASIP.2016.7853809.
Chuck Lever and David Boreham. Malloc() Performance in a Multithreaded Linux Environment. In Proceedings of the Annual Conference on USENIX Annual Technical Conference, ATEC '00, Berkeley, CA, USA, 2000. USENIX Association.
Xiaosong Li, Wentong Cai, and Stephen J. Turner. Efficient Neighbor Searching for Agent-Based Simulation on GPU. In Proceedings of the 2014 IEEE/ACM 18th International Symposium on Distributed Simulation and Real Time Applications, DS-RT '14, pages 87-96, Washington, DC, USA, 2014. IEEE Computer Society. URL: http://dx.doi.org/10.1109/DS-RT.2014.19.
Xiaosong Li, Wentong Cai, and Stephen J. Turner. Cloning Agent-based Simulation on GPU. In Proceedings of the 3rd ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, SIGSIM PADS '15, pages 173-182, New York, NY, USA, 2015. ACM. URL: http://dx.doi.org/10.1145/2769458.2769470.
Xiaosong Li, Wentong Cai, and Stephen J. Turner. Supporting efficient execution of continuous space agent-based simulation on GPU. Concurrency and Computation: Practice and Experience, 28(12):3313-3332, 2016. URL: http://dx.doi.org/10.1002/cpe.3808.
X. Lu, B.Y. Chen, V.B.C. Tan, and T.E. Tay. Adaptive floating node method for modelling cohesive fracture of composite materials. Engineering Fracture Mechanics, 194:240-261, 2018. URL: http://dx.doi.org/10.1016/j.engfracmech.2018.03.011.
Toni Mattis, Johannes Henning, Patrick Rein, Robert Hirschfeld, and Malte Appeltauer. Columnar Objects: Improving the Performance of Analytical Applications. In 2015 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward!), Onward! 2015, pages 197-210, New York, NY, USA, 2015. ACM. URL: http://dx.doi.org/10.1145/2814228.2814230.
Maged M. Michael. Safe Memory Reclamation for Dynamic Lock-free Objects Using Atomic Reads and Writes. In Proceedings of the Twenty-first Annual Symposium on Principles of Distributed Computing, PODC '02, pages 21-30, New York, NY, USA, 2002. ACM. URL: http://dx.doi.org/10.1145/571825.571829.
Maged M. Michael. Scalable Lock-free Dynamic Memory Allocation. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation, PLDI '04, pages 35-46, New York, NY, USA, 2004. ACM. URL: http://dx.doi.org/10.1145/996841.996848.
Mikołaj Morzy, Tadeusz Morzy, Alexandros Nanopoulos, and Yannis Manolopoulos. Hierarchical Bitmap Index: An Efficient and Scalable Indexing Technique for Set-Valued Attributes. In Leonid Kalinichenko, Rainer Manthey, Bernhard Thalheim, and Uwe Wloka, editors, Advances in Databases and Information Systems, pages 236-252, Berlin, Heidelberg, 2003. Springer-Verlag. URL: http://dx.doi.org/10.1007/978-3-540-39403-7_19.
Kai Nagel and Michael Schreckenberg. A cellular automaton model for freeway traffic. J. Phys. I France, 2(12):2221-2229, September 1992. URL: http://dx.doi.org/10.1051/jp1:1992277.
Parag Patel. Object Oriented Programming for Scientific Computing. Master’s thesis, The University of Edinburgh, 2006.
Matt Pharr and William R. Mark. ispc: A SPMD compiler for High-Performance CPU Programming. In 2012 Innovative Parallel Computing (InPar), pages 1-13. IEEE Computer Society, May 2012. URL: http://dx.doi.org/10.1109/InPar.2012.6339601.
Max Plauth, Frank Feinbube, Frank Schlegel, and Andreas Polze. A Performance Evaluation of Dynamic Parallelism for Fine-Grained, Irregular Workloads. International Journal of Networking and Computing, 6(2):212-229, 2016. URL: http://dx.doi.org/10.15803/ijnc.6.2_212.
Henry Schäfer, Benjamin Keinert, and Marc Stamminger. Real-time Local Displacement Using Dynamic GPU Memory Management. In Proceedings of the 5th High-Performance Graphics Conference, HPG '13, pages 63-72, New York, NY, USA, 2013. ACM. URL: http://dx.doi.org/10.1145/2492045.2492052.
Shubhabrata Sengupta, Aaron E. Lefohn, and John D. Owens. A Work-Efficient Step-Efficient Prefix Sum Algorithm. In Workshop on Edge Computing Using New Commodity Architectures, 2006.
Hark-Soo Song and Sang-Hee Lee. Effects of wind and tree density on forest fire patterns in a mixed-tree species forest. Forest Science and Technology, 13(1):9-16, 2017. URL: http://dx.doi.org/10.1080/21580103.2016.1262793.
Roy Spliet, Lee Howes, Benedict R. Gaster, and Ana Lucia Varbanescu. KMA: A dynamic memory manager for OpenCL. In Proceedings of Workshop on General Purpose Processing Using GPUs, GPGPU-7, pages 9:9-9:18, New York, NY, USA, 2014. ACM. URL: http://dx.doi.org/10.1145/2576779.2576781.
Matthias Springer and Hidehiko Masuhara. Ikra-Cpp: A C++/CUDA DSL for object-oriented programming with structure-of-arrays layout. In Proceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing, WPMVP'18, pages 6:1-6:9, New York, NY, USA, 2018. ACM. URL: http://dx.doi.org/10.1145/3178433.3178439.
Markus Steinberger, Michael Kenzel, Bernhard Kainz, and Dieter Schmalstieg. ScatterAlloc: Massively parallel dynamic memory allocation for the GPU. In 2012 Innovative Parallel Computing (InPar), pages 1-10. IEEE Computer Society, May 2012. URL: http://dx.doi.org/10.1109/InPar.2012.6339604.
Radek Stibora. Building of SBVH on Graphical Hardware. Master’s thesis, Faculty of Informatics, Masaryk University, 2016.
Bjarne Stroustrup. Bjarne Stroustrup’s C++ style and technique FAQ. is there a "placement delete"? http://www.stroustrup.com/bs_faq2.html#placement-delete, 2017.
Robert Strzodka. Chapter 31 - Abstraction for AoS and SoA Layout in C++. In Wen mei W. Hwu, editor, GPU Computing Gems Jade Edition, Applications of GPU Computing Series, pages 429-441. Morgan Kaufmann, Boston, 2012. URL: http://dx.doi.org/10.1016/B978-0-12-385963-1.00031-9.
Alexandros Tasos, Juliana Franco, Tobias Wrigstad, Sophia Drossopoulou, and Susan Eisenbach. Extending SHAPES for SIMD Architectures: An Approach to Native Support for Struct of Arrays in Languages. In Proceedings of the 13th Workshop on Implementation, Compilation, Optimization of Object-Oriented Languages, Programs and Systems, ICOOOLPS '18, pages 23-29, New York, NY, USA, 2018. ACM. URL: http://dx.doi.org/10.1145/3242947.3242951.
Katsuhiro Ueno, Atsushi Ohori, and Toshiaki Otomo. An Efficient Non-moving Garbage Collector for Functional Languages. In Proceedings of the 16th ACM SIGPLAN International Conference on Functional Programming, ICFP '11, pages 196-208, New York, NY, USA, 2011. ACM. URL: http://dx.doi.org/10.1145/2034773.2034802.
Marek Vinkler and Vlastimil Havran. Register Efficient Dynamic Memory Allocator for GPUs. Comput. Graph. Forum, 34(8):143-154, December 2015. URL: http://dx.doi.org/10.1111/cgf.12666.
Vasily Volkov. Understanding Latency Hiding on GPUs. PhD thesis, EECS Department, University of California, Berkeley, August 2016. URL: http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.html.
Nicolas Weber and Michael Goesele. MATOG: Array layout auto-tuning for CUDA. ACM Trans. Archit. Code Optim., 14(3):28:1-28:26, August 2017. URL: http://dx.doi.org/10.1145/3106341.
Sven Widmer, Dominik Wodniok, Nicolas Weber, and Michael Goesele. Fast Dynamic Memory Allocator for Massively Parallel Architectures. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, GPGPU-6, pages 120-126, New York, NY, USA, 2013. ACM. URL: http://dx.doi.org/10.1145/2458523.2458535.
Xiangyuan Zhu, Kenli Li, Ahmad Salah, Lin Shi, and Keqin Li. Parallel Implementation of MAFFT on CUDA-enabled Graphics Hardware. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 12(1):205-218, January 2015. URL: http://dx.doi.org/10.1109/TCBB.2014.2351801.

DynaSOAr: A Parallel Memory Allocator for Object-Oriented Programming on GPUs with Efficient Memory Access

Authors Matthias Springer, Hidehiko Masuhara

File

Document Identifiers

Author Details

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

DynaSOAr: A Parallel Memory Allocator for Object-Oriented Programming on GPUs with Efficient Memory Access

Authors Matthias Springer, Hidehiko Masuhara

File

Document Identifiers

Author Details

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Supplementary Materials

References