Coupling Memory and Computation for Locality Management

Acar, Umut A.; Blelloch, Guy; Fluet, Matthew; Muller, Stefan K.; Raghunathan, Ram

doi:10.4230/LIPIcs.SNAPL.2015.1

Abstract

We articulate the need for managing (data) locality automatically rather than leaving it to the programmer, especially in parallel programming systems. To this end, we propose techniques for coupling tightly the computation (including the thread scheduler) and the memory manager so that data and computation can be positioned closely in hardware. Such tight coupling of computation and memory management is in sharp contrast with the prevailing practice of considering each in isolation. For example, memory-management techniques usually abstract the computation as an unknown "mutator", which is treated as a "black box". As an example of the approach, in this paper we consider a specific class of parallel computations, nested-parallel computations. Such computations dynamically create a nesting of parallel tasks. We propose a method for organizing memory as a tree of heaps reflecting the structure of the nesting. More specifically, our approach creates a heap for a task if it is separately scheduled on a processor. This allows us to couple garbage collection with the structure of the computation and the way in which it is dynamically scheduled on the processors. This coupling enables taking advantage of locality in the program by mapping it to the locality of the hardware. For example for improved locality a heap can be garbage collected immediately after its task finishes when the heap contents is likely in cache.

Umut A. Acar and Guy Blelloch. 15210: Algorithms: Parallel and seqential, 2015. URL: http://www.cs.cmu.edu/~15210/.
Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3):321-347, 2002.
Todd A. Anderson. Optimizations in a private nursery-based garbage collector. In ISMM, pages 21-30, 2010.
Lars Arge, Michael T. Goodrich, Michael Nelson, and Nodari Sitchinava. Fundamental parallel algorithms for private-cache chip multiprocessors. In SPAA, 2008.
Sven Auhagen, Lars Bergstrom, Matthew Fluet, and John Reppy. Garbage collection for multicore NUMA machines. In MSPC'11: Proceedings of the 2011 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, pages 51-57. ACM Press, June 2011.
Sven Auhagen, Lars Bergstrom, Matthew Fluet, and John H. Reppy. Garbage collection for multicore NUMA machines. In Proceedings of the 2011 ACM SIGPLAN workshop on Memory Systems Performance and Correctness: held in conjunction with PLDI'11, San Jose, CA, USA, June 5, 2011, pages 51-57, 2011.
Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. Graph expansion and communication costs of fast matrix multiplication: regular submission. In Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, pages 1-12, 2011.
Emery Berger, Kathryn McKinley, Robert Blumofe, and Paul Wilson. Hoard: A scalable memory allocator for multithreaded applications. In ASPLOS, pages 117-128, 2000.
Ganesh Bikshandi, Jia Guo, Daniel Hoeflinger, Gheorghe Almasi, Basilio B. Fraguela, María J. Garzarán, David Padua, and Christoph von Praun. Programming for parallelism and locality with hierarchically tiled arrays. In Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 48-57, 2006.
Guy E. Blelloch, Rezaul A. Chowdhury, Phillip B. Gibbons, Vijaya Ramachandran, Shimin Chen, and Michael Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. In In the Proceedings of the 19th ACM-SIAM Symposium on Discrete Algorithms, pages 501-510, 2008.
Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Harsha Vardhan Simhadri. Scheduling irregular parallel computations on hierarchical caches. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures, SPAA'11, pages 355-366, 2011.
Guy E. Blelloch and Phillip B. Gibbons. Effectively sharing a cache among threads. In SPAA, 2004.
Guy E. Blelloch, Phillip B. Gibbons, and Harsha Vardhan Simhadri. Low-depth cache oblivious algorithms. In SPAA, 2010.
Guy E. Blelloch and John Greiner. A provable time and space efficient implementation of NESL. In Proceedings of the 1st ACM SIGPLAN International Conference on Functional Programming, pages 213-225. ACM, 1996.
S. Borkar, P. Dubey, K. Kahn, D. Kuck, H. Mulder, S. Pawlowski, and J. Rattner. Platform 2015: Intel processor and platform evolution for the next decade., 2005.
Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. X10: an object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, OOPSLA'05, pages 519-538. ACM, 2005.
Siddhartha Chatterjee. Locality, communication, and code generation for array-parallel languages. In PPSC, pages 656-661, 1995.
Shimin Chen, Phillip B. Gibbons, Michael Kozuch, Vasileios Liaskovitis, Anastassia Ailamaki, Guy E. Blelloch, Babak Falsafi, Limor Fix, Nikos Hardavellas, Todd C. Mowry, and Chris Wilkerson. Scheduling threads for constructive cache sharing on CMPs. In ACM Symposium on Parallel Algorithms and Architectures, SPAA'07, pages 105-115, 2007.
Perry Cheng and Guy Blelloch. A parallel, real-time garbage collector. In PLDI, pages 125-136, 2001.
R.A. Chowdhury, F. Silvestri, B. Blakeley, and V. Ramachandran. Oblivious algorithms for multicores and network of processors. In International Symposium on Parallel Distributed Processing (IPDPS), pages 1-12, April 2010.
Rezaul Alam Chowdhury and Vijaya Ramachandran. The cache-oblivious gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation. In SPAA, 2007.
Rezaul Alam Chowdhury and Vijaya Ramachandran. Cache-efficient dynamic programming algorithms for multicores. In SPAA, 2008.
Richard Cole and Vijaya Ramachandran. Resource oblivious sorting on multicores. In ICALP, 2010.
Peter J. Denning. The locality principle. Commun. ACM, 48(7):19-24, July 2005.
Damien Doligez and Georges Gonthier. Portable, unobtrusive garbage collection for multiprocessor systems. In POPL, pages 70-83, 1994.
Damien Doligez and Georges Gonthier. Portable, unobtrusive garbage collection for multiprocessor systems. In Conference Record of POPL'94: 21st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, Portland, Oregon, USA, January 17-21, 1994, pages 70-83, 1994.
Tamar Domani, Elliot K. Kolodner, Ethan Lewis, Erez Petrank, and Dafna Sheinwald. Thread-local heaps for Java. In ISMM, pages 76-87, 2002.
Kayvon Fatahalian, Timothy Knight, Mike Houston, Mattan Erez, Daniel Horn, Larkhoon Leem, Ji Park, Manman Ren, Alex Aiken, William Dally, and et al. Sequoia: Programming the memory hierarchy. ACMIEEE SC 2006 Conference SC06, 0(November):4-4, 2006.
Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. Implicitly-threaded parallelism in Manticore. In ICFP'08: Proceedings of the Thirteenth ACM SIGPLAN International Conference on Functional Programming, pages 119-130. ACM Press, September 2008.
Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. Implicitly-threaded parallelism in Manticore. The Journal of Functional Programming, 20(5-6):537-576, November 2010.
Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI, pages 212-223, 1998.
Matteo Frigo and Volker Strumpen. The cache complexity of multithreaded cache oblivious algorithms. In SPAA, 2006.
Sanjay Ghemwat and Paul Menage. TCMalloc : Thread-caching malloc, 2010.
Robert H. Halstead. Multilisp: A language for concurrent symbolic computation. ACM Transactions on Programming Languages and Systems, 7(4):501-538, October 1985.
Maurice Herlihy and J. Eliot B Moss. Lock-free garbage collection for multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 3(3):304-311, May 1992.
Shams Mahmood Imam and Vivek Sarkar. Habanero-java library: a java 8 framework for multicore programming. In 2014 International Conference on Principles and Practices of Programming on the Java Platform Virtual Machines, Languages and Tools, PPPJ'14, Cracow, Poland, September 23-26, 2014, pages 75-86, 2014.
Suresh Jagannathan, Armand Navabi, KC Sivaramakrishnan, and Lukasz Ziarek. The design rationale for Multi-MLton. In ML'10: Proceedings of the ACM SIGPLAN Workshop on ML. ACM, 2010.
Richard Jones, Antony Hosking, and Eliot Moss. The Garbage Collection Handbook : The Art of Automatic Memory Management. CRC Press, 2012.
Gabriele Keller, Manuel M.T. Chakravarty, Roman Leshchinskiy, Simon Peyton Jones, and Ben Lippmeier. Regular, shape-polymorphic, parallel arrays in haskell. In Proceedings of the 15th ACM SIGPLAN international conference on Functional programming, ICFP'10, pages 261-272, 2010.
Milind Kulkarni, Patrick Carribault, Keshav Pingali, Ganesh Ramanarayanan, Bruce Walter, Kavita Bala, and L. Paul Chew. Scheduling strategies for optimistic parallel execution of irregular programs. In Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures, pages 217-228, 2008.
Doug Lea. A java fork/join framework. In Proceedings of the ACM 2000 conference on Java Grande, JAVA'00, pages 36-43, 2000.
Daan Leijen, Wolfram Schulte, and Sebastian Burckhardt. The design of a task parallel library. In Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications, OOPSLA'09, pages 227-242, 2009.
Ran Liu and Haibo Chen. SSMalloc: a low-latency, locality-conscious memory allocator with stable performance scalability. In Proceedings of the Third ACM SIGOPS Asia-Pacific conference on Systems, APSys'12, pages 15-15, Berkeley, CA, USA, 2012. USENIX Association.
Simon Marlow and Simon L. Peyton Jones. Multicore garbage collection with local heaps. In Proceedings of the 10th International Symposium on Memory Management, ISMM 2011, San Jose, CA, USA, June 04 - 05, 2011, pages 21-32, 2011.
Simon Marlow and Simon L. Peyton Jones. Multicore garbage collection with local heaps. In ISMM, pages 21-32, 2011.
Apurva Mehta and Cuong Tran. Optimizing linux memory management for low-latency / high-throughput databases. http://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases, 2013.
Robin Milner, Mads Tofte, Robert Harper, and David MacQueen. The Definition of Standard ML (revised). The MIT Press, 1997.
MLton web site. URL: http://www.mlton.org.
Takeshi Ogasawara. NUMA-aware memory manager with dominant-thread-based copying GC. In OOPSLA, pages 377-390, 2009.
Jean-Noël Quintin and Frédéric Wagner. Hierarchical work-stealing. In Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I, EuroPar'10, pages 217-229, Berlin, Heidelberg, 2010. Springer-Verlag.
Patrick Sansom. Dual-mode garbage collection. In Proceedings of the Workshop on the Parallel Implementation of Functional Languages, 1991.
Julian Shun, Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, Aapo Kyrola, Harsha Vardhan Simhadri, and Kanat Tangwongsan. Brief announcement: the problem based benchmark suite. In Proceedinbgs of the 24th ACM symposium on Parallelism in algorithms and architectures, SPAA'12, pages 68-70, New York, NY, USA, 2012. ACM.
K. C. Sivaramakrishnan, Lukasz Ziarek, and Suresh Jagannathan. Multimlton: A multicore-aware runtime for standard ML. J. Funct. Program., 24(6):613-674, 2014.
Daniel Spoonhower. Scheduling Deterministic Parallel Programs. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 2009.
Daniel Spoonhower, Guy E. Blelloch, Robert Harper, and Phillip B. Gibbons. Space profiling for parallel functional programs. In International Conference on Functional Programming, 2008.
Leslie G. Valiant. A bridging model for parallel computation. Commun. ACM, 33:103-111, August 1990.
Leslie G. Valiant. A bridging model for multicore computing. In Proc. 16th European Symposium on Algorithms, 2008.
Stephen Weeks. Whole-program compilation in MLton. In ML'06: Proceedings of the 2006 workshop on ML, pages 1-1. ACM, 2006.
Wm. A. Wulf and Sally A. McKee. Hitting the memory wall: implications of the obvious. SIGARCH Comput. Archit. News, 23(1):20-24, March 1995.
Jin Zhou and Brian Demsky. Memory management for many-core processors with software configurable locality policies. In ISMM, pages 3-14, 2012.

Coupling Memory and Computation for Locality Management

Authors Umut A. Acar, Guy Blelloch, Matthew Fluet, Stefan K. Muller, Ram Raghunathan

File

Document Identifiers

Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message