Coupling Memory and Computation for Locality Management

Authors Umut A. Acar, Guy Blelloch, Matthew Fluet, Stefan K. Muller, Ram Raghunathan



PDF
Thumbnail PDF

File

LIPIcs.SNAPL.2015.1.pdf
  • Filesize: 494 kB
  • 14 pages

Document Identifiers

Author Details

Umut A. Acar
Guy Blelloch
Matthew Fluet
Stefan K. Muller
Ram Raghunathan

Cite As Get BibTex

Umut A. Acar, Guy Blelloch, Matthew Fluet, Stefan K. Muller, and Ram Raghunathan. Coupling Memory and Computation for Locality Management. In 1st Summit on Advances in Programming Languages (SNAPL 2015). Leibniz International Proceedings in Informatics (LIPIcs), Volume 32, pp. 1-14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2015) https://doi.org/10.4230/LIPIcs.SNAPL.2015.1

Abstract

We articulate the need for managing (data) locality automatically rather than leaving it to the programmer, especially in parallel programming systems. To this end, we propose techniques for coupling tightly the computation (including the thread scheduler) and the memory manager so that data and computation can be positioned closely in hardware. Such tight coupling of computation and memory management is in sharp contrast with the prevailing practice of considering each in isolation. For example, memory-management techniques usually abstract the computation as an unknown "mutator", which is treated as a "black box". As an example of the approach, in this paper we consider a specific class of parallel computations, nested-parallel computations. Such computations dynamically create a nesting of parallel tasks. We propose a method for organizing memory as a tree of heaps reflecting the structure of the nesting. More specifically, our approach creates a heap for a task if it is separately scheduled on a processor. This allows us to couple garbage collection with the structure of the computation and the way in which it is dynamically scheduled on the processors. This coupling enables taking advantage of locality in the program by mapping it to the locality of the hardware. For example for improved locality a heap can be garbage collected immediately after its task finishes when the heap contents is likely in cache.

Subject Classification

Keywords
  • Parallel computing
  • locality
  • memory management
  • parallel garbage collection
  • functional programming
  • nested parallelism
  • thread scheduling

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Umut A. Acar and Guy Blelloch. 15210: Algorithms: Parallel and seqential, 2015. URL: http://www.cs.cmu.edu/~15210/.
  2. Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3):321-347, 2002. Google Scholar
  3. Todd A. Anderson. Optimizations in a private nursery-based garbage collector. In ISMM, pages 21-30, 2010. Google Scholar
  4. Lars Arge, Michael T. Goodrich, Michael Nelson, and Nodari Sitchinava. Fundamental parallel algorithms for private-cache chip multiprocessors. In SPAA, 2008. Google Scholar
  5. Sven Auhagen, Lars Bergstrom, Matthew Fluet, and John Reppy. Garbage collection for multicore NUMA machines. In MSPC'11: Proceedings of the 2011 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, pages 51-57. ACM Press, June 2011. Google Scholar
  6. Sven Auhagen, Lars Bergstrom, Matthew Fluet, and John H. Reppy. Garbage collection for multicore NUMA machines. In Proceedings of the 2011 ACM SIGPLAN workshop on Memory Systems Performance and Correctness: held in conjunction with PLDI'11, San Jose, CA, USA, June 5, 2011, pages 51-57, 2011. Google Scholar
  7. Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. Graph expansion and communication costs of fast matrix multiplication: regular submission. In Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, pages 1-12, 2011. Google Scholar
  8. Emery Berger, Kathryn McKinley, Robert Blumofe, and Paul Wilson. Hoard: A scalable memory allocator for multithreaded applications. In ASPLOS, pages 117-128, 2000. Google Scholar
  9. Ganesh Bikshandi, Jia Guo, Daniel Hoeflinger, Gheorghe Almasi, Basilio B. Fraguela, María J. Garzarán, David Padua, and Christoph von Praun. Programming for parallelism and locality with hierarchically tiled arrays. In Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 48-57, 2006. Google Scholar
  10. Guy E. Blelloch, Rezaul A. Chowdhury, Phillip B. Gibbons, Vijaya Ramachandran, Shimin Chen, and Michael Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. In In the Proceedings of the 19th ACM-SIAM Symposium on Discrete Algorithms, pages 501-510, 2008. Google Scholar
  11. Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Harsha Vardhan Simhadri. Scheduling irregular parallel computations on hierarchical caches. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures, SPAA'11, pages 355-366, 2011. Google Scholar
  12. Guy E. Blelloch and Phillip B. Gibbons. Effectively sharing a cache among threads. In SPAA, 2004. Google Scholar
  13. Guy E. Blelloch, Phillip B. Gibbons, and Harsha Vardhan Simhadri. Low-depth cache oblivious algorithms. In SPAA, 2010. Google Scholar
  14. Guy E. Blelloch and John Greiner. A provable time and space efficient implementation of NESL. In Proceedings of the 1st ACM SIGPLAN International Conference on Functional Programming, pages 213-225. ACM, 1996. Google Scholar
  15. S. Borkar, P. Dubey, K. Kahn, D. Kuck, H. Mulder, S. Pawlowski, and J. Rattner. Platform 2015: Intel processor and platform evolution for the next decade., 2005. Google Scholar
  16. Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. X10: an object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, OOPSLA'05, pages 519-538. ACM, 2005. Google Scholar
  17. Siddhartha Chatterjee. Locality, communication, and code generation for array-parallel languages. In PPSC, pages 656-661, 1995. Google Scholar
  18. Shimin Chen, Phillip B. Gibbons, Michael Kozuch, Vasileios Liaskovitis, Anastassia Ailamaki, Guy E. Blelloch, Babak Falsafi, Limor Fix, Nikos Hardavellas, Todd C. Mowry, and Chris Wilkerson. Scheduling threads for constructive cache sharing on CMPs. In ACM Symposium on Parallel Algorithms and Architectures, SPAA'07, pages 105-115, 2007. Google Scholar
  19. Perry Cheng and Guy Blelloch. A parallel, real-time garbage collector. In PLDI, pages 125-136, 2001. Google Scholar
  20. R.A. Chowdhury, F. Silvestri, B. Blakeley, and V. Ramachandran. Oblivious algorithms for multicores and network of processors. In International Symposium on Parallel Distributed Processing (IPDPS), pages 1-12, April 2010. Google Scholar
  21. Rezaul Alam Chowdhury and Vijaya Ramachandran. The cache-oblivious gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation. In SPAA, 2007. Google Scholar
  22. Rezaul Alam Chowdhury and Vijaya Ramachandran. Cache-efficient dynamic programming algorithms for multicores. In SPAA, 2008. Google Scholar
  23. Richard Cole and Vijaya Ramachandran. Resource oblivious sorting on multicores. In ICALP, 2010. Google Scholar
  24. Peter J. Denning. The locality principle. Commun. ACM, 48(7):19-24, July 2005. Google Scholar
  25. Damien Doligez and Georges Gonthier. Portable, unobtrusive garbage collection for multiprocessor systems. In POPL, pages 70-83, 1994. Google Scholar
  26. Damien Doligez and Georges Gonthier. Portable, unobtrusive garbage collection for multiprocessor systems. In Conference Record of POPL'94: 21st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, Portland, Oregon, USA, January 17-21, 1994, pages 70-83, 1994. Google Scholar
  27. Tamar Domani, Elliot K. Kolodner, Ethan Lewis, Erez Petrank, and Dafna Sheinwald. Thread-local heaps for Java. In ISMM, pages 76-87, 2002. Google Scholar
  28. Kayvon Fatahalian, Timothy Knight, Mike Houston, Mattan Erez, Daniel Horn, Larkhoon Leem, Ji Park, Manman Ren, Alex Aiken, William Dally, and et al. Sequoia: Programming the memory hierarchy. ACMIEEE SC 2006 Conference SC06, 0(November):4-4, 2006. Google Scholar
  29. Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. Implicitly-threaded parallelism in Manticore. In ICFP'08: Proceedings of the Thirteenth ACM SIGPLAN International Conference on Functional Programming, pages 119-130. ACM Press, September 2008. Google Scholar
  30. Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. Implicitly-threaded parallelism in Manticore. The Journal of Functional Programming, 20(5-6):537-576, November 2010. Google Scholar
  31. Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI, pages 212-223, 1998. Google Scholar
  32. Matteo Frigo and Volker Strumpen. The cache complexity of multithreaded cache oblivious algorithms. In SPAA, 2006. Google Scholar
  33. Sanjay Ghemwat and Paul Menage. TCMalloc : Thread-caching malloc, 2010. Google Scholar
  34. Robert H. Halstead. Multilisp: A language for concurrent symbolic computation. ACM Transactions on Programming Languages and Systems, 7(4):501-538, October 1985. Google Scholar
  35. Maurice Herlihy and J. Eliot B Moss. Lock-free garbage collection for multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 3(3):304-311, May 1992. Google Scholar
  36. Shams Mahmood Imam and Vivek Sarkar. Habanero-java library: a java 8 framework for multicore programming. In 2014 International Conference on Principles and Practices of Programming on the Java Platform Virtual Machines, Languages and Tools, PPPJ'14, Cracow, Poland, September 23-26, 2014, pages 75-86, 2014. Google Scholar
  37. Suresh Jagannathan, Armand Navabi, KC Sivaramakrishnan, and Lukasz Ziarek. The design rationale for Multi-MLton. In ML'10: Proceedings of the ACM SIGPLAN Workshop on ML. ACM, 2010. Google Scholar
  38. Richard Jones, Antony Hosking, and Eliot Moss. The Garbage Collection Handbook : The Art of Automatic Memory Management. CRC Press, 2012. Google Scholar
  39. Gabriele Keller, Manuel M.T. Chakravarty, Roman Leshchinskiy, Simon Peyton Jones, and Ben Lippmeier. Regular, shape-polymorphic, parallel arrays in haskell. In Proceedings of the 15th ACM SIGPLAN international conference on Functional programming, ICFP'10, pages 261-272, 2010. Google Scholar
  40. Milind Kulkarni, Patrick Carribault, Keshav Pingali, Ganesh Ramanarayanan, Bruce Walter, Kavita Bala, and L. Paul Chew. Scheduling strategies for optimistic parallel execution of irregular programs. In Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures, pages 217-228, 2008. Google Scholar
  41. Doug Lea. A java fork/join framework. In Proceedings of the ACM 2000 conference on Java Grande, JAVA'00, pages 36-43, 2000. Google Scholar
  42. Daan Leijen, Wolfram Schulte, and Sebastian Burckhardt. The design of a task parallel library. In Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications, OOPSLA'09, pages 227-242, 2009. Google Scholar
  43. Ran Liu and Haibo Chen. SSMalloc: a low-latency, locality-conscious memory allocator with stable performance scalability. In Proceedings of the Third ACM SIGOPS Asia-Pacific conference on Systems, APSys'12, pages 15-15, Berkeley, CA, USA, 2012. USENIX Association. Google Scholar
  44. Simon Marlow and Simon L. Peyton Jones. Multicore garbage collection with local heaps. In Proceedings of the 10th International Symposium on Memory Management, ISMM 2011, San Jose, CA, USA, June 04 - 05, 2011, pages 21-32, 2011. Google Scholar
  45. Simon Marlow and Simon L. Peyton Jones. Multicore garbage collection with local heaps. In ISMM, pages 21-32, 2011. Google Scholar
  46. Apurva Mehta and Cuong Tran. Optimizing linux memory management for low-latency / high-throughput databases. http://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases, 2013.
  47. Robin Milner, Mads Tofte, Robert Harper, and David MacQueen. The Definition of Standard ML (revised). The MIT Press, 1997. Google Scholar
  48. MLton web site. URL: http://www.mlton.org.
  49. Takeshi Ogasawara. NUMA-aware memory manager with dominant-thread-based copying GC. In OOPSLA, pages 377-390, 2009. Google Scholar
  50. Jean-Noël Quintin and Frédéric Wagner. Hierarchical work-stealing. In Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I, EuroPar'10, pages 217-229, Berlin, Heidelberg, 2010. Springer-Verlag. Google Scholar
  51. Patrick Sansom. Dual-mode garbage collection. In Proceedings of the Workshop on the Parallel Implementation of Functional Languages, 1991. Google Scholar
  52. Julian Shun, Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, Aapo Kyrola, Harsha Vardhan Simhadri, and Kanat Tangwongsan. Brief announcement: the problem based benchmark suite. In Proceedinbgs of the 24th ACM symposium on Parallelism in algorithms and architectures, SPAA'12, pages 68-70, New York, NY, USA, 2012. ACM. Google Scholar
  53. K. C. Sivaramakrishnan, Lukasz Ziarek, and Suresh Jagannathan. Multimlton: A multicore-aware runtime for standard ML. J. Funct. Program., 24(6):613-674, 2014. Google Scholar
  54. Daniel Spoonhower. Scheduling Deterministic Parallel Programs. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 2009. Google Scholar
  55. Daniel Spoonhower, Guy E. Blelloch, Robert Harper, and Phillip B. Gibbons. Space profiling for parallel functional programs. In International Conference on Functional Programming, 2008. Google Scholar
  56. Leslie G. Valiant. A bridging model for parallel computation. Commun. ACM, 33:103-111, August 1990. Google Scholar
  57. Leslie G. Valiant. A bridging model for multicore computing. In Proc. 16th European Symposium on Algorithms, 2008. Google Scholar
  58. Stephen Weeks. Whole-program compilation in MLton. In ML'06: Proceedings of the 2006 workshop on ML, pages 1-1. ACM, 2006. Google Scholar
  59. Wm. A. Wulf and Sally A. McKee. Hitting the memory wall: implications of the obvious. SIGARCH Comput. Archit. News, 23(1):20-24, March 1995. Google Scholar
  60. Jin Zhou and Brian Demsky. Memory management for many-core processors with software configurable locality policies. In ISMM, pages 3-14, 2012. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail