Accelerating Dynamically-Typed Languages on Heterogeneous Platforms Using Guards Optimization

Authors Mohaned Qunaibit , Stefan Brunthaler, Yeoul Na, Stijn Volckaert, Michael Franz



PDF
Thumbnail PDF

File

LIPIcs.ECOOP.2018.16.pdf
  • Filesize: 1.08 MB
  • 29 pages

Document Identifiers

Author Details

Mohaned Qunaibit
  • University of California, Irvine
Stefan Brunthaler
  • National Cyber Defense Research Institute CODE, Munich, and SBA Research
Yeoul Na
  • University of California, Irvine
Stijn Volckaert
  • University of California, Irvine
Michael Franz
  • University of California, Irvine

Cite As Get BibTex

Mohaned Qunaibit, Stefan Brunthaler, Yeoul Na, Stijn Volckaert, and Michael Franz. Accelerating Dynamically-Typed Languages on Heterogeneous Platforms Using Guards Optimization. In 32nd European Conference on Object-Oriented Programming (ECOOP 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 109, pp. 16:1-16:29, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018) https://doi.org/10.4230/LIPIcs.ECOOP.2018.16

Abstract

Scientific applications are ideal candidates for the "heterogeneous computing" paradigm, in which parts of a computation are "offloaded" to available accelerator hardware such as GPUs. However, when such applications are written in dynamic languages such as Python or R, as they increasingly are, things become less straightforward. The same flexibility that makes these languages so appealing to programmers also significantly complicates the problem of automatically and transparently partitioning a program's execution between a CPU and available accelerator hardware without having to rely on programmer annotations.
A common way of handling the features of dynamic languages is by introducing speculation in conjunction with guards to ascertain the validity of assumptions made in the speculative computation. Unfortunately, a single guard violation during the execution of "offloaded" code may result in a huge performance penalty and necessitate the complete re-execution of the offloaded computation. In the case of dynamic languages, this problem is compounded by the fact that a full compiler analysis is not always possible ahead of time.
This paper presents MegaGuards, a new approach for speculatively executing dynamic languages on heterogeneous platforms in a fully automatic and transparent manner. Our method translates each target loop into a single static region devoid of any dynamic type features. The dynamic parts are instead handled by a construct that we call a mega guard which checks all the speculative assumptions ahead of its corresponding static region. Notably, the advantage of MegaGuards is not limited to heterogeneous computing; because it removes guards from compute-intensive loops, the approach also improves sequential performance.
We have implemented MegaGuards along with an automatic loop parallelization backend in ZipPy, a Python Virtual Machine. The results of a careful and detailed evaluation reveal very significant speedups of an order of magnitude on average with a maximum speedup of up to two orders of magnitudes when compared to the original ZipPy performance as a baseline. These results demonstrate the potential for applying heterogeneous computing to dynamic languages.

Subject Classification

ACM Subject Classification
  • Software and its engineering → Interpreters
  • Software and its engineering → Just-in-time compilers
  • Software and its engineering → Dynamic compilers
Keywords
  • Type Specialization
  • Guards Optimization
  • Automatic Heterogeneous Computing
  • Automatic Parallelism

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Keith Adams, Jason Evans, Bertrand Maher, Guilherme Ottoni, Andrew Paroski, Brett Simmers, Edwin Smith, and Owen Yamauchi. The hiphop virtual machine. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages &Applications, OOPSLA '14, pages 777-790, New York, NY, USA, 2014. ACM. Google Scholar
  2. Joshua Auerbach, David F. Bacon, Ioana Burcea, Perry Cheng, Stephen J. Fink, Rodric Rabbah, and Sunil Shukla. A compiler and runtime for heterogeneous computing. In Proceedings of the 49th Annual Design Automation Conference, DAC '12, pages 271-276. ACM, 2012. Google Scholar
  3. Muthu Manikandan Baskaran, J. Ramanujam, and P. Sadayappan. Automatic c-to-cuda code generation for affine programs. In Proceedings of the International Conference on Compiler Construction, CC'10, pages 244-263. Springer-Verlag, 2010. Google Scholar
  4. Michael Bebenita, Florian Brandner, Manuel Fahndrich, Francesco Logozzo, Wolfram Schulte, Nikolai Tillmann, and Herman Venter. Spur: A trace-based jit compiler for cil. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '10, pages 708-725, New York, NY, USA, 2010. ACM. Google Scholar
  5. James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), 2010. Google Scholar
  6. Carl Friedrich Bolz, Antonio Cuni, Maciej Fijalkowski, and Armin Rigo. Pypy, 2017. URL: http://pypy.org/.
  7. Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Arvind K. Sujeeth, Christopher De Sa, Christopher Aberger, and Kunle Olukotun. Have abstraction and eat performance, too: Optimized heterogeneous computing with parallel patterns. In Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO 2016, pages 194-205. ACM, 2016. Google Scholar
  8. Stefan Brunthaler. Efficient interpretation using quickening. In Proceedings of the 6th Symposium on Dynamic Languages, DLS '10, pages 1-14. ACM, 2010. Google Scholar
  9. Stefan Brunthaler. Inline caching meets quickening. In Proceedings of the 24th European Conference on Object-oriented Programming, ECOOP'10, pages 429-451. Springer-Verlag, 2010. Google Scholar
  10. Michael Burke, Ron Cytron, Jeanne Ferrante, and Wilson Hsieh. Automatic generation of nested, fork-join parallelism. The Journal of Supercomputing, 3(2):71-88, 1989. Google Scholar
  11. Bryan Catanzaro, Michael Garland, and Kurt Keutzer. Copperhead: Compiling an embedded data parallel language. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP '11, pages 47-56. ACM, 2011. Google Scholar
  12. Manuel M.T. Chakravarty, Gabriele Keller, Sean Lee, Trevor L. McDonell, and Vinod Grover. Accelerating haskell array codes with multicore gpus. In Proceedings of the Sixth Workshop on Declarative Aspects of Multicore Programming, DAMP '11, pages 3-14. ACM, 2011. Google Scholar
  13. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization, IISWC '09, pages 44-54. IEEE Computer Society, 2009. Google Scholar
  14. Shuai Che, Jeremy W. Sheaffer, Michael Boyer, Lukasz G. Szafaryn, Liang Wang, and Kevin Skadron. A characterization of the rodinia benchmark suite with comparison to contemporary cmp workloads. In Proceedings of the IEEE International Symposium on Workload Characterization, IISWC '10, pages 1-11. IEEE Computer Society, 2010. Google Scholar
  15. Continuum Analytics. Numba benchmark suite, 2017. URL: https://github.com/numba/numba-benchmark.
  16. L Peter Deutsch and Allan M Schiffman. Efficient implementation of the smalltalk-80 system. In Proceedings of the 11th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, pages 297-302. ACM, 1984. Google Scholar
  17. G. Dot, A. Martinez, and A. Gonzalez. Removing checks in dynamically typed languages through efficient profiling. In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 257-268, Feb 2017. Google Scholar
  18. Christophe Dubach, Perry Cheng, Rodric Rabbah, David F. Bacon, and Stephen J. Fink. Compiling a high-level language for gpus: (via language support for architectures and compilers). In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '12, pages 1-12. ACM, 2012. Google Scholar
  19. Gilles Duboscq, Thomas Würthinger, Lukas Stadler, Christian Wimmer, Doug Simon, and Hanspeter Mössenböck. An intermediate representation for speculative optimizations in a dynamic compiler. In Proceedings of the 7th ACM Workshop on Virtual Machines and Intermediate Languages, VMIL '13, pages 1-10, New York, NY, USA, 2013. ACM. URL: http://dx.doi.org/10.1145/2542142.2542143.
  20. Jeanne Ferrante, Karl J Ottenstein, and Joe D Warren. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems (TOPLAS), 9(3):319-349, 1987. Google Scholar
  21. Juan Fumero, Michel Steuwer, Lukas Stadler, and Christophe Dubach. Just-in-time gpu compilation for interpreted languages with partial evaluation. In Proceedings of the 13th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE '17, pages 60-73, New York, NY, USA, 2017. ACM. Google Scholar
  22. Juan José Fumero, Toomas Remmelg, Michel Steuwer, and Christophe Dubach. Runtime code generation and data management for heterogeneous computing in java. In Proceedings of the Principles and Practices of Programming on The Java Platform, PPPJ '15, pages 16-26. ACM, 2015. Google Scholar
  23. Rahul Garg and José Nelson Amaral. Compiling python to a hybrid execution environment. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU-3, pages 19-30. ACM, 2010. Google Scholar
  24. Tobias Grosser, Albert Cohen, Justin Holewinski, P. Sadayappan, and Sven Verdoolaege. Hybrid hexagonal/classical tiling for gpus. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '14, pages 66:66-66:75. ACM, 2014. Google Scholar
  25. Tobias Grosser, Armin Groesslinger, and Christian Lengauer. Polly - performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters, 22(04):1250010, 2012. Google Scholar
  26. Gregory D. Hager, Mark D. Hill, and Katherine Yelick. Opportunities and challenges for next generation computing, 2015. Google Scholar
  27. Stephan Herhut, Richard L. Hudson, Tatiana Shpeisman, and Jaswanth Sreeram. River trail: A path to parallelism in javascript. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '13, pages 729-744, New York, NY, USA, 2013. ACM. Google Scholar
  28. Kazuaki Ishizaki, Akihiro Hayashi, Gita Koblents, and Vivek Sarkar. Compiling and optimizing java 8 programs for gpu execution. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation, PACT '15, pages 419-431. IEEE Computer Society, 2015. Google Scholar
  29. Ivan Jibaja, Peter Jensen, Ningxin Hu, Mohammad R. Haghighat, John McCutchan, Dan Gohman, Stephen M. Blackburn, and Kathryn S. McKinley. Vector parallelism in javascript: Language and compiler support for simd. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation, PACT '15, pages 407-418, Washington, DC, USA, 2015. IEEE Computer Society. Google Scholar
  30. Onur Kayıran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. Neither more nor less: optimizing thread-level parallelism for gpgpus. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, pages 157-166. IEEE Press, 2013. Google Scholar
  31. Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T Kandemir, Gabriel H Loh, Onur Mutlu, and Chita R Das. Managing gpu concurrency in heterogeneous architectures. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pages 114-126. IEEE, 2014. Google Scholar
  32. Madhukar N. Kedlaya, Jared Roesch, Behnam Robatmili, Mehrdad Reshadi, and Ben Hardekopf. Improved type specialization for dynamic scripting languages. In Proceedings of the 9th Symposium on Dynamic Languages, DLS '13, pages 37-48, New York, NY, USA, 2013. ACM. Google Scholar
  33. Andreas Klöckner, Nicolas Pinto, Yunsup Lee, Bryan Catanzaro, Paul Ivanov, and Ahmed Fasih. Pycuda and pyopencl: A scripting-based approach to gpu run-time code generation. Parallel Computing, 38(3):157-174, 2012. Google Scholar
  34. Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: A llvm-based python jit compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, LLVM '15, pages 7:1-7:6, New York, NY, USA, 2015. ACM. Google Scholar
  35. Alan Leung, Ondřej Lhoták, and Ghulam Lashari. Automatic parallelization for graphics processing units. In Proceedings of the 7th International Conference on Principles and Practice of Programming in Java, PPPJ '09, pages 91-100. ACM, 2009. Google Scholar
  36. Thibaut Lutz and Vinod Grover. Lambdajit: A dynamic compiler for heterogeneous optimizations of stl algorithms. In Proceedings of the 3rd ACM SIGPLAN Workshop on Functional High-performance Computing, FHPC '14, pages 99-108. ACM, 2014. Google Scholar
  37. X. Ma, J. Li, and N. F. Samatova. Automatic parallelization of scripting languages: Toward transparent desktop parallel computing. In 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), Proceedings, 26-30 March 2007, Long Beach, California, USA, pages 1-6. IEEE, March 2007. URL: http://dx.doi.org/10.1109/IPDPS.2007.370488.
  38. T. J. McCabe. A complexity measure. IEEE Transactions on Software Engineering, SE-2(4):308-320, Dec 1976. Google Scholar
  39. Mojtaba Mehrara, Po-Chun Hsu, Mehrzad Samadi, and Scott Mahlke. Dynamic parallelization of javascript applications using an ultra-lightweight speculation mechanism. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture, HPCA '11, pages 87-98. IEEE Computer Society, 2011. Google Scholar
  40. Yeoul Na, Seon Wook Kim, and Youngsun Han. Javascript parallelizing compiler for exploiting parallelism from data-parallel html5 applications. ACM Trans. Archit. Code Optim., 12(4):64:1-64:25, 2016. Google Scholar
  41. Radford M. Neal. pqr, 2016. URL: http://www.pqr-project.org/.
  42. NVIDIA Corporation. Cuda, 2017. URL: https://developer.nvidia.com/cuda-zone.
  43. NVIDIA Corporation. Nvidia opencl sdk code samples, 2017. URL: https://developer.nvidia.com/opencl.
  44. Michael F. P. O'Boyle, Zheng Wang, and Dominik Grewe. Portable mapping of data parallel programs to opencl for heterogeneous systems. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), CGO '13, pages 1-10. IEEE Computer Society, 2013. Google Scholar
  45. Richard Plangger and Andreas Krall. Vectorization in pypy’s tracing just-in-time compiler. In Proceedings of the 19th International Workshop on Software and Compilers for Embedded Systems, SCOPES '16, pages 67-76, New York, NY, USA, 2016. ACM. URL: http://dx.doi.org/10.1145/2906363.2906384.
  46. Philip C. Pratt-Szeliga, James W. Fawcett, and Roy D. Welch. Rootbeer: Seamlessly using gpus from java. In Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems, HPCC '12, pages 375-380. IEEE Computer Society, 2012. Google Scholar
  47. Michel Steuwer, Christian Fensch, Sam Lindley, and Christophe Dubach. Generating performance portable code using rewrite rules: From high-level functional expressions to high-performance opencl code. In Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming, ICFP 2015, pages 205-217. ACM, 2015. Google Scholar
  48. John E. Stone, David Gohara, and Guochun Shi. Opencl: A parallel programming standard for heterogeneous computing systems. IEEE Des. Test, 12(3):66-73, 2010. Google Scholar
  49. Secure Systems and Software Laboratory. Zippy, 2015. URL: https://github.com/securesystemslab/zippy.
  50. Justin Talbot, Zachary DeVito, and Pat Hanrahan. Riposte: A trace-driven compiler and parallel vm for vector code in r. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, pages 43-52. ACM, 2012. Google Scholar
  51. Peng Tu and David Padua. Automatic array privatization. In Compiler optimizations for scalable parallel systems, pages 247-281. Springer, 2001. Google Scholar
  52. Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. Polyhedral parallel code generation for cuda. ACM Trans. Archit. Code Optim., pages 54:1-54:23, 2013. Google Scholar
  53. Sven Verdoolaege and Tobias Grosser. Polyhedral extraction tool. In In Second International Workshop on Polyhedral Compilation Techniques (IMPACT '12), 2012. Google Scholar
  54. Haichuan Wang, David Padua, and Peng Wu. Vectorization of apply to reduce interpretation overhead of r. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2015, pages 400-415. ACM, 2015. Google Scholar
  55. Haichuan Wang, Peng Wu, and David Padua. Optimizing r vm: Allocation removal and path length reduction via interpreter-level specialization. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '14, pages 295:295-295:305. ACM, 2014. Google Scholar
  56. Zheng Wang, Daniel Powell, Björn Franke, and Michael O'Boyle. Exploitation of GPUs for the Parallelisation of Probably Parallel Legacy Code, pages 154-173. Springer Berlin Heidelberg, Berlin, Heidelberg, 2014. Google Scholar
  57. Thomas Wurthinger, Christian Wimmer, Christian Humer, Andreas Woss, Lukas Stadler, Chris Seaton, Gilles Duboscq, Doug Simon, and Matthias Grimmer. Practical partial evaluation for high-performance dynamic language runtimes. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017, pages 662-676. ACM, 2017. Google Scholar
  58. Thomas Würthinger, Christian Wimmer, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Christian Humer, Gregor Richards, Doug Simon, and Mario Wolczko. One vm to rule them all. In Proceedings of the 2013 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming &Software, Onward! 2013, pages 187-204. ACM, 2013. Google Scholar
  59. Thomas Würthinger, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Doug Simon, and Christian Wimmer. Self-optimizing AST interpreters. In Proceedings of the 8th symposium on Dynamic languages - DLS '12, page 73. ACM Press, 2012. Google Scholar
  60. Thomas Würthinger, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Doug Simon, and Christian Wimmer. Self-optimizing ast interpreters. In Proceedings of the 8th Symposium on Dynamic Languages, DLS '12, pages 73-82, New York, NY, USA, 2012. ACM. URL: http://dx.doi.org/10.1145/2384577.2384587.
  61. Wei Zhang. Efficient Hosted Interpreter for Dynamic Languages. PhD thesis, University of California Irvine, 2015. Google Scholar
  62. Wei Zhang, Per Larsen, Stefan Brunthaler, and Michael Franz. Accelerating iterators in optimizing ast interpreters. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages &Applications, OOPSLA '14, pages 727-743. ACM, 2014. Google Scholar
  63. Haiping Zhao, Iain Proctor, Minghui Yang, Xin Qi, Mark Williams, Qi Gao, Guilherme Ottoni, Andrew Paroski, Scott MacVicar, Jason Evans, and Stephen Tu. The hiphop compiler for php. SIGPLAN Not., pages 575-586, 2012. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail