Accelerating Dynamically-Typed Languages on Heterogeneous Platforms Using Guards Optimization

Qunaibit, Mohaned; Brunthaler, Stefan; Na, Yeoul; Volckaert, Stijn; Franz, Michael

doi:10.4230/LIPIcs.ECOOP.2018.16

File

LIPIcs.ECOOP.2018.16.pdf

Filesize: 1.08 MB
29 pages

Document Identifiers

DOI: 10.4230/LIPIcs.ECOOP.2018.16
URN: urn:nbn:de:0030-drops-92217

Author Details

Mohaned Qunaibit

University of California, Irvine

Stefan Brunthaler

National Cyber Defense Research Institute CODE, Munich, and SBA Research

Yeoul Na

University of California, Irvine

Stijn Volckaert

University of California, Irvine

Michael Franz

University of California, Irvine

Cite AsGet BibTex

Mohaned Qunaibit, Stefan Brunthaler, Yeoul Na, Stijn Volckaert, and Michael Franz. Accelerating Dynamically-Typed Languages on Heterogeneous Platforms Using Guards Optimization. In 32nd European Conference on Object-Oriented Programming (ECOOP 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 109, pp. 16:1-16:29, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)
https://doi.org/10.4230/LIPIcs.ECOOP.2018.16

Abstract

Scientific applications are ideal candidates for the "heterogeneous computing" paradigm, in which parts of a computation are "offloaded" to available accelerator hardware such as GPUs. However, when such applications are written in dynamic languages such as Python or R, as they increasingly are, things become less straightforward. The same flexibility that makes these languages so appealing to programmers also significantly complicates the problem of automatically and transparently partitioning a program's execution between a CPU and available accelerator hardware without having to rely on programmer annotations. A common way of handling the features of dynamic languages is by introducing speculation in conjunction with guards to ascertain the validity of assumptions made in the speculative computation. Unfortunately, a single guard violation during the execution of "offloaded" code may result in a huge performance penalty and necessitate the complete re-execution of the offloaded computation. In the case of dynamic languages, this problem is compounded by the fact that a full compiler analysis is not always possible ahead of time. This paper presents MegaGuards, a new approach for speculatively executing dynamic languages on heterogeneous platforms in a fully automatic and transparent manner. Our method translates each target loop into a single static region devoid of any dynamic type features. The dynamic parts are instead handled by a construct that we call a mega guard which checks all the speculative assumptions ahead of its corresponding static region. Notably, the advantage of MegaGuards is not limited to heterogeneous computing; because it removes guards from compute-intensive loops, the approach also improves sequential performance. We have implemented MegaGuards along with an automatic loop parallelization backend in ZipPy, a Python Virtual Machine. The results of a careful and detailed evaluation reveal very significant speedups of an order of magnitude on average with a maximum speedup of up to two orders of magnitudes when compared to the original ZipPy performance as a baseline. These results demonstrate the potential for applying heterogeneous computing to dynamic languages.

Subject Classification

ACM Subject Classification

Software and its engineering → Interpreters
Software and its engineering → Just-in-time compilers
Software and its engineering → Dynamic compilers

Keywords

Type Specialization
Guards Optimization
Automatic Heterogeneous Computing
Automatic Parallelism

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Keith Adams, Jason Evans, Bertrand Maher, Guilherme Ottoni, Andrew Paroski, Brett Simmers, Edwin Smith, and Owen Yamauchi. The hiphop virtual machine. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages &Applications, OOPSLA '14, pages 777-790, New York, NY, USA, 2014. ACM.
Joshua Auerbach, David F. Bacon, Ioana Burcea, Perry Cheng, Stephen J. Fink, Rodric Rabbah, and Sunil Shukla. A compiler and runtime for heterogeneous computing. In Proceedings of the 49th Annual Design Automation Conference, DAC '12, pages 271-276. ACM, 2012.
Muthu Manikandan Baskaran, J. Ramanujam, and P. Sadayappan. Automatic c-to-cuda code generation for affine programs. In Proceedings of the International Conference on Compiler Construction, CC'10, pages 244-263. Springer-Verlag, 2010.
Michael Bebenita, Florian Brandner, Manuel Fahndrich, Francesco Logozzo, Wolfram Schulte, Nikolai Tillmann, and Herman Venter. Spur: A trace-based jit compiler for cil. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '10, pages 708-725, New York, NY, USA, 2010. ACM.
James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), 2010.
Carl Friedrich Bolz, Antonio Cuni, Maciej Fijalkowski, and Armin Rigo. Pypy, 2017. URL: http://pypy.org/.
Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Arvind K. Sujeeth, Christopher De Sa, Christopher Aberger, and Kunle Olukotun. Have abstraction and eat performance, too: Optimized heterogeneous computing with parallel patterns. In Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO 2016, pages 194-205. ACM, 2016.
Stefan Brunthaler. Efficient interpretation using quickening. In Proceedings of the 6th Symposium on Dynamic Languages, DLS '10, pages 1-14. ACM, 2010.
Stefan Brunthaler. Inline caching meets quickening. In Proceedings of the 24th European Conference on Object-oriented Programming, ECOOP'10, pages 429-451. Springer-Verlag, 2010.
Michael Burke, Ron Cytron, Jeanne Ferrante, and Wilson Hsieh. Automatic generation of nested, fork-join parallelism. The Journal of Supercomputing, 3(2):71-88, 1989.
Bryan Catanzaro, Michael Garland, and Kurt Keutzer. Copperhead: Compiling an embedded data parallel language. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP '11, pages 47-56. ACM, 2011.
Manuel M.T. Chakravarty, Gabriele Keller, Sean Lee, Trevor L. McDonell, and Vinod Grover. Accelerating haskell array codes with multicore gpus. In Proceedings of the Sixth Workshop on Declarative Aspects of Multicore Programming, DAMP '11, pages 3-14. ACM, 2011.
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization, IISWC '09, pages 44-54. IEEE Computer Society, 2009.
Shuai Che, Jeremy W. Sheaffer, Michael Boyer, Lukasz G. Szafaryn, Liang Wang, and Kevin Skadron. A characterization of the rodinia benchmark suite with comparison to contemporary cmp workloads. In Proceedings of the IEEE International Symposium on Workload Characterization, IISWC '10, pages 1-11. IEEE Computer Society, 2010.
Continuum Analytics. Numba benchmark suite, 2017. URL: https://github.com/numba/numba-benchmark.
L Peter Deutsch and Allan M Schiffman. Efficient implementation of the smalltalk-80 system. In Proceedings of the 11th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, pages 297-302. ACM, 1984.
G. Dot, A. Martinez, and A. Gonzalez. Removing checks in dynamically typed languages through efficient profiling. In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 257-268, Feb 2017.
Christophe Dubach, Perry Cheng, Rodric Rabbah, David F. Bacon, and Stephen J. Fink. Compiling a high-level language for gpus: (via language support for architectures and compilers). In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '12, pages 1-12. ACM, 2012.
Gilles Duboscq, Thomas Würthinger, Lukas Stadler, Christian Wimmer, Doug Simon, and Hanspeter Mössenböck. An intermediate representation for speculative optimizations in a dynamic compiler. In Proceedings of the 7th ACM Workshop on Virtual Machines and Intermediate Languages, VMIL '13, pages 1-10, New York, NY, USA, 2013. ACM. URL: http://dx.doi.org/10.1145/2542142.2542143.
Jeanne Ferrante, Karl J Ottenstein, and Joe D Warren. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems (TOPLAS), 9(3):319-349, 1987.
Juan Fumero, Michel Steuwer, Lukas Stadler, and Christophe Dubach. Just-in-time gpu compilation for interpreted languages with partial evaluation. In Proceedings of the 13th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE '17, pages 60-73, New York, NY, USA, 2017. ACM.
Juan José Fumero, Toomas Remmelg, Michel Steuwer, and Christophe Dubach. Runtime code generation and data management for heterogeneous computing in java. In Proceedings of the Principles and Practices of Programming on The Java Platform, PPPJ '15, pages 16-26. ACM, 2015.
Rahul Garg and José Nelson Amaral. Compiling python to a hybrid execution environment. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU-3, pages 19-30. ACM, 2010.
Tobias Grosser, Albert Cohen, Justin Holewinski, P. Sadayappan, and Sven Verdoolaege. Hybrid hexagonal/classical tiling for gpus. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '14, pages 66:66-66:75. ACM, 2014.
Tobias Grosser, Armin Groesslinger, and Christian Lengauer. Polly - performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters, 22(04):1250010, 2012.
Gregory D. Hager, Mark D. Hill, and Katherine Yelick. Opportunities and challenges for next generation computing, 2015.
Stephan Herhut, Richard L. Hudson, Tatiana Shpeisman, and Jaswanth Sreeram. River trail: A path to parallelism in javascript. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '13, pages 729-744, New York, NY, USA, 2013. ACM.
Kazuaki Ishizaki, Akihiro Hayashi, Gita Koblents, and Vivek Sarkar. Compiling and optimizing java 8 programs for gpu execution. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation, PACT '15, pages 419-431. IEEE Computer Society, 2015.
Ivan Jibaja, Peter Jensen, Ningxin Hu, Mohammad R. Haghighat, John McCutchan, Dan Gohman, Stephen M. Blackburn, and Kathryn S. McKinley. Vector parallelism in javascript: Language and compiler support for simd. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation, PACT '15, pages 407-418, Washington, DC, USA, 2015. IEEE Computer Society.
Onur Kayıran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. Neither more nor less: optimizing thread-level parallelism for gpgpus. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, pages 157-166. IEEE Press, 2013.
Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T Kandemir, Gabriel H Loh, Onur Mutlu, and Chita R Das. Managing gpu concurrency in heterogeneous architectures. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pages 114-126. IEEE, 2014.
Madhukar N. Kedlaya, Jared Roesch, Behnam Robatmili, Mehrdad Reshadi, and Ben Hardekopf. Improved type specialization for dynamic scripting languages. In Proceedings of the 9th Symposium on Dynamic Languages, DLS '13, pages 37-48, New York, NY, USA, 2013. ACM.
Andreas Klöckner, Nicolas Pinto, Yunsup Lee, Bryan Catanzaro, Paul Ivanov, and Ahmed Fasih. Pycuda and pyopencl: A scripting-based approach to gpu run-time code generation. Parallel Computing, 38(3):157-174, 2012.
Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: A llvm-based python jit compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, LLVM '15, pages 7:1-7:6, New York, NY, USA, 2015. ACM.
Alan Leung, Ondřej Lhoták, and Ghulam Lashari. Automatic parallelization for graphics processing units. In Proceedings of the 7th International Conference on Principles and Practice of Programming in Java, PPPJ '09, pages 91-100. ACM, 2009.
Thibaut Lutz and Vinod Grover. Lambdajit: A dynamic compiler for heterogeneous optimizations of stl algorithms. In Proceedings of the 3rd ACM SIGPLAN Workshop on Functional High-performance Computing, FHPC '14, pages 99-108. ACM, 2014.
X. Ma, J. Li, and N. F. Samatova. Automatic parallelization of scripting languages: Toward transparent desktop parallel computing. In 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), Proceedings, 26-30 March 2007, Long Beach, California, USA, pages 1-6. IEEE, March 2007. URL: http://dx.doi.org/10.1109/IPDPS.2007.370488.
T. J. McCabe. A complexity measure. IEEE Transactions on Software Engineering, SE-2(4):308-320, Dec 1976.
Mojtaba Mehrara, Po-Chun Hsu, Mehrzad Samadi, and Scott Mahlke. Dynamic parallelization of javascript applications using an ultra-lightweight speculation mechanism. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture, HPCA '11, pages 87-98. IEEE Computer Society, 2011.
Yeoul Na, Seon Wook Kim, and Youngsun Han. Javascript parallelizing compiler for exploiting parallelism from data-parallel html5 applications. ACM Trans. Archit. Code Optim., 12(4):64:1-64:25, 2016.
Radford M. Neal. pqr, 2016. URL: http://www.pqr-project.org/.
NVIDIA Corporation. Cuda, 2017. URL: https://developer.nvidia.com/cuda-zone.
NVIDIA Corporation. Nvidia opencl sdk code samples, 2017. URL: https://developer.nvidia.com/opencl.
Michael F. P. O'Boyle, Zheng Wang, and Dominik Grewe. Portable mapping of data parallel programs to opencl for heterogeneous systems. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), CGO '13, pages 1-10. IEEE Computer Society, 2013.
Richard Plangger and Andreas Krall. Vectorization in pypy’s tracing just-in-time compiler. In Proceedings of the 19th International Workshop on Software and Compilers for Embedded Systems, SCOPES '16, pages 67-76, New York, NY, USA, 2016. ACM. URL: http://dx.doi.org/10.1145/2906363.2906384.
Philip C. Pratt-Szeliga, James W. Fawcett, and Roy D. Welch. Rootbeer: Seamlessly using gpus from java. In Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems, HPCC '12, pages 375-380. IEEE Computer Society, 2012.
Michel Steuwer, Christian Fensch, Sam Lindley, and Christophe Dubach. Generating performance portable code using rewrite rules: From high-level functional expressions to high-performance opencl code. In Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming, ICFP 2015, pages 205-217. ACM, 2015.
John E. Stone, David Gohara, and Guochun Shi. Opencl: A parallel programming standard for heterogeneous computing systems. IEEE Des. Test, 12(3):66-73, 2010.
Secure Systems and Software Laboratory. Zippy, 2015. URL: https://github.com/securesystemslab/zippy.
Justin Talbot, Zachary DeVito, and Pat Hanrahan. Riposte: A trace-driven compiler and parallel vm for vector code in r. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, pages 43-52. ACM, 2012.
Peng Tu and David Padua. Automatic array privatization. In Compiler optimizations for scalable parallel systems, pages 247-281. Springer, 2001.
Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. Polyhedral parallel code generation for cuda. ACM Trans. Archit. Code Optim., pages 54:1-54:23, 2013.
Sven Verdoolaege and Tobias Grosser. Polyhedral extraction tool. In In Second International Workshop on Polyhedral Compilation Techniques (IMPACT '12), 2012.
Haichuan Wang, David Padua, and Peng Wu. Vectorization of apply to reduce interpretation overhead of r. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2015, pages 400-415. ACM, 2015.
Haichuan Wang, Peng Wu, and David Padua. Optimizing r vm: Allocation removal and path length reduction via interpreter-level specialization. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '14, pages 295:295-295:305. ACM, 2014.
Zheng Wang, Daniel Powell, Björn Franke, and Michael O'Boyle. Exploitation of GPUs for the Parallelisation of Probably Parallel Legacy Code, pages 154-173. Springer Berlin Heidelberg, Berlin, Heidelberg, 2014.
Thomas Wurthinger, Christian Wimmer, Christian Humer, Andreas Woss, Lukas Stadler, Chris Seaton, Gilles Duboscq, Doug Simon, and Matthias Grimmer. Practical partial evaluation for high-performance dynamic language runtimes. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017, pages 662-676. ACM, 2017.
Thomas Würthinger, Christian Wimmer, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Christian Humer, Gregor Richards, Doug Simon, and Mario Wolczko. One vm to rule them all. In Proceedings of the 2013 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming &Software, Onward! 2013, pages 187-204. ACM, 2013.
Thomas Würthinger, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Doug Simon, and Christian Wimmer. Self-optimizing AST interpreters. In Proceedings of the 8th symposium on Dynamic languages - DLS '12, page 73. ACM Press, 2012.
Thomas Würthinger, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Doug Simon, and Christian Wimmer. Self-optimizing ast interpreters. In Proceedings of the 8th Symposium on Dynamic Languages, DLS '12, pages 73-82, New York, NY, USA, 2012. ACM. URL: http://dx.doi.org/10.1145/2384577.2384587.
Wei Zhang. Efficient Hosted Interpreter for Dynamic Languages. PhD thesis, University of California Irvine, 2015.
Wei Zhang, Per Larsen, Stefan Brunthaler, and Michael Franz. Accelerating iterators in optimizing ast interpreters. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages &Applications, OOPSLA '14, pages 727-743. ACM, 2014.
Haiping Zhao, Iain Proctor, Minghui Yang, Xin Qi, Mark Williams, Qi Gao, Guilherme Ottoni, Andrew Paroski, Scott MacVicar, Jason Evans, and Stephen Tu. The hiphop compiler for php. SIGPLAN Not., pages 575-586, 2012.

Accelerating Dynamically-Typed Languages on Heterogeneous Platforms Using Guards Optimization

Authors Mohaned Qunaibit , Stefan Brunthaler, Yeoul Na, Stijn Volckaert, Michael Franz

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Accelerating Dynamically-Typed Languages on Heterogeneous Platforms Using Guards Optimization

Authors Mohaned Qunaibit , Stefan Brunthaler, Yeoul Na, Stijn Volckaert, Michael Franz

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message