Cross Module Quickening - The Curious Case of C Extensions

Berlakovich, Felix; Brunthaler, Stefan

doi:10.4230/LIPIcs.ECOOP.2024.6

Abstract

Dynamic programming languages such as Python offer expressive power and programmer productivity at the expense of performance. Although the topic of optimizing Python has received considerable attention over the years, a key obstacle remains elusive: C extensions. Time and again, optimized run-time environments, such as JIT compilers and optimizing interpreters, fall short of optimizing across C extensions, as they cannot reason about the native code hiding underneath.
To bridge this gap, we present an analysis of C extensions for Python. The analysis data indicates that C extensions come in different varieties. One such variety is to merely speed up a single thing, such as reading a file and processing it directly in C. Another variety offers broad access through an API, resulting in a domain-specific language realized by function calls.
While the former variety of C extensions offer little optimization potential for optimizing run-times, we find that the latter variety does offer considerable optimization potential. This optimization potential rests on dynamic locality that C extensions cannot readily tap. We introduce a new, interpreter-based optimization leveraging this untapped optimization potential called Cross-Module Quickening. The key idea is that C extensions can use an optimization interface to register highly-optimized operations on C extension-specific datatypes. A quickening interpreter uses these information to continuously specialize programs with C extensions.
To quantify the attainable performance potential of going beyond C extensions, we demonstrate a concrete instantiation of Cross-Module Quickening for the CPython interpreter and the popular NumPy C extension. We evaluate our implementation with the NPBench benchmark suite and report performance improvements by a factor of up to 2.84.

Scott B. Baden. High Performance Storage Reclamation in an Object-Based Memory System. Technical Report, University of California at Berkeley, USA, May 1982.
Gergö Barany. Python interpreter performance deconstructed. In Proceedings of the Workshop on Dynamic Languages and Applications, Dyla 2014, Edinburgh, United Kingdom, June 9-11, 2014, pages 5:1-5:9, Edinburgh United Kingdom, June 2014. ACM. URL: https://doi.org/10.1145/2617548.2617552.
Stefan Behnel, Robert Bradshaw, Craig Citro, Lisandro Dalcín, Dag Sverre Seljebotn, and Kurt Smith. Cython: The best of both worlds. Comput. Sci. Eng., 13(2):31-39, March 2011. URL: https://doi.org/10.1109/MCSE.2010.118.
Felix Berlakovich. CMQ CPython implementation. Software (visited on 2024-08-29). URL: https://github.com/fberlakovich/cmq-ae.
Felix Berlakovich. CMQ Numpy implementation. Software (visited on 2024-08-29). URL: https://github.com/fberlakovich/cmq-numpy-ae.
Felix Berlakovich and Stefan Brunthaler. Cross-Module Quickening. Software (visited on 2024-08-29). URL: https://doi.org/10.5281/zenodo.11174717.
Maxwell Bernstein and CF Bolz-Tereick. Dr wenowdis: Specializing dynamic language C extensions using type information. CoRR, abs/2403.02420(arXiv:2403.02420), March 2024. URL: https://doi.org/10.48550/arXiv.2403.02420.
Blake Griffith. A mechanism for overriding Ufuncs. URL: https://numpy.org/neps/nep-0013-ufunc-overrides.html.
Carl Friedrich Bolz, Antonio Cuni, Maciej Fijalkowski, Michael Leuschel, Samuele Pedroni, and Armin Rigo. Allocation removal by partial evaluation in a tracing JIT. In Siau-Cheng Khoo and Jeremy G. Siek, editors, Proceedings of the 2011 ACM SIGPLAN Workshop on Partial Evaluation and Program Manipulation, PEPM 2011, Austin, TX, USA, January 24-25, 2011, PEPM '11, pages 43-52, New York, NY, USA, January 2011. ACM. URL: https://doi.org/10.1145/1929501.1929508.
Carl Friedrich Bolz, Antonio Cuni, Maciej Fijalkowski, Michael Leuschel, Samuele Pedroni, and Armin Rigo. Runtime feedback in a meta-tracing JIT for efficient dynamic languages. In Ian Rogers, Eric Jul, and Olivier Zendra, editors, Proceedings of the 6th Workshop on Implementation, Compilation, Optimization of Object-Oriented Languages, Programs and Systems, ICOOOLPS 2011, Lancaster, United Kingdom, July 26, 2011, ICOOOLPS '11, pages 9:1-9:8, New York, NY, USA, July 2011. ACM. URL: https://doi.org/10.1145/2069172.2069181.
Carl Friedrich Bolz, Antonio Cuni, Maciej Fijalkowski, and Armin Rigo. Tracing the meta-level: Pypy’s tracing JIT compiler. In Ian Rogers, editor, Proceedings of the 4th workshop on the Implementation, Compilation, Optimization of Object-Oriented Languages and Programming Systems, ICOOOLPS 2009, Genova, Italy, July 6, 2009, ICOOOLPS '09, pages 18-25, New York, NY, USA, July 2009. ACM. URL: https://doi.org/10.1145/1565824.1565827.
Stefan Brunthaler. Virtual-machine abstraction and optimization techniques. Electronic Notes in Theoretical Computer Science, 253(5):3-14, December 2009. URL: https://doi.org/10.1016/j.entcs.2009.11.011.
Stefan Brunthaler. Inline caching meets quickening. In Theo D'Hondt, editor, ECOOP 2010 - Object-Oriented Programming, 24th European Conference, Maribor, Slovenia, June 21-25, 2010. Proceedings, volume 6183 of Lecture Notes in Computer Science, pages 429-451, Berlin, Heidelberg, 2010. Springer. URL: https://doi.org/10.1007/978-3-642-14107-2_21.
Stefan Brunthaler. Multi-level quickening: Ten years later. CoRR, abs/2109.02958, 2021. URL: https://doi.org/10.48550/arXiv.2109.02958.
Lin Cheng, Berkin Ilbeyi, Carl Friedrich Bolz-Tereick, and Christopher Batten. Type freezing: exploiting attribute type monomorphism in tracing JIT compilers. In CGO '20: 18th ACM/IEEE International Symposium on Code Generation and Optimization, San Diego, CA, USA, February, 2020, CGO 2020, pages 16-29, New York, NY, USA, February 2020. ACM. URL: https://doi.org/10.1145/3368826.3377907.
Maxime Chevalier-Boisvert, Noah Gibbs, Jean Boussier, Si Xing (Alan) Wu, Aaron Patterson, Kevin Newton, and John Hawthorn. YJIT: a basic block versioning JIT compiler for cruby. In Gregor Richards and Manuel Rigger, editors, VMIL 2021: Proceedings of the 13th ACM SIGPLAN International Workshop on Virtual Machines and Intermediate Languages, Virtual Event / Chicago, IL, USA, 19 October 2021, pages 25-32, Chicago IL USA, October 2021. ACM. URL: https://doi.org/10.1145/3486606.3486781.
Maxime Chevalier-Boisvert, Takashi Kokubun, Noah Gibbs, Si Xing (Alan) Wu, Aaron Patterson, and Jemma Issroff. Evaluating yjit’s performance in a production context: A pragmatic approach. In Rodrigo Bruno and Eliot Moss, editors, Proceedings of the 20th ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes, MPLR 2023, Cascais, Portugal, 22 October 2023, MPLR 2023, pages 20-33, New York, NY, USA, October 2023. ACM. URL: https://doi.org/10.1145/3617651.3622982.
L. Peter Deutsch and Allan M. Schiffman. Efficient implementation of the smalltalk-80 system. In Ken Kennedy, Mary S. Van Deusen, and Larry Landweber, editors, Conference Record of the Eleventh Annual ACM Symposium on Principles of Programming Languages, Salt Lake City, Utah, USA, January 1984, pages 297-302, New York, New York, USA, 1984. ACM Press. ISSN: 07308566. URL: https://doi.org/10.1145/800017.800542.
NumPy Developers. Universal functions (ufunc) basics - NumPy v1.26 Manual. URL: https://numpy.org/doc/1.26/user/basics.ufuncs.html#type-casting-rules.
M. Anton Ertl and David Gregg. The behavior of efficient virtual machine interpreters on modern architectures. In Rizos Sakellariou, John A. Keane, John R. Gurd, and Len Freeman, editors, Euro-Par 2001: Parallel Processing, 7th International Euro-Par Conference Manchester, UK August 28-31, 2001, Proceedings, volume 2150 of Lecture Notes in Computer Science, pages 403-412, Berlin, Heidelberg, 2001. Springer. URL: https://doi.org/10.1007/3-540-44681-8_59.
M. Anton Ertl and David Gregg. Optimizing indirect branch prediction accuracy in virtual machine interpreters. In Ron Cytron and Rajiv Gupta, editors, Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation 2003, San Diego, California, USA, June 9-11, 2003, PLDI '03, pages 278-288, New York, NY, USA, May 2003. ACM. URL: https://doi.org/10.1145/781131.781162.
Christopher Flynn. PyPI Download Stats. URL: https://pypistats.org/top.
Matthias Grimmer, Manuel Rigger, Roland Schatz, Lukas Stadler, and Hanspeter Mössenböck. Trufflec: dynamic execution of C on a java virtual machine. In Joanna Kolodziej and Bruce R. Childers, editors, 2014 International Conference on Principles and Practices of Programming on the Java Platform Virtual Machines, Languages and Tools, PPPJ '14, Cracow, Poland, September 23-26, 2014, PPPJ '14, pages 17-26, New York, NY, USA, September 2014. ACM. URL: https://doi.org/10.1145/2647508.2647528.
Matthias Grimmer, Roland Schatz, Chris Seaton, Thomas Würthinger, and Mikel Luján. Cross-language interoperability in a multi-language runtime. ACM Trans. Program. Lang. Syst., 40(2):8:1-8:43, May 2018. URL: https://doi.org/10.1145/3201898.
Matthias Grimmer, Chris Seaton, Thomas Würthinger, and Hanspeter Mössenböck. Dynamically composing languages in a modular way: supporting C extensions for dynamic languages. In Robert B. France, Sudipto Ghosh, and Gary T. Leavens, editors, Proceedings of the 14th International Conference on Modularity, MODULARITY 2015, Fort Collins, CO, USA, March 16-19, 2015, pages 1-13, Fort Collins CO USA, March 2015. ACM. URL: https://doi.org/10.1145/2724525.2728790.
WebAssembly Community Group and Andreas (editor) Rossberg. WebAssembly Core Specification. Technical report, W3C, 2024.
Stefan Hoyer, Matthew Rocklin, Marten van Kerkwijk, and Hameer Abbasi. A dispatch mechanism for numpy’s high level array functions. URL: https://numpy.org/neps/nep-0018-array-function-protocol.html.
Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. Numba: A LLVM-based python JIT compiler. In Hal Finkel, editor, Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, LLVM 2015, Austin, Texas, USA, November 15, 2015, LLVM '15, pages 7:1-7:6, New York, NY, USA, November 2015. ACM. URL: https://doi.org/10.1145/2833157.2833162.
Tim Lindholm and Frank Yellin. The Java Virtual Machine Specification. The Java Series. Addison-Wesley, Reading, Mass., 1. print edition, 1997.
Vladimir Makarov. A Faster CRuby interpreter with dynamically specialized IR. URL: https://rubykaigi.org/2022.
Nagy Mostafa, Chandra Krintz, Calin Cascaval, David Edelsohn, Priya Nagpurkar, and Peng Wu. Understanding the Potential of Interpreter-based Optimizations for Python. Technical report, University of California, Santa Barbara, September 2010.
Manuel Rigger, Matthias Grimmer, and Hanspeter Mössenböck. Sulong - Execution of LLVM-based languages on the JVM: position paper. In Proceedings of the 11th Workshop on Implementation, Compilation, Optimization of Object-Oriented Languages, Programs and Systems, ICOOOLPS@ECOOP 2016, Rome, Italy, July 17-22, 2016, ICOOOLPS '16, pages 7:1-7:4, New York, NY, USA, July 2016. ACM. URL: https://doi.org/10.1145/3012408.3012416.
Victor Rodriguez Bahena. Numpy Benchmark Benchmark - OpenBenchmarking.org. URL: https://openbenchmarking.org/test/pts/numpy.
Christopher Graham Seaton. Specialising dynamic techniques for implementing the Ruby programming language. PhD thesis, University of Manchester, UK, 2015. URL: https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.674722.
Mark Shannon. The construction of high-performance virtual machines for dynamic languages. PhD thesis, University of Glasgow, UK, 2011. URL: http://theses.gla.ac.uk/2975/.
Tiobe. index ert TIOBE - The Software Quality Company, 2021. URL: https://www.tiobe.com/tiobe-index/.
Christian Wimmer and Stefan Brunthaler. Zippy on truffle: a fast and simple implementation of python. In Antony L. Hosking and Patrick Th. Eugster, editors, SPLASH'13 - The Proceedings of the 2013 Companion Publication for Conference on Systems, Programming, & Applications: Software for Humanity, Indianapolis, IN, USA, October 26-31, 2013, pages 17-18, Indianapolis Indiana USA, October 2013. ACM. URL: https://doi.org/10.1145/2508075.2514572.
Qiang Zhang, Lei Xu, and Baowen Xu. Regcpython: A register-based python interpreter for better performance. ACM Trans. Archit. Code Optim., 20(1):14:1-14:25, March 2023. URL: https://doi.org/10.1145/3568973.
Qiang Zhang, Lei Xu, Xiangyu Zhang, and Baowen Xu. Quantifying the interpretation overhead of python. Sci. Comput. Program., 215:102759, March 2022. URL: https://doi.org/10.1016/j.scico.2021.102759.
Wei Zhang, Per Larsen, Stefan Brunthaler, and Michael Franz. Accelerating iterators in optimizing AST interpreters. ACM SIGPLAN Notices, 49(10):727-743, December 2014. URL: https://doi.org/10.1145/2660193.2660223.
Alexandros Nikolaos Ziogas, Tal Ben-Nun, Timo Schneider, and Torsten Hoefler. Npbench: a benchmarking suite for high-performance numpy. In Huiyang Zhou, Jose Moreira, Frank Mueller, and Yoav Etsion, editors, ICS '21: 2021 International Conference on Supercomputing, Virtual Event, USA, June 14-17, 2021, ICS '21, pages 63-74, New York, NY, USA, June 2021. ACM. URL: https://doi.org/10.1145/3447818.3460360.

Cross Module Quickening - The Curious Case of C Extensions

Authors Felix Berlakovich , Stefan Brunthaler

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Cross Module Quickening - The Curious Case of C Extensions

Authors Felix Berlakovich , Stefan Brunthaler

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Supplementary Materials

References

Thanks for your feedback!

Could not send message