Performance Modeling & Mapping of LLM Inference on Heterogeneous Vectorized CGRAs

Kefallinos, Dionysios; Alexandris, Georgios; Maras, Alexis; Chaidos, Panagiotis; Gomony, Manil Dev; Corporaal, Henk; Soudris, Dimitrios; Xydis, Sotirios

doi:10.4230/OASIcs.PARMA-DITAM.2026.8

Abstract

Since the emergence of transformer-based models, the computational demands for Large Language Model (LLM) inference have been increasing exponentially, primarily due to their compounding parameter sizes, their structural complexity, and the use of non-linear functions. This tendency leads to the necessity of deploying them on low-power edge devices and DNN accelerators, to fuel next-generation agentic AI systems. Coarse-Grained Reconfigurable Architectures (CGRAs) have proven to be a compelling paradigm for edge acceleration, combining the programmability of general-purpose platforms with the high performance and energy efficiency associated with ASICs. In this work, we introduce an end-to-end performance modeling and mapping framework for LLM inference on heterogeneous CGRAs. Our methodology enables rapid exploration of the micro-architectural design space parameters, i.e., the number of processing elements, vector sizes, and memory configurations, by providing an accurate, explainable, and analytical CGRA performance modeling methodology, with an average cycle error of 0.9%. Architecturally, we build upon R-Blocks, a heterogeneous CGRA platform, and extend it to support floating-point arithmetic operations as well as a full-stack compilation and mapping flow for both full (FP32) and quantized (INT8) Llama2 models. The proposed methodology, evaluated on a 22nm technology node, achieves superior peak performance per Watt compared to related works such as REVAMP and CFEACT (1.8× and 2.8× respectively).

Thilini Kaushalya Bandara, Dhananjaya Wijerathne, Tulika Mitra, and Li-Shiuan Peh. REVAMP: a systematic framework for heterogeneous CGRA realization. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '22, pages 918-932, New York, NY, USA, 2022. Association for Computing Machinery. URL: https://doi.org/10.1145/3503222.3507772.
Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos, Razvan Pascanu, and Petar Veličković. Round and Round We Go! What makes Rotary Positional Encodings useful?, 2025. URL: https://doi.org/10.48550/arXiv.2410.06205.
Fenglong Cai, Dong Yuan, Zhe Yang, and Lizhen Cui. Edge-LLM: A Collaborative Framework for Large Language Model Serving in Edge Computing. In 2024 IEEE International Conference on Web Services (ICWS), pages 799-809, 2024. URL: https://doi.org/10.1109/ICWS62655.2024.00099.
Lvcheng Chen, Ying Wu, Chenyi Wen, Shizhang Wang, Li Zhang, Bei Yu, Qi Sun, and Cheng Zhuo. An Agile Framework for Efficient LLM Accelerator Development and Model Inference, 2024.
S. Alexander Chin, Noriaki Sakamoto, Allan Rui, Jim Zhao, Jin Hee Kim, Yuko Hara-Azumi, and Jason Anderson. CGRA-ME: A unified framework for CGRA modelling and exploration. In 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP), pages 184-189, 2017. URL: https://doi.org/10.1109/ASAP.2017.7995277.
Henk Corporaal and Reinoud Lamberts. TTA processor synthesis. In First Annual Conf. of ASCI, pages 18-27. Citeseer, 1995.
Barry de Bruin, Kanishkan Vadivel, Mark Wijtvliet, Pekka Jääskeläinen, and Henk Corporaal. R-Blocks: An Energy-Efficient, Flexible, and Programmable CGRA. ACM Trans. Reconfigurable Technol. Syst., 17(2), May 2024. URL: https://doi.org/10.1145/3656642.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2019. URL: https://arxiv.org/abs/1810.04805.
Lidong Guo, Zhenhua Zhu, Tengxuan Liu, Xuefei Ning, Shiyao Li, Guohao Dai, Huazhong Yang, Wangyang Fu, and Yu Wang. Towards Floating Point-Based Attention-Free LLM: Hybrid PIM with Non-Uniform Data Format and Reduced Multiplications. In Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, ICCAD '24, New York, NY, USA, 2025. Association for Computing Machinery. URL: https://doi.org/10.1145/3676536.3676776.
Yingbing Huang, Lily Jiaxin Wan, Hanchen Ye, Manvi Jha, Jinghua Wang, Yuhong Li, Xiaofan Zhang, and Deming Chen. Invited: New Solutions on LLM Acceleration, Optimization, and Application. In Proceedings of the 61st ACM/IEEE Design Automation Conference, DAC '24, New York, NY, USA, 2024. Association for Computing Machinery. URL: https://doi.org/10.1145/3649329.3663517.
Suyeon Hur, Seongmin Na, Dongup Kwon, Joonsung Kim, Andrew Boutros, Eriko Nurvitadhi, and Jangwoo Kim. A Fast and Flexible FPGA-based Accelerator for Natural Language Processing Neural Networks. ACM Trans. Archit. Code Optim., 20(1), February 2023. URL: https://doi.org/10.1145/3564606.
Pekka Jääskeläinen, Timo Viitanen, Jarmo Takala, and Heikki Berg. HW/SW Co-design Toolset for Customization of Exposed Datapath Processors, pages 147-164. Springer International Publishing, 2017. URL: https://doi.org/10.1007/978-3-319-49679-5_8.
Jaeyong Jang, Yulhwa Kim, Juheun Lee, and Jae-Joon Kim. FIGNA: Integer Unit-Based Accelerator Design for FP-INT GEMM Preserving Numerical Accuracy. In 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 760-773, 2024. URL: https://doi.org/10.1109/HPCA57654.2024.00064.
Roman Kaplan. Intel Gaudi 3 AI Accelerator: Architected for Gen AI Training and Inference. In 2024 IEEE Hot Chips 36 Symposium (HCS), pages 1-16, 2024. URL: https://doi.org/10.1109/HCS61935.2024.10665178.
Manupa Karunaratne, Aditi Kulkarni Mohite, Tulika Mitra, and Li-Shiuan Peh. HyCUBE: A CGRA with reconfigurable single-cycle multi-hop interconnect. In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC), pages 1-6, 2017. URL: https://doi.org/10.1145/3061639.3062262.
Hanjoon Kim, Younggeun Choi, Junyoung Park, Byeongwook Bae, and Hyunmin et al. Jeong. TCP: A Tensor Contraction Processor for AI Workloads Industrial Product*. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 890-902, 2024. URL: https://doi.org/10.1109/ISCA59077.2024.00069.
Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. MLIR: A Compiler Infrastructure for the End of Moore’s Law, 2020. URL: https://arxiv.org/abs/2002.11054.
Wenjie Li, Aokun Hu, Ningyi Xu, and Guanghui He. Quantization and Hardware Architecture Co-Design for Matrix-Vector Multiplications of Large Language Models. IEEE Transactions on Circuits and Systems I: Regular Papers, 71(6):2858-2871, 2024. URL: https://doi.org/10.1109/TCSI.2024.3350661.
Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Xuan Shen, Pu Zhao, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Xue Lin, Dong Huang, and Yanzhi Wang. RoRA: Efficient Fine-Tuning of LLM with Reliability Optimization for Rank Adaptation, 2025. URL: https://doi.org/10.48550/arXiv.2501.04315.
Leibo Liu, Jianfeng Zhu, Zhaoshi Li, Yanan Lu, Yangdong Deng, Jie Han, Shouyi Yin, and Shaojun Wei. A Survey of Coarse-Grained Reconfigurable Architecture and Design: Taxonomy, Challenges, and Applications. ACM Comput. Surv., 52(6), October 2019. URL: https://doi.org/10.1145/3357375.
Yixuan Luo, Cheng Tan, Nicolas Bohm Agostini, Ang Li, Antonino Tumeo, Nirav Dave, and Tong Geng. ML-CGRA: An Integrated Compilation Framework to Enable Efficient Machine Learning Acceleration on CGRAs. In 2023 60th ACM/IEEE Design Automation Conference (DAC), pages 1-6, 2023. URL: https://doi.org/10.1109/DAC56929.2023.10247873.
Yiqing Mao, Xuchen Gao, Jiahang Lou, Yunhui Qiu, Wenbo Yin, Wai-Shing Luk, and Lingli Wang. CFEACT: A CGRA-based Framework Enabling Agile CNN and Transformer Accelerator Design. In 2024 34th International Conference on Field-Programmable Logic and Applications (FPL), pages 213-219, 2024. URL: https://doi.org/10.1109/FPL64840.2024.00037.
Seungjae Moon, Jung-Hoon Kim, Junsoo Kim, Seongmin Hong, and Junseo et al. Cha. A Latency Processing Unit: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference. IEEE Micro, 44(6):17-33, 2024. URL: https://doi.org/10.1109/MM.2024.3420728.
OpenCores. OpenCores Floating Point Unit. URL: https://opencores.org/projects/fpu.
Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A. Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W. Keckler, and Joel Emer. Timeloop: A Systematic Approach to DNN Accelerator Evaluation. In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 304-315, 2019. URL: https://doi.org/10.1109/ISPASS.2019.00042.
Jiajun Qin, Tianhua Xia, Cheng Tan, Jeff Zhang, and Sai Qian Zhang. PICACHU: Plug-In CGRA Handling Upcoming Nonlinear Operations in LLMs. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 845-861, 2025. URL: https://doi.org/10.1145/3676641.3716013.
Yubin Qin, Yang Wang, Zhiren Zhao, Xiaolong Yang, Yang Zhou, Shaojun Wei, Yang Hu, and Shouyi Yin. MECLA: Memory-Compute-Efficient LLM Accelerator with Scaling Sub-matrix Partition. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 1032-1047, 2024. URL: https://doi.org/10.1109/ISCA59077.2024.00079.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language Models are Unsupervised Multitask Learners. OpenAI Blog, 1(8), 2019. URL: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
Seyyed Ahmad Razavi, Morteza Saheb Zamani, and Kia Bazargan. A tileable switch module architecture for homogeneous 3D FPGAs. In 2009 IEEE International Conference on 3D System Integration, pages 1-4, 2009. URL: https://doi.org/10.1109/3DIC.2009.5306586.
Cheng Tan, Nicolas Bohm Agostini, Tong Geng, Chenhao Xie, Jiajia Li, Ang Li, Kevin J. Barker, and Antonino Tumeo. DRIPS: Dynamic Rebalancing of Pipelined Streaming Applications on CGRAs. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 304-316, 2022. URL: https://doi.org/10.1109/HPCA53966.2022.00030.
Cheng Tan, Chenhao Xie, Ang Li, Kevin J. Barker, and Antonino Tumeo. OpenCGRA: An Open-Source Unified Framework for Modeling, Testing, and Evaluating CGRAs. In 2020 IEEE 38th International Conference on Computer Design (ICCD), pages 381-388, 2020. URL: https://doi.org/10.1109/ICCD50377.2020.00070.
Hugo Touvron, Louis Martin, and Kevin Stone et al. Llama 2: Open Foundation and Fine-Tuned Chat Models, 2023. URL: https://doi.org/10.48550/arXiv.2307.09288.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need, 2023. URL: https://arxiv.org/abs/1706.03762.
Mark Wijtvliet, Luc Waeijen, and Henk Corporaal. Coarse grained reconfigurable architectures in the past 25 years: Overview and classification. In 2016 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS), pages 235-244, 2016. URL: https://doi.org/10.1109/SAMOS.2016.7818353.
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 38087-38099. PMLR, 23-29 July 2023. URL: https://proceedings.mlr.press/v202/xiao23c.html.
Eunji Yoo, Gunho Park, Jung Gyu Min, Se Jung Kwon, Baeseong Park, Dongsoo Lee, and Youngjoo Lee. TF-MVP: Novel Sparsity-Aware Transformer Accelerator with Mixed-Length Vector Pruning. In 2023 60th ACM/IEEE Design Automation Conference (DAC), pages 1-6, 2023. URL: https://doi.org/10.1109/DAC56929.2023.10247799.
Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, and Yu Wang. FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs. In Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA '24, pages 223-234, New York, NY, USA, 2024. Association for Computing Machinery. URL: https://doi.org/10.1145/3626202.3637562.

Performance Modeling & Mapping of LLM Inference on Heterogeneous Vectorized CGRAs

Authors Dionysios Kefallinos , Georgios Alexandris , Alexis Maras , Panagiotis Chaidos , Manil Dev Gomony , Henk Corporaal , Dimitrios Soudris , Sotirios Xydis

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Performance Modeling & Mapping of LLM Inference on Heterogeneous Vectorized CGRAs

Authors Dionysios Kefallinos , Georgios Alexandris , Alexis Maras , Panagiotis Chaidos , Manil Dev Gomony , Henk Corporaal , Dimitrios Soudris , Sotirios Xydis

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message