Accelerator-Driven Data Arrangement to Minimize Transformers Run-Time on Multi-Core Architectures

Authors Alireza Amirshahi , Giovanni Ansaloni , David Atienza



PDF
Thumbnail PDF

File

OASIcs.PARMA-DITAM.2024.2.pdf
  • Filesize: 1.3 MB
  • 13 pages

Document Identifiers

Author Details

Alireza Amirshahi
  • École Polytechnique Fédérale de Lausanne (EPFL), Switzerland
Giovanni Ansaloni
  • École Polytechnique Fédérale de Lausanne (EPFL), Switzerland
David Atienza
  • École Polytechnique Fédérale de Lausanne (EPFL), Switzerland

Cite AsGet BibTex

Alireza Amirshahi, Giovanni Ansaloni, and David Atienza. Accelerator-Driven Data Arrangement to Minimize Transformers Run-Time on Multi-Core Architectures. In 15th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 13th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2024). Open Access Series in Informatics (OASIcs), Volume 116, pp. 2:1-2:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/OASIcs.PARMA-DITAM.2024.2

Abstract

The increasing complexity of transformer models in artificial intelligence expands their computational costs, memory usage, and energy consumption. Hardware acceleration tackles the ensuing challenges by designing processors and accelerators tailored for transformer models, supporting their computation hotspots with high efficiency. However, memory bandwidth can hinder improvements in hardware accelerators. Against this backdrop, in this paper we propose a novel memory arrangement strategy, governed by the hardware accelerator’s kernel size, which effectively minimizes off-chip data access. This arrangement is particularly beneficial for end-to-end transformer model inference, where most of the computation is based on general matrix multiplication (GEMM) operations. Additionally, we address the overhead of non-GEMM operations in transformer models within the scope of this memory data arrangement. Our study explores the implementation and effectiveness of the proposed accelerator-driven data arrangement approach in both single- and multi-core systems. Our evaluation demonstrates that our approach can achieve up to a 2.7x speed increase when executing inferences employing state-of-the-art transformers.

Subject Classification

ACM Subject Classification
  • Hardware → Hardware-software codesign
  • Computing methodologies → Neural networks
Keywords
  • Memory arrangement
  • Data layout
  • Hardware accelerators
  • Transformer models
  • Multi-core
  • System simulation

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Alireza Amirshahi, Joshua Alexander Harrison Klein, Giovanni Ansaloni, and David Atienza. Tic-sat: Tightly-coupled systolic accelerator for transformers. In Proceedings of the 28th Asia and South Pacific Design Automation Conference, pages 657-663, 2023. Google Scholar
  2. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint, 2018. URL: https://arxiv.org/abs/1810.04805.
  3. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint, 2020. URL: https://arxiv.org/abs/2010.11929.
  4. Grégoire Axel Eggermann, Marco Antonio Rios, Giovanni Ansaloni, David Atienza Alonso, and Sani Nassif. A 16-bit floating-point near-sram architecture for low-power sparse matrix-vector multiplication. In VLSI SoC, 2023. Google Scholar
  5. Corentin Ferry, Tomofumi Yuki, Steven Derrien, and Sanjay Rajopadhye. Increasing fpga accelerators memory bandwidth with a burst-friendly memory layout. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2022. Google Scholar
  6. José R Herrero and Juan J Navarro. Using non-canonical array layouts in dense matrix operations. In International Workshop on Applied Parallel Computing, pages 580-588. Springer, 2006. Google Scholar
  7. Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451-3460, 2021. Google Scholar
  8. Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th annual international symposium on computer architecture, pages 1-12, 2017. Google Scholar
  9. Soroosh Khoram, Yue Zha, Jialiang Zhang, and Jing Li. Challenges and opportunities: From near-memory computing to in-memory computing. In Proceedings of the 2017 ACM on International Symposium on Physical Design, pages 43-46, 2017. Google Scholar
  10. Sehoon Kim, Coleman Hooper, Thanakul Wattanawong, Minwoo Kang, Ruohan Yan, Hasan Genc, Grace Dinh, Qijing Huang, Kurt Keutzer, Michael W Mahoney, et al. Full stack optimization of transformer inference: a survey. arXiv preprint, 2023. URL: https://arxiv.org/abs/2302.14017.
  11. Monica D Lam, Edward E Rothberg, and Michael E Wolf. The cache performance and optimizations of blocked algorithms. ACM SIGOPS Operating Systems Review, 25(Special Issue):63-74, 1991. Google Scholar
  12. Bingbing Li, Santosh Pandey, Haowen Fang, Yanjun Lyv, Ji Li, Jieyang Chen, Mimi Xie, Lipeng Wan, Hang Liu, and Caiwen Ding. Ftrans: energy-efficient acceleration of transformers using fpga. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design, pages 175-180, 2020. Google Scholar
  13. Jason Lowe-Power, Abdul Mutaal Ahmad, Ayaz Akram, Mohammad Alian, Rico Amslinger, Matteo Andreozzi, Adrià Armejach, Nils Asmussen, Brad Beckmann, Srikant Bharadwaj, et al. The gem5 simulator: Version 20.0+. arXiv preprint, 2020. URL: https://arxiv.org/abs/2007.03152.
  14. Hongwu Peng, Shaoyi Huang, Tong Geng, Ang Li, Weiwen Jiang, Hang Liu, Shusen Wang, and Caiwen Ding. Accelerating transformer-based deep learning models on fpgas using column balanced block pruning. In 2021 22nd International Symposium on Quality Electronic Design (ISQED), pages 142-148. IEEE, 2021. Google Scholar
  15. Panjie Qi, Edwin Hsing-Mean Sha, Qingfeng Zhuge, Hongwu Peng, Shaoyi Huang, Zhenglun Kong, Yuhong Song, and Bingbing Li. Accelerating framework of transformer by hardware design and model compression co-optimization. In 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD), pages 1-9. IEEE, 2021. Google Scholar
  16. Yasir Mahmood Qureshi, William Andrew Simon, Marina Zapater, David Atienza, and Katzalin Olcoz. Gem5-x: A gem5-based system level simulation framework to optimize many-core platforms. In 2019 Spring Simulation Conference (SpringSim), pages 1-12. IEEE, 2019. Google Scholar
  17. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. Google Scholar
  18. Yu Emma Wang, Gu-Yeon Wei, and David Brooks. Benchmarking tpu, gpu, and cpu platforms for deep learning. arXiv preprint, 2019. URL: https://arxiv.org/abs/1907.10701.
  19. Yuxin Wang, Qiang Wang, Shaohuai Shi, Xin He, Zhenheng Tang, Kaiyong Zhao, and Xiaowen Chu. Benchmarking the performance and energy efficiency of ai accelerators for ai training. In 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pages 744-751. IEEE, 2020. Google Scholar
  20. Xin Yang and Tao Su. Efa-trans: An efficient and flexible acceleration architecture for transformers. Electronics, 11(21):3550, 2022. Google Scholar
  21. Haoran You, Zhanyi Sun, Huihong Shi, Zhongzhi Yu, Yang Zhao, Yongan Zhang, Chaojian Li, Baopu Li, and Yingyan Lin. Vitcod: Vision transformer acceleration via dedicated algorithm and accelerator co-design. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 273-286. IEEE, 2023. Google Scholar
  22. Juan Zhong, Zheng Liu, and Xi Chen. Transformer-based models and hardware acceleration analysis in autonomous driving: A survey. arXiv preprint, 2023. URL: https://arxiv.org/abs/2304.10891.