Parallelism-Aware High-Performance Cache Coherence with Tight Latency Bounds

Authors Reza Mirosanlou, Mohamed Hassan, Rodolfo Pellizzoni



PDF
Thumbnail PDF

File

LIPIcs.ECRTS.2022.16.pdf
  • Filesize: 2.37 MB
  • 27 pages

Document Identifiers

Author Details

Reza Mirosanlou
  • University of Waterloo, Canada
Mohamed Hassan
  • McMaster University, Hamilton, Canada
Rodolfo Pellizzoni
  • University of Waterloo, Canada

Acknowledgements

We would like to thank the anonymous reviewers for their valuable feedback, and our shepherd for helping to significantly improve this paper. This work has been supported in part by NSERC, CMC Microsystems, and TII. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the sponsors.

Cite AsGet BibTex

Reza Mirosanlou, Mohamed Hassan, and Rodolfo Pellizzoni. Parallelism-Aware High-Performance Cache Coherence with Tight Latency Bounds. In 34th Euromicro Conference on Real-Time Systems (ECRTS 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 231, pp. 16:1-16:27, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.ECRTS.2022.16

Abstract

In Commercial-Off-The-Shelf (COTS) systems-on-chip, processing elements communicate data through a shared memory hierarchy, and a coherent high-performance interconnect, where the de facto standard to handle shared data is through a coherence protocol. Driven by the extraordinary demands from modern real-time embedded system applications to generate, process, and communicate massive amounts of data, recent efforts aim to ensure timing predictability while integrating cache coherence in multi-core real-time systems. However, we observe that most of these efforts compromise system average performance upon offering predictability guarantees. Motivated by this observation, this work proposes an arbiter aimed at providing a predictable, coherent shared cache hierarchy solution, yet with a negligible performance degradation compared to COTS solutions. We achieve this goal by adopting a high-performance-driven architecture including a split-transaction bus and bankized shared cache. In addition, all accesses are arbitrated through a global ordering mechanism. Our proposed arbiter operates alongside conventional coherence protocols without requiring any protocol modifications. Furthermore, we leverage the Duetto reference model by pairing the proposed arbiter and a high-performance arbiter. We evaluate our solution based on both synthetic and SPLASH-3 benchmarks, showing that we can significantly outperform the state-of-the-art in predictable cache coherence, while offering a COTS-level performance.

Subject Classification

ACM Subject Classification
  • Computer systems organization → Real-time system architecture
  • Computer systems organization → Embedded hardware
Keywords
  • Predictability
  • Cache
  • COTS
  • Arbitration
  • Real-time system

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Arm cortex-a53 mpcore processor technical reference manual r0p3. https://developer.arm.com/documentation/ddi0500/e/level-1-memory-system/about-the-l1-memory-system. Accessed: 2022-01-23.
  2. Intel® 64 and ia-32 architectures optimization reference manual. https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf. Accessed: 2021-07-20.
  3. ARM. Arm® cortex®-r8 mpcore processor. https://developer.arm.com/documentation/100400/0001/xdc1471434436160, 2019.
  4. Matthias Becker, Dakshina Dasari, Borislav Nikolic, Benny Akesson, Vincent Nélis, and Thomas Nolte. Contention-free execution of automotive applications on a clustered many-core platform. In 28th Euromicro Conference on Real-Time Systems, ECRTS 2016, Toulouse, France, July 5-8, 2016, pages 14-24. IEEE Computer Society, 2016. URL: https://doi.org/10.1109/ECRTS.2016.14.
  5. Micaiah Chisholm, Namhoon Kim, Bryan C Ward, Nathan Otterness, James H Anderson, and F Donelson Smith. Reconciling the tension between hardware isolation and data sharing in mixed-criticality, multicore systems. In 2016 IEEE Real-Time Systems Symposium (RTSS), pages 57-68. IEEE, 2016. URL: https://doi.org/10.1109/RTSS.2016.015.
  6. Giovani Gracioli, Rohan Tabish, Renato Mancuso, Reza Mirosanlou, Rodolfo Pellizzoni, and Marco Caccamo. Designing Mixed Criticality Applications on Modern Heterogeneous MPSoC Platforms. In 31st Euromicro Conference on Real-Time Systems (ECRTS 2019), pages 27:1-27:25, Dagstuhl, Germany, 2019. Google Scholar
  7. Danlu Guo, Mohamed Hassan, Rodolfo Pellizzoni, and Hiren Patel. A comparative study of predictable dram controllers. ACM Trans. Embed. Comput. Syst., 17(2), February 2018. URL: https://doi.org/10.1145/3158208.
  8. Mohamed Hassan. Heterogeneous mpsocs for mixed-criticality systems: Challenges and opportunities. IEEE Design & Test, 35(4):47-55, 2017. Google Scholar
  9. Mohamed Hassan. Discriminative coherence: Balancing performance and latency bounds in data-sharing multi-core real-time systems. In 32nd Euromicro Conference on Real-Time Systems (ECRTS 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020. Google Scholar
  10. Mohamed Hassan, Anirudh M Kaushik, and Hiren Patel. Predictable cache coherence for multi-core real-time systems. In 2017 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 235-246. IEEE, 2017. Google Scholar
  11. Mohamed Hassan and Hiren Patel. Criticality- and requirement-aware bus arbitration for multi-core mixed criticality systems. In 2016 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 1-11, 2016. URL: https://doi.org/10.1109/RTAS.2016.7461327.
  12. Mohamed Hassan, Hiren Patel, and Rodolfo Pellizzoni. A framework for scheduling dram memory accesses for multi-core mixed-time critical systems. In 21st IEEE Real-Time and Embedded Technology and Applications Symposium, pages 307-316. IEEE, 2015. Google Scholar
  13. Mohamed Hassan and Rodolfo Pellizzoni. Bounding dram interference in cots heterogeneous mpsocs for mixed criticality systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 37(11):2323-2336, 2018. URL: https://doi.org/10.1109/TCAD.2018.2857379.
  14. Mohamed Hassan and Rodolfo Pellizzoni. Analysis of memory-contention in heterogeneous cots mpsocs. In Euromicro Conference on Real-Time Systems, 2020. Google Scholar
  15. Salah Hessien and Mohamed Hassan. The best of all worlds: Improving predictability at the performance of conventional coherence with no protocol modifications. In 2020 IEEE Real-Time Systems Symposium (RTSS), pages 218-230, 2020. URL: https://doi.org/10.1109/RTSS49844.2020.00029.
  16. Anirudh Mohan Kaushik, Mohamed Hassan, and Hiren Patel. Designing predictable cache coherence protocols for multi-core real-time systems. IEEE Transactions on Computers, 70(12):2098-2111, 2021. URL: https://doi.org/10.1109/TC.2020.3037747.
  17. Anirudh Mohan Kaushik and Hiren Patel. A systematic approach to achieving tight worst-case latency and high-performance under predictable cache coherence. In 2021 IEEE 27th Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 105-117, 2021. URL: https://doi.org/10.1109/RTAS52030.2021.00017.
  18. Anirudh Mohan Kaushik, Paulos Tegegn, Zhuanhao Wu, and Hiren Patel. Carp: A data communication mechanism for multi-core mixed-criticality systems. In 2019 IEEE Real-Time Systems Symposium (RTSS), pages 419-432, 2019. URL: https://doi.org/10.1109/RTSS46320.2019.00044.
  19. Manpreet S Khaira. Fast first-come first served arbitration method, November 12 1996. US Patent 5,574,867. Google Scholar
  20. Hyoseung Kim, Dionisio de Niz, Björn Andersson, Mark Klein, Onur Mutlu, and Ragunathan Rajkumar. Bounding memory interference delay in cots-based multi-core systems. In 2014 IEEE 19th Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 145-154, 2014. URL: https://doi.org/10.1109/RTAS.2014.6925998.
  21. Namhoon Kim, Micaiah Chisholm, Nathan Otterness, James H. Anderson, and F. Donelson Smith. Allowing shared libraries while supporting hardware isolation in multicore real-time systems. In 2017 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 223-234, 2017. URL: https://doi.org/10.1109/RTAS.2017.14.
  22. Benjamin Lesage, Isabelle Puaut, and André Seznec. Preti: Partitioned real-time shared cache for mixed-criticality real-time systems. In Proceedings of the 20th International Conference on Real-Time and Network Systems, pages 171-180, 2012. Google Scholar
  23. Reza Mirosanlou, Danlu Guo, Mohamed Hassan, and Rodolfo Pellizzoni. Mcsim: An extensible dram memory controller simulator. IEEE Computer Architecture Letters, 19(2):105-109, 2020. URL: https://doi.org/10.1109/LCA.2020.3008288.
  24. Reza Mirosanlou, Mohamed Hassan, and Rodolfo Pellizzoni. Drambulism: Balancing performance and predictability through dynamic pipelining. In 2020 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 82-94, 2020. URL: https://doi.org/10.1109/RTAS48715.2020.00-15.
  25. Reza Mirosanlou, Mohamed Hassan, and Rodolfo Pellizzoni. Duetto: Latency guarantees at minimal performance cost. In 2021 Design, Automation Test in Europe Conference Exhibition (DATE), pages 1136-1141, 2021. URL: https://doi.org/10.23919/DATE51398.2021.9474062.
  26. Reza Mirosanlou, Mohamed Hassan, and Rodolfo Pellizzoni. DuoMC: Tight DRAM Latency Bounds with Shared Banks and Near-COTS Performance. In ACM International Symposium on Memory Systems (MEMSYS), pages 1-14, 2021. Google Scholar
  27. NXP. Qorlq® t4240, t4160 and t4080 multicore processors, 2018. Google Scholar
  28. Marco Paolieri, Eduardo Quiñones, Francisco J Cazorla, Guillem Bernat, and Mateo Valero. Hardware support for wcet analysis of hard real-time multicore systems. ACM SIGARCH Computer Architecture News, 37(3), 2009. Google Scholar
  29. Rodolfo Pellizzoni, Bach D. Bui, Marco Caccamo, and Lui Sha. Coscheduling of cpu and i/o transactions in cots-based embedded systems. In 2008 Real-Time Systems Symposium, pages 221-231, 2008. URL: https://doi.org/10.1109/RTSS.2008.42.
  30. Fong Pong and Michel Dubois. A new approach for the verification of cache coherence protocols. IEEE Transactions on Parallel and Distributed Systems, 6(8):773-787, 1995. Google Scholar
  31. Christos Sakalis, Carl Leonardsson, Stefanos Kaxiras, and Alberto Ros. Splash-3: A properly synchronized benchmark suite for contemporary research. In 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 101-111. IEEE, 2016. Google Scholar
  32. Martin Schoeberl, Sahar Abbaspour, Benny Akesson, Neil Audsley, Raffaele Capasso, Jamie Garside, Kees Goossens, Sven Goossens, Scott Hansen, Reinhold Heckmann, et al. T-crest: Time-predictable multi-core architecture for embedded systems. Journal of Systems Architecture, 61(9):449-471, 2015. Google Scholar
  33. Nathanaël Sensfelder, Julien Brunel, and Claire Pagetti. Modeling cache coherence to expose interference (artifact). In Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019. Google Scholar
  34. Nathanal Sensfelder, Julien Brunel, and Claire Pagetti. On how to identify cache coherence: Case of the nxp qoriq t4240. In 32nd Euromicro Conference on Real-Time Systems (ECRTS 2020). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2020. Google Scholar
  35. Ashok Singhal, Bjorn Liencres, Jeff Price, Frederick M Cerauskis, David Broniarczyk, Gerald Cheung, Erik Hagersten, and Nalini Agarwal. Implementing snooping on a split-transaction computer system bus, November 2 1999. US Patent 5,978,874. Google Scholar
  36. Daniel J Sorin, Mark D Hill, and David A Wood. A primer on memory consistency and cache coherence. Synthesis lectures on computer architecture, 6(3):1-212, 2011. Google Scholar
  37. Nivedita Sritharan, Anirudh Kaushik, Mohamed Hassan, and Hiren Patel. Enabling predictable, simultaneous and coherent data sharing in mixed criticality systems. In 2019 IEEE Real-Time Systems Symposium (RTSS), pages 433-445, 2019. URL: https://doi.org/10.1109/RTSS46320.2019.00045.
  38. Calvin K Tang. Cache system design in the tightly coupled multiprocessor system. In Proceedings of the June 7-10, 1976, national computer conference and exposition, pages 749-753, 1976. Google Scholar
  39. Zheng Pei Wu, Yogen Krish, and Rodolfo Pellizzoni. Worst case analysis of dram latency in multi-requestor systems. In 2013 IEEE 34th Real-Time Systems Symposium, pages 372-383, 2013. URL: https://doi.org/10.1109/RTSS.2013.44.
  40. Zhuanhao Wu, Anirudh Mohan Kaushik, Paulos Tegegn, and Hiren Patel. A hardware platform for exploring predictable cache coherence protocols for real-time multicores. In 2021 IEEE 27th Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 92-104, 2021. URL: https://doi.org/10.1109/RTAS52030.2021.00016.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail