Evaluating the Ability of Large Language Models to Reason About Cardinal Directions (Short Paper)

Authors Anthony G Cohn , Robert E Blackwell



PDF
Thumbnail PDF

File

LIPIcs.COSIT.2024.28.pdf
  • Filesize: 0.68 MB
  • 9 pages

Document Identifiers

Author Details

Anthony G Cohn
  • School of Computing, University of Leeds, UK
Robert E Blackwell
  • Alan Turing Institute, London, UK

Acknowledgements

We thank the anonymous referees for their helpful comments. We also thank Microsoft Research - Accelerating Foundation Models Research program, for the provision of Azure resources to access GPT which were used in the early stages of the work.

Cite AsGet BibTex

Anthony G Cohn and Robert E Blackwell. Evaluating the Ability of Large Language Models to Reason About Cardinal Directions (Short Paper). In 16th International Conference on Spatial Information Theory (COSIT 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 315, pp. 28:1-28:9, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.COSIT.2024.28

Abstract

We investigate the abilities of a representative set of Large language Models (LLMs) to reason about cardinal directions (CDs). To do so, we create two datasets: the first, co-created with ChatGPT, focuses largely on recall of world knowledge about CDs; the second is generated from a set of templates, comprehensively testing an LLM’s ability to determine the correct CD given a particular scenario. The templates allow for a number of degrees of variation such as means of locomotion of the agent involved, and whether set in the first , second or third person. Even with a temperature setting of zero, Our experiments show that although LLMs are able to perform well in the simpler dataset, in the second more complex dataset no LLM is able to reliably determine the correct CD, even with a temperature setting of zero.

Subject Classification

ACM Subject Classification
  • Computing methodologies → Spatial and physical reasoning
Keywords
  • Large Language Models
  • Spatial Reasoning
  • Cardinal Directions

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Prabin Bhandari. A survey on prompting techniques in LLMs, 2024. URL: https://arxiv.org/abs/2312.03740.
  2. A G Cohn. An evaluation of ChatGPT-4’s Qualitative Spatial Reasoning Capabilities in RCC-8. arXiv preprint, 2023. URL: https://arxiv.org/abs/2309.15577.
  3. Anthony G Cohn and Robert E Blackwell. Evaluating the Ability of Large Language Models to Reason about Cardinal Directions. Dataset, swhId: https://archive.softwareheritage.org/swh:1:dir:37c617e865cfba41c74743123b5d3785379caacc;origin=https://github.com/alan-turing-institute/cosit-2024-evaluating-the-ability-of-llms-to-reason-about-cardinal-directions;visit=swh:1:snp:7629d8b01a3d5e05c8ea9cf7956480d3b94b40fd;anchor=swh:1:rev:f80b374d4b36dc616425175a99844d94cd36d62d (visited on 2024-08-22).
  4. A Creswell and M Shanahan. Faithful reasoning using large language models, 2022. URL: https://arxiv.org/abs/2208.14271.
  5. T T Brown et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877-1901, 2020. Google Scholar
  6. J B Haviland. Guugu Yimithirr cardinal directions. Ethos, 26(1):25-47, 1998. Google Scholar
  7. Y Hou, J Li, Y Fei, A Stolfo, W Zhou, G Zeng, A Bosselut, and M Sachan. Towards a mechanistic interpretation of multi-step reasoning capabilities of language models. arXiv preprint, 2023. URL: https://arxiv.org/abs/2310.14491.
  8. J Huang and K C-C Chang. Towards reasoning in large language models: A survey, 2023. URL: https://arxiv.org/abs/2212.10403.
  9. Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199-22213, 2022. URL: https://arxiv.org/abs/2205.11916.
  10. K Leyton-Brown. Rationality Report Cards. Slides presented at a AAAI-24 workshop. URL: https://tinyurl.com/Leyton-Brown-AAAI24.
  11. F Li, D C Hogg, and A G Cohn. Advancing spatial reasoning in large language models: An in-depth evaluation and enhancement using the StepGame benchmark. In Proc. AAAI, 2024. Google Scholar
  12. F Li, D C Hogg, and A G Cohn. Reframing spatial reasoning evaluation in language models: A real-world simulation benchmark for qualitative reasoning. In Proc. IJCAI, 2024. Google Scholar
  13. Zekun Li, Wenxuan Zhou, Yao-Yi Chiang, and Muhao Chen. Geolm: Empowering language models for geospatially grounded language understanding. In Conference on Empirical Methods in Natural Language Processing, 2023. URL: https://arxiv.org/abs/2310.14478.
  14. R Mirzaee, H Rajaby Faghihi, Q Ning, and P Kordjamshidi. SPARTQA: A textual question answering benchmark for spatial reasoning. In Proc. NAACL, pages 4582-4598, 2021. Google Scholar
  15. OpenAI and Josh Achiam et al. GPT-4 technical report, 2024. URL: https://arxiv.org/abs/2303.08774.
  16. Narun Krishnamurthi Raman, Taylor Lundy, Samuel Joseph Amouyal, Yoav Levine, Kevin Leyton-Brown, and Moshe Tennenholtz. STEER: Assessing the economic rationality of large language models. In Forty-first International Conference on Machine Learning, 2024. URL: https://arxiv.org/abs/2402.09552.
  17. Z Shi, Q Zhang, and A Lipani. StepGame: A new benchmark for robust multi-hop spatial reasoning in texts. In Proc. AAAI, volume 36, pages 11321-11329, 2022. Google Scholar
  18. J Weston, A Bordes, S Chopra, A M Rush, B Van Merriënboer, A Joulin, and T Mikolov. Towards AI-complete question answering: A set of prerequisite toy tasks. In ICLR, 2016. Google Scholar
  19. W Wu, S Mao, Y Zhang, Y Xia, L Dong, L Cui, and F Wei. Visualization-of-thought elicits spatial reasoning in large language models. arXiv preprint, 2024. URL: https://arxiv.org/abs/2404.03622.
  20. Y Yamada, Y Bao, A K Lampinen, J Kasai, and I Yildirim. Evaluating spatial understanding of large language models, 2024. URL: https://arxiv.org/abs/2310.14540.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail