Assessing the (In)Ability of LLMs to Reason in Interval Temporal Logic

Bellodi, Pietro; Casavecchia, Pietro; Paparella, Alberto; Sciavicco, Guido; Stan, Ionel Eduard

doi:10.4230/LIPIcs.TIME.2025.4

Abstract

The logical reasoning skills of Large Language Models (LLMs) is poorly understood and often overstated. Current evaluation suites rely on algebraic or commonsense puzzles that mix reasoning with symbolic manipulation and/or provide static datasets that quickly saturate or leak into pretraining corpora. In purely logical terms, the most relevant reasoning skill is the meta-mathematical task of valid formula recognition, which is at the foundation of higher-level reasoning tasks (including deduction and minimization of assertions, to name just a few). In the current landscape of LLMs benchmarking, puzzles are most often stated in propositional or first-order logic, with a few exceptions for point-based temporal logic, such as LTL; yet, in the real world, event-based temporal statements are prevalent, and they are more naturally expressed in interval-based temporal logic. Interval temporal logic offers a much richer (w.r.t. point-based temporal logic, for example) variety of problems, and not only do different languages present different expressive powers, but also the computational complexity of the validity problem can vary widely. In this paper, we tackle the problem of assessing the ability of LLMs to reason about interval-based statements in the form of validity recognition. We explore whether their accuracy is sensible to the underlying language, the computational complexity of the associated validity problem, and the intrinsic hardness of the problem in terms of formula length and modal depth of the problem. We benchmark several frontier LLMs (Gemma 3 27b It, Llama 4 Maverick, DeepSeek Chat V3 release 0324, Qwen 3 32b, and Qwen 3 235b) and show that, despite apparently impressive performance on algebraic or commonsense benchmarks, they falter on logically rigorous tasks.

L. Aceto, D. Della Monica, A. Ingólfsdóttir, A. Montanari, and G. Sciavicco. On the expressiveness of the interval logic of allen’s relations over finite and discrete linear orders. In Proc. of the 14th European Conference on Logics in Artificial Intelligence (JELIA), volume 8761 of Lecture Notes in Computer Science, pages 267-281. Springer, 2014. URL: https://doi.org/10.1007/978-3-319-11558-0_19.
AI Insiders. Simple bench. https://simple-bench.com, 2024.
J. Allen. Maintaining knowledge about temporal intervals. Communications of the ACM, 26(11):832-843, 1983. URL: https://doi.org/10.1145/182.358434.
F. Baader, D. Calvanese, D.L. McGuinness, and others, editors. The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press, 2003.
D. Bresolin, D. Della Monica, A. Montanari, P. Sala, and G. Sciavicco. Interval temporal logics over strongly discrete linear orders: Expressiveness and complexity. Theor. Comput. Sci., 560:269-291, 2014. URL: https://doi.org/10.1016/J.TCS.2014.03.033.
D. Bresolin, D. Della Monica, A. Montanari, and G. Sciavicco. A tableau system for right propositional neighborhood logic over finite linear orders: An implementation. In Proc. of the 22th International Conference on Automated Reasoning with Analytic Tableaux and Related Methods (TABLEAUX), volume 8123 of LNCS, pages 74-80. Springer, 2013. URL: https://doi.org/10.1007/978-3-642-40537-2_8.
T.B. Brown, B. Mann, N. Ryder, and others. Language models are few-shot learners. In Proc. of the 33rd Annual Confernce on Advances in Neural Information Processing Systems, pages 1-25, 2020.
F. Chollet. On the measure of intelligence. CoRR, abs/1911.01547, 2019. URL: https://arxiv.org/abs/1911.01547.
P. Clark, O. Tafjord, and K. Richardson. Transformers as soft reasoners over language. In Proc. of the 29th International Joint Conference on Artificial Intelligence, pages 3882-3890, 2020.
K. Cobbe, V. Kosaraju, M. Bavarian, and others. Training verifiers to solve math word problems, 2021. URL: https://arxiv.org/abs/2110.14168.
V. Goranko, A. Montanari, P. Sala, and G. Sciavicco. A general tableau method for propositional interval temporal logics: Theory and implementation. Journal of Applied Logics, 4(3):305-330, 2006. URL: https://doi.org/10.1016/J.JAL.2005.06.012.
V. Goranko, A. Montanari, and G. Sciavicco. Propositional interval neighborhood temporal logics. Journal of Universal Computer Science, 9(9):1137-1167, 2003. URL: https://doi.org/10.3217/JUCS-009-09-1137.
V. Goranko, A. Montanari, and G. Sciavicco. A road map of interval temporal logics and duration calculi. Journal of Applied Non-Classical Logics, 14(1-2):9-54, 2004. URL: https://doi.org/10.3166/JANCL.14.9-54.
Joseph Y. Halperns and Yoav Shoham. A propositional modal logic of time intervals. Journal of the ACM, 38(4):935-962, 1991. URL: https://doi.org/10.1145/115234.115351.
S. Han, H. Schoelkopf, Y. Zhao, and others. FOLIO: natural language reasoning with first-order logic. In Proc. of the Conference on Empirical Methods in Natural Language Processing, pages 22017-22031, 2024.
D. Hendrycks, C. Burns, S. Kadavath, and others. Measuring mathematical problem solving with the MATH dataset. In Proc. of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks, 2021.
T. Kojima, S. Shane Gu, M. Reid, and others. Large language models are zero-shot reasoners. In Proc. of the 35th Annual Conference on Advances in Neural Information Processing Systems, pages 1-15, 2022.
X. Lin, Q. Cao, Y. Huang, and others. ATG: benchmarking automated theorem generation for generative language models. In Findings of the Association for Computational Linguistics, pages 4465-4480. Association for Computational Linguistics, 2024. URL: https://doi.org/10.18653/V1/2024.FINDINGS-NAACL.279.
E. Lucena-Sánchez, G. Sciavicco, and I.E Stan. Feature and language selection in temporal symbolic regression for interpretable air quality modelling. Algorithms, 14(3):76, 2021. URL: https://doi.org/10.3390/A14030076.
F. Manzella, G. Pagliarini, G. Sciavicco, and I.E. Stan. The voice of COVID-19: breath and cough recording classification with temporal decision trees and random forests. Artificial Intelligence in Medicine, 137:102486, 2023. URL: https://doi.org/10.1016/J.ARTMED.2022.102486.
T. Morishita, G. Morio, A. Yamaguchi, and others. Learning deductive reasoning from synthetic corpus based on formal logic. In Proc. of the International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 25254-25274, 2023. URL: https://proceedings.mlr.press/v202/morishita23a.html.
E. Muñoz-Velasco, M. Pelegrín-Garcí, P. Sala, G. Sciavicco, and I. E. Stan. On coarser interval temporal logics. Artificial Intelligence, 266:1-26, 2019. URL: https://doi.org/10.1016/J.ARTINT.2018.09.001.
T. Olausson, A. Gu, B. Lipkin, and others. LINC: A neurosymbolic approach for logical reasoning by combining language models with first-order logic provers. In Proc. of the Conference on Empirical Methods in Natural Language Processing, pages 5153-5176. Association for Computational Linguistics, 2023. URL: https://doi.org/10.18653/V1/2023.EMNLP-MAIN.313.
L. Pan, A. Albalak, X. Wang, and others. Logic-LM: Empowering large language models with symbolic solvers for faithful logical reasoning. In Findings of the Association for Computational Linguistics, pages 3806-3824. Association for Computational Linguistics, 2023. URL: https://doi.org/10.18653/V1/2023.FINDINGS-EMNLP.248.
M. Parmar, N. Patel, N. Varshney, and others. Logicbench: Towards systematic evaluation of logical reasoning ability of large language models. In Proc. of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 13679-13707, 2024.
D.A. Randell, Z. Cui, and A.G. Cohn. A spatial logic based on regions and connection. In Proc. of the 3rd International Conference on Principles of Knowledge Representation and Reasoning, pages 165-176. Morgan Kaufmann, 1992.
A. Saparov and H: He. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. In Proc. of the 11th International Conference on Learning Representations, 2023.
A. Srivastava, A. Rastogi, A. Rao, and others. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023.
O. Tafjord, B. Dalvi, and P. Clark. ProofWriter: Generating implications, proofs, and abductive statements over natural language. In Findings of the Association for Computational Linguistics, pages 3621-3634, 2021.
W. Tang and V. Belle. LTLBench: Towards benchmarks for evaluating temporal logic reasoning in large language models. CoRR, abs/2407.05434, 2024. URL: https://doi.org/10.48550/arXiv.2407.05434.
J Tian, Y Li, W. Chen, and others. Diagnosing the first-order logical reasoning ability through logicNLI. In Proc. of the Conference on Empirical Methods in Natural Language Processing, pages 3738-3747, 2021.
Y. Venema. Expressiveness and completeness of an interval tense logic. Notre Dame Journal of Formal Logic, 31(4):529-547, 1990. URL: https://doi.org/10.1305/NDJFL/1093635589.
J. Wei, X. Wang, D. Schuurmans, and others. Chain-of-thought prompting elicits reasoning in large language models. In Proc. of the 35th Annual Conference on Advances in Neural Information Processing Systems, pages 1-14, 2022.
W. Zhong, R. Cui, Y. Guo, and others. AGIEval: A human-centric benchmark for evaluating foundation models. In Findings of the Association for Computational Linguistics, pages 2299-2314. Association for Computational Linguistics, 2024. URL: https://doi.org/10.18653/V1/2024.FINDINGS-NAACL.149.
W. Zhong, S. Wang, D. Tang, and others. Analytical reasoning of text. In Findings of the Association for Computational Linguistics, pages 2306-2319, 2022.

Assessing the (In)Ability of LLMs to Reason in Interval Temporal Logic

Authors Pietro Bellodi , Pietro Casavecchia , Alberto Paparella , Guido Sciavicco , Ionel Eduard Stan

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message

Assessing the (In)Ability of LLMs to Reason in Interval Temporal Logic

Authors Pietro Bellodi , Pietro Casavecchia , Alberto Paparella , Guido Sciavicco , Ionel Eduard Stan

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

Supplementary Materials

References

Thanks for your feedback!

Could not send message