Towards Explainable Temporal Reasoning in Large Language Models: A Structure-Aware Generative Framework

About

While large language models (LLMs) show great potential in temporal reasoning, most existing work focuses heavily on enhancing performance, often neglecting the explainable reasoning processes underlying the results. To address this gap, we introduce a comprehensive benchmark covering a wide range of temporal granularities, designed to systematically evaluate LLMs' capabilities in explainable temporal reasoning. Furthermore, our findings reveal that LLMs struggle to deliver convincing explanations when relying solely on textual information. To address challenge, we propose GETER, a novel structure-aware generative framework that integrates Graph structures with text for Explainable TEmporal Reasoning. Specifically, we first leverage temporal knowledge graphs to develop a temporal encoder that captures structural information for the query. Subsequently, we introduce a structure-text prefix adapter to map graph structure features into the text embedding space. Finally, LLMs generate explanation text by seamlessly integrating the soft graph token with instruction-tuning prompt tokens. Experimental results indicate that GETER achieves state-of-the-art performance while also demonstrating its effectiveness as well as strong generalization capabilities. Our dataset and code are available at https://github.com/carryTatum/GETER.

Zihao Jiang, Ben Liu, Miao Peng, Wenjie Xu, Yao Xiao, Zhenyan Shan, Min Peng• 2025

Related benchmarks

Task	Dataset	Result
Explanation Generation	ICEWS14 (test)	BLEU-440.54	17
Explanation Generation	GDELT (test)	BLEU-434.46	17
Explanation Generation	ICEWS05-15 (test)	BLEU-445.98	17
Semantic Similarity	ICEWS18 (test)	BLEU-440.39	17
Semantic Similarity	Wiki (test)	BLEU-455.52	17
Temporal Reasoning	ICEWS14 (test)	Positive Score77.45	17
Temporal Reasoning	GDELT (test)	Positive Accuracy63.77	17
Temporal Reasoning	ICEWS05-15 (test)	Positive Score78.94	17
Temporal Reasoning Prediction	ICEWS18 (test)	Positive Accuracy75.78	17
Temporal Reasoning Prediction	Wiki (test)	Positive Performance99.28	17

Showing 10 of 10 rows

Other info

Code

Follow for update

@wizwand_team Discord