TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning

About

While Large Language Models (LLMs) have demonstrated impressive capabilities, their output quality remains inconsistent across various application scenarios, making it difficult to identify trustworthy responses, especially in complex tasks requiring multi-step reasoning. In this paper, we propose a Token-level Uncertainty estimation framework for Reasoning (TokUR) that enables LLMs to self-assess and self-improve their responses in mathematical reasoning. Specifically, we introduce low-rank random weight perturbation during LLM decoding to generate predictive distributions for token-level uncertainty estimation, and we aggregate these uncertainty quantities to capture the semantic uncertainty of generated responses. Experiments on mathematical reasoning datasets of varying difficulty demonstrate that TokUR exhibits a strong correlation with answer correctness and model robustness, and the uncertainty signals produced by TokUR can be leveraged to enhance the model's reasoning performance at test time. These results highlight the effectiveness of TokUR as a principled and scalable approach for improving the reliability and interpretability of LLMs in challenging reasoning tasks.

Tunyu Zhang, Haizhou Shi, Yibin Wang, Hengyi Wang, Xiaoxiao He, Zhuowei Li, Haoxian Chen, Ligong Han, Kai Xu, Huan Zhang, Dimitris Metaxas, Hao Wang• 2025

Related benchmarks

Task	Dataset	Result
Incorrect Reasoning Path Detection	MATH500	Accuracy94	46
Incorrect Reasoning Path Detection	DeepScaleR	Accuracy64.24	46
Incorrect Reasoning Path Detection	GSM8K	Accuracy97.79	46
Long-form Factuality Verification	FactScore	Precision@161.53	15
Code Generation	HumanEval	ACC*80.9	8
Logical reasoning	Reasoning Gym Zebra Puzzles	Accuracy (*)39.33	8
Logical reasoning	Reasoning Gym Leg Counting	Accuracy50.67	8
Logical reasoning	Reasoning Gym Color Cube	Accuracy (*)32	8

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord