CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought

About

Large language models (LLMs) excel in many tasks but struggle to accurately quantify uncertainty in their generated responses. This limitation makes it challenging to detect misinformation and ensure reliable decision-making. Existing uncertainty quantification (UQ) methods for LLMs are primarily prompt-wise rather than response-wise, often requiring multiple response samples, which incurs high computational costs. Moreover, LLMs have been shown to be overconfident, particularly when using reasoning steps to derive their answers. In this work, we propose CoT-UQ, a response-wise UQ framework that integrates LLMs' inherent reasoning capabilities through Chain-of-Thought (CoT) into the UQ process. CoT-UQ captures critical information during inference by extracting keywords from each reasoning step and assessing their importance to the final answer. This key reasoning information is then aggregated to produce a final uncertainty estimate. We conduct extensive experiments based on Llama Family with model sizes varying from 8B to 13B across logical and mathematical reasoning tasks. Experimental results demonstrate that CoT-UQ significantly outperforms existing UQ methods, achieving an average improvement of 5.9% AUROC compared to current UQ methods. The code is available at: https://github.com/ZBox1005/CoT-UQ.

Boxuan Zhang, Ruqi Zhang• 2025

Related benchmarks

Task	Dataset	Result
Multi-hop Reasoning	HotpotQA	AUROC67.19	26
Multi-hop Reasoning	2WikiMHQA	AUROC0.7002	26
Mathematical Reasoning	GSM8K	AUROC0.651	20
Mathematical Reasoning	SVAMP	AUROC0.6211	20
Mathematical Reasoning	ASDIV	AUROC66.91	20
Sentence-Level Confidence Prediction	FEVER	AUROC0.582	15
Math Word Problem Reasoning	SVAMP	AUROC Score65.83	6
Math Word Problem Reasoning	ASDIV	AUROC0.6818	6
Question Answering	HotpotQA	AUROC63.1	6
Question Answering	ASDIV	AUROC66.91	6

Showing 10 of 11 rows

Other info

Code

Follow for update

@wizwand_team Discord