Uncertainty Quantification for Retrieval-Augmented Reasoning

About

Retrieval-augmented reasoning (RAR) is a recent evolution of retrieval-augmented generation (RAG) that employs multiple reasoning steps for retrieval and generation. While effective for some complex queries, RAR remains vulnerable to errors and misleading outputs. Uncertainty quantification (UQ) offers methods to estimate the confidence of systems' outputs. These methods, however, often handle simple queries with no retrieval or single-step retrieval, without properly handling RAR setup. Accurate estimation of UQ for RAR requires accounting for all sources of uncertainty, including those arising from retrieval and generation. In this paper, we account for all these sources and introduce Retrieval-Augmented Reasoning Consistency (R2C)--a novel UQ method for RAR. The core idea of R2C is to perturb the multi-step reasoning process by applying various actions to reasoning steps. These perturbations alter the retriever's input, which shifts its output and consequently modifies the generator's input at the next step. Through this iterative feedback loop, the retriever and generator continuously reshape one another's inputs, enabling us to capture uncertainty arising from both components. Experiments on five popular RAR systems across diverse QA datasets show that R2C improves AUROC by over 5% on average compared to the state-of-the-art UQ baselines. Extrinsic evaluations using R2C as an external signal further confirm its effectiveness for two downstream tasks: in Abstention, it achieves ~5% gains in both F1Abstain and AccAbstain; in Model Selection, it improves the exact match by ~7% over single models and ~3% over selection methods.

Heydar Soudani, Hamed Zamani, Faegheh Hasibi• 2025

Related benchmarks

Task	Dataset	Result
Question Answering	PopQA (test)	Accuracy46.8	111
Uncertainty Quantification	PopQA 500 randomly sampled queries (test)	AUROC0.8709	70
Uncertainty Quantification	HotpotQA 500 randomly sampled queries (test)	AUROC83.25	70
Uncertainty Quantification	Musique 500 randomly sampled queries (test)	AUROC0.8322	70
Question Answering	HotpotQA (test)	EM50.2	39
Abstention	PopQA	Abstain Accuracy81.6	25
Abstention	HotpotQA	Abstain Accuracy76.8	25
Abstention	MusiQ	Abstain Acc90.4	25
Abstention	PopQA (test)	AUARC66.06	25
Abstention	Hotpot (test)	AUARC60.9	25

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord