Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Uncertainty Quantification for Retrieval-Augmented Reasoning

About

Retrieval-augmented reasoning (RAR) is a recent evolution of retrieval-augmented generation (RAG) that employs multiple reasoning steps for retrieval and generation. While effective for some complex queries, RAR remains vulnerable to errors and misleading outputs. Uncertainty quantification (UQ) offers methods to estimate the confidence of systems' outputs. These methods, however, often handle simple queries with no retrieval or single-step retrieval, without properly handling RAR setup. Accurate estimation of UQ for RAR requires accounting for all sources of uncertainty, including those arising from retrieval and generation. In this paper, we account for all these sources and introduce Retrieval-Augmented Reasoning Consistency (R2C)--a novel UQ method for RAR. The core idea of R2C is to perturb the multi-step reasoning process by applying various actions to reasoning steps. These perturbations alter the retriever's input, which shifts its output and consequently modifies the generator's input at the next step. Through this iterative feedback loop, the retriever and generator continuously reshape one another's inputs, enabling us to capture uncertainty arising from both components. Experiments on five popular RAR systems across diverse QA datasets show that R2C improves AUROC by over 5% on average compared to the state-of-the-art UQ baselines. Extrinsic evaluations using R2C as an external signal further confirm its effectiveness for two downstream tasks: in Abstention, it achieves ~5% gains in both F1Abstain and AccAbstain; in Model Selection, it improves the exact match by ~7% over single models and ~3% over selection methods.

Heydar Soudani, Hamed Zamani, Faegheh Hasibi• 2025

Related benchmarks

TaskDatasetResultRank
Uncertainty QuantificationPopQA 500 randomly sampled queries (test)
AUROC0.8709
70
Uncertainty QuantificationHotpotQA 500 randomly sampled queries (test)
AUROC83.25
70
Uncertainty QuantificationMusique 500 randomly sampled queries (test)
AUROC0.8322
70
Question AnsweringHotpotQA (test)
EM50.2
39
Question AnsweringPopQA (test)
Accuracy46.8
39
AbstentionPopQA
Abstain Accuracy81.6
25
AbstentionHotpotQA
Abstain Accuracy76.8
25
AbstentionMusiQ
Abstain Acc90.4
25
AbstentionPopQA (test)
AUARC66.06
25
AbstentionHotpot (test)
AUARC60.9
25
Showing 10 of 11 rows

Other info

Follow for update