Decomposed Prompting Does Not Fix Knowledge Gaps, But Helps Models Say "I Don't Know"

About

Large language models often struggle to recognize their knowledge limits in closed-book question answering, leading to confident hallucinations. While decomposed prompting is typically used to improve accuracy, we investigate its impact on reliability. We evaluate three task-equivalent prompting regimes: Direct, Assistive, and Incremental, across different model scales and multi-hop QA benchmarks. We find that although accuracy gains from decomposition diminish in frontier models, disagreements between prompting regimes remain highly indicative of potential errors. Because factual knowledge is stable while hallucinations are stochastic, cross-regime agreement provides a precise signal of internal uncertainty. We leverage this signal to implement a training-free abstention policy that requires no retrieval or fine-tuning. Our results show that disagreement-based abstention outperforms standard uncertainty baselines as an error detector, improving both F1 and AUROC across settings. This demonstrates that decomposition-based prompting can serve as a practical diagnostic probe for model reliability in closed-book QA.

Dhruv Madhwal, Lyuxin David Zhang, Dan Roth, Tomer Wolfson, Vivek Gupta• 2026

Related benchmarks

Task	Dataset	Result
Error detection	HotpotQA	AUROC81	57
Error detection	Bamboogle	F1 Score0.94	36
Error detection	CRAG	F1 Score91	36
Error detection	FRAMES	F1 Score95	36
Error detection	Mintaka	F1 Score88	36
Error detection	MuSiQue	F1 Score0.93	36
Error detection	Mintaka (val)	Precision91	36
Error detection	MuSiQue (val)	Precision0.96	36
Error detection	Bamboogle Full	Precision97	36
Error detection	CRAG multi-hop subset (train)	Precision91	36

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord