Chain-of-Verification Reduces Hallucination in Large Language Models

About

Generation of plausible yet incorrect factual information, termed hallucination, is an unsolved issue in large language models. We study the ability of language models to deliberate on the responses they give in order to correct their mistakes. We develop the Chain-of-Verification (CoVe) method whereby the model first (i) drafts an initial response; then (ii) plans verification questions to fact-check its draft; (iii) answers those questions independently so the answers are not biased by other responses; and (iv) generates its final verified response. In experiments, we show CoVe decreases hallucinations across a variety of tasks, from list-based questions from Wikidata, closed book MultiSpanQA and longform text generation.

Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, Jason Weston• 2023

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Accuracy93.6	1424
Mathematical Reasoning	AIME 2025	Accuracy86.67	378
Commonsense Reasoning	CSQA	Accuracy86	366
Mathematical Reasoning	MathQA	Accuracy84	354
Mathematical Reasoning	AIME	AIME Accuracy45	288
Question Answering	GPQA	Accuracy52	258
Mathematical Reasoning	AMC 23	Accuracy72.5	198
Question Answering	TruthfulQA	Accuracy60	164
Mathematical Reasoning	MATH L5	Accuracy0.56	162
Scientific Reasoning	GPQA	Accuracy65.4	75

Showing 10 of 53 rows

Other info

Follow for update

@wizwand_team Discord