Chain-of-Verification Reduces Hallucination in Large Language Models
About
Generation of plausible yet incorrect factual information, termed hallucination, is an unsolved issue in large language models. We study the ability of language models to deliberate on the responses they give in order to correct their mistakes. We develop the Chain-of-Verification (CoVe) method whereby the model first (i) drafts an initial response; then (ii) plans verification questions to fact-check its draft; (iii) answers those questions independently so the answers are not biased by other responses; and (iv) generates its final verified response. In experiments, we show CoVe decreases hallucinations across a variety of tasks, from list-based questions from Wikidata, closed book MultiSpanQA and longform text generation.
Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, Jason Weston• 2023
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | GSM8K | Accuracy93.6 | 1362 | |
| Commonsense Reasoning | CSQA | Accuracy86 | 366 | |
| Mathematical Reasoning | MathQA | Accuracy84 | 305 | |
| Mathematical Reasoning | AIME | AIME Accuracy45 | 288 | |
| Question Answering | GPQA | Accuracy52 | 258 | |
| Mathematical Reasoning | AMC 23 | Accuracy72.5 | 198 | |
| Mathematical Reasoning | MATH L5 | Accuracy0.56 | 90 | |
| Scientific Reasoning | GPQA | Accuracy65.4 | 75 | |
| Troop placement prediction | Risk | EMD0.56 | 66 | |
| Question Answering | SQuAD (test) | GPT Judge Accuracy58 | 45 |
Showing 10 of 34 rows