Self-Consistency Improves Chain of Thought Reasoning in Language Models
About
Chain-of-thought prompting combined with pre-trained large language models has achieved encouraging results on complex reasoning tasks. In this paper, we propose a new decoding strategy, self-consistency, to replace the naive greedy decoding used in chain-of-thought prompting. It first samples a diverse set of reasoning paths instead of only taking the greedy one, and then selects the most consistent answer by marginalizing out the sampled reasoning paths. Self-consistency leverages the intuition that a complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer. Our extensive empirical evaluation shows that self-consistency boosts the performance of chain-of-thought prompting with a striking margin on a range of popular arithmetic and commonsense reasoning benchmarks, including GSM8K (+17.9%), SVAMP (+11.0%), AQuA (+12.2%), StrategyQA (+6.4%) and ARC-challenge (+3.9%).
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | Accuracy51.23 | 1460 | |
| Visual Question Answering | VQA v2 | Accuracy66.04 | 1165 | |
| Visual Question Answering | TextVQA | Accuracy54.85 | 1117 | |
| Mathematical Reasoning | GSM8K | Accuracy91.8 | 983 | |
| Code Generation | HumanEval | Pass@187.58 | 850 | |
| Multi-task Language Understanding | MMLU | Accuracy80.96 | 842 | |
| Mathematical Reasoning | GSM8K (test) | Accuracy96 | 797 | |
| Commonsense Reasoning | WinoGrande | Accuracy64.1 | 776 | |
| Language Understanding | MMLU | Accuracy83.66 | 756 | |
| Mathematical Reasoning | GSM8K (test) | Accuracy94.2 | 751 |