Large Language Models Can Self-Correct with Key Condition Verification
About
Intrinsic self-correct was a method that instructed large language models (LLMs) to verify and correct their responses without external feedback. Unfortunately, the study concluded that the LLMs could not self-correct reasoning yet. We find that a simple yet effective verification method can unleash inherent capabilities of the LLMs. That is to mask a key condition in the question, add the current response to construct a verification question, and predict the condition to verify the response. The condition can be an entity in an open-domain question or a numeric value in a math question, which requires minimal effort (via prompting) to identify. We propose an iterative verify-then-correct framework to progressively identify and correct (probably) false responses, named ProCo. We conduct experiments on three reasoning tasks. On average, ProCo, with GPT-3.5-Turbo as the backend LLM, yields $+6.8$ exact match on four open-domain question answering datasets, $+14.1$ accuracy on three arithmetic reasoning datasets, and $+9.6$ accuracy on a commonsense reasoning dataset, compared to Self-Correct. Our implementation is made publicly available at https://wzy6642.github.io/proco.github.io/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reasoning | GSM8K | Accuracy0.834 | 106 | |
| Symbolic Reasoning | Letter | Accuracy74.67 | 67 | |
| Symbolic Reasoning | Last Letter Concatenation | Accuracy74 | 58 | |
| Algorithmic Reasoning | MATH | Accuracy69.6 | 46 | |
| Reasoning | Bamboogle | Accuracy50 | 46 | |
| Mathematical Reasoning | GSM-Hard | Accuracy39.6 | 46 | |
| Symbolic Reasoning | COIN | Accuracy75.25 | 45 | |
| Reasoning | StrategyQA | Accuracy64.75 | 40 | |
| Domain-specific Reasoning | LegalBench | Accuracy44.21 | 33 | |
| Mathematical Reasoning | GSM-Hard | Accuracy48.6 | 28 |