Self-Refine: Iterative Refinement with Self-Feedback
About
Like humans, large language models (LLMs) do not always generate the best output on their first try. Motivated by how humans refine their written text, we introduce Self-Refine, an approach for improving initial outputs from LLMs through iterative feedback and refinement. The main idea is to generate an initial output using an LLMs; then, the same LLMs provides feedback for its output and uses it to refine itself, iteratively. Self-Refine does not require any supervised training data, additional training, or reinforcement learning, and instead uses a single LLM as the generator, refiner, and feedback provider. We evaluate Self-Refine across 7 diverse tasks, ranging from dialog response generation to mathematical reasoning, using state-of-the-art (GPT-3.5, ChatGPT, and GPT-4) LLMs. Across all evaluated tasks, outputs generated with Self-Refine are preferred by humans and automatic metrics over those generated with the same LLM using conventional one-step generation, improving by ~20% absolute on average in task performance. Our work demonstrates that even state-of-the-art LLMs like GPT-4 can be further improved at test time using our simple, standalone approach.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | Accuracy39.2 | 1460 | |
| Mathematical Reasoning | GSM8K | Accuracy94 | 983 | |
| Code Generation | HumanEval | -- | 850 | |
| Mathematical Reasoning | GSM8K (test) | Accuracy94.8 | 797 | |
| Mathematical Reasoning | GSM8K (test) | Accuracy36.9 | 751 | |
| Commonsense Reasoning | PIQA | Accuracy62.6 | 647 | |
| Mathematical Reasoning | MATH | Accuracy58.5 | 643 | |
| Reasoning | BBH | -- | 507 | |
| Question Answering | OpenBookQA | Accuracy82.66 | 465 | |
| Mathematical Reasoning | MATH (test) | Overall Accuracy72.2 | 433 |