Training Language Models to Self-Correct via Reinforcement Learning
About
Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are often insufficient for instilling self-correction behavior. In particular, we observe that training via SFT falls prey to either a distribution mismatch between mistakes made by the data-collection policy and the model's own responses, or to behavior collapse, where learning implicitly prefers only a certain mode of correction behavior that is often not effective at self-correction on test problems. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt. This regularization process includes an initial phase of multi-turn RL on a base model to generate a policy initialization that is less susceptible to collapse, followed by using a reward bonus to amplify self-correction. With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | HMMT25 | Accuracy (%)76.3 | 115 | |
| Mathematical Reasoning | OLY | Accuracy25.3 | 105 | |
| Code Reasoning | LiveCodeBench | Accuracy71.9 | 90 | |
| General Reasoning | MMLU-R | -- | 40 | |
| General Reasoning | MMLU-P | -- | 24 | |
| General Reasoning | GPQA | multi@5 Accuracy63.2 | 16 | |
| Math Reasoning | MATH 500 | Multi@5 Accuracy55.8 | 16 | |
| Math Reasoning | ThmQA | Multi@5 Accuracy31.8 | 16 | |
| Mathematical Reasoning | MATH | Multi-step pass@5 Accuracy53 | 16 | |
| Commonsense Reasoning | CSQA | Pass Rate78 | 14 |