Training Language Models to Self-Correct via Reinforcement Learning

About

Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are often insufficient for instilling self-correction behavior. In particular, we observe that training via SFT falls prey to either a distribution mismatch between mistakes made by the data-collection policy and the model's own responses, or to behavior collapse, where learning implicitly prefers only a certain mode of correction behavior that is often not effective at self-correction on test problems. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt. This regularization process includes an initial phase of multi-turn RL on a base model to generate a policy initialization that is less susceptible to collapse, followed by using a reward bonus to amplify self-correction. With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.

Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, Aleksandra Faust• 2024

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	HMMT25	Accuracy (%)76.3	115
Mathematical Reasoning	OLY	Accuracy25.3	105
Code Reasoning	LiveCodeBench	Accuracy71.9	102
General Reasoning	MMLU-R	--	40
General Reasoning	MMLU-P	--	24
General Reasoning	GPQA	multi@5 Accuracy63.2	16
Math Reasoning	MATH 500	Multi@5 Accuracy55.8	16
Math Reasoning	ThmQA	Multi@5 Accuracy31.8	16
Mathematical Reasoning	MATH	Multi-step pass@5 Accuracy53	16
Commonsense Reasoning	CSQA	Pass Rate78	14

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord