Stable Reinforcement Learning for Efficient Reasoning

About

The success of Deepseek-R1 has drawn the LLM community's attention to reinforcement learning (RL) methods like GRPO. However, such rule-based 0/1 outcome reward methods lack the capability to regulate the intermediate reasoning processes during chain-of-thought (CoT) generation, leading to severe overthinking phenomena. In response, recent studies have designed reward functions to reinforce models' behaviors in producing shorter yet correct completions. Nevertheless, we observe that these length-penalty reward functions exacerbate RL training instability: as the completion length decreases, model accuracy abruptly collapses, often occurring early in training. To address this issue, we propose a simple yet effective solution GRPO-$\lambda$, an efficient and stabilized variant of GRPO, which dynamically adjusts the reward strategy by monitoring the correctness ratio among completions within each query-sampled group. A low correctness ratio indicates the need to avoid length penalty that compromises CoT quality, triggering a switch to length-agnostic 0/1 rewards that prioritize reasoning capability. A high ratio maintains length penalties to boost efficiency. Experimental results show that our approach avoids training instability caused by length penalty while maintaining the optimal accuracy-efficiency trade-off. On the GSM8K, GPQA, MATH-500, AMC 2023, and AIME 2024 benchmarks, it improves average accuracy by 1.48% while reducing CoT sequence length by 47.3%.

Muzhi Dai, Shixuan Liu, Qingyi Si• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AIME 2025	Accuracy63.3	353
Mathematical Reasoning	Overall	Accuracy79.9	81
Mathematical Reasoning	MATH 500	Accuracy96.6	79
Mathematical Reasoning	AMC 2023	Accuracy97.5	35
Mathematical Reasoning	AIME 2024	Accuracy76.7	24
Scientific Reasoning	GPQA	Accuracy60.1	24
Logical reasoning	Big-Bench Hard (BBH)	Accuracy86.3	7

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord