Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Stable Reinforcement Learning for Efficient Reasoning

About

The success of Deepseek-R1 has drawn the LLM community's attention to reinforcement learning (RL) methods like GRPO. However, such rule-based 0/1 outcome reward methods lack the capability to regulate the intermediate reasoning processes during chain-of-thought (CoT) generation, leading to severe overthinking phenomena. In response, recent studies have designed reward functions to reinforce models' behaviors in producing shorter yet correct completions. Nevertheless, we observe that these length-penalty reward functions exacerbate RL training instability: as the completion length decreases, model accuracy abruptly collapses, often occurring early in training. To address this issue, we propose a simple yet effective solution GRPO-$\lambda$, an efficient and stabilized variant of GRPO, which dynamically adjusts the reward strategy by monitoring the correctness ratio among completions within each query-sampled group. A low correctness ratio indicates the need to avoid length penalty that compromises CoT quality, triggering a switch to length-agnostic 0/1 rewards that prioritize reasoning capability. A high ratio maintains length penalties to boost efficiency. Experimental results show that our approach avoids training instability caused by length penalty while maintaining the optimal accuracy-efficiency trade-off. On the GSM8K, GPQA, MATH-500, AMC 2023, and AIME 2024 benchmarks, it improves average accuracy by 1.48% while reducing CoT sequence length by 47.3%.

Muzhi Dai, Shixuan Liu, Qingyi Si• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME 2025
Accuracy63.3
311
Mathematical ReasoningOverall
Accuracy79.9
81
Mathematical ReasoningMATH 500
Accuracy96.6
79
Mathematical ReasoningAMC 2023
Accuracy97.5
35
Mathematical ReasoningAIME 2024
Accuracy76.7
24
Scientific ReasoningGPQA
Accuracy60.1
24
Logical reasoningBig-Bench Hard (BBH)
Accuracy86.3
7
Showing 7 of 7 rows

Other info

Follow for update