Step-GRPO: Internalizing Dynamic Early Exit for Efficient Reasoning
About
Large reasoning models that use long chain-of-thought excel at problem-solving yet waste compute on redundant checks. Curbing this overthinking is hard: training-time length penalties can cripple ability, while inference-time early-exit adds system overhead. To bridge this gap, we propose Step-GRPO, a novel post-training framework that internalizes dynamic early-exit capabilities directly into the model. Step-GRPO shifts the optimization objective from raw tokens to semantic steps by utilizing linguistic markers to structure reasoning. We introduce a Dynamic Truncated Rollout mechanism that exposes the model to concise high-confidence trajectories during exploration, synergized with a Step-Aware Relative Reward that dynamically penalizes redundancy based on group-level baselines. Extensive experiments across three model sizes on diverse benchmarks demonstrate that Step-GRPO achieves a superior accuracy-efficiency trade-off. On Qwen3-8B, our method reduces token consumption by 32.0\% compared to the vanilla model while avoiding the accuracy degradation observed in traditional length-penalty methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | AIME 2025 | Accuracy73.3 | 311 | |
| Mathematical Reasoning | Overall | Accuracy82.1 | 81 | |
| Mathematical Reasoning | MATH 500 | Accuracy96.8 | 79 | |
| Mathematical Reasoning | AMC 2023 | Accuracy95 | 35 | |
| Mathematical Reasoning | AIME 2024 | Accuracy76.7 | 24 | |
| Scientific Reasoning | GPQA | Accuracy56.1 | 24 | |
| Logical reasoning | Big-Bench Hard (BBH) | Accuracy86.51 | 7 |