Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Step-GRPO: Internalizing Dynamic Early Exit for Efficient Reasoning

About

Large reasoning models that use long chain-of-thought excel at problem-solving yet waste compute on redundant checks. Curbing this overthinking is hard: training-time length penalties can cripple ability, while inference-time early-exit adds system overhead. To bridge this gap, we propose Step-GRPO, a novel post-training framework that internalizes dynamic early-exit capabilities directly into the model. Step-GRPO shifts the optimization objective from raw tokens to semantic steps by utilizing linguistic markers to structure reasoning. We introduce a Dynamic Truncated Rollout mechanism that exposes the model to concise high-confidence trajectories during exploration, synergized with a Step-Aware Relative Reward that dynamically penalizes redundancy based on group-level baselines. Extensive experiments across three model sizes on diverse benchmarks demonstrate that Step-GRPO achieves a superior accuracy-efficiency trade-off. On Qwen3-8B, our method reduces token consumption by 32.0\% compared to the vanilla model while avoiding the accuracy degradation observed in traditional length-penalty methods.

Benteng Chen, Weida Wang, Shufei Zhang, Mingbao Lin, Min Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME 2025
Accuracy73.3
311
Mathematical ReasoningOverall
Accuracy82.1
81
Mathematical ReasoningMATH 500
Accuracy96.8
79
Mathematical ReasoningAMC 2023
Accuracy95
35
Mathematical ReasoningAIME 2024
Accuracy76.7
24
Scientific ReasoningGPQA
Accuracy56.1
24
Logical reasoningBig-Bench Hard (BBH)
Accuracy86.51
7
Showing 7 of 7 rows

Other info

Follow for update