Step-GRPO: Internalizing Dynamic Early Exit for Efficient Reasoning

About

Large reasoning models that use long chain-of-thought excel at problem-solving yet waste compute on redundant checks. Curbing this overthinking is hard: training-time length penalties can cripple ability, while inference-time early-exit adds system overhead. To bridge this gap, we propose Step-GRPO, a novel post-training framework that internalizes dynamic early-exit capabilities directly into the model. Step-GRPO shifts the optimization objective from raw tokens to semantic steps by utilizing linguistic markers to structure reasoning. We introduce a Dynamic Truncated Rollout mechanism that exposes the model to concise high-confidence trajectories during exploration, synergized with a Step-Aware Relative Reward that dynamically penalizes redundancy based on group-level baselines. Extensive experiments across three model sizes on diverse benchmarks demonstrate that Step-GRPO achieves a superior accuracy-efficiency trade-off. On Qwen3-8B, our method reduces token consumption by 32.0\% compared to the vanilla model while avoiding the accuracy degradation observed in traditional length-penalty methods.

Benteng Chen, Weida Wang, Shufei Zhang, Mingbao Lin, Min Zhang• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AIME 2025	Accuracy73.3	353
Mathematical Reasoning	Overall	Accuracy82.1	81
Mathematical Reasoning	MATH 500	Accuracy96.8	79
Mathematical Reasoning	AMC 2023	Accuracy95	35
Mathematical Reasoning	AIME 2024	Accuracy76.7	24
Scientific Reasoning	GPQA	Accuracy56.1	24
Logical reasoning	Big-Bench Hard (BBH)	Accuracy86.51	7

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord