Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Promoting Efficient Reasoning with Verifiable Stepwise Reward

About

Large reasoning models (LRMs) have recently achieved significant progress in complex reasoning tasks, aided by reinforcement learning with verifiable rewards. However, LRMs often suffer from overthinking, expending excessive computation on simple problems and reducing efficiency. Existing efficient reasoning methods typically require accurate task assessment to preset token budgets or select reasoning modes, which limits their flexibility and reliability. In this work, we revisit the essence of overthinking and identify that encouraging effective steps while penalizing ineffective ones is key to its solution. To this end, we propose a novel rule-based verifiable stepwise reward mechanism (VSRM), which assigns rewards based on the performance of intermediate states in the reasoning trajectory. This approach is intuitive and naturally fits the step-by-step nature of reasoning tasks. We conduct extensive experiments on standard mathematical reasoning benchmarks, including AIME24 and AIME25, by integrating VSRM with PPO and Reinforce++. Results show that our method achieves substantial output length reduction while maintaining original reasoning performance, striking an optimal balance between efficiency and accuracy. Further analysis of overthinking frequency and pass@k score before and after training demonstrates that our approach in deed effectively suppresses ineffective steps and encourages effective reasoning, fundamentally alleviating the overthinking problem. All code will be released upon acceptance.

Chuhuai Yue, Chengqi Dong, Yinan Gao, Hang He, Jiajun Chai, Guojun Yin, Wei Lin• 2025

Related benchmarks

TaskDatasetResultRank
Math ReasoningMATH 500
Accuracy89.8
38
Math ReasoningAIME 2024
Accuracy0.522
37
Math ReasoningAIME 2025
Accuracy36.4
33
Math ReasoningAMC 2023
Accuracy80.9
26
Math ReasoningOlympiadBench
Accuracy66.1
22
Showing 5 of 5 rows

Other info

Follow for update