Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

About

While large reasoning models trained with critic-free reinforcement learning and verifiable rewards (RLVR) represent the state-of-the-art, their practical utility is hampered by ``overthinking'', a critical issue where models generate excessively long reasoning paths without any performance benefit. Existing solutions that penalize length often fail, inducing performance degradation due to a fundamental misalignment between trajectory-level rewards and token-level optimization. In this work, we introduce a novel framework, DECS, built on our theoretical discovery of two previously unaddressed flaws in current length rewards: (1) the erroneous penalization of essential exploratory tokens and (2) the inadvertent rewarding of partial redundancy. Our framework's innovations include (i) a first-of-its-kind decoupled token-level reward mechanism that surgically distinguishes and penalizes redundant tokens, and (ii) a novel curriculum batch scheduling strategy to master the efficiency-efficacy equilibrium. Experimental results show DECS can achieve a dramatic reduction in reasoning tokens by over 50\% across seven benchmarks while simultaneously maintaining or even improving performance. It demonstrates conclusively that substantial gains in reasoning efficiency can be achieved without compromising a model's underlying reasoning power.

Shuyang Jiang, Yusheng Liao, Ya Zhang, Yanfeng Wang, Yu Wang• 2025

Related benchmarks

TaskDatasetResultRank
Math ReasoningMATH 500
Accuracy93
38
Math ReasoningAIME 2024
Accuracy0.513
37
Math ReasoningAIME 2025
Accuracy36.4
33
Math ReasoningAMC 2023
Accuracy89
26
Math ReasoningOlympiadBench
Accuracy70.3
22
Showing 5 of 5 rows

Other info

Follow for update