Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

About

While large reasoning models trained with critic-free reinforcement learning and verifiable rewards (RLVR) represent the state-of-the-art, their practical utility is hampered by ``overthinking'', a critical issue where models generate excessively long reasoning paths without any performance benefit. Existing solutions that penalize length often fail, inducing performance degradation due to a fundamental misalignment between trajectory-level rewards and token-level optimization. In this work, we introduce a novel framework, DECS, built on our theoretical discovery of two previously unaddressed flaws in current length rewards: (1) the erroneous penalization of essential exploratory tokens and (2) the inadvertent rewarding of partial redundancy. Our framework's innovations include (i) a first-of-its-kind decoupled token-level reward mechanism that surgically distinguishes and penalizes redundant tokens, and (ii) a novel curriculum batch scheduling strategy to master the efficiency-efficacy equilibrium. Experimental results show DECS can achieve a dramatic reduction in reasoning tokens by over 50\% across seven benchmarks while simultaneously maintaining or even improving performance. It demonstrates conclusively that substantial gains in reasoning efficiency can be achieved without compromising a model's underlying reasoning power. Code is available at https://github.com/pixas/DECS.

Shuyang Jiang, Yusheng Liao, Ya Zhang, Yanfeng Wang, Yu Wang• 2025

Related benchmarks

Task	Dataset	Result
Math Reasoning	MATH 500	Accuracy93	60
Math Reasoning	AIME 2025	Accuracy36.4	49
Math Reasoning	OlympiadBench	Accuracy70.3	44
Math Reasoning	AIME 2024	Accuracy0.513	37
Math Reasoning	AMC 2023	Accuracy89	26

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord