Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR

About

Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective post-training method for improving the reasoning abilities of Large Language Models (LLMs). However, existing methods mainly apply uniform optimization constraints across all tokens, ignoring their heterogeneous roles. Prior work shows that high-entropy tokens are closely tied to reasoning, while low-entropy tokens primarily encode factual knowledge, and recent approaches attempt to exploit this distinction by isolating token updates via masking or asynchronous training. We argue that such isolation breaks the sequential dependency structure of autoregressive generation, leading to suboptimal learning. To address this, we propose \textbf{Archer}, an entropy-aware RLVR framework with \textbf{dual-token constraints} that preserves joint optimization while modulating update strength across token types. Our method introduces response-level entropy normalization for stable token classification and applies differentiated clipping ranges and KL regularization to encourage exploration on reasoning tokens while preserving knowledge tokens. Experiments on mathematical reasoning and code generation benchmarks show that Archer consistently outperforms strong baselines across multiple model scales, improving both \textit{pass@1} and \textit{pass@K} performance. These results highlight the importance of respecting sequence-level dependencies when designing fine-grained RL optimization strategies for LLMs.

Jiakang Wang, Runze Liu, Fuzheng Zhang, Xiu Li, Guorui Zhou, Ling Pan• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	Accuracy (Acc)87.76	600
Mathematical Reasoning	AIME 2024	Accuracy46.28	525
Mathematical Reasoning	AMC	Accuracy (%)86.73	375
Mathematical Reasoning	AIME 2025	Accuracy38.68	353
Mathematical Reasoning	Minerva 272	Accuracy (Minerva 272)40.62	28
Mathematical Reasoning	OlympiadBench 675	Accuracy53.19	28
Code Generation	LiveCodeBench v5 (2024.08.01-2025.02.01)	Average@842.1	16
Code Generation	LiveCodeBench 2025.02.01-2025.05.01 v6	Average Score (@16)37.5	16
Code	LCB	AVG@831.8	16
Math	AIME 24	AVG@3239.48	16

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord