Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR

About

Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective post-training method for improving the reasoning abilities of Large Language Models (LLMs). However, existing methods mainly apply uniform optimization constraints across all tokens, ignoring their heterogeneous roles. Prior work shows that high-entropy tokens are closely tied to reasoning, while low-entropy tokens primarily encode factual knowledge, and recent approaches attempt to exploit this distinction by isolating token updates via masking or asynchronous training. We argue that such isolation breaks the sequential dependency structure of autoregressive generation, leading to suboptimal learning. To address this, we propose \textbf{Archer}, an entropy-aware RLVR framework with \textbf{dual-token constraints} that preserves joint optimization while modulating update strength across token types. Our method introduces response-level entropy normalization for stable token classification and applies differentiated clipping ranges and KL regularization to encourage exploration on reasoning tokens while preserving knowledge tokens. Experiments on mathematical reasoning and code generation benchmarks show that Archer consistently outperforms strong baselines across multiple model scales, improving both \textit{pass@1} and \textit{pass@K} performance. These results highlight the importance of respecting sequence-level dependencies when designing fine-grained RL optimization strategies for LLMs.

Jiakang Wang, Runze Liu, Fuzheng Zhang, Xiu Li, Guorui Zhou, Ling Pan• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH 500
Accuracy (Acc)87.76
543
Mathematical ReasoningAIME 2024
Accuracy46.28
479
Mathematical ReasoningAMC
Accuracy (%)86.73
368
Mathematical ReasoningAIME 2025
Accuracy38.68
311
Mathematical ReasoningMinerva 272
Accuracy (Minerva 272)40.62
28
Mathematical ReasoningOlympiadBench 675
Accuracy53.19
28
Code GenerationLiveCodeBench v5 (2024.08.01-2025.02.01)
Average@842.1
16
Code GenerationLiveCodeBench 2025.02.01-2025.05.01 v6
Average Score (@16)37.5
16
Code GenerationLiveCodeBench
Average Score29.75
5
Logical reasoningLogic-RL
Average Score87.68
5
Showing 10 of 10 rows

Other info

Follow for update