Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Verifier-Free RL for LLMs via Intrinsic Gradient-Norm Reward

About

While Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a promising post-training paradigm for Large Language Models (LLMs), its dependency on the gold label or domain-specific verifiers limits its scalability to new tasks and domains. In this work, we propose Verifier-free Intrinsic Gradient-Norm Reward (VIGOR), a simple reward that uses only the policy model itself. Given a prompt, VIGOR samples a group of completions and assigns higher within-group rewards to outputs that induce smaller $\ell_2$ norms of the teacher-forced negative log-likelihood gradients under the current parameters. Intuitively, lower gradient norms suggest the completion aligns better with the current policy, serving as an intrinsic preference signal for policy optimization. To make this intrinsic signal practical for RL, we correct the systematic length bias of averaged token-level gradients with a $\sqrt{T}$ scaling, and apply group-wise rank shaping to stabilize reward scales across prompts. Across mathematical reasoning benchmarks, VIGOR outperforms the state-of-the-art Reinforcement Learning from Internal Feedback (RLIF) baseline, and it also exhibits cross-domain transfer to code benchmarks when trained only on math data. For instance, on Qwen2.5-7B-Base post-trained on MATH, VIGOR improves the average math accuracy by +3.31% and the average code accuracy by +1.91% over this baseline, while exhibiting more stable training dynamics. The code is available at https://github.com/ZJUSCL/VIGOR.

Xuexiang Wen, Hang Yu, Linchao Zhu, Gaoang Wang• 2026

Related benchmarks

TaskDatasetResultRank
Instruction FollowingIFEval--
836
Mathematical ReasoningMATH 500
Accuracy (Acc)62.8
543
Mathematical ReasoningAMC
Accuracy (%)44.42
368
Mathematical ReasoningGSM8K
GSM8K Accuracy (%)88.7
204
Mathematical ReasoningGSM8K
Accuracy77.1
93
Code GenerationLiveCodeBench--
84
Code ReasoningCRUX
Accuracy35.62
26
Code GenerationCRUX
Score (%)56.38
18
Multi-Task ReasoningMMLU-Pro
Pass@143.09
18
Mathematical ReasoningMATH500
MATH500 Score76.2
8
Showing 10 of 10 rows

Other info

Follow for update