Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Not All Steps are Informative: On the Linearity of LLMs' RLVR Training

About

Reinforcement learning with verifiable rewards (RLVR) has become a central component of large language model (LLM) post-training. Unlike supervised fine-tuning (SFT), RLVR lets an LLM generate multiple candidate solutions and reinforces those that lead to a verifiably correct final answer. However, in practice, RLVR often requires thousands of training steps to reach strong performance, incurring substantial computation largely attributed to prolonged exploration. In this work, we make a surprising observation: during RLVR, LLMs evolve in a strongly linear manner. Specifically, both model weights and model output log-probabilities exhibit strong linear correlations with RL training steps. This suggests that RLVR predominantly amplifies trends that emerge early in training, rather than continuously discovering new behaviors throughout the entire optimization trajectory. Motivated by this linearity, we investigate whether future model states can be predicted from intermediate checkpoints via extrapolation, avoiding continued expensive training. We show that Weight Extrapolation produces models with performance comparable to standard RL training while requiring significantly less computation. Moreover, Logits Extrapolation consistently outperforms continued RL training on mathematics and code benchmarks by extrapolating beyond the step range where RL training remains stable. Our code is available at https://github.com/Miaow-Lab/RLVR-Linearity

Tianle Wang, Zhongyuan Wu, Shenghao Jin, Hao Xu, Wei Chen, Ning Miao• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME 2024
Accuracy14.6
104
Mathematical ReasoningMinerva
Accuracy (Acc)28.6
62
Multiple-choice Question AnsweringMMLU-Pro
Biology Accuracy82.8
20
Multi-task Language UnderstandingMMLU Pro (test)
History Score62.8
20
Mathematical ReasoningMathematical Reasoning Tasks AMC23 Minerva
AMC23 Score39.4
16
Mathematical ReasoningOlymMATH
Accuracy7.8
16
Mathematical ReasoningAggregate Mathematical Tasks (AIME24/25, AMC23, Minerva, OlymMATH)
Average Score25
16
Showing 7 of 7 rows

Other info

Follow for update