Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ReMiT: RL-Guided Mid-Training for Iterative LLM Evolution

About

Standard training pipelines for large language models (LLMs) are typically unidirectional, progressing from pre-training to post-training. However, the potential for a bidirectional process--where insights from post-training retroactively improve the pre-trained foundation--remains unexplored. We aim to establish a self-reinforcing flywheel: a cycle in which reinforcement learning (RL)-tuned model strengthens the base model, which in turn enhances subsequent post-training performance, requiring no specially trained teacher or reference model. To realize this, we analyze training dynamics and identify the mid-training (annealing) phase as a critical turning point for model capabilities. This phase typically occurs at the end of pre-training, utilizing high-quality corpora under a rapidly decaying learning rate. Building upon this insight, we introduce ReMiT (Reinforcement Learning-Guided Mid-Training). Specifically, ReMiT leverages the reasoning priors of RL-tuned models to dynamically reweight tokens during the mid-training phase, prioritizing those pivotal for reasoning. Empirically, ReMiT achieves an average improvement of 3\% on 10 pre-training benchmarks, spanning math, code, and general reasoning, and sustains these gains by over 2\% throughout the post-training pipeline. These results validate an iterative feedback loop, enabling continuous and self-reinforcing evolution of LLMs.

Junjie Huang, Jiarui Qin, Di Yin, Weiwen Liu, Yong Yu, Xing Sun, Weinan Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Code GenerationHumanEval--
850
Mathematical ReasoningMATH
Accuracy31.68
535
Instruction FollowingIFEval--
292
Science Question AnsweringARC Challenge
Accuracy54.69
234
Graduate-level Question AnsweringGPQA
Accuracy29.69
114
Code GenerationMBPP
Accuracy49.6
90
General KnowledgeMMLU-Pro
MMLU-Pro General Knowledge Score30.73
38
Common Sense ReasoningBBH
Accuracy58.27
27
Aggregated PerformanceAverage 10 Tasks
Average Accuracy42.97
19
FactualityTruthfulQA
Accuracy31.95
18
Showing 10 of 10 rows

Other info

Follow for update