Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Self-Improving Pretraining: using post-trained models to pretrain better models

About

Large language models are classically trained in stages: pretraining on raw text followed by post-training for instruction following and reasoning. However, this separation creates a fundamental limitation: many desirable behaviors such as safety, factuality, overall generation quality, and reasoning ability are only added at a late stage, even though the patterns learned earlier strongly shape a model's capabilities. To tackle this issue, we introduce a new way to pretrain and mid-train models that incorporates these behaviors earlier. We utilize an existing strong, post-trained model to both rewrite pretraining data and to judge policy model rollouts, thus using reinforcement earlier in training. In our experiments, we show this can give strong gains in quality, safety, factuality and reasoning.

Ellen Xiaoqing Tan, Jack Lanchantin, Shehzaad Dhuliawala, Danwei Li, Thao Nguyen, Jing Xu, Ping Yu, Ilia Kulikov, Sainbayar Sukhbaatar, Jason Weston, Xian Li, Olga Golovneva• 2026

Related benchmarks

TaskDatasetResultRank
Common Sense ReasoningHellaSwag
Accuracy51.7
213
Common Sense ReasoningBoolQ
Accuracy70.3
212
ReasoningARC Easy
Accuracy69.4
187
ReasoningPIQA
Accuracy75.8
145
ReasoningARC-C
Accuracy35.7
80
Safety EvaluationToxigen
Safety93.1
77
ReasoningSIQA
Accuracy46.8
44
ReasoningMMLU
Accuracy28.3
35
ReasoningOBQA
Accuracy30
26
Safety EvaluationXSTest
Safety Score88.4
23
Showing 10 of 21 rows

Other info

Follow for update