Self-Improving Pretraining: using post-trained models to pretrain better models

About

Large language models are classically trained in stages: pretraining on raw text followed by post-training for instruction following and reasoning. However, this separation creates a fundamental limitation: many desirable behaviors such as safety, factuality, overall generation quality, and reasoning ability are only added at a late stage, even though the patterns learned earlier strongly shape a model's capabilities. To tackle this issue, we introduce a new way to pretrain and mid-train models that incorporates these behaviors earlier. We utilize an existing strong, post-trained model to both rewrite pretraining data and to judge policy model rollouts, thus using reinforcement earlier in training. In our experiments, we show this can give strong gains in quality, safety, factuality and reasoning.

Ellen Xiaoqing Tan, Jack Lanchantin, Shehzaad Dhuliawala, Danwei Li, Thao Nguyen, Jing Xu, Ping Yu, Ilia Kulikov, Sainbayar Sukhbaatar, Jason Weston, Xian Li, Olga Golovneva• 2026

Related benchmarks

Task	Dataset	Result
Common Sense Reasoning	BoolQ	Accuracy70.3	240
Reasoning	ARC Easy	Accuracy69.4	233
Common Sense Reasoning	HellaSwag	Accuracy51.7	213
Reasoning	PIQA	Accuracy75.8	164
Reasoning	ARC-C	Accuracy35.7	112
Safety Evaluation	Toxigen	Safety93.1	77
Reasoning	OBQA	Accuracy30	46
Reasoning	SIQA	Accuracy46.8	44
Reasoning	MMLU	Accuracy28.3	35
Safety Evaluation	XSTest	Safety Score88.4	32

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord