Optimal Splitting of Language Models from Mixtures to Specialized Domains

About

Language models achieve impressive performance on a variety of knowledge, language, and reasoning tasks due to the scale and diversity of pretraining data available. The standard training recipe is a two-stage paradigm: pretraining first on the full corpus of data followed by specialization on a subset of high quality, specialized data from the full corpus. In the multi-domain setting, this involves continued pretraining of multiple models on each specialized domain, referred to as split model training. We propose a method for pretraining multiple models independently over a general pretraining corpus, and determining the optimal compute allocation between pretraining and continued pretraining using scaling laws. Our approach accurately predicts the loss of a model of size N with D pretraining and D' specialization tokens, and extrapolates to larger model sizes and number of tokens. Applied to language model training, our approach improves performance consistently across common sense knowledge and reasoning benchmarks across different model sizes and compute budgets.

Skyler Seto, Pierre Ablin, Anastasiia Filippova, Jiayuan Ye, Louis Bethune, Angelos Katharopoulos, David Grangier• 2026

Related benchmarks

Task	Dataset	Result
Question Answering	ARC Challenge	Accuracy45.82	906
Question Answering	ARC Easy	Accuracy73.74	597
Question Answering	PIQA	Accuracy78.29	505
Question Answering	BoolQ	--	317
Question Answering	SciQ	--	283
Multiple-choice Question Answering	HellaSwag	Accuracy69.37	196
Question Answering	WinoGrande (WG)	Accuracy63.22	138
Question Answering	MMLU	Normalized Log Accuracy (MMLU)38.54	7
Language Modeling	The Pile ArXiv (test)	--	6
Language Modeling	The Pile DM Math (test)	--	6

Showing 10 of 33 rows

Other info

Follow for update

@wizwand_team Discord