Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Optimal Splitting of Language Models from Mixtures to Specialized Domains

About

Language models achieve impressive performance on a variety of knowledge, language, and reasoning tasks due to the scale and diversity of pretraining data available. The standard training recipe is a two-stage paradigm: pretraining first on the full corpus of data followed by specialization on a subset of high quality, specialized data from the full corpus. In the multi-domain setting, this involves continued pretraining of multiple models on each specialized domain, referred to as split model training. We propose a method for pretraining multiple models independently over a general pretraining corpus, and determining the optimal compute allocation between pretraining and continued pretraining using scaling laws. Our approach accurately predicts the loss of a model of size N with D pretraining and D' specialization tokens, and extrapolates to larger model sizes and number of tokens. Applied to language model training, our approach improves performance consistently across common sense knowledge and reasoning benchmarks across different model sizes and compute budgets.

Skyler Seto, Pierre Ablin, Anastasiia Filippova, Jiayuan Ye, Louis Bethune, Angelos Katharopoulos, David Grangier• 2026

Related benchmarks

TaskDatasetResultRank
Question AnsweringARC Challenge
Accuracy45.82
906
Question AnsweringARC Easy
Accuracy73.74
597
Question AnsweringPIQA
Accuracy78.29
374
Question AnsweringBoolQ--
317
Question AnsweringSciQ--
283
Question AnsweringWinoGrande (WG)
Accuracy63.22
124
Multiple-choice Question AnsweringHellaSwag
Accuracy69.37
93
Question AnsweringMMLU
Normalized Log Accuracy (MMLU)38.54
7
Language ModelingClustered Pre-training Data Cluster 0
Perplexity (PPL)19.819
2
Language ModelingClustered Pre-training Data Cluster 1
Perplexity (PPL)17.731
2
Showing 10 of 33 rows

Other info

Follow for update