Scaling Laws for Optimal Data Mixtures

About

Large foundation models are typically trained on data from multiple domains, with the data mixture--the proportion of each domain used--playing a critical role in model performance. The standard approach to selecting this mixture relies on trial and error, which becomes impractical for large-scale pretraining. We propose a systematic method to determine the optimal data mixture for any target domain using scaling laws. Our approach accurately predicts the loss of a model of size $N$ trained with $D$ tokens and a specific domain weight vector $h$. We validate the universality of these scaling laws by demonstrating their predictive power in three distinct and large-scale settings: large language model (LLM), native multimodal model (NMM), and large vision models (LVM) pretraining. We further show that these scaling laws can extrapolate to new data mixtures and across scales: their parameters can be accurately estimated using a few small-scale training runs, and used to estimate the performance at larger scales and unseen domain weights. The scaling laws allow to derive the optimal domain weights for any target domain under a given training budget ($N$,$D$), providing a principled alternative to costly trial-and-error methods.

Mustafa Shukor, Louis Bethune, Dan Busbridge, David Grangier, Enrico Fini, Alaaeldin El-Nouby, Pierre Ablin• 2025

Related benchmarks

Task	Dataset	Result
Code Generation	HumanEval	--	1043
Language Understanding	MMLU	Accuracy56.6	844
Science Question Answering	ARC Challenge	Accuracy54.1	354
Mathematics	MATH	MATH Accuracy39.9	136
General Reasoning	BBH	BBH General Reasoning Accuracy45.3	103
Chinese Language Understanding	C-Eval	Accuracy62.8	68
Scaling Law Fitting	LLM-arxiv (test)	MRE7.431	23
Scaling Law Fitting	LLM-book (test)	Mean Relative Error (MRE)4.915	23
Scaling Law Fitting	LLM-c4 (test)	MRE0.0698	23
Scaling Law Fitting	LLM-commoncrawl (test)	Mean Relative Error0.0564	23

Showing 10 of 25 rows

Other info

Follow for update

@wizwand_team Discord