Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima

About

The foundational capabilities of large language models are acquired during pretraining on internet-scale, highly heterogeneous data mixtures. In this work, we investigate an interesting geometric question regarding the converged state of pretraining: Does the model converge to a common minimizer across all data sources (e.g., \cref{fig:cwa_illustration:close}), or merely a minimizer of the summed loss (e.g., \cref{fig:cwa_illustration:distant})? We hypothesize that the geometric "closeness" of task-specific minima is intrinsically linked to downstream generalization. We reveal that standard optimizers (e.g., AdamW) often converge to points where task-specific minima are distant from each other. To address this, we propose the Nexus optimizer, which encourages the closeness of these minima by maximizing gradient similarity during optimization. Experiments across models ranging from 130M to 3B parameters, various data mixtures and hyperparameter schedules, show that Nexus \textit{significantly boosts downstream performance}, despite \textit{achieving the same pretraining loss} (see \cref{fig:demo:benchmark}). Notably, on the 3B model, Nexus reduces the out-of-distribution loss by 0.012 and yields up to a 15.0\% accuracy improvement on complex reasoning tasks (e.g., GSM8k). This finding challenges the reliance on pretraining loss as the sole proxy for model evaluation and demonstrates the importance of implicit biases in unlocking downstream generalization.

Huanran Chen, Huaqing Zhang, Xiao Li, Yinpeng Dong, Ke Shen, Jun Zhu• 2026

Related benchmarks

Task	Dataset	Result
Multiple-choice Question Answering	MMLU	Accuracy48.9	210
Graduate-level Science Question Answering	GPQA	Accuracy (GPQA)30.4	72
Language Modeling	Pre-training (val)	Validation Loss1.602	55
Mathematical Reasoning	GSM8K	Accuracy59	38
Reasoning	GPQA	Accuracy29.6	21
Reasoning	GPQA D	Accuracy23.4	15
Language Modeling	Pre-training corpus	Loss1.602	9
Language Modeling	OOD	Loss1.29	7
Logical reasoning	BBH	--	6
Aggregated Performance	Downstream Average All	Accuracy40.3	4

Showing 10 of 25 rows

Other info

Follow for update

@wizwand_team Discord