Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning

About

We investigate the role of learning rate scheduling in the large-scale pre-training of large language models, focusing on its influence on downstream performance after supervised fine-tuning (SFT). Decay-based learning rate schedulers are widely used to minimize pre-training loss. However, despite their widespread use, how these schedulers affect performance after SFT remains underexplored. In this paper, we examine Warmup-Stable-Only (WSO), which maintains a constant learning rate after warmup without any decay. Through experiments with 1B and 8B parameter models, we show that WSO consistently outperforms decay-based schedulers in terms of performance after SFT, even though decay-based schedulers may exhibit better performance after pre-training. The result also holds across different regimes with mid-training and over-training. Loss landscape analysis further reveals that decay-based schedulers lead models into sharper minima, whereas WSO preserves flatter minima that support adaptability. These findings indicate that applying LR decay to improve pre-training metrics may compromise downstream adaptability. Our work also provides practical guidance for training and model release strategies, highlighting that pre-training models with WSO enhances their adaptability for downstream tasks.

Kazuki Yano, Shun Kiyono, Sosuke Kobayashi, Sho Takase, Jun Suzuki• 2026

Related benchmarks

Task	Dataset	Result
Multitask Language Understanding	MMLU	Accuracy42.9	520
Instruction Following	AlpacaEval	Win Rate79.4	420
Logical reasoning	BBH	Accuracy31.2	249
Mathematical Reasoning	GSM8K	Math Score54.7	197
Reading Comprehension	DROP	DROP Accuracy19.4	129
Multitask Knowledge	MMLU	Accuracy36.6	92
Truthfulness	TruthfulQA	TruthfulQA38.1	32
Reading Comprehension	DROP	DROP Score36.4	25
General Intelligence	AGI-Eval	AGI Eval Score40.2	24
General Language Modeling Performance	Aggregate AlpacaEval, TruthfulQA, GSM8K, DROP, AGI Eval, BBH, MMLU	Average Score44.7	16

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord