Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning

About

We investigate the role of learning rate scheduling in the large-scale pre-training of large language models, focusing on its influence on downstream performance after supervised fine-tuning (SFT). Decay-based learning rate schedulers are widely used to minimize pre-training loss. However, despite their widespread use, how these schedulers affect performance after SFT remains underexplored. In this paper, we examine Warmup-Stable-Only (WSO), which maintains a constant learning rate after warmup without any decay. Through experiments with 1B and 8B parameter models, we show that WSO consistently outperforms decay-based schedulers in terms of performance after SFT, even though decay-based schedulers may exhibit better performance after pre-training. The result also holds across different regimes with mid-training and over-training. Loss landscape analysis further reveals that decay-based schedulers lead models into sharper minima, whereas WSO preserves flatter minima that support adaptability. These findings indicate that applying LR decay to improve pre-training metrics may compromise downstream adaptability. Our work also provides practical guidance for training and model release strategies, highlighting that pre-training models with WSO enhances their adaptability for downstream tasks.

Kazuki Yano, Shun Kiyono, Sosuke Kobayashi, Sho Takase, Jun Suzuki• 2026

Related benchmarks

TaskDatasetResultRank
Multitask Language UnderstandingMMLU
Accuracy42.9
413
Instruction FollowingAlpacaEval
Win Rate79.4
227
Logical reasoningBBH
Accuracy31.2
201
Mathematical ReasoningGSM8K
Math Score54.7
197
Reading ComprehensionDROP
DROP Accuracy19.4
111
Multitask KnowledgeMMLU
Accuracy36.6
53
General IntelligenceAGI-Eval
AGI Eval Score40.2
24
General Language Modeling PerformanceAggregate AlpacaEval, TruthfulQA, GSM8K, DROP, AGI Eval, BBH, MMLU
Average Score44.7
16
Reading ComprehensionDROP
DROP Score36.4
16
TruthfulnessTruthfulQA
TruthfulQA38.1
8
Showing 10 of 11 rows

Other info

Follow for update