Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs

About

LLMs are commonly trained with a learning rate (LR) warmup, followed by cosine decay to 10% of the maximum (10x decay). In a large-scale empirical study, we show that under an optimal peak LR, a simple linear decay-to-zero (D2Z) schedule consistently outperforms other schedules when training at compute-optimal dataset sizes. D2Z is superior across a range of model sizes, batch sizes, datasets, and vocabularies. Benefits increase as dataset size increases. Leveraging a novel interpretation of AdamW as an exponential moving average of weight updates, we show how linear D2Z optimally balances the demands of early training (moving away from initial conditions) and late training (averaging over more updates in order to mitigate gradient noise). In experiments, a 610M-parameter model trained for 80 tokens-per-parameter (TPP) using D2Z achieves lower loss than when trained for 200 TPP using 10x decay, corresponding to an astonishing 60% compute savings. Models such as Llama2-7B, trained for 286 TPP with 10x decay, could likely have saved a majority of compute by training with D2Z.

Shane Bergsma, Nolan Dey, Gurpreet Gosal, Gavia Gray, Daria Soboleva, Joel Hestness• 2025

Related benchmarks

Task	Dataset	Result
Multitask Language Understanding	MMLU	Accuracy46.8	520
Instruction Following	AlpacaEval	Win Rate76.3	420
Logical reasoning	BBH	Accuracy31	249
Mathematical Reasoning	GSM8K	Math Score43.5	197
Reading Comprehension	DROP	DROP Accuracy19.6	129
Multitask Knowledge	MMLU	Accuracy35	92
Truthfulness	TruthfulQA	TruthfulQA39.4	32
Reading Comprehension	DROP	DROP Score30.5	25
General Intelligence	AGI-Eval	AGI Eval Score33.6	24
General Language Modeling Performance	Aggregate AlpacaEval, TruthfulQA, GSM8K, DROP, AGI Eval, BBH, MMLU	Average Score41.4	16

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord