The Road Less Scheduled

About

Existing learning rate schedules that do not require specification of the optimization stopping step T are greatly out-performed by learning rate schedules that depend on T. We propose an approach that avoids the need for this stopping time by eschewing the use of schedules entirely, while exhibiting state-of-the-art performance compared to schedules across a wide family of problems ranging from convex problems to large-scale deep learning problems. Our Schedule-Free approach introduces no additional hyper-parameters over standard optimizers with momentum. Our method is a direct consequence of a new theory we develop that unifies scheduling and iterate averaging. An open source implementation of our method is available at https://github.com/facebookresearch/schedule_free. Schedule-Free AdamW is the core algorithm behind our winning entry to the MLCommons 2024 AlgoPerf Algorithmic Efficiency Challenge Self-Tuning track.

Aaron Defazio, Xingyu Alice Yang, Harsh Mehta, Konstantin Mishchenko, Ahmed Khaled, Ashok Cutkosky• 2024

Related benchmarks

Task	Dataset	Result
Language Modeling	FineWeb 100B (val)	--	28
Language Model Pre-training	C4 Llama-160M scratch (val)	Validation Loss3.1051	20
Language Model Pre-training	C4 Llama-1B scratch (val)	Validation Loss2.638	16
Image Classification	CIFAR-10 (test)	Test Accuracy (η_init=10^-3)86.5	9
Pretraining	Llama pretraining 124M	Iteration Time (s)0.684	4

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord