Cautious Optimizers: Improving Training with One Line of Code

About

AdamW has been the default optimizer for transformer pretraining. For many years, our community searched for faster and more stable optimizers with only constrained positive outcomes. In this work, we propose a \textbf{one-line modification in Pytorch} to any momentum-based optimizer, which we rename cautious optimizer, e.g. C-AdamW and C-Lion. Our theoretical result shows that this modification preserves Adam's Hamiltonian function and it does not break the convergence guarantee under the Lyapunov analysis. In addition, a whole new family of optimizers is revealed by our theoretical insight. Among them, we pick the simplest one for empirical experiments, showing not only consistent speed-up on LLM pretraining, but also image classification, with minimum extra tuning on hyperparameters. Code is available at https://github.com/kyleliang919/C-Optim.

Kaizhao Liang, Lizhang Chen, Bo Liu, Qiang Liu• 2024

Related benchmarks

Task	Dataset	Result
Language Modeling	C4	Perplexity23.15	1565
Physical Commonsense Reasoning	PIQA	Accuracy67.68	696
Question Answering	ARC Easy	Accuracy60.9	597
Multi-task Language Understanding	MMLU	Accuracy25.35	353
Common Sense Reasoning	HellaSwag	Accuracy41.93	213
Language Modeling	Lambada OpenAI	Accuracy32.29	127
Question Answering	ARC Challenge	Normalized Accuracy29.78	105
Language Model Pre-training	C4 Llama 2 pre-training (val)	Perplexity15.92	47
Image Classification	Mini-ImageNet	Top-1 Acc74.91	40

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord