Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Adam-mini: Use Fewer Learning Rates To Gain More

About

We propose Adam-mini, an optimizer that achieves on par or better performance than AdamW with 50% less memory footprint. Adam-mini reduces memory by cutting down the learning rate resources in Adam (i.e., $1/\sqrt{v}$). By investigating the Hessian structure of neural nets, we find Adam's $v$ might not function at its full potential as effectively as we expected. We find that $\geq$ 99.9% of these learning rates in $v$ could be harmlessly removed if we (1) carefully partition the parameters into blocks following our new principle on Hessian structure; (2) assign a single but good learning rate to each parameter block. We then provide one simple way to find good learning rates and propose Adam-mini. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 39M to 13B for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also alleviates communication overheads among GPUs, thereby increasing throughput. For instance, Adam-mini achieves 49.6% higher throughput than AdamW when pre-training Llama 2-7B on $2\times$ A800-80GB GPUs, which saves 33% wall-clock time for pre-training.

Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Diederik P. Kingma, Yinyu Ye, Zhi-Quan Luo, Ruoyu Sun• 2024

Related benchmarks

TaskDatasetResultRank
Language ModelingC4 Qwen2.5 (val)
Perplexity (PPL)17.55
27
Language ModelingC4 LLaMA-130M (val)
Perplexity23.73
27
Language ModelingC4 LLaMA-60M (val)
Perplexity29.63
12
Language ModelingC4 LLaMA-350M (val)
Perplexity17.83
12
Language ModelingC4 LLaMA-1.3B (val)
Perplexity15.1
12
Showing 5 of 5 rows

Other info

Follow for update