ZO-AdaMU Optimizer: Adapting Perturbation by the Momentum and Uncertainty in Zeroth-order Optimization

About

Lowering the memory requirement in full-parameter training on large models has become a hot research area. MeZO fine-tunes the large language models (LLMs) by just forward passes in a zeroth-order SGD optimizer (ZO-SGD), demonstrating excellent performance with the same GPU memory usage as inference. However, the simulated perturbation stochastic approximation for gradient estimate in MeZO leads to severe oscillations and incurs a substantial time overhead. Moreover, without momentum regularization, MeZO shows severe over-fitting problems. Lastly, the perturbation-irrelevant momentum on ZO-SGD does not improve the convergence rate. This study proposes ZO-AdaMU to resolve the above problems by adapting the simulated perturbation with momentum in its stochastic approximation. Unlike existing adaptive momentum methods, we relocate momentum on simulated perturbation in stochastic gradient approximation. Our convergence analysis and experiments prove this is a better way to improve convergence stability and rate in ZO-SGD. Extensive experiments demonstrate that ZO-AdaMU yields better generalization for LLMs fine-tuning across various NLP tasks than MeZO and its momentum variants.

Shuoran Jiang, Qingcai Chen, Youchen Pan, Yang Xiang, Yukang Lin, Xiangping Wu, Chuanyi Liu, Xiaobao Song• 2023

Related benchmarks

Task	Dataset	Result
Common Sense Reasoning	COPA	Accuracy89	256
Text Classification	BoolQ	Accuracy74.9	118
Text Classification	RTE	Accuracy72.9	104
Classification	SST2	Accuracy92.1	102
Classification	CB	Accuracy72.3	70
Classification	WSC	Accuracy61.5	59
Generation	SQuAD	F1 Score85.2	52
Word-in-Context Classification	WiC	Accuracy60.7	52
Sentence Completion	COPA	Accuracy89	48
Generation	DROP	F1 Score32.4	43

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord