Enhancing Zeroth-order Fine-tuning for Language Models with Low-rank Structures

About

Parameter-efficient fine-tuning (PEFT) significantly reduces memory costs when adapting large language models (LLMs) for downstream applications. However, traditional first-order (FO) fine-tuning algorithms incur substantial memory overhead due to the need to store activation values for back-propagation during gradient computation, particularly in long-context fine-tuning tasks. Zeroth-order (ZO) algorithms offer a promising alternative by approximating gradients using finite differences of function values, thus eliminating the need for activation storage. Nevertheless, existing ZO methods struggle to capture the low-rank gradient structure common in LLM fine-tuning, leading to suboptimal performance. This paper proposes a low-rank ZO gradient estimator and introduces a novel low-rank ZO algorithm (LOZO) that effectively captures this structure in LLMs. We provide convergence guarantees for LOZO by framing it as a subspace optimization method. Additionally, its low-rank nature enables LOZO to integrate with momentum techniques while incurring negligible extra memory costs. Extensive experiments across various model sizes and downstream tasks demonstrate that LOZO and its momentum-based variant outperform existing ZO methods and closely approach the performance of FO algorithms.

Yiming Chen, Yuan Zhang, Liyuan Cao, Kun Yuan, Zaiwen Wen• 2024

Related benchmarks

Task	Dataset	Result
Natural Language Inference	RTE	Accuracy78.7	590
Image Classification	CIFAR-100	Accuracy61.8	302
Question Classification	TREC	Accuracy89.8	262
Common Sense Reasoning	COPA	Accuracy91	256
Natural Language Inference	SNLI	Accuracy82.5	196
Sentiment Analysis	SST-5	Accuracy50.4	123
Text Classification	BoolQ	Accuracy68.1	118
Natural Language Understanding	SuperGLUE	--	84
Classification	CB	Accuracy69.6	70
Natural Language Understanding	GLUE and SuperGLUE (test val)	SST-286.6	37

Showing 10 of 37 rows

Other info

Follow for update

@wizwand_team Discord