Enhancing Zeroth-order Fine-tuning for Language Models with Low-rank Structures
About
Parameter-efficient fine-tuning (PEFT) significantly reduces memory costs when adapting large language models (LLMs) for downstream applications. However, traditional first-order (FO) fine-tuning algorithms incur substantial memory overhead due to the need to store activation values for back-propagation during gradient computation, particularly in long-context fine-tuning tasks. Zeroth-order (ZO) algorithms offer a promising alternative by approximating gradients using finite differences of function values, thus eliminating the need for activation storage. Nevertheless, existing ZO methods struggle to capture the low-rank gradient structure common in LLM fine-tuning, leading to suboptimal performance. This paper proposes a low-rank ZO gradient estimator and introduces a novel low-rank ZO algorithm (LOZO) that effectively captures this structure in LLMs. We provide convergence guarantees for LOZO by framing it as a subspace optimization method. Additionally, its low-rank nature enables LOZO to integrate with momentum techniques while incurring negligible extra memory costs. Extensive experiments across various model sizes and downstream tasks demonstrate that LOZO and its momentum-based variant outperform existing ZO methods and closely approach the performance of FO algorithms.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | CIFAR-100 | Accuracy61.8 | 302 | |
| Natural Language Understanding | SuperGLUE | -- | 84 | |
| Natural Language Understanding | GLUE and SuperGLUE (test val) | SST-286.6 | 37 | |
| Natural Language Understanding | SuperGLUE | SST-2 Accuracy92.5 | 18 | |
| Natural Language Understanding | GLUE & SuperGLUE (test) | RTE Accuracy69.7 | 17 | |
| Question Answering | SQuAD v1.1 v2.0 (test dev) | F1 Score77.3 | 8 | |
| Sentiment Analysis | SST-2 GLUE (test dev) | Accuracy92.2 | 8 | |
| Natural Language Inference | RTE GLUE (test dev) | Accuracy56.3 | 8 | |
| Natural Language Inference | CB SuperGLUE (test dev) | Accuracy57.1 | 8 | |
| Question Answering | BoolQ SuperGLUE (test dev) | Accuracy65 | 8 |