Fine-Tuning Language Models with Just Forward Passes
About
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder optimizer (MeZO), adapting the classical ZO-SGD method to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation). Our results demonstrate that (1) MeZO significantly outperforms in-context learning and linear probing; (2) MeZO achieves comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12x memory reduction and up to 2x GPU-hour reduction in our implementation; (3) MeZO is compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning; (4) MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1). We support our empirical findings with theoretical insights, highlighting how adequate pre-training and task prompts enable MeZO to fine-tune huge models, despite classical ZO analyses suggesting otherwise.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Modeling | WikiText-2 | -- | 841 | |
| Natural Language Inference | SNLI (test) | Accuracy50.2 | 681 | |
| Physical Commonsense Reasoning | PIQA | Accuracy84.3 | 329 | |
| Image Classification | CIFAR-100 | Accuracy64.5 | 302 | |
| Question Answering | BoolQ | Accuracy76.6 | 240 | |
| Sentiment Classification | SST2 (test) | Accuracy79 | 214 | |
| Sentiment Analysis | SST-5 (test) | Accuracy35.5 | 173 | |
| Mathematical Reasoning | AQUA | Accuracy24 | 132 | |
| Question Classification | TREC (test) | Accuracy32 | 124 | |
| Text Classification | BoolQ | Accuracy83.4 | 84 |