Make Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuning

About

Parameter-efficient fine-tuning (PEFT) of pre-trained language models (PLMs) has emerged as a highly successful approach, with training only a small number of parameters without sacrificing performance and becoming the de-facto learning paradigm with the increasing size of PLMs. However, existing PEFT methods are not memory-efficient, because they still require caching most of the intermediate activations for the gradient calculation, akin to fine-tuning. One effective way to reduce the activation memory is to apply a reversible model, so the intermediate activations are not necessary to be cached and can be recomputed. Nevertheless, modifying a PLM to its reversible variant is not straightforward, since the reversible model has a distinct architecture from the currently released PLMs. In this paper, we first investigate what is a key factor for the success of existing PEFT methods, and realize that it's essential to preserve the PLM's starting point when initializing a PEFT method. With this finding, we propose memory-efficient fine-tuning (MEFT) that inserts adapters into a PLM, preserving the PLM's starting point and making it reversible without additional pre-training. We evaluate MEFT on the GLUE benchmark and five question-answering tasks with various backbones, BERT, RoBERTa, BART and OPT. MEFT significantly reduces the activation memory up to 84% of full fine-tuning with a negligible amount of trainable parameters. Moreover, MEFT achieves the same score on GLUE and a comparable score on the question-answering tasks as full fine-tuning. A similar finding is also observed for the image classification task.

Baohao Liao, Shaomu Tan, Christof Monz• 2023

Related benchmarks

Task	Dataset	Result
Question Answering	ARC Challenge	Accuracy34.1	906
Question Answering	OpenBookQA	Accuracy37	465
Natural Language Understanding	GLUE (test)	SST-2 Accuracy96.8	416
Physical Interaction Question Answering	PIQA	Accuracy77.4	415
Question Answering	ARC Easy	Normalized Acc65.7	391
Question Answering	SciQ	Accuracy94.4	283
Natural Language Understanding	GLUE excluding STS-B	Average Score88.4	4

Showing 7 of 7 rows

Other info

Code

Follow for update

@wizwand_team Discord