LoRA Learns Less and Forgets Less

About

Low-Rank Adaptation (LoRA) is a widely-used parameter-efficient finetuning method for large language models. LoRA saves memory by training only low rank perturbations to selected weight matrices. In this work, we compare the performance of LoRA and full finetuning on two target domains, programming and mathematics. We consider both the instruction finetuning (approximately 100K prompt-response pairs) and continued pretraining (20B unstructured tokens) data regimes. Our results show that, in the standard low-rank settings, LoRA substantially underperforms full finetuning. Nevertheless, LoRA better maintains the base model's performance on tasks outside the target domain. We show that LoRA mitigates forgetting more than common regularization techniques such as weight decay and dropout; it also helps maintain more diverse generations. Finally, we show that full finetuning learns perturbations with a rank that is 10-100X greater than typical LoRA configurations, possibly explaining some of the reported gaps. We conclude by proposing best practices for finetuning with LoRA.

Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, John P. Cunningham• 2024

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	--	1896
Commonsense Reasoning	WinoGrande	--	1442
Physical Commonsense Reasoning	PIQA	--	696
Sentence Completion	HellaSwag	--	364
Language Modeling	PG-19	--	206
Mathematical Reasoning	GSM8K	Math Score50.9	197
Question Answering	ARC-C	--	116
Question Answering	OpenBookQA	Normalized Accuracy-2.2	102
Question Answering	ARC-E	Normalized Accuracy (ARC-E)1.5	59
Code Generation	MBPP	MBPP Score47.7	35

Showing 10 of 29 rows

Other info

Follow for update

@wizwand_team Discord