BitDelta: Your Fine-Tune May Only Be Worth One Bit

About

Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of pre-training, it's intuitive to assume that fine-tuning adds less new information to the model, and is thus more compressible. We explore this assumption by decomposing the weights of fine-tuned models into their pre-trained components and an additional delta. We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance. This interesting finding not only highlights the potential redundancy of information added during fine-tuning, but also has significant implications for the multi-tenant serving and multi-tenant storage of fine-tuned models. By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x, which can also be translated to enhanced generation latency in multi-tenant settings. We validate BitDelta through experiments across Llama-2 and Mistral model families, and on models up to 70B parameters, showcasing minimal performance degradation over all tested settings.

James Liu, Guangxuan Xiao, Kai Li, Jason D. Lee, Song Han, Tri Dao, Tianle Cai• 2024

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	Accuracy59.88	1896
Code Generation	HumanEval	Pass@183.5	1043
Mathematical Reasoning	GSM8K (test)	Accuracy61.11	954
Physical Commonsense Reasoning	PIQA	Accuracy81.22	696
Code Generation	HumanEval (test)	Pass@151.83	612
Multi-turn Dialogue Evaluation	MT-Bench	Overall Score7.2	532
Visual Question Answering	GQA	Accuracy61.2	524
Question Answering	ARC-E	Accuracy84.13	523
Commonsense Reasoning	WinoGrande	Accuracy76.09	453
Mathematical Reasoning	MATH (test)	Overall Accuracy12.12	433

Showing 10 of 36 rows

Other info

Code

Follow for update

@wizwand_team Discord