BitDelta: Your Fine-Tune May Only Be Worth One Bit
About
Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of pre-training, it's intuitive to assume that fine-tuning adds less new information to the model, and is thus more compressible. We explore this assumption by decomposing the weights of fine-tuned models into their pre-trained components and an additional delta. We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance. This interesting finding not only highlights the potential redundancy of information added during fine-tuning, but also has significant implications for the multi-tenant serving and multi-tenant storage of fine-tuned models. By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x, which can also be translated to enhanced generation latency in multi-tenant settings. We validate BitDelta through experiments across Llama-2 and Mistral model families, and on models up to 70B parameters, showcasing minimal performance degradation over all tested settings.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | Accuracy59.88 | 1891 | |
| Code Generation | HumanEval | Pass@183.5 | 1036 | |
| Mathematical Reasoning | GSM8K (test) | Accuracy61.11 | 900 | |
| Physical Commonsense Reasoning | PIQA | Accuracy81.22 | 572 | |
| Code Generation | HumanEval (test) | Pass@151.83 | 506 | |
| Visual Question Answering | GQA | Accuracy61.2 | 505 | |
| Multi-turn Dialogue Evaluation | MT-Bench | Overall Score7.2 | 447 | |
| Mathematical Reasoning | MATH (test) | Overall Accuracy12.12 | 433 | |
| Question Answering | ARC-E | Accuracy84.13 | 416 | |
| Commonsense Reasoning | WinoGrande | Accuracy76.09 | 372 |