BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation
About
The upscaling of Large Language Models (LLMs) has yielded impressive advances in natural language processing, yet it also poses significant deployment challenges. Weight quantization has emerged as a widely embraced solution to reduce memory and computational demands. This paper introduces BitDistiller, a framework that synergizes Quantization-Aware Training (QAT) with Knowledge Distillation (KD) to boost the performance of LLMs at ultra-low precisions (sub-4-bit). Specifically, BitDistiller first incorporates a tailored asymmetric quantization and clipping technique to maximally preserve the fidelity of quantized weights, and then proposes a novel Confidence-Aware Kullback-Leibler Divergence (CAKLD) objective, which is employed in a self-distillation manner to enable faster convergence and superior model performance. Empirical evaluations demonstrate that BitDistiller significantly surpasses existing methods in both 3-bit and 2-bit configurations on general language understanding and complex reasoning benchmarks. Notably, BitDistiller is shown to be more cost-effective, demanding fewer data and training resources. The code is available at https://github.com/DD-DuDa/BitDistiller.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Modeling | WikiText2 | Perplexity5.97 | 1875 | |
| Mathematical Reasoning | GSM8K | Accuracy51.02 | 983 | |
| Code Generation | HumanEval | Pass@136.58 | 850 | |
| Multi-task Language Understanding | MMLU | -- | 842 | |
| Code Generation | HumanEval @WizardCoder (test) | Pass@169.51 | 45 | |
| Mathematical Reasoning | GSM8K @MetaMath (test) | Accuracy69.69 | 31 | |
| Language Modeling | Wikitext 2 Llama 2 & 3 (test) | PPL (Llama 2, Config 7)5.97 | 16 | |
| General Language Understanding | General Language Tasks Suite (WikiText-2, MMLU, PIQA, HellaSwag, WinoGrande, ARC-Challenge) standard (various) | PPL5.2 | 13 | |
| Language Understanding and Reasoning | MMLU, PIQA, HellaSwag, WinoGrande, ARC-Challenge | MMLU (5s)43.65 | 13 | |
| LLM Quantization | Llama-2-70B | GPU Hours (h)64 | 13 |