BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation
About
The upscaling of Large Language Models (LLMs) has yielded impressive advances in natural language processing, yet it also poses significant deployment challenges. Weight quantization has emerged as a widely embraced solution to reduce memory and computational demands. This paper introduces BitDistiller, a framework that synergizes Quantization-Aware Training (QAT) with Knowledge Distillation (KD) to boost the performance of LLMs at ultra-low precisions (sub-4-bit). Specifically, BitDistiller first incorporates a tailored asymmetric quantization and clipping technique to maximally preserve the fidelity of quantized weights, and then proposes a novel Confidence-Aware Kullback-Leibler Divergence (CAKLD) objective, which is employed in a self-distillation manner to enable faster convergence and superior model performance. Empirical evaluations demonstrate that BitDistiller significantly surpasses existing methods in both 3-bit and 2-bit configurations on general language understanding and complex reasoning benchmarks. Notably, BitDistiller is shown to be more cost-effective, demanding fewer data and training resources. The code is available at https://github.com/DD-DuDa/BitDistiller.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Modeling | WikiText2 | Perplexity5.97 | 2839 | |
| Mathematical Reasoning | GSM8K | Accuracy51.02 | 1362 | |
| Language Modeling | C4 | Perplexity10.01 | 1071 | |
| Code Generation | HumanEval | Pass@136.58 | 1036 | |
| Multi-task Language Understanding | MMLU | -- | 876 | |
| Code Generation | HumanEval @WizardCoder (test) | Pass@169.51 | 45 | |
| Mathematical Reasoning | GSM8K @MetaMath (test) | Accuracy69.69 | 31 | |
| Language Modeling | LLaMA-2-7B | Perplexity8.08 | 18 | |
| Language Modeling | Wikitext 2 Llama 2 & 3 (test) | PPL (Llama 2, Config 7)5.97 | 16 | |
| General Language Understanding | General Language Tasks Suite (WikiText-2, MMLU, PIQA, HellaSwag, WinoGrande, ARC-Challenge) standard (various) | PPL5.2 | 13 |