On-the-Fly Adaptation to Quantization: Configuration-Aware LoRA for Efficient Fine-Tuning of Quantized LLMs
About
As increasingly large pre-trained models are released, deploying them on edge devices for privacy-preserving applications requires effective compression. Recent works combine quantization with the fine-tuning of high-precision LoRA adapters, which can substantially reduce model size while mitigating the accuracy loss from quantization. However, edge devices have inherently heterogeneous capabilities, while performing configuration-wise fine-tuning for every quantization setting is computationally prohibitive. In this paper, we propose CoA-LoRA, a method that dynamically adjusts the LoRA adapter to arbitrary quantization configurations (i.e., the per-layer bit-width choices of a pre-trained model) without requiring repeated fine-tuning. This is accomplished via a configuration-aware model that maps each configuration to its low-rank adjustments. The effectiveness of this model critically depends on the training configuration set, a collection of configurations chosen to cover different total bit-width budgets. However, constructing a high-quality configuration set is non-trivial. We therefore design a Pareto-based configuration search that iteratively optimizes the training configuration set, yielding more precise low-rank adjustments. Our experiments demonstrate that, unlike the state-of-the-art methods that require fine-tuning a separate LoRA adapter for each configuration, CoA-LoRA incurs no additional time cost while achieving comparable or even superior performance to those methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Question Answering | ARC Easy | -- | 597 | |
| Natural Language Inference | RTE | Accuracy62.5 | 448 | |
| Commonsense Reasoning | WinoGrande | Accuracy67.97 | 372 | |
| Question Answering | ARC Challenge | Accuracy (ARC)41.41 | 142 | |
| Natural Language Inference | aNLI | Accuracy38.28 | 65 | |
| Reading Comprehension | BoolQ | Accuracy (BoolQ)78.91 | 55 | |
| Physical Commonsense Reasoning | PIQA | Accuracy76.56 | 45 | |
| Language Modeling | LLaMA-2 13B | Perplexity (PPL)6.99 | 32 | |
| Aggregated Downstream Evaluation | ANLI, BoolQ, Winogrande, RTE, PiQA, ARC-Easy, ARC-Challenge | Average Accuracy61.94 | 8 | |
| Language Modeling | Qwen2.5-1.5B | HV47.9 | 5 |