GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs
About
Mixture-of-Experts Large Language Models (MoE-LLMs) achieve strong performance but incur substantial memory overhead due to massive expert parameters. Mixed-precision quantization mitigates this cost by allocating expert-wise bit-widths based on their importance, approaching the accuracy-memory Pareto frontier and enabling extreme low-bit quantization. However, existing methods rely on layer-wise importance estimation and overlook router shifts induced by quantization, resulting in suboptimal allocation and routing. In this work, we propose Global Expert-level Mixed-precision Quantization (GEMQ) to overcome these limitations via (1) a global linear-programming formulation that captures model-wide expert importance based on quantization error analysis, and (2) efficient router fine-tuning to adapt routing to quantized experts. These components are integrated into a progressive quantization framework that iteratively refines importance estimation and allocation. Experiments demonstrate that GEMQ significantly reduces memory and accelerates inference with minimal accuracy degradation. Source code is available at https://github.com/jndeng/GEMQ .
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Modeling | WikiText2 | Perplexity4.37 | 3785 | |
| Language Modeling | WikiText-2 (test) | PPL4.37 | 2333 | |
| Language Modeling | C4 (test) | Perplexity8.06 | 464 | |
| Multi-task Language Understanding | MMLU | MMLU Accuracy64.63 | 442 | |
| Mathematical Reasoning | MathQA | Accuracy38.63 | 354 | |
| Question Answering | BoolQ | Accuracy85.02 | 201 | |
| Language Understanding | MMLU 5-shot | -- | 153 | |
| Zero-shot Reasoning | ZeroShot 7 | Accuracy65.69 | 56 | |
| Commonsense Reasoning | Reasoning Suite Zero-shot Aggregate | Aggregate Score59.49 | 50 | |
| Zero-shot Reasoning and General Knowledge | Evaluation Suite Zero-shot (PIQA, ARC-Easy, ARC-Challenge, HellaSwag, WinoGrande, MathQA, MMLU) | PIQA (PQ) Accuracy81.07 | 40 |