GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs

About

Mixture-of-Experts Large Language Models (MoE-LLMs) achieve strong performance but incur substantial memory overhead due to massive expert parameters. Mixed-precision quantization mitigates this cost by allocating expert-wise bit-widths based on their importance, approaching the accuracy-memory Pareto frontier and enabling extreme low-bit quantization. However, existing methods rely on layer-wise importance estimation and overlook router shifts induced by quantization, resulting in suboptimal allocation and routing. In this work, we propose Global Expert-level Mixed-precision Quantization (GEMQ) to overcome these limitations via (1) a global linear-programming formulation that captures model-wide expert importance based on quantization error analysis, and (2) efficient router fine-tuning to adapt routing to quantized experts. These components are integrated into a progressive quantization framework that iteratively refines importance estimation and allocation. Experiments demonstrate that GEMQ significantly reduces memory and accelerates inference with minimal accuracy degradation. Source code is available at https://github.com/jndeng/GEMQ .

Jianing Deng, Song Wang, Dongwei Wang, Zijie Liu, Tianlong Chen, Huanrui Yang, Jingtong Hu• 2026

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText2	Perplexity4.37	4085
Language Modeling	WikiText-2 (test)	PPL4.37	2416
Language Modeling	C4 (test)	Perplexity8.06	488
Multi-task Language Understanding	MMLU	MMLU Accuracy64.63	456
Mathematical Reasoning	MathQA	Accuracy38.63	354
Question Answering	BoolQ	Accuracy85.02	233
Language Understanding	MMLU 5-shot	--	160
Zero-shot Reasoning	ZeroShot 7	Accuracy65.69	56
Commonsense Reasoning	Reasoning Suite Zero-shot Aggregate	Aggregate Score59.49	50
Zero-shot Reasoning and General Knowledge	Evaluation Suite Zero-shot (PIQA, ARC-Easy, ARC-Challenge, HellaSwag, WinoGrande, MathQA, MMLU)	PIQA (PQ) Accuracy81.07	40

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord