TileQ: Efficient Low-Rank Quantization of Mixture-of-Experts with 2D Tiling

About

Mixture-of-Experts (MoE) models achieve remarkable performance by sparsely activating specialized experts, yet their massive parameters in experts pose significant challenges for deployment. While low-rank quantization offers a promising route to compress MoE models, existing methods still incur nonnegligible memory overhead and inference latency. To address these limitations, we propose \textsc{TileQ}, a fine-tuning-free post-training quantization (PTQ) method that employs 2D-tiling structured low-rank quantization to share low-rank factors across both input and output dimensions of MoE experts. Furthermore, we introduce an efficient inference technique for \textsc{TileQ} that fuses multiple low-rank expert computations into a single-pass operation, significantly improving hardware utilization. Experiments show that \textsc{TileQ} cuts down additional memory usage up to 10$\times$ and reduces inference latency to $\sim$5\% while preserving state-of-the-art accuracy.

Hongyaoxing Gu, Xinzhe Chen, Lijuan Hu, Fangfang Liu• 2026

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText-2	Perplexity (PPL)4.12	2862
Language Modeling	WikiText-2 (test)	PPL4.1	2416
Question Answering	ARC Challenge	Accuracy (ARC)61.4	631
Commonsense Reasoning	HellaSwag	HellaSwag Score86	53
Commonsense Reasoning	PIQA	PIQA PQ83.8	4
Multi-task Language Understanding	MMLU	MMLU Accuracy69.5	4

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord