Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TileQ: Efficient Low-Rank Quantization of Mixture-of-Experts with 2D Tiling

About

Mixture-of-Experts (MoE) models achieve remarkable performance by sparsely activating specialized experts, yet their massive parameters in experts pose significant challenges for deployment. While low-rank quantization offers a promising route to compress MoE models, existing methods still incur nonnegligible memory overhead and inference latency. To address these limitations, we propose \textsc{TileQ}, a fine-tuning-free post-training quantization (PTQ) method that employs 2D-tiling structured low-rank quantization to share low-rank factors across both input and output dimensions of MoE experts. Furthermore, we introduce an efficient inference technique for \textsc{TileQ} that fuses multiple low-rank expert computations into a single-pass operation, significantly improving hardware utilization. Experiments show that \textsc{TileQ} cuts down additional memory usage up to 10$\times$ and reduces inference latency to $\sim$5\% while preserving state-of-the-art accuracy.

Hongyaoxing Gu, Xinzhe Chen, Lijuan Hu, Fangfang Liu• 2026

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText-2 (test)
PPL4.1
2333
Language ModelingWikiText-2
Perplexity (PPL)4.12
2320
Question AnsweringARC Challenge
Accuracy (ARC)61.4
598
Commonsense ReasoningHellaSwag
HellaSwag Score86
53
Commonsense ReasoningPIQA
PIQA PQ83.8
4
Multi-task Language UnderstandingMMLU
MMLU Accuracy69.5
4
Showing 6 of 6 rows

Other info

Follow for update