CodeQuant: Unified Clustering and Quantization for Enhanced Outlier Smoothing in Low-Precision Mixture-of-Experts
About
Outliers have emerged as a fundamental bottleneck in preserving accuracy for low-precision large models, particularly within Mixture-of-Experts (MoE) architectures that are increasingly central to large-scale language modeling. Under post-training quantization (PTQ), these outliers induce substantial quantization errors, leading to severe accuracy degradation. While recent rotation-based smoothing techniques alleviate the problem by redistributing outlier magnitudes, residual errors remain and continue to impede reliable low-precision deployment. In this work, we tackle this challenge by introducing \textit{CodeQuant}, a unified quantization-and-clustering scheme that contains smoothing activation outliers via learnable rotation and absorbing weight outliers into fine-tuned cluster centroids for MoE. This design reduces the influence of extreme values by fitting them within cluster centroids, thereby lowering quantization error while maintaining expressive capacity. Coupled with a dedicated kernel design for GPU and CPU, CodeQuant achieves up to $4.15\times$ speedup while delivering significantly higher accuracy than state-of-the-art quantization approaches across diverse MoE models. Our results highlight CodeQuant as a promising direction for efficient and accurate deployment of MoE-based large language models under low-precision constraints. Our code is available at https://github.com/SAI-Lab-NYU/CodeQuant.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Modeling | WikiText-2 | -- | 1624 | |
| Commonsense Reasoning | WinoGrande | Accuracy78 | 1085 | |
| Language Modeling | C4 | Perplexity8.06 | 1071 | |
| Multi-task Language Understanding | MMLU | Accuracy73.5 | 876 | |
| Question Answering | ARC Easy | -- | 597 | |
| Question Answering | PIQA | Accuracy82.7 | 374 | |
| Commonsense Reasoning | HellaSwag | HellaSwag Accuracy77.5 | 350 | |
| Language Modeling | Wiki2 | PPL4.65 | 149 | |
| Question Answering | ARC Challenge | Accuracy (ARC)57.9 | 142 | |
| Mathematical Reasoning | GSM8K 8-shot | Accuracy86.7 | 26 |