MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models
About
Post-training quantization (PTQ) with computational invariance for Large Language Models~(LLMs) have demonstrated remarkable advances, however, their application to Multimodal Large Language Models~(MLLMs) presents substantial challenges. In this paper, we analyze SmoothQuant as a case study and identify two critical issues: Smoothing Misalignment and Cross-Modal Computational Invariance. To address these issues, we propose Modality-Aware Smoothing Quantization (MASQuant), a novel framework that introduces (1) Modality-Aware Smoothing (MAS), which learns separate, modality-specific smoothing factors to prevent Smoothing Misalignment, and (2) Cross-Modal Compensation (CMC), which addresses Cross-modal Computational Invariance by using SVD whitening to transform multi-modal activation differences into low-rank forms, enabling unified quantization across modalities. MASQuant demonstrates stable quantization performance across both dual-modal and tri-modal MLLMs. Experimental results show that MASQuant is competitive among the state-of-the-art PTQ algorithms. Source code: https://github.com/alibaba/EfficientAI.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VizWiz | Accuracy71.5 | 1525 | |
| Text-based Visual Question Answering | TextVQA | Accuracy82.6 | 807 | |
| Multimodal Optical Character Recognition | OCRBench | Recognition Score84.6 | 66 | |
| Vision Understanding | MMMU | Accuracy49.9 | 65 | |
| Scientific Question Answering | ScienceQA | Accuracy88.6 | 61 | |
| Multimodal Understanding | MMMU | Accuracy46.7 | 38 | |
| Voice recognition | LibriSpeech | WER2.7 | 34 | |
| Vision-Audio-Text | OmniBench | Accuracy46.9 | 34 | |
| Audio-Text | Wenetspeech | WER6.9 | 34 |