Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Breaking Modality Heterogeneity in Low-Bit Quantization for Large Vision-Language Models

About

Low-bit post-training quantization (PTQ) is a pivotal technique for deploying Vision-Language Models (VLMs) on resource-constrained devices. However, existing PTQ methods often degrade VLMs' accuracy due to the heterogeneous activation distributions of text and vision modalities during quantization. We find that this cross-modal heterogeneity is distributed unevenly across channels: a small subset of channels contains most modality-specific outliers, and these outliers typically reside in different channels for each modality. Motivated by this, we propose SplitQ, a channel-Splitting-driven post-training Quantization framework. At its core, SplitQ introduces a novel Modality-specific Outlier Channel Decoupling (MOCD) module that effectively isolates salient modality-specific outlier channels with minimal overhead. To further address the remaining cross-modal distribution discrepancies, we design an Adaptive Cross-Modal Calibration (ACC) module that employs dual lightweight learnable branches to dynamically mitigate modality-induced quantization errors. Extensive experiments on popular VLMs demonstrate that SplitQ significantly outperforms existing approaches across 6 popular multi-modal datasets under all evaluated quantization settings, including W4A8, W4A4, W3A3, and W3A2. Notably, SplitQ preserves 93.5% of FP16 performance under the challenging W3A3 setting (69.5 vs. 74.3), pushing the efficiency frontier for deploying advanced VLMs. Our code is available at https://github.com/EMVision-NK/SplitQ

Yi Zhong, Haotong Qin, Xindong Zhang, Lei Zhang, Guolei Sun• 2026

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVizWiz
Accuracy68.7
1820
Visual Question AnsweringTextVQA
Accuracy82.6
1453
Science Question AnsweringScienceQA
Accuracy88.1
791
Optical Character RecognitionOCRBench
Score83.5
433
Multimodal UnderstandingSEED
Accuracy73.2
216
Multimodal UnderstandingMMMU
Accuracy (MMMU)49.1
52
Multimodal UnderstandingSEED-I, VizWiz, ScienceQA
SEED-I Score68.3
22
VLM Inference EfficiencyQwen2.5-VL-7B
Prefill Latency (ms)742.2
4
Showing 8 of 8 rows

Other info

Follow for update