Breaking Modality Heterogeneity in Low-Bit Quantization for Large Vision-Language Models

About

Low-bit post-training quantization (PTQ) is a pivotal technique for deploying Vision-Language Models (VLMs) on resource-constrained devices. However, existing PTQ methods often degrade VLMs' accuracy due to the heterogeneous activation distributions of text and vision modalities during quantization. We find that this cross-modal heterogeneity is distributed unevenly across channels: a small subset of channels contains most modality-specific outliers, and these outliers typically reside in different channels for each modality. Motivated by this, we propose SplitQ, a channel-Splitting-driven post-training Quantization framework. At its core, SplitQ introduces a novel Modality-specific Outlier Channel Decoupling (MOCD) module that effectively isolates salient modality-specific outlier channels with minimal overhead. To further address the remaining cross-modal distribution discrepancies, we design an Adaptive Cross-Modal Calibration (ACC) module that employs dual lightweight learnable branches to dynamically mitigate modality-induced quantization errors. Extensive experiments on popular VLMs demonstrate that SplitQ significantly outperforms existing approaches across 6 popular multi-modal datasets under all evaluated quantization settings, including W4A8, W4A4, W3A3, and W3A2. Notably, SplitQ preserves 93.5% of FP16 performance under the challenging W3A3 setting (69.5 vs. 74.3), pushing the efficiency frontier for deploying advanced VLMs. Our code is available at https://github.com/EMVision-NK/SplitQ

Yi Zhong, Haotong Qin, Xindong Zhang, Lei Zhang, Guolei Sun• 2026

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VizWiz	Accuracy68.7	1863
Visual Question Answering	TextVQA	Accuracy82.6	1455
Science Question Answering	ScienceQA	Accuracy88.1	916
Optical Character Recognition	OCRBench	Score83.5	486
Multimodal Understanding	SEED	Accuracy73.2	226
Multimodal Understanding	MMMU	Accuracy (MMMU)49.1	73
Multimodal Understanding	SEED-I, VizWiz, ScienceQA	SEED-I Score68.3	22
VLM Inference Efficiency	Qwen2.5-VL-7B	Prefill Latency (ms)742.2	4

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord