Q-VLM: Post-training Quantization for Large Vision-Language Models

About

In this paper, we propose a post-training quantization framework of large vision-language models (LVLMs) for efficient multi-modal inference. Conventional quantization methods sequentially search the layer-wise rounding functions by minimizing activation discretization errors, which fails to acquire optimal quantization strategy without considering cross-layer dependency. On the contrary, we mine the cross-layer dependency that significantly influences discretization errors of the entire vision-language model, and embed this dependency into optimal quantization strategy searching with low search cost. Specifically, we observe the strong correlation between the activation entropy and the cross-layer dependency concerning output discretization errors. Therefore, we employ the entropy as the proxy to partition blocks optimally, which aims to achieve satisfying trade-offs between discretization errors and the search cost. Moreover, we optimize the visual encoder to disentangle the cross-layer dependency for fine-grained decomposition of search space, so that the search cost is further reduced without harming the quantization accuracy. Experimental results demonstrate that our method compresses the memory by 2.78x and increase generate speed by 1.44x about 13B LLaVA model without performance degradation on diverse multi-modal reasoning tasks. Code is available at https://github.com/ChangyuanWang17/QVLM.

Changyuan Wang, Ziwei Wang, Xiuwei Xu, Yansong Tang, Jie Zhou, Jiwen Lu• 2024

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VizWiz	Accuracy53.69	1820
Visual Question Answering	VQA v2	Accuracy79.65	1429
Multimodal Understanding	SEED-Bench	Accuracy65.39	516
Science Question Answering	ScienceQA IMG	Accuracy66.46	335
Multimodal Science Question Answering	ScienceQA IMG	Accuracy65.28	152
Visual Question Answering	ScienceQA (test)	Accuracy89.81	115
Visual Question Answering	SciQA-IMG	Accuracy65.28	71
Visual Question Answering	SQA	Accuracy72.27	41
Multimodal Understanding	SEED-I, VizWiz, ScienceQA	SEED-I Score64.2	22
Multimodal Question Answering	ScienceQA v1.3 (test)	NAT Score0.8954	21

Showing 10 of 17 rows

Other info

Code

Follow for update

@wizwand_team Discord