QSVD: Efficient Low-rank Approximation for Unified Query-Key-Value Weight Compression in Low-Precision Vision-Language Models

About

Vision-Language Models (VLMs) are integral to tasks such as image captioning and visual question answering, but their high computational cost, driven by large memory footprints and processing time, limits their scalability and real-time applicability. In this work, we propose leveraging Singular-Value Decomposition (SVD) over the joint query (Q), key (K), and value (V) weight matrices to reduce KV cache size and computational overhead. We in addition introduce an efficient rank allocation strategy that dynamically adjusts the SVD rank based on its impact on VLM accuracy, achieving a significant reduction in both memory usage and computational cost. Finally, we extend this approach by applying quantization to both VLM weights and activations, resulting in a highly efficient VLM. Our method outperforms previous approaches that rely solely on quantization or SVD by achieving more than $10\%$ accuracy improvement while consuming less hardware cost, making it better for real-time deployment on resource-constrained devices. We open source our code at \href{https://github.com/SAI-Lab-NYU/QSVD}{\texttt{https://github.com/SAI-Lab-NYU/QSVD}}.

Yutong Wang, Haiyu Wang, Sai Qian Zhang• 2025

Related benchmarks

Task	Dataset	Result
Multimodal Understanding	SEED-Bench	Accuracy71.23	516
Optical Character Recognition	OCRBench	--	433
Science Question Answering	ScienceQA IMG	Accuracy70.43	335
Multimodal Science Question Answering	ScienceQA IMG	Accuracy95.54	152
Visual Question Answering	ScienceQA IMG	Accuracy77	135
Multimodal Reasoning	MMBench EN v1.1	Accuracy80.16	125
High-Resolution Visual Perception	HR-Bench-4K	Accuracy44.88	79
Visual Question Answering	SciQA-IMG	Accuracy70.43	71
Vision-Language Evaluation	SEED-Bench	Accuracy74.47	50
Multimodal Understanding	SEED-I, VizWiz, ScienceQA	SEED-I Score67	22

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord