Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs

About

Large vision-language models (LVLMs) achieve strong multimodal understanding, but their inference cost grows rapidly with the number of visual tokens, especially for high-resolution images and long videos. Existing attention-based methods estimate token importance from attention scores, which may introduce positional bias, while representation-based methods reduce visual redundancy based on feature relations or reconstruction errors, overlooking the global structure of the visual token set. In this paper, we revisit visual token compression from the perspective of low-rank compressibility. Across models and datasets, we observe that visual token representations exhibit a pronounced low-rank structure, with a dominant subspace that remains stable even after a large fraction of tokens is randomly removed. Motivated by this finding, we propose LRCP, a training-free compression framework that first estimates the dominant low-rank subspace of visual tokens via PCA, and then scores each token by its projection residual onto this subspace, retaining tokens that are poorly explained by the low-rank background. Extensive experiments show that LRCP achieves superior results, preserving 94.7% of the original image-understanding performance with an 88.9% token reduction and 97.8% of the average video-understanding accuracy with an 87.5% token reduction.

Hongyu Lu, Feng Zhang, Wenwei Jin, Huanling Hu, Tianjun Shi, Shikai Jiang, Yao Hu, Jiawei Li• 2026

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy87
2019
Text-based Visual Question AnsweringTextVQA
Accuracy57.4
962
Multimodal UnderstandingMMBench
Accuracy63.7
847
Science Question AnsweringScienceQA
Accuracy68.8
791
Multimodal ReasoningMM-Vet
MM-Vet Score31.8
517
Multimodal UnderstandingSEED-Bench
Accuracy57.2
516
Multimodal UnderstandingMMBench CN
Accuracy60.1
254
Visual Question AnsweringGQA
Accuracy60.3
155
Visual Question AnsweringGQA
GQA Score63.2
139
Multimodal UnderstandingLLaVA-Bench
Overall Score67.5
94
Showing 10 of 19 rows

Other info

Follow for update