Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Rethinking Visual Token Reduction in LVLMs Under Cross-Modal Misalignment

About

Large Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics. These visual tokens often outnumber their textual counterparts by a large margin, leading to substantial computational overhead and limiting the scalability of LVLMs in practice. Previous efforts have explored visual token reduction either prior to or within the large language models (LLMs). However, most in-LLM reduction approaches rely on text-conditioned interactions, implicitly assuming that textual tokens can reliably capture the importance of visual tokens. In this work, we revisit this assumption and reveal causal, semantic, and spatial forms of cross-modal misalignment. These misalignments undermine the effectiveness of text-guided visual token reduction. To address this, we introduce VisionDrop, a training-free, visual-only pruning framework that selects informative visual tokens based on intra-modal (visual-to-visual) attention, without relying on textual signals. To further suppress redundancy throughout the model hierarchy, we treat the visual encoder and the LLM as a unified system and design a progressive pruning pipeline. Our method performs dominant token selection and lightweight contextual merging at multiple stages, enabling fine-grained visual information to be retained even under aggressive token budgets. Extensive experiments across diverse benchmarks show that VisionDrop achieves consistent improvements over existing approaches, despite requiring no additional training or complex modifications. Notably, when integrated with LLaVA-NeXT-7B, VisionDrop achieves a 2.7x reduction in inference latency and 6x in FLOPs, while retaining 95.71% of the original performance.

Rui Xu, Yunke Wang, Yong Luo, Bo Du• 2025

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVizWiz
Accuracy52.28
1525
Object Hallucination EvaluationPOPE
Accuracy87.23
1455
Text-based Visual Question AnsweringTextVQA
Accuracy57.81
807
Multimodal EvaluationMME--
658
Visual Question AnsweringGQA
Accuracy59.99
505
Video Question AnsweringMSVD
Accuracy68.4
152
Video Question AnsweringMSRVTT
Accuracy54.6
100
Visual Question AnsweringMMBench (MMB)
Accuracy65.19
76
Visual Question AnsweringMMBench CN
Accuracy58.41
62
Scientific Question AnsweringScienceQA
Accuracy69.41
61
Showing 10 of 13 rows

Other info

Follow for update