Rethinking Visual Token Reduction in LVLMs Under Cross-Modal Misalignment

About

Large Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics. These visual tokens often outnumber their textual counterparts by a large margin, leading to substantial computational overhead and limiting the scalability of LVLMs in practice. Previous efforts have explored visual token reduction either prior to or within the large language models (LLMs). However, most in-LLM reduction approaches rely on text-conditioned interactions, implicitly assuming that textual tokens can reliably capture the importance of visual tokens. In this work, we revisit this assumption and reveal causal, semantic, and spatial forms of cross-modal misalignment. These misalignments undermine the effectiveness of text-guided visual token reduction. To address this, we introduce VisionDrop, a training-free, visual-only pruning framework that selects informative visual tokens based on intra-modal (visual-to-visual) attention, without relying on textual signals. To further suppress redundancy throughout the model hierarchy, we treat the visual encoder and the LLM as a unified system and design a progressive pruning pipeline. Our method performs dominant token selection and lightweight contextual merging at multiple stages, enabling fine-grained visual information to be retained even under aggressive token budgets. Extensive experiments across diverse benchmarks show that VisionDrop achieves consistent improvements over existing approaches, despite requiring no additional training or complex modifications. Notably, when integrated with LLaVA-NeXT-7B, VisionDrop achieves a 2.7x reduction in inference latency and 6x in FLOPs, while retaining 95.71% of the original performance.

Rui Xu, Yunke Wang, Yong Luo, Bo Du• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy87.23	2019
Visual Question Answering	VizWiz	Accuracy52.28	1820
Text-based Visual Question Answering	TextVQA	Accuracy57.81	962
Multimodal Evaluation	MME	--	727
Visual Question Answering	GQA	Accuracy59.99	524
Video Question Answering	MSVD	Accuracy68.4	152
Video Question Answering	MSRVTT	Accuracy54.6	100
Visual Question Answering	MMBench (MMB)	Accuracy65.19	86
Visual Question Answering	MMBench CN	Accuracy58.41	72
Scientific Question Answering	ScienceQA	Accuracy69.41	61

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord