IWP: Token Pruning as Implicit Weight Pruning in Large Vision Language Models

About

Large Vision Language Models show impressive performance across image and video understanding tasks, yet their computational cost grows rapidly with the number of visual tokens. Existing token pruning methods mitigate this issue through empirical approaches while overlooking the internal mechanism of attention. In this paper, we propose a novel training free token pruning framework grounded in the dual form perspective of attention. We reformulate attention as an implicit linear layer whose weight matrix is the sum of rank 1 outer products, each generated by a single token's key value pair. Token pruning thus reduces to selecting an optimal subset of these rank 1 updates that best approximates the original dual weight matrix. Extending this perspective to standard softmax attention in LVLMs, we derive a novel metric quantifying both a token's information magnitude and information duplication. To efficiently select the subset with the proposed metric, we introduce Progressive Chunked Maximal Marginal Relevance. Extensive experiments demonstrate that our method achieves a better trade off between performance and efficiency, while providing another perspective on existing pruning approaches.

Dong-Jae Lee, Sunghyun Baek, Junmo Kim• 2026

Related benchmarks

Task	Dataset	Result
Text-based Visual Question Answering	TextVQA	Accuracy75.7	962
Multimodal Understanding	MMBench	Accuracy84.2	847
Multimodal Understanding	MMMU	Accuracy58.9	437
Multimodal Understanding	MMStar	Accuracy61	407
Document Visual Question Answering	DocVQA	ANLS82.1	301
Video Understanding	MLVU	Score60.9	221
Video Understanding	EgoSchema	EgoSchema Score62.2	185
Multimodal Understanding	LVLM Evaluation Suite (AI2D, DocVQA, InfoVQA, MMBench, MME, MMMU, SciQA, TextVQA, MMStar, POPE)	AI2D81.8	38
Scientific Question Answering	SciQA	Accuracy94.2	35
Multimodal Understanding	MME	Score2.38e+3	16

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord