Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference

About

Multimodal large language models (MLLMs) demand considerable computations for inference due to the extensive parameters and the additional input tokens needed for visual information representation. Herein, we introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference. Our approach is inspired by two intriguing phenomena we have observed: (1) the attention sink phenomenon that is prevalent in LLMs also persists in MLLMs, suggesting that initial tokens and nearest tokens receive the majority of attention, while middle vision tokens garner minimal attention in deep layers; (2) the presence of information migration, which implies that visual information is transferred to subsequent text tokens within the first few layers of MLLMs. As per our findings, we conclude that vision tokens are unnecessary in the deep layers of MLLMs. Thus, we strategically withdraw them at a certain layer, enabling only text tokens to engage in subsequent layers. To pinpoint the ideal layer for VTW, we initially analyze a limited set of tiny datasets and choose the first layer that meets the Kullback-Leibler divergence criterion. Our VTW approach can cut computational overhead by over 40\% across diverse multimodal tasks while maintaining performance.

Zhihang Lin, Mingbao Lin, Luxi Lin, Rongrong Ji• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy51.3	2056
Visual Question Answering	VizWiz	Accuracy52.1	1863
Visual Question Answering	TextVQA	Accuracy73.33	1455
Visual Question Answering	GQA	Accuracy63.7	1445
Science Question Answering	ScienceQA	Accuracy75.3	916
Multimodal Evaluation	MME	Score1.53e+3	902
Video Understanding	MVBench	Accuracy72.43	635
Visual Question Answering	ChartQA	Accuracy82.24	620
Visual Question Answering	GQA	Accuracy55.14	524
Multimodal Understanding	MMStar	Accuracy55.7	511

Showing 10 of 90 rows

...

Other info

Follow for update

@wizwand_team Discord