VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration

About

Multimodal large language models (MLLMs) suffer from high computational costs due to excessive visual tokens, particularly in high-resolution and video-based scenarios. Existing token reduction methods typically focus on isolated pipeline components and often neglect textual alignment, leading to performance degradation. In this paper, we propose VisionTrim, a unified framework for training-free MLLM acceleration, integrating two effective plug-and-play modules: 1) the Dominant Vision Token Selection (DVTS) module, which preserves essential visual tokens via a global-local view, and 2) the Text-Guided Vision Complement (TGVC) module, which facilitates context-aware token merging guided by textual cues. Extensive experiments across diverse image and video multimodal benchmarks demonstrate the performance superiority of our VisionTrim, advancing practical MLLM deployment in real-world applications. The code is available at: https://github.com/hanxunyu/VisionTrim.

Hanxun Yu, Wentong Li, Xuan Qu, Song Wang, Junbo Chen, Jianke Zhu• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy86.3	2019
Text-based Visual Question Answering	TextVQA	Accuracy61	962
Multimodal Understanding	MMBench	Accuracy82.8	847
Science Question Answering	ScienceQA	Accuracy70.7	791
Multimodal Evaluation	MME	Score1.83e+3	727
Multimodal Understanding	SEED-Bench	Accuracy70.2	516
Video Question Answering	MSRVTT-QA	Accuracy54.9	505
Multimodal Understanding	MMMU	Accuracy53.8	437
Video Question Answering	ActivityNet-QA	Accuracy43.5	418
Multimodal Understanding	MMStar	Accuracy61.6	407

Showing 10 of 48 rows

Other info

GitHub

Follow for update

@wizwand_team Discord