VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration
About
Multimodal large language models (MLLMs) suffer from high computational costs due to excessive visual tokens, particularly in high-resolution and video-based scenarios. Existing token reduction methods typically focus on isolated pipeline components and often neglect textual alignment, leading to performance degradation. In this paper, we propose VisionTrim, a unified framework for training-free MLLM acceleration, integrating two effective plug-and-play modules: 1) the Dominant Vision Token Selection (DVTS) module, which preserves essential visual tokens via a global-local view, and 2) the Text-Guided Vision Complement (TGVC) module, which facilitates context-aware token merging guided by textual cues. Extensive experiments across diverse image and video multimodal benchmarks demonstrate the performance superiority of our VisionTrim, advancing practical MLLM deployment in real-world applications. The code is available at: https://github.com/hanxunyu/VisionTrim.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy86.3 | 935 | |
| Multimodal Evaluation | MME | Score1.83e+3 | 557 | |
| Text-based Visual Question Answering | TextVQA | Accuracy61 | 496 | |
| Video Question Answering | MSRVTT-QA | Accuracy54.9 | 481 | |
| Multimodal Understanding | MMBench | Accuracy82.8 | 367 | |
| Video Question Answering | MSVD-QA | Accuracy68.6 | 340 | |
| Video Question Answering | ActivityNet-QA | Accuracy43.5 | 319 | |
| Multimodal Understanding | MMMU | Accuracy53.8 | 275 | |
| Multi-discipline Multimodal Understanding | MMMU | Accuracy37.9 | 266 | |
| Science Question Answering | ScienceQA | Accuracy70.7 | 229 |