Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

How Many Visual Tokens Do Multimodal Language Models Need? Scaling Visual Token Pruning with F^3A

About

Vision-language models improve perception by feeding increasingly long visual token sequences into language backbones, but the resulting inference cost raises a basic scaling question: as multimodal models grow, how many visual tokens are actually needed, and how should they be allocated under a fixed visual token budget? Existing training-free pruning methods typically answer this with one-shot proxies such as decoder attention, visual similarity, or conditional diversity. We argue that visual token pruning is better viewed as task-conditioned evidence search, especially under aggressive compression and across model scales. We propose F^3A, a training-free router for visual token pruning that operates before the language model consumes image tokens. F^3A builds lightweight question-conditioned cues, matches them to visual-grid tokens through frozen sparse sensing heads, and allocates a fixed vision token budget via coarse evidence localization, local refinement, coverage-preserving competition, and recovery of under-covered regions. It requires no model training, no extra LLM forward pass and preserves the original multimodal prompting and decoding pipeline.

YiJie Huang, Yiqun Zhang, Zhuoyue Jia, Xiaocui Yang, Junzhao Huang, Zihan Wang, Shi Feng, Daling Wang, Yifei Zhang, Yongkang Liu• 2026

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy89.1
2019
Science Question AnsweringScienceQA
Accuracy94.97
791
Multimodal EvaluationMME
Score2.21e+3
727
Diagram Question AnsweringAI2D
AI2D Accuracy86.43
387
Science Question AnsweringScienceQA IMG
Accuracy73.52
335
Diagram UnderstandingAI2D
Accuracy79.45
317
Multimodal Perception and CognitionMME
Overall Score2.49e+3
270
Multimodal UnderstandingMMBench CN
Accuracy84.71
254
Multimodal Model EvaluationMMBench
Accuracy81.16
204
Real-world Visual Question AnsweringRealworldQA
Accuracy62.61
173
Showing 10 of 50 rows

Other info

Follow for update