How Many Visual Tokens Do Multimodal Language Models Need? Scaling Visual Token Pruning with F^3A

About

Vision-language models improve perception by feeding increasingly long visual token sequences into language backbones, but the resulting inference cost raises a basic scaling question: as multimodal models grow, how many visual tokens are actually needed, and how should they be allocated under a fixed visual token budget? Existing training-free pruning methods typically answer this with one-shot proxies such as decoder attention, visual similarity, or conditional diversity. We argue that visual token pruning is better viewed as task-conditioned evidence search, especially under aggressive compression and across model scales. We propose F^3A, a training-free router for visual token pruning that operates before the language model consumes image tokens. F^3A builds lightweight question-conditioned cues, matches them to visual-grid tokens through frozen sparse sensing heads, and allocates a fixed vision token budget via coarse evidence localization, local refinement, coverage-preserving competition, and recovery of under-covered regions. It requires no model training, no extra LLM forward pass and preserves the original multimodal prompting and decoding pipeline.

YiJie Huang, Yiqun Zhang, Zhuoyue Jia, Xiaocui Yang, Junzhao Huang, Zihan Wang, Shi Feng, Daling Wang, Yifei Zhang, Yongkang Liu• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy89.1	2056
Science Question Answering	ScienceQA	Accuracy94.97	916
Multimodal Evaluation	MME	Score2.21e+3	902
Diagram Question Answering	AI2D	AI2D Accuracy86.43	509
Diagram Understanding	AI2D	Accuracy79.45	377
Science Question Answering	ScienceQA IMG	Accuracy73.52	357
Multimodal Perception and Cognition	MME	Overall Score2.49e+3	344
Multimodal Understanding	MMBench CN	Accuracy84.71	302
Multimodal Model Evaluation	MMBench	Accuracy81.16	265
Real-world Visual Question Answering	RealworldQA	Accuracy62.61	183

Showing 10 of 50 rows

Other info

Follow for update

@wizwand_team Discord