How Many Visual Tokens Do Multimodal Language Models Need? Scaling Visual Token Pruning with F^3A
About
Vision-language models improve perception by feeding increasingly long visual token sequences into language backbones, but the resulting inference cost raises a basic scaling question: as multimodal models grow, how many visual tokens are actually needed, and how should they be allocated under a fixed visual token budget? Existing training-free pruning methods typically answer this with one-shot proxies such as decoder attention, visual similarity, or conditional diversity. We argue that visual token pruning is better viewed as task-conditioned evidence search, especially under aggressive compression and across model scales. We propose F^3A, a training-free router for visual token pruning that operates before the language model consumes image tokens. F^3A builds lightweight question-conditioned cues, matches them to visual-grid tokens through frozen sparse sensing heads, and allocates a fixed vision token budget via coarse evidence localization, local refinement, coverage-preserving competition, and recovery of under-covered regions. It requires no model training, no extra LLM forward pass and preserves the original multimodal prompting and decoding pipeline.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy89.1 | 2019 | |
| Science Question Answering | ScienceQA | Accuracy94.97 | 791 | |
| Multimodal Evaluation | MME | Score2.21e+3 | 727 | |
| Diagram Question Answering | AI2D | AI2D Accuracy86.43 | 387 | |
| Science Question Answering | ScienceQA IMG | Accuracy73.52 | 335 | |
| Diagram Understanding | AI2D | Accuracy79.45 | 317 | |
| Multimodal Perception and Cognition | MME | Overall Score2.49e+3 | 270 | |
| Multimodal Understanding | MMBench CN | Accuracy84.71 | 254 | |
| Multimodal Model Evaluation | MMBench | Accuracy81.16 | 204 | |
| Real-world Visual Question Answering | RealworldQA | Accuracy62.61 | 173 |