OccamToken: Efficient VLM Inference with Training-Free and Budget-Adaptive Token Pruning

About

Vision-language models (VLMs) rely on long visual token sequences for visual understanding, making the prefill stage expensive in both computation and memory. Most existing pruning methods follow an absolute-ranking paradigm, assigning importance scores to visual tokens and retaining a fixed top-K subset. In this work, we argue that this paradigm is fundamentally brittle: attention sinks distort token importance rankings, while image redundancy and query-dependent visual evidence make fixed token budgets unreliable across inputs. We propose OccamToken, a training-free framework that replaces absolute token ranking with register-anchored relative evidence testing. Instead of asking which tokens are globally important, OccamToken evaluates whether a visual token provides information beyond a register-based reference. Our key insight is that register tokens naturally absorb low-information attention patterns, making them a stable reference for identifying genuinely informative visual evidence. Based on this principle, OccamToken performs both image-adaptive redundancy pruning and query-adaptive relevance pruning through dynamic thresholds derived from register attention. Across LLaVA-NeXT, LLaVA-v1.5, and Qwen3-VL, OccamToken consistently improves the accuracy-efficiency trade-off without additional training. Notably, on LLaVA-NeXT, it reduces 2,880 visual tokens to approximately 40 while preserving over 93% of the original accuracy, enabling stable visual token compression even in the extreme 1.4% retention regime.

Geng Li, Guohao Chen, Ting Chen, Shilin Shan, Kuangji Zuo, Bofan Lyu, Tuo An, Gen Li, Jianfei Yang• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy86.8	2056
Visual Question Answering	VQA v2	Accuracy80	1429
Science Question Answering	ScienceQA	Accuracy81.4	916
Multimodal Evaluation	MME	Score2.16e+3	902
Multimodal Understanding	MMBench	Accuracy77.1	887
Multi-discipline Multimodal Understanding	MMMU	Accuracy47.7	422
Multimodal Understanding	MMBench (MMB)	Accuracy66.2	166
Science Question Answering	ScienceQA SQA-I	Accuracy68.7	149
Multimodal Evaluation	MMBench	--	146
Multimodal Understanding	SEED-Bench Image	Accuracy74.1	143

Showing 10 of 25 rows

Other info

Follow for update

@wizwand_team Discord