Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

OccamToken: Efficient VLM Inference with Training-Free and Budget-Adaptive Token Pruning

About

Vision-language models (VLMs) rely on long visual token sequences for visual understanding, making the prefill stage expensive in both computation and memory. Most existing pruning methods follow an absolute-ranking paradigm, assigning importance scores to visual tokens and retaining a fixed top-K subset. In this work, we argue that this paradigm is fundamentally brittle: attention sinks distort token importance rankings, while image redundancy and query-dependent visual evidence make fixed token budgets unreliable across inputs. We propose OccamToken, a training-free framework that replaces absolute token ranking with register-anchored relative evidence testing. Instead of asking which tokens are globally important, OccamToken evaluates whether a visual token provides information beyond a register-based reference. Our key insight is that register tokens naturally absorb low-information attention patterns, making them a stable reference for identifying genuinely informative visual evidence. Based on this principle, OccamToken performs both image-adaptive redundancy pruning and query-adaptive relevance pruning through dynamic thresholds derived from register attention. Across LLaVA-NeXT, LLaVA-v1.5, and Qwen3-VL, OccamToken consistently improves the accuracy-efficiency trade-off without additional training. Notably, on LLaVA-NeXT, it reduces 2,880 visual tokens to approximately 40 while preserving over 93% of the original accuracy, enabling stable visual token compression even in the extreme 1.4% retention regime.

Geng Li, Guohao Chen, Ting Chen, Shilin Shan, Kuangji Zuo, Bofan Lyu, Tuo An, Gen Li, Jianfei Yang• 2026

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy86.8
2019
Visual Question AnsweringVQA v2
Accuracy80
1429
Multimodal UnderstandingMMBench
Accuracy77.1
847
Science Question AnsweringScienceQA
Accuracy81.4
791
Multimodal EvaluationMME
Score2.16e+3
727
Multi-discipline Multimodal UnderstandingMMMU
Accuracy47.7
363
Multimodal UnderstandingMMBench (MMB)
Accuracy66.2
166
Multimodal UnderstandingSEED-Bench Image
Accuracy74.1
143
Science Question AnsweringScienceQA SQA-I
Accuracy68.7
122
Multimodal EvaluationMMBench--
118
Showing 10 of 25 rows

Other info

Follow for update