ERASE: Eliminating Redundant Visual Tokens via Adaptive Two-Stage Token Pruning

About

Recent advancements in Vision-Language Models (VLMs) enable large language models (LLMs) to process high-resolution images, significantly improving real-world multimodal understanding. However, this capability introduces a large number of vision tokens, resulting in substantial computational overhead. To mitigate this issue, various vision token pruning methods have been proposed. Nevertheless, existing approaches predominantly rely on learned semantic features within the model to capture visual redundancy. Moreover, they lack adaptive mechanisms to adjust pruning strategies according to the complexity of the input image. In this paper, we propose ERASE, a two-stage vision token pruning framework that identifies and retains salient tokens through pruning strategies adaptive to image complexity. Experiment results demonstrate that ERASE significantly reduces vision tokens while preserving accuracy. For Qwen2.5-VL-7B, at a token pruning ratio of 85\%, ERASE retains 89.46% of the original model accuracy, whereas the best prior method retains only 78.1%. Our code is available at https://github.com/Tuna-Luna/ERASE.

Yuna Lee, Kyoungho Min, Yulhwa Kim• 2026

Related benchmarks

Task	Dataset	Result
Visual Question Answering	TextVQA	Accuracy82.12	1455
Visual Question Answering	ChartQA	Accuracy81.44	620
Optical Character Recognition	OCRBench	Score800	486
Visual Question Answering	InfoVQA	Accuracy73.34	264
Visual Question Answering	TextVQA	TextVQA Accuracy78.62	210
Visual Question Answering	DocVQA	Accuracy91.32	205
Information Visual Question Answering	InfoVQA	Accuracy82.45	159
Mathematical Visual Question Answering	MathVista	Accuracy56.4	87
Document Visual Question Answering	DocVQA	Accuracy95.44	54
Visual Question Answering	OCRBench	Score784	53

Showing 10 of 25 rows

Other info

Follow for update

@wizwand_team Discord