Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ERASE: Eliminating Redundant Visual Tokens via Adaptive Two-Stage Token Pruning

About

Recent advancements in Vision-Language Models (VLMs) enable large language models (LLMs) to process high-resolution images, significantly improving real-world multimodal understanding. However, this capability introduces a large number of vision tokens, resulting in substantial computational overhead. To mitigate this issue, various vision token pruning methods have been proposed. Nevertheless, existing approaches predominantly rely on learned semantic features within the model to capture visual redundancy. Moreover, they lack adaptive mechanisms to adjust pruning strategies according to the complexity of the input image. In this paper, we propose ERASE, a two-stage vision token pruning framework that identifies and retains salient tokens through pruning strategies adaptive to image complexity. Experiment results demonstrate that ERASE significantly reduces vision tokens while preserving accuracy. For Qwen2.5-VL-7B, at a token pruning ratio of 85\%, ERASE retains 89.46% of the original model accuracy, whereas the best prior method retains only 78.1%. Our code is available at https://github.com/Tuna-Luna/ERASE.

Yuna Lee, Kyoungho Min, Yulhwa Kim• 2026

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringTextVQA
Accuracy82.12
1453
Visual Question AnsweringChartQA
Accuracy81.44
519
Optical Character RecognitionOCRBench
Score800
433
Visual Question AnsweringTextVQA
TextVQA Accuracy78.62
210
Visual Question AnsweringDocVQA
Accuracy91.32
205
Visual Question AnsweringInfoVQA
Accuracy73.34
195
Information Visual Question AnsweringInfoVQA
Accuracy82.45
110
Mathematical Visual Question AnsweringMathVista
Accuracy56.4
87
Visual Question AnsweringOCRBench
Score784
53
Visual GroundingHRBench8K
Accuracy75.63
51
Showing 10 of 25 rows

Other info

Follow for update