Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models

About

Token pruning has emerged as an effective approach to reduce the substantial computational overhead of Large Vision-Language Models (LVLMs) by discarding less informative visual tokens while preserving performance. However, existing methods typically rely on individual attention sources from different LVLM components, resulting in incomplete and suboptimal pruning decisions due to biased attention distributions. To address this problem, we propose DeSAP, a novel Decoupled Similarity-Aware Pruning method for precise, task-aware token pruning within the visual encoder. Specifically, DeSAP introduces a decoupled similarity to capture fine-grained cross-modal relevance between visual features and text tokens, providing explicit task-related guidance for pruning. By integrating decoupled similarity with visual saliency signals derived from visual attention, DeSAP performs token pruning under the guidance of both task-related and visual cues, enabling robust pruning even under aggressive pruning ratios. Extensive experiments across diverse benchmarks and architectures show that DeSAP consistently outperforms SOTA methods in both accuracy and efficiency. On LLaVA-1.5-7B, DeSAP achieves a 10 times FLOPs reduction and a 2.3 times prefill speedup by retaining only 11.1% of visual tokens, while maintaining 98.1% of the original performance.

Kexin Ma, Jing Xiao, Chaofeng Chen, Geyong Min, Guibo Zhu, Jinqiao Wang, Liang Liao• 2026

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVizWiz
Accuracy57.3
1525
Object Hallucination EvaluationPOPE
Accuracy87.5
1455
Visual Question AnsweringTextVQA
Accuracy57.2
1285
Visual Question AnsweringGQA
Accuracy62.5
1249
Text-based Visual Question AnsweringTextVQA
Accuracy79.5
807
Science Question AnsweringScienceQA
Accuracy80.6
502
Video Question AnsweringMSRVTT-QA
Accuracy57
491
Video Question AnsweringMSVD-QA
Accuracy71.2
360
Scientific Question AnsweringScienceQA image
Accuracy69.3
184
Science Question AnsweringScienceQA SQA-IMG
Accuracy69.9
139
Showing 10 of 17 rows

Other info

Follow for update