Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models

About

In large vision-language models, visual tokens typically constitute the majority of input tokens, leading to substantial computational overhead. To address this, recent studies have explored pruning redundant or less informative visual tokens for image understanding tasks. However, these methods struggle with pixel grounding tasks, where token importance is highly contingent on the input text. Through an in-depth analysis of CLIP, we observe that visual tokens within referent regions often exhibit low similarity to their textual representation. Motivated by this insight, we introduce LiteLVLM, a training-free, text-guided token pruning strategy for efficient pixel grounding inference. By reversing the ranking of CLIP's visual-text similarity, LiteLVLM effectively retains visual tokens covering the referent regions, while recovering context tokens to enable clear foreground-background separation. Extensive experiments demonstrate that LiteLVLM significantly outperforms existing methods by over 5% across diverse token budgets. Without any training or fine-tuning, LiteLVLM maintains 90% of the original performance with a 22% speedup and a 2.3X memory reduction. Our code is available at https://github.com/sejong-rcv/LiteLVLM.

Sangin Lee, Yukyung Choi• 2026

Related benchmarks

TaskDatasetResultRank
Referring Video Object SegmentationRef-DAVIS 2017 (val)
J&F69.2
230
Referring Video Object SegmentationMeViS (val)
J&F Score0.4508
166
Vision-Language Understanding and ReasoningLLaVA Multimodal Evaluation Suite (GQA, MMBench, MME, POPE, ScienceQA, VQAv2, TextVQA, SEED-Bench, MM-Vet, VizWiz) 1.5 (test/val)
GQA58.4
41
Referring Video Object SegmentationRefer-Youtube-VOS
J&F Score66.5
23
Showing 4 of 4 rows

Other info

Follow for update