Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Similarity-Aware Token Pruning: Your VLM but Faster

About

The computational demands of Vision Transformers (ViTs) and Vision-Language Models (VLMs) remain a significant challenge due to the quadratic complexity of self-attention. While token pruning offers a promising solution, existing methods often introduce training overhead or fail to adapt dynamically across layers. We present SAINT, a training-free token pruning framework that leverages token similarity and a graph-based formulation to dynamically optimize pruning rates and redundancy thresholds. Through systematic analysis, we identify a universal three-stage token evolution process (aligner-explorer-aggregator) in transformers, enabling aggressive pruning in early stages without sacrificing critical information. For ViTs, SAINT doubles the throughput of ViT-H/14 at 224px with only 0.6% accuracy loss on ImageNet-1K, surpassing the closest competitor by 0.8%. For VLMs, we apply SAINT in three modes: ViT-only, LLM-only, and hybrid. SAINT reduces LLaVA-13B's tokens by 75%, achieving latency comparable to LLaVA-7B with less than 1% performance loss across benchmarks. Our work establishes a unified, practical framework for efficient inference in ViTs and VLMs.

Ahmadreza Jeddi, Negin Baghbanzadeh, Elham Dolatabadi, Babak Taati• 2025

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringTextVQA
Accuracy60.6
1453
Massive Multi-discipline Multimodal UnderstandingMMMU
Accuracy42.28
216
Document Visual Question AnsweringDocVQA
Accuracy60.44
203
Multimodal EvaluationMMStar
Accuracy62.38
139
Mathematical Visual Question AnsweringMathVista
Accuracy60.5
87
Instruction FollowingALFRED
Accuracy16.15
57
Multimodal Conversational Question AnsweringMMCoQA
ROUGE-L32
21
Multimodal PerceptionBLINK
Accuracy63.82
21
Showing 8 of 8 rows

Other info

Follow for update