Similarity-Aware Token Pruning: Your VLM but Faster

About

The computational demands of Vision Transformers (ViTs) and Vision-Language Models (VLMs) remain a significant challenge due to the quadratic complexity of self-attention. While token pruning offers a promising solution, existing methods often introduce training overhead or fail to adapt dynamically across layers. We present SAINT, a training-free token pruning framework that leverages token similarity and a graph-based formulation to dynamically optimize pruning rates and redundancy thresholds. Through systematic analysis, we identify a universal three-stage token evolution process (aligner-explorer-aggregator) in transformers, enabling aggressive pruning in early stages without sacrificing critical information. For ViTs, SAINT doubles the throughput of ViT-H/14 at 224px with only 0.6% accuracy loss on ImageNet-1K, surpassing the closest competitor by 0.8%. For VLMs, we apply SAINT in three modes: ViT-only, LLM-only, and hybrid. SAINT reduces LLaVA-13B's tokens by 75%, achieving latency comparable to LLaVA-7B with less than 1% performance loss across benchmarks. Our work establishes a unified, practical framework for efficient inference in ViTs and VLMs.

Ahmadreza Jeddi, Negin Baghbanzadeh, Elham Dolatabadi, Babak Taati• 2025

Related benchmarks

Task	Dataset	Result
Visual Question Answering	TextVQA	Accuracy60.6	1455
Massive Multi-discipline Multimodal Understanding	MMMU	Accuracy42.28	249
Document Visual Question Answering	DocVQA	Accuracy60.44	203
Multimodal Evaluation	MMStar	Accuracy62.38	177
Mathematical Visual Question Answering	MathVista	Accuracy60.5	87
Instruction Following	ALFRED	Accuracy16.15	57
Multimodal Conversational Question Answering	MMCoQA	ROUGE-L32	21
Multimodal Perception	BLINK	Accuracy63.82	21

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord