Similarity-Aware Token Pruning: Your VLM but Faster
About
The computational demands of Vision Transformers (ViTs) and Vision-Language Models (VLMs) remain a significant challenge due to the quadratic complexity of self-attention. While token pruning offers a promising solution, existing methods often introduce training overhead or fail to adapt dynamically across layers. We present SAINT, a training-free token pruning framework that leverages token similarity and a graph-based formulation to dynamically optimize pruning rates and redundancy thresholds. Through systematic analysis, we identify a universal three-stage token evolution process (aligner-explorer-aggregator) in transformers, enabling aggressive pruning in early stages without sacrificing critical information. For ViTs, SAINT doubles the throughput of ViT-H/14 at 224px with only 0.6% accuracy loss on ImageNet-1K, surpassing the closest competitor by 0.8%. For VLMs, we apply SAINT in three modes: ViT-only, LLM-only, and hybrid. SAINT reduces LLaVA-13B's tokens by 75%, achieving latency comparable to LLaVA-7B with less than 1% performance loss across benchmarks. Our work establishes a unified, practical framework for efficient inference in ViTs and VLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | TextVQA | Accuracy60.6 | 1453 | |
| Massive Multi-discipline Multimodal Understanding | MMMU | Accuracy42.28 | 216 | |
| Document Visual Question Answering | DocVQA | Accuracy60.44 | 203 | |
| Multimodal Evaluation | MMStar | Accuracy62.38 | 139 | |
| Mathematical Visual Question Answering | MathVista | Accuracy60.5 | 87 | |
| Instruction Following | ALFRED | Accuracy16.15 | 57 | |
| Multimodal Conversational Question Answering | MMCoQA | ROUGE-L32 | 21 | |
| Multimodal Perception | BLINK | Accuracy63.82 | 21 |