Tango: Taming Visual Signals for Efficient Video Large Language Models
About
Token pruning has emerged as a mainstream approach for developing efficient Video Large Language Models (Video LLMs). This work revisits and advances the two predominant token-pruning paradigms: attention-based selection and similarity-based clustering. Our study reveals two critical limitations in existing methods: (1) conventional top-k selection strategies fail to fully account for the attention distribution, which is often spatially multi-modal and long-tailed in magnitude; and (2) direct similarity-based clustering frequently generates fragmented clusters, resulting in distorted representations after pooling. To address these bottlenecks, we propose Tango, a novel framework designed to optimize the utilization of visual signals. Tango integrates a diversity-driven strategy to enhance attention-based token selection, and introduces Spatio-temporal Rotary Position Embedding (ST-RoPE) to preserve geometric structure via locality priors. Comprehensive experiments across various Video LLMs and video understanding benchmarks demonstrate the effectiveness and generalizability of our approach. Notably, when retaining only 10% of the video tokens, Tango preserves 98.9% of the original performance on LLaVA-OV while delivering a 1.88$\times$ inference speedup.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Understanding | MVBench | Accuracy62.4 | 425 | |
| Video Understanding | LongVideoBench | -- | 92 | |
| Video Understanding | Aggregate MVBench, LongVideo Bench, MLVU, VideoMME | Average Score97.6 | 59 | |
| Video Understanding | Video-MME | Performance (Short)71.3 | 20 | |
| Video Understanding | MLVU | Accuracy64.1 | 20 | |
| Video Understanding | Video-MME w/o sub | Score (Short)75.3 | 17 | |
| Video Understanding | MLVU | Accuracy70.4 | 17 |