Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

FastVID: Dynamic Density Pruning for Fast Video Large Language Models

About

Video Large Language Models have demonstrated strong video understanding capabilities, yet their practical deployment is hindered by substantial inference costs caused by redundant video tokens. Existing pruning techniques fail to effectively exploit the spatiotemporal redundancy present in video data. To bridge this gap, we perform a systematic analysis of video redundancy from two perspectives: temporal context and visual context. Leveraging these insights, we propose Dynamic Density Pruning for Fast Video LLMs termed FastVID. Specifically, FastVID dynamically partitions videos into temporally ordered segments to preserve temporal structure and applies a density-based token pruning strategy to maintain essential spatial and temporal information. Our method significantly reduces computational overhead while maintaining temporal and visual integrity. Extensive evaluations show that FastVID achieves state-of-the-art performance across various short- and long-video benchmarks on leading Video LLMs, including LLaVA-OneVision, LLaVA-Video, Qwen2-VL, and Qwen2.5-VL. Notably, on LLaVA-OneVision-7B, FastVID effectively prunes $\textbf{90.3%}$ of video tokens, reduces FLOPs to $\textbf{8.3%}$, and accelerates the LLM prefill stage by $\textbf{7.1}\times$, while maintaining $\textbf{98.0%}$ of the original accuracy. The code is available at https://github.com/LunarShen/FastVID.

Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, Guiguang Ding• 2025

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench--
247
Video UnderstandingVideoMME
Score (Short)74.7
127
Long Video UnderstandingLongVideoBench
Score58.2
110
Video UnderstandingLongVideoBench
LongVideoBench Score57.8
79
Video UnderstandingEgoSchema--
49
Video UnderstandingVideoMME, EgoSchema, LongVideoBench, MVBench
Avg. Score60
48
Multi-modal Video UnderstandingMVBench
Score65.5
39
Egocentric Video UnderstandingEgoSchema
Subset Accuracy61.2
39
Video UnderstandingLLaVA-Video Benchmark Suite Aggregate
Score59.2
9
Showing 9 of 9 rows

Other info

Follow for update