Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding

About

Processing long-form videos with Video Large Language Models (Video-LLMs) is computationally prohibitive. Current efficiency methods often compromise fine-grained perception through irreversible information disposal or inhibit long-range temporal modeling via rigid, predefined sparse patterns. This paper introduces AdaSpark, an adaptive sparsity framework designed to address these limitations. AdaSpark first partitions video inputs into 3D spatio-temporal cubes. It then employs two co-designed, context-aware components: (1) Adaptive Cube-Selective Attention (AdaS-Attn), which adaptively selects a subset of relevant video cubes to attend for each query token, and (2) Adaptive Token-Selective FFN (AdaS-FFN), which selectively processes only the most salient tokens within each cube. An entropy-based (Top-p) selection mechanism adaptively allocates computational resources based on input complexity. Experiments demonstrate that AdaSpark significantly reduces computational load by up to 57% FLOPs while maintaining comparable performance to dense models and preserving fine-grained, long-range dependencies, as validated on challenging hour-scale video benchmarks.

Handong Li, Zikang Liu, Longteng Guo, Tongtian Yue, Yepeng Tang, Xinxin Zhu, Chuanyang Zheng, Ziming Wang, Zhibin Wang, Jun Song, Cheng Yu, Bo Zheng, Jing Liu• 2026

Related benchmarks

TaskDatasetResultRank
Long Video UnderstandingMLVU (dev)
Score69.8
63
Long Video UnderstandingVideoMME--
40
Long-form Video UnderstandingLVBench
Overall Score47.9
35
Short video understandingMVBench
Accuracy70.3
28
Long-form Video-Language UnderstandingLongVideo
Score62.1
19
Video GroundingCharadesSTA
Accuracy (CharadesSTA)55.3
19
Extra Long Video RetrievalVideoNIAH
Accuracy97.5
17
Spatial Perception Video UnderstandingVSIBench
Overall Score39.8
14
Showing 8 of 8 rows

Other info

Follow for update