AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding

About

Processing long-form videos with Video Large Language Models (Video-LLMs) is computationally prohibitive. Current efficiency methods often compromise fine-grained perception through irreversible information disposal or inhibit long-range temporal modeling via rigid, predefined sparse patterns. This paper introduces AdaSpark, an adaptive sparsity framework designed to address these limitations. AdaSpark first partitions video inputs into 3D spatio-temporal cubes. It then employs two co-designed, context-aware components: (1) Adaptive Cube-Selective Attention (AdaS-Attn), which adaptively selects a subset of relevant video cubes to attend for each query token, and (2) Adaptive Token-Selective FFN (AdaS-FFN), which selectively processes only the most salient tokens within each cube. An entropy-based (Top-p) selection mechanism adaptively allocates computational resources based on input complexity. Experiments demonstrate that AdaSpark significantly reduces computational load by up to 57% FLOPs while maintaining comparable performance to dense models and preserving fine-grained, long-range dependencies, as validated on challenging hour-scale video benchmarks.

Handong Li, Zikang Liu, Longteng Guo, Tongtian Yue, Yepeng Tang, Xinxin Zhu, Chuanyang Zheng, Ziming Wang, Zhibin Wang, Jun Song, Cheng Yu, Bo Zheng, Jing Liu• 2026

Related benchmarks

Task	Dataset	Result
Long Video Understanding	VideoMME	--	89
Long Video Understanding	MLVU (dev)	Score69.8	63
Long-form Video Understanding	LVBench	Overall Score47.9	35
Short video understanding	MVBench	Accuracy70.3	28
Long-form Video-Language Understanding	LongVideo	Score62.1	19
Video Grounding	CharadesSTA	Accuracy (CharadesSTA)55.3	19
Extra Long Video Retrieval	VideoNIAH	Accuracy97.5	17
Spatial Perception Video Understanding	VSIBench	Overall Score39.8	14

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord