AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding
About
Processing long-form videos with Video Large Language Models (Video-LLMs) is computationally prohibitive. Current efficiency methods often compromise fine-grained perception through irreversible information disposal or inhibit long-range temporal modeling via rigid, predefined sparse patterns. This paper introduces AdaSpark, an adaptive sparsity framework designed to address these limitations. AdaSpark first partitions video inputs into 3D spatio-temporal cubes. It then employs two co-designed, context-aware components: (1) Adaptive Cube-Selective Attention (AdaS-Attn), which adaptively selects a subset of relevant video cubes to attend for each query token, and (2) Adaptive Token-Selective FFN (AdaS-FFN), which selectively processes only the most salient tokens within each cube. An entropy-based (Top-p) selection mechanism adaptively allocates computational resources based on input complexity. Experiments demonstrate that AdaSpark significantly reduces computational load by up to 57% FLOPs while maintaining comparable performance to dense models and preserving fine-grained, long-range dependencies, as validated on challenging hour-scale video benchmarks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Long Video Understanding | MLVU (dev) | Score69.8 | 63 | |
| Long Video Understanding | VideoMME | -- | 40 | |
| Long-form Video Understanding | LVBench | Overall Score47.9 | 35 | |
| Short video understanding | MVBench | Accuracy70.3 | 28 | |
| Long-form Video-Language Understanding | LongVideo | Score62.1 | 19 | |
| Video Grounding | CharadesSTA | Accuracy (CharadesSTA)55.3 | 19 | |
| Extra Long Video Retrieval | VideoNIAH | Accuracy97.5 | 17 | |
| Spatial Perception Video Understanding | VSIBench | Overall Score39.8 | 14 |