VideoNSA: Native Sparse Attention Scales Video Understanding

About

Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks. Project Page: https://enxinsong.com/VideoNSA-web/, Code: https://github.com/Espere-1119-Song/VideoNSA

Enxin Song, Wenhao Chai, Shusheng Yang, Ethan Armand, Xiaojun Shan, Haiyang Xu, Jianwen Xie, Zhuowen Tu• 2025

Related benchmarks

Task	Dataset	Result
Long Video Question Answering	Long VideoBench (val)	Accuracy60	36
Long Video Understanding	MLVU 3-120 min	Accuracy51.8	36
Long Video Understanding	LongVideoBench 23sec-60 min	Accuracy60	19
Long Video Understanding	LongTimeScope 300min+	Accuracy44.4	8

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord