Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VideoNSA: Native Sparse Attention Scales Video Understanding

About

Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks. Project Page: https://enxinsong.com/VideoNSA-web/, Code: https://github.com/Espere-1119-Song/VideoNSA

Enxin Song, Wenhao Chai, Shusheng Yang, Ethan Armand, Xiaojun Shan, Haiyang Xu, Jianwen Xie, Zhuowen Tu• 2025

Related benchmarks

TaskDatasetResultRank
Long Video Question AnsweringLong VideoBench (val)
Accuracy60
36
Long Video UnderstandingMLVU 3-120 min
Accuracy51.8
23
Long Video UnderstandingLongVideoBench 23sec-60 min
Accuracy60
19
Long Video UnderstandingLongTimeScope 300min+
Accuracy44.4
8
Showing 4 of 4 rows

Other info

Follow for update