VSA: Faster Video Diffusion with Trainable Sparse Attention

About

Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at \emph{both} training and inference. In VSA, a lightweight coarse stage pools tokens into tiles and identifies high-weight \emph{critical tokens}; a fine stage computes token-level attention only inside those tiles subjecting to block computing layout to ensure hard efficiency. This leads to a single differentiable kernel that trains end-to-end, requires no post-hoc profiling, and sustains 85\% of FlashAttention3 MFU. We perform a large sweep of ablation studies and scaling-law experiments by pretraining DiTs from 60M to 1.4B parameters. VSA reaches a Pareto point that cuts training FLOPS by 2.53$\times$ with no drop in diffusion loss. Retrofitting the open-source Wan-2.1 model speeds up attention time by 6$\times$ and lowers end-to-end generation time from 31s to 18s with comparable quality. These results establish trainable sparse attention as a practical alternative to full attention and a key enabler for further scaling of video diffusion models. Code will be available at https://github.com/hao-ai-lab/FastVideo.

Peiyuan Zhang, Yongqi Chen, Haofeng Huang, Will Lin, Zhengzhong Liu, Ion Stoica, Eric Xing, Hao Zhang• 2025

Related benchmarks

Task	Dataset	Result
Video Generation	VBench 2.0 (test)	Total Score82.77	49
Image Generation	FFHQ 256x256 (test)	FID40.69	38
Video Generation	VBench	Motion Smoothness98.49	23
Text-to-Video Generation	VBench Wan 480p	Total Score81.28	15
Text-to-Video Generation	Private Video Dataset Wan2.1-T2V-14B-720P (test)	IQ64.03	10
Text-to-Video Generation	Private Video Dataset Wan2.1-T2V-1.3B-480P (test)	IQ59.57	10
Video Generation	VBench Wan2.1 1.3B, 61x448x832 1.0 (test)	AQ64.46	8
Video Generation	VBench Self-Forcing Wan 2.1-1.3B (test)	Quality Score0.828	6
Video Generation	Wan 14B 720p resolution 2.1 (test)	IQ64.03	6
Video Generation	Wan2.1-1.3B 480p resolution (test)	IQ59.57	6

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord