VSA: Faster Video Diffusion with Trainable Sparse Attention
About
Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at \emph{both} training and inference. In VSA, a lightweight coarse stage pools tokens into tiles and identifies high-weight \emph{critical tokens}; a fine stage computes token-level attention only inside those tiles subjecting to block computing layout to ensure hard efficiency. This leads to a single differentiable kernel that trains end-to-end, requires no post-hoc profiling, and sustains 85\% of FlashAttention3 MFU. We perform a large sweep of ablation studies and scaling-law experiments by pretraining DiTs from 60M to 1.4B parameters. VSA reaches a Pareto point that cuts training FLOPS by 2.53$\times$ with no drop in diffusion loss. Retrofitting the open-source Wan-2.1 model speeds up attention time by 6$\times$ and lowers end-to-end generation time from 31s to 18s with comparable quality. These results establish trainable sparse attention as a practical alternative to full attention and a key enabler for further scaling of video diffusion models. Code will be available at https://github.com/hao-ai-lab/FastVideo.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Generation | VBench 2.0 (test) | Total Score82.77 | 44 | |
| Image Generation | FFHQ 256x256 (test) | FID40.69 | 30 | |
| Text-to-Video Generation | Private Video Dataset Wan2.1-T2V-14B-720P (test) | IQ64.03 | 10 | |
| Text-to-Video Generation | Private Video Dataset Wan2.1-T2V-1.3B-480P (test) | IQ59.57 | 10 | |
| Video Generation | VBench Wan2.1 1.3B, 61x448x832 1.0 (test) | AQ64.46 | 8 | |
| Video Generation | VBench Self-Forcing Wan 2.1-1.3B (test) | Quality Score0.828 | 6 | |
| Video Generation | Wan 14B 720p resolution 2.1 (test) | IQ64.03 | 6 | |
| Video Generation | Wan2.1-1.3B 480p resolution (test) | IQ59.57 | 6 | |
| Video Generation | VBench Wan2.1 14B, 93x704x1280 1.0 (test) | AQ66.4 | 4 | |
| Video Generation | VBench Wan 5B 93x704x1280 2.2 (test) | Aesthetic Quality (AQ)64.82 | 4 |