SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention
About
In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank. This naturally suggests applying sparse acceleration to the first part and low-rank acceleration to the second. Based on this finding, we propose SLA (Sparse-Linear Attention), a trainable attention method that fuses sparse and linear attention to accelerate diffusion models. SLA classifies attention weights into critical, marginal, and negligible categories, applying O(N^2) attention to critical weights, O(N) attention to marginal weights, and skipping negligible ones. SLA combines these computations into a single GPU kernel and supports both forward and backward passes. With only a few fine-tuning steps using SLA, DiT models achieve a 20x reduction in attention computation, resulting in significant acceleration without loss of generation quality. Experiments show that SLA reduces attention computation by 95% without degrading end-to-end generation quality, outperforming baseline methods. In addition, we implement an efficient GPU kernel for SLA, which yields a 13.7x speedup in attention computation and a 2.2x end-to-end speedup in video generation on Wan2.1-1.3B. The code is available at https://github.com/thu-ml/SLA.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Generation | FFHQ 256x256 (test) | FID39.98 | 30 | |
| Video Generation | VBench v1 (test) | Latency (s)7.71 | 13 | |
| Text-to-Video Generation | Private Video Dataset Wan2.1-T2V-14B-720P (test) | IQ67.58 | 10 | |
| Text-to-Video Generation | Private Video Dataset Wan2.1-T2V-1.3B-480P (test) | IQ63.14 | 10 | |
| Video Generation | VBench and Vision Reward Mixkit 2000 videos | SC83.77 | 9 | |
| Video Generation | Wan 14B 720p resolution 2.1 (test) | IQ64.43 | 6 | |
| Video Generation | Wan2.1-1.3B 480p resolution (test) | IQ63.14 | 6 | |
| Image Generation | FFHQ 128x128 (test) | FID25.73 | 4 | |
| Image Generation | ImageNet 256 PixelFlow script eval (test) | FID22.58 | 3 |