Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SparVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration

About

Visual AutoRegressive (VAR) modeling has garnered significant attention for its innovative next-scale prediction paradigm. However, mainstream VAR paradigms attend to all tokens across historical scales at each autoregressive step. As the next scale resolution grows, the computational complexity of attention increases quartically with resolution, causing substantial latency. Prior accelerations often skip high-resolution scales, which speeds up inference but discards high-frequency details and harms image quality. To address these problems, we present \textbf{SparVAR}, a training-free acceleration framework that exploits three properties of VAR attention: \textbf{(i) strong attention sinks}, \textbf{(ii) cross-scale activation similarity}, and \textbf{(iii) pronounced locality}. Specifically, we dynamically predict the sparse attention pattern of later high-resolution scales from a sparse decision scale, and construct scale self-similar sparse attention via an efficient index-mapping mechanism, enabling high-efficiency sparse attention computation at large scales. Furthermore, we propose cross-scale local sparse attention and implement an efficient block-wise sparse kernel, which achieves $\mathbf{> 5\times}$ faster forward speed than FlashAttention. Extensive experiments demonstrate that the proposed SparVAR can reduce the generation time of an 8B model producing $1024\times1024$ high-resolution images to the \textbf{1s}, \textbf{without skipping the last scales}. Compared with the VAR baseline accelerated by FlashAttention, our method achieves a $\mathbf{1.57\times}$ speed-up while preserving almost all high-frequency details. When combined with existing scale-skipping strategies, SparVAR attains up to a $\mathbf{2.28\times}$ acceleration, while maintaining competitive visual generation quality. Code is available at \href{https://github.com/CAS-CLab/SparVAR}{SparVAR}.

Zekun Li, Ning Wang, Tongxin Bai, Changwang Mei, Peisong Wang, Shuang Qiu, Jian Cheng• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationDPG-Bench
Overall Score75.625
265
Text-to-Image GenerationDPG-Bench (test)
Global Fidelity91.729
58
Image GenerationGenEval
Overall Score50.7
57
Text-to-Image GenerationImageReward
ImageReward Score0.68
56
Human Preference EvaluationImageReward
Average Score1.0533
24
Human Preference EvaluationHPS v2.1
Photo Score29.47
24
Text-to-Image GenerationGenEval 1024x1024
Overall Score (GenEval)0.8
23
Image Generation1024x1024
Latency (ms)383.1
6
Image GenerationHPS v2.1
Overall Score29.14
3
Showing 9 of 9 rows

Other info

Follow for update