Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification

About

Diffusion transformers exhibit remarkable video generation capability, yet their prohibitive computational and memory costs hinder practical deployment. Model quantization and attention sparsification are two promising directions for compression, but each alone suffers severe performance degradation under aggressive compression. Combining them promises compounded efficiency gains, but naive integration is ineffective. The sparsity-induced information loss exacerbates quantization noise, leading to amplified attention shifts. To address this, we propose \textbf{QuantSparse}, a unified framework that integrates model quantization with attention sparsification. Specifically, we introduce \textit{Multi-Scale Salient Attention Distillation}, which leverages both global structural guidance and local salient supervision to mitigate quantization-induced bias. In addition, we develop \textit{Second-Order Sparse Attention Reparameterization}, which exploits the temporal stability of second-order residuals to efficiently recover information lost under sparsity. Experiments on HunyuanVideo-13B demonstrate that QuantSparse achieves 20.88 PSNR, substantially outperforming the state-of-the-art quantization baseline Q-VDiT (16.85 PSNR), while simultaneously delivering a \textbf{3.68$\times$} reduction in storage and \textbf{1.88$\times$} acceleration in end-to-end inference. Our code will be released in https://github.com/wlfeng0509/QuantSparse.

Weilun Feng, Chuanguang Yang, Haotong Qin, Mingqiang Wu, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-Video GenerationVBench--
155
Text-to-Video GenerationWan 1.3B 2.1
CLIPSIM0.193
27
Video GenerationWan2.1 14B (test)
CLIPSIM0.183
11
Text-to-Video GenerationHunyuanVideo 13B CFG = 6.0, 720 × 1280p, frames = 60 (test)
CLIPSIM0.183
11
Text-to-Video GenerationWan2.1 14B CFG = 5.0, 720 × 1280p, frames = 80 (test)
CLIPSIM0.182
11
Video GenerationHunyuanVideo 13B (test)
CLIPSIM0.184
11
Image GenerationDrawBench
PSNR20.34
3
Showing 7 of 7 rows

Other info

Follow for update