Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PTQ4DiT: Post-training Quantization for Diffusion Transformers

About

The recent introduction of Diffusion Transformers (DiTs) has demonstrated exceptional capabilities in image generation by using a different backbone architecture, departing from traditional U-Nets and embracing the scalable nature of transformers. Despite their advanced capabilities, the wide deployment of DiTs, particularly for real-time applications, is currently hampered by considerable computational demands at the inference stage. Post-training Quantization (PTQ) has emerged as a fast and data-efficient solution that can significantly reduce computation and memory footprint by using low-bit weights and activations. However, its applicability to DiTs has not yet been explored and faces non-trivial difficulties due to the unique design of DiTs. In this paper, we propose PTQ4DiT, a specifically designed PTQ method for DiTs. We discover two primary quantization challenges inherent in DiTs, notably the presence of salient channels with extreme magnitudes and the temporal variability in distributions of salient activation over multiple timesteps. To tackle these challenges, we propose Channel-wise Salience Balancing (CSB) and Spearmen's $\rho$-guided Salience Calibration (SSC). CSB leverages the complementarity property of channel magnitudes to redistribute the extremes, alleviating quantization errors for both activations and weights. SSC extends this approach by dynamically adjusting the balanced salience to capture the temporal variations in activation. Additionally, to eliminate extra computational costs caused by PTQ4DiT during inference, we design an offline re-parameterization strategy for DiTs. Experiments demonstrate that our PTQ4DiT successfully quantizes DiTs to 8-bit precision (W8A8) while preserving comparable generation ability and further enables effective quantization to 4-bit weight precision (W4A8) for the first time.

Junyi Wu, Haoxuan Wang, Yuzhang Shang, Mubarak Shah, Yan Yan• 2024

Related benchmarks

TaskDatasetResultRank
Image GenerationImageNet 256x256 (val)
FID4.63
340
Image GenerationImageNet 512x512 (val)
FID-50K17.55
219
Text-to-Image GenerationMJHQ-30K
Overall FID19.15
153
Image Super-resolutionDRealSR
MANIQA0.4949
130
Text-to-Image GenerationCOCO
FID28.15
61
Real-world Image Super-ResolutionRealLQ250
MUSIQ0.587
45
Real-world Image Super-ResolutionDRealSR
LPIPS0.697
35
Real-world Image Super-ResolutionRealLR200
MUSIQ57.63
34
Real-world Image Super-ResolutionRealSR
LPIPS0.6934
31
Text-to-Video GenerationVBench (test)
Motion Smoothness99.3
28
Showing 10 of 16 rows

Other info

Code

Follow for update