Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models

About

We present Deep Compression Autoencoder (DC-AE), a new family of autoencoder models for accelerating high-resolution diffusion models. Existing autoencoder models have demonstrated impressive results at a moderate spatial compression ratio (e.g., 8x), but fail to maintain satisfactory reconstruction accuracy for high spatial compression ratios (e.g., 64x). We address this challenge by introducing two key techniques: (1) Residual Autoencoding, where we design our models to learn residuals based on the space-to-channel transformed features to alleviate the optimization difficulty of high spatial-compression autoencoders; (2) Decoupled High-Resolution Adaptation, an efficient decoupled three-phases training strategy for mitigating the generalization penalty of high spatial-compression autoencoders. With these designs, we improve the autoencoder's spatial compression ratio up to 128 while maintaining the reconstruction quality. Applying our DC-AE to latent diffusion models, we achieve significant speedup without accuracy drop. For example, on ImageNet 512x512, our DC-AE provides 19.1x inference speedup and 17.9x training speedup on H100 GPU for UViT-H while achieving a better FID, compared with the widely used SD-VAE-f8 autoencoder. Our code is available at https://github.com/mit-han-lab/efficientvit.

Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, Song Han• 2024

Related benchmarks

TaskDatasetResultRank
Image ReconstructionImageNet 256x256
rFID0.69
93
Class-conditional Image GenerationImageNet 512x512 (val test)
FID1.72
40
Conditional Image GenerationImageNet 512x512 (val)
gFID2.25
30
Image GenerationFFHQ unconditional 1024x1024
Throughput (Training)2.09e+3
9
Image GenerationMJHQ class-conditional 1024x1024
Throughput (Training)2.09e+3
9
Image ReconstructionFFHQ 1024x1024
PSNR31.18
6
Image ReconstructionImageNet 512x512
rFID0.22
4
Image ReconstructionMapillary Vistas 2048x2048
rFID0.36
4
Image GenerationMapillary Vistas unconditional 2048x2048
Throughput (Training)459
2
Text-to-Image GenerationMJHQ 512x512
FID6.1
2
Showing 10 of 10 rows

Other info

Code

Follow for update