Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

All are Worth Words: A ViT Backbone for Diffusion Models

About

Vision transformers (ViT) have shown promise in various vision tasks while the U-Net based on a convolutional neural network (CNN) remains dominant in diffusion models. We design a simple and general ViT-based architecture (named U-ViT) for image generation with diffusion models. U-ViT is characterized by treating all inputs including the time, condition and noisy image patches as tokens and employing long skip connections between shallow and deep layers. We evaluate U-ViT in unconditional and class-conditional image generation, as well as text-to-image generation tasks, where U-ViT is comparable if not superior to a CNN-based U-Net of a similar size. In particular, latent diffusion models with U-ViT achieve record-breaking FID scores of 2.29 in class-conditional image generation on ImageNet 256x256, and 5.48 in text-to-image generation on MS-COCO, among methods without accessing large external datasets during the training of generative models. Our results suggest that, for diffusion-based image modeling, the long skip connection is crucial while the down-sampling and up-sampling operators in CNN-based U-Net are not always necessary. We believe that U-ViT can provide insights for future research on backbones in diffusion models and benefit generative modeling on large scale cross-modality datasets.

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, Jun Zhu• 2022

Related benchmarks

TaskDatasetResultRank
Class-conditional Image GenerationImageNet 256x256
Inception Score (IS)263.9
441
Class-conditional Image GenerationImageNet 256x256 (train)
IS265.3
305
Class-conditional Image GenerationImageNet 256x256 (val)
FID2.29
293
Image GenerationImageNet 256x256
FID2.29
243
Image GenerationImageNet 512x512 (val)
FID-50K3.23
184
Class-conditional Image GenerationImageNet 256x256 (train val)
FID3.4
178
Unconditional Image GenerationCIFAR-10
FID3.11
171
Class-conditional Image GenerationImageNet 64x64
FID4.26
126
Text-to-Image GenerationMS-COCO (val)
FID5.45
112
Unconditional Image GenerationCelebA unconditional 64 x 64
FID2.87
95
Showing 10 of 43 rows

Other info

Code

Follow for update