Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

DiC: Rethinking Conv3x3 Designs in Diffusion Models

About

Diffusion models have shown exceptional performance in visual generation tasks. Recently, these models have shifted from traditional U-Shaped CNN-Attention hybrid structures to fully transformer-based isotropic architectures. While these transformers exhibit strong scalability and performance, their reliance on complicated self-attention operation results in slow inference speeds. Contrary to these works, we rethink one of the simplest yet fastest module in deep learning, 3x3 Convolution, to construct a scaled-up purely convolutional diffusion model. We first discover that an Encoder-Decoder Hourglass design outperforms scalable isotropic architectures for Conv3x3, but still under-performing our expectation. Further improving the architecture, we introduce sparse skip connections to reduce redundancy and improve scalability. Based on the architecture, we introduce conditioning improvements including stage-specific embeddings, mid-block condition injection, and conditional gating. These improvements lead to our proposed Diffusion CNN (DiC), which serves as a swift yet competitive diffusion architecture baseline. Experiments on various scales and settings show that DiC surpasses existing diffusion transformers by considerable margins in terms of performance while keeping a good speed advantage. Project page: https://github.com/YuchuanTian/DiC

Yuchuan Tian, Jing Han, Chengcheng Wang, Yuchen Liang, Chao Xu, Hanting Chen• 2024

Related benchmarks

TaskDatasetResultRank
Image GenerationImageNet 256x256
FID2.25
243
Image GenerationImageNet 512x512
FID12.89
34
Conditional Image GenerationImageNet 256x256 2012 400K iterations
FID3.89
9
Image GenerationImageNet 256x256 400K iterations (test)
FID11.36
7
Image SynthesisImageNet 512x512 (train)
FID2.96
3
Showing 5 of 5 rows

Other info

Code

Follow for update