Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MDTv2: Masked Diffusion Transformer is a Strong Image Synthesizer

About

Despite its success in image synthesis, we observe that diffusion probabilistic models (DPMs) often lack contextual reasoning ability to learn the relations among object parts in an image, leading to a slow learning process. To solve this issue, we propose a Masked Diffusion Transformer (MDT) that introduces a mask latent modeling scheme to explicitly enhance the DPMs' ability to contextual relation learning among object semantic parts in an image. During training, MDT operates in the latent space to mask certain tokens. Then, an asymmetric diffusion transformer is designed to predict masked tokens from unmasked ones while maintaining the diffusion generation process. Our MDT can reconstruct the full information of an image from its incomplete contextual input, thus enabling it to learn the associated relations among image tokens. We further improve MDT with a more efficient macro network structure and training strategy, named MDTv2. Experimental results show that MDTv2 achieves superior image synthesis performance, e.g., a new SOTA FID score of 1.58 on the ImageNet dataset, and has more than 10x faster learning speed than the previous SOTA DiT. The source code is released at https://github.com/sail-sg/MDT.

Shanghua Gao, Pan Zhou, Ming-Ming Cheng, Shuicheng Yan• 2023

Related benchmarks

TaskDatasetResultRank
Class-conditional Image GenerationImageNet 256x256
Inception Score (IS)314.7
441
Image GenerationImageNet 256x256 (val)
FID1.58
307
Class-conditional Image GenerationImageNet 256x256 (train)
IS314.7
305
Class-conditional Image GenerationImageNet 256x256 (val)
FID1.58
293
Image GenerationImageNet 256x256
FID1.58
243
Class-conditional Image GenerationImageNet 256x256 (train val)
FID1.58
178
Class-conditional Image GenerationImageNet 256x256 (test)
FID1.79
167
Image ReconstructionImageNet 256x256
rFID0.61
93
Class-conditional Image GenerationImageNet class-conditional 256x256 (test val)
FID1.79
75
Class-conditional Image GenerationImageNet 512x512 (val)
FID (Val)51.16
69
Showing 10 of 14 rows

Other info

Code

Follow for update