Improving Diffusion-Based Image Synthesis with Context Prediction

About

Diffusion models are a new class of generative models, and have dramatically promoted image generation with unprecedented quality and diversity. Existing diffusion models mainly try to reconstruct input image from a corrupted one with a pixel-wise or feature-wise constraint along spatial axes. However, such point-based reconstruction may fail to make each predicted pixel/feature fully preserve its neighborhood context, impairing diffusion-based image synthesis. As a powerful source of automatic supervisory signal, context has been well studied for learning representations. Inspired by this, we for the first time propose ConPreDiff to improve diffusion-based image synthesis with context prediction. We explicitly reinforce each point to predict its neighborhood context (i.e., multi-stride features/tokens/pixels) with a context decoder at the end of diffusion denoising blocks in training stage, and remove the decoder for inference. In this way, each point can better reconstruct itself by preserving its semantic connections with neighborhood context. This new paradigm of ConPreDiff can generalize to arbitrary discrete and continuous diffusion backbones without introducing extra parameters in sampling procedure. Extensive experiments are conducted on unconditional image generation, text-to-image generation and image inpainting tasks. Our ConPreDiff consistently outperforms previous methods and achieves a new SOTA text-to-image generation results on MS-COCO, with a zero-shot FID score of 6.21.

Ling Yang, Jingwei Liu, Shenda Hong, Zhilong Zhang, Zhilin Huang, Zheming Cai, Wentao Zhang, Bin Cui• 2024

Related benchmarks

Task	Dataset	Result
Text-to-Image Generation	MS-COCO (val)	FID6.21	202
Text-to-Image Generation	T2I-CompBench	Shape Fidelity56.37	185
Inpainting	ImageNet	LPIPS0.057	54
Text-to-Image Generation	T2I-CompBench	Color Fidelity0.7019	46
Image Inpainting	CelebA-HQ	LPIPS0.022	42
Unconditional image synthesis	FFHQ 256x256 (test)	FID2.24	31
Unconditional image synthesis	CelebA-HQ 256 x 256 (test)	FID3.22	22
Unconditional image synthesis	LSUN-Bedrooms 256x256 (test)	FID1.12	8
Unconditional image synthesis	LSUN-Churches 256x256 (test)	FID1.78	8

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord