ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis

About

Autoregressive models and their sequential factorization of the data likelihood have recently demonstrated great potential for image representation and synthesis. Nevertheless, they incorporate image context in a linear 1D order by attending only to previously synthesized image patches above or to the left. Not only is this unidirectional, sequential bias of attention unnatural for images as it disregards large parts of a scene until synthesis is almost complete. It also processes the entire image on a single scale, thus ignoring more global contextual information up to the gist of the entire scene. As a remedy we incorporate a coarse-to-fine hierarchy of context by combining the autoregressive formulation with a multinomial diffusion process: Whereas a multistage diffusion process successively removes information to coarsen an image, we train a (short) Markov chain to invert this process. In each stage, the resulting autoregressive ImageBART model progressively incorporates context from previous stages in a coarse-to-fine manner. Experiments show greatly improved image modification capabilities over autoregressive models while also providing high-fidelity image generation, both of which are enabled through efficient training in a compressed latent space. Specifically, our approach can take unrestricted, user-provided masks into account to perform local image editing. Thus, in contrast to pure autoregressive models, it can solve free-form image inpainting and, in the case of conditional models, local, text-guided image modification without requiring mask-specific training.

Patrick Esser, Robin Rombach, Andreas Blattmann, Bj\"orn Ommer• 2021

Related benchmarks

Task	Dataset	Result
Class-conditional Image Generation	ImageNet	FID7.44	174
Class-conditional Image Generation	ImageNet (val)	IS61.6	116
Unconditional Image Generation	LSUN Bedrooms unconditional	FID5.51	96
Image Generation	LSUN Church 256x256 (test)	FID7.32	61
Unconditional image synthesis	FFHQ 256x256 (test)	FID9.57	31
Class-conditional Image Generation	ImageNet (train val)	FID7.44	30
Unconditional Image Generation	FFHQ 256x256 (test)	FID9.57	25
Image Generation	LSUN Churches 256x256	FID7.32	23
Unconditional Image Generation	LSUN Church (test)	FID7.32	17
Image Generation	LSUN Bedroom 256x256	FID5.51	16

Showing 10 of 24 rows

Other info

Code

Follow for update

@wizwand_team Discord