Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Taming Transformers for High-Resolution Image Synthesis

About

Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias that prioritizes local interactions. This makes them expressive, but also computationally infeasible for long sequences, such as high-resolution images. We demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images. We show how to (i) use CNNs to learn a context-rich vocabulary of image constituents, and in turn (ii) utilize transformers to efficiently model their composition within high-resolution images. Our approach is readily applied to conditional synthesis tasks, where both non-spatial information, such as object classes, and spatial information, such as segmentations, can control the generated image. In particular, we present the first results on semantically-guided synthesis of megapixel images with transformers and obtain the state of the art among autoregressive models on class-conditional ImageNet. Code and pretrained models can be found at https://github.com/CompVis/taming-transformers .

Patrick Esser, Robin Rombach, Bj\"orn Ommer• 2020

Related benchmarks

TaskDatasetResultRank
Class-conditional Image GenerationImageNet 256x256
Inception Score (IS)280.3
441
Image GenerationImageNet 256x256 (val)
FID15.78
307
Class-conditional Image GenerationImageNet 256x256 (train)
IS280.3
305
Class-conditional Image GenerationImageNet 256x256 (val)
FID5.2
293
Image GenerationImageNet 256x256
FID15.78
243
Image GenerationImageNet 512x512 (val)
FID-50K26.52
184
Class-conditional Image GenerationImageNet 256x256 (train val)
FID15.78
178
Class-conditional Image GenerationImageNet 256x256 (test)
FID3.04
167
Class-conditional Image GenerationImageNet
FID5.88
132
Unconditional Image GenerationLSUN Bedrooms unconditional
FID6.35
96
Showing 10 of 187 rows
...

Other info

Code

Follow for update