Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis

About

Text-to-image synthesis has recently seen significant progress thanks to large pretrained language models, large-scale training data, and the introduction of scalable model families such as diffusion and autoregressive models. However, the best-performing models require iterative evaluation to generate a single sample. In contrast, generative adversarial networks (GANs) only need a single forward pass. They are thus much faster, but they currently remain far behind the state-of-the-art in large-scale text-to-image synthesis. This paper aims to identify the necessary steps to regain competitiveness. Our proposed model, StyleGAN-T, addresses the specific requirements of large-scale text-to-image synthesis, such as large capacity, stable training on diverse datasets, strong text alignment, and controllable variation vs. text alignment tradeoff. StyleGAN-T significantly improves over previous GANs and outperforms distilled diffusion models - the previous state-of-the-art in fast text-to-image synthesis - in terms of sample quality and speed.

Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, Timo Aila• 2023

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationMS-COCO 2014 (val)--
128
Text-to-Image GenerationMS-COCO
FID13.9
75
Text-to-Image GenerationCOCO 30k subset 2014 (val)
FID13.9
46
Text-to-Image GenerationMS COCO zero-shot
FID13.9
42
Text-to-Image GenerationMS-COCO 512x512 zero-shot
FID13.9
19
Text-to-Image SynthesisMS COCO 64x64 zero-shot
Zero-shot FID30k7.3
13
Text-to-Image SynthesisMS COCO 256x256
FID13.9
13
Showing 7 of 7 rows

Other info

Code

Follow for update