CogView: Mastering Text-to-Image Generation via Transformers
About
Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, e.g. eliminating NaN losses. CogView achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a recent similar work DALL-E.
Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, Jie Tang• 2021
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Generation | MS-COCO 2014 (val) | FID27.1 | 128 | |
| Text-to-Image Generation | MS-COCO (val) | FID27.1 | 112 | |
| Text-to-Image Generation | MS-COCO | -- | 75 | |
| Text-to-Image Generation | MS-COCO 256x256 (val) | FID27.1 | 53 | |
| Text-to-Image Generation | MSCOCO 30K | FID13.9 | 42 | |
| Text-to-Image Generation | MS COCO zero-shot | FID24 | 42 | |
| Text-to-Image Synthesis | COCO (test) | FID27.1 | 38 | |
| Text-to-Image Generation | COCO 256 x 256 2014 (val) | FID27.1 | 37 | |
| Text-to-Image Synthesis | MS-COCO (val) | FID27.1 | 35 | |
| Text-to-Image Synthesis | MSCOCO | FID13.9 | 31 |
Showing 10 of 18 rows