Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CogView: Mastering Text-to-Image Generation via Transformers

About

Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, e.g. eliminating NaN losses. CogView achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a recent similar work DALL-E.

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, Jie Tang• 2021

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationMS-COCO 2014 (val)
FID27.1
128
Text-to-Image GenerationMS-COCO (val)
FID27.1
112
Text-to-Image GenerationMS-COCO--
75
Text-to-Image GenerationMS-COCO 256x256 (val)
FID27.1
53
Text-to-Image GenerationMSCOCO 30K
FID13.9
42
Text-to-Image GenerationMS COCO zero-shot
FID24
42
Text-to-Image SynthesisCOCO (test)
FID27.1
38
Text-to-Image GenerationCOCO 256 x 256 2014 (val)
FID27.1
37
Text-to-Image SynthesisMS-COCO (val)
FID27.1
35
Text-to-Image SynthesisMSCOCO
FID13.9
31
Showing 10 of 18 rows

Other info

Code

Follow for update