Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LAFITE: Towards Language-Free Training for Text-to-Image Generation

About

One of the major challenges in training text-to-image generation models is the need of a large number of high-quality image-text pairs. While image samples are often easily accessible, the associated text descriptions typically require careful human captioning, which is particularly time- and cost-consuming. In this paper, we propose the first work to train text-to-image generation models without any text data. Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model: the requirement of text-conditioning is seamlessly alleviated via generating text features from image features. Extensive experiments are conducted to illustrate the effectiveness of the proposed method. We obtain state-of-the-art results in the standard text-to-image generation tasks. Importantly, the proposed language-free model outperforms most existing models trained with full image-text pairs. Furthermore, our method can be applied in fine-tuning pre-trained models, which saves both training time and cost in training text-to-image generation models. Our pre-trained model obtains competitive results in zero-shot text-to-image generation on the MS-COCO dataset, yet with around only 1% of the model size and training data size relative to the recently proposed large DALL-E model.

Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, Tong Sun• 2021

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationMS-COCO (val)
FID8.12
112
Text-to-Image GenerationMS-COCO
FID8.12
75
Text-to-Image GenerationMS-COCO 256x256 (val)
FID8.12
53
Text-to-Image GenerationCOCO 30k subset 2014 (val)
FID8.12
46
Text-to-Image GenerationMS COCO zero-shot
FID26.94
42
Text-to-Image SynthesisCOCO (test)
FID8.21
38
Text-to-Image GenerationCOCO 256 x 256 2014 (val)
FID8.12
37
Text-to-Image SynthesisMSCOCO
FID8.12
31
Grounded Text-to-Image GenerationCOCO 2014 (val)
FID8.12
26
Text-to-Image GenerationMS-COCO Captions 30,000 (val)
FID-026.9
21
Showing 10 of 21 rows

Other info

Code

Follow for update