ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation

About

Conventional methods for the image-text generation tasks mainly tackle the naturally bidirectional generation tasks separately, focusing on designing task-specific frameworks to improve the quality and fidelity of the generated samples. Recently, Vision-Language Pre-training models have greatly improved the performance of the image-to-text generation tasks, but large-scale pre-training models for text-to-image synthesis task are still under-developed. In this paper, we propose ERNIE-ViLG, a unified generative pre-training framework for bidirectional image-text generation with transformer model. Based on the image quantization models, we formulate both image generation and text generation as autoregressive generative tasks conditioned on the text/image input. The bidirectional image-text generative modeling eases the semantic alignments across vision and language. For the text-to-image generation process, we further propose an end-to-end training method to jointly learn the visual sequence generator and the image reconstructor. To explore the landscape of large-scale pre-training for bidirectional text-image generation, we train a 10-billion parameter ERNIE-ViLG model on a large-scale dataset of 145 million (Chinese) image-text pairs which achieves state-of-the-art performance for both text-to-image and image-to-text tasks, obtaining an FID of 7.9 on MS-COCO for text-to-image synthesis and best results on COCO-CN and AIC-ICC for image captioning.

Han Zhang, Weichong Yin, Yewei Fang, Lanxin Li, Boqiang Duan, Zhihua Wu, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang• 2021

Related benchmarks

Task	Dataset	Result
Text-to-Image Generation	MS-COCO 2014 (val)	FID14.7	143
Text-to-Image Generation	MS-COCO 256x256 (val)	--	64
Text-to-Image Synthesis	MS-COCO (val)	FID7.9	35
Text-to-Image Synthesis	MS COCO 256x256	FID14.7	13
Image Captioning	AIC-ICC (val)	METEOR41.7	4
Visual Question Answering	FMIQA (val)	Turing Test Passing Rate78.5	4
Image Captioning	COCO-CN (test)	BLEU@450	2
Image Captioning	COCO-CN (Zh)	BLEU@450	2
Text-to-Image Synthesis	Human evaluation dataset 500 texts	Image Clarity4.221	2

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord