Write and Paint: Generative Vision-Language Models are Unified Modal Learners

About

Recent advances in vision-language pre-training have pushed the state-of-the-art on various vision-language tasks, making machines more capable of multi-modal writing (image-to-text generation) and painting (text-to-image generation). However, few studies investigate if these two essential capabilities can be learned together and boost each other, making a versatile and powerful multi-modal foundation model. In this work, we disclose the potential of symmetric generative vision-language pre-training in learning to write and paint concurrently, and propose a new unified modal model, named DaVinci, trained with prefix language modeling and prefix image modeling, a simple generative self-supervised objective on image-text pairs. Thanks to the proposed prefix multi-modal modeling framework, DaVinci is simple to train, scalable to huge data, adaptable to both writing and painting tasks, and also strong on other vision, text, and multi-modal understanding tasks. DaVinci achieves competitive performance on a wide range of 27 generation/understanding tasks and demonstrates the superiority of combining vision/language generative pre-training. Furthermore, we carefully benchmark the performance of different vision-language pre-training objectives on different scales of pre-training datasets on a heterogeneous and broad distribution coverage. Our results demonstrate the potential of exploiting self-supervision in both language and vision inputs, and establish new, stronger baselines for future comparisons at different data scales. The code and pre-trained models are available at https://github.com/shizhediao/DaVinci.

Shizhe Diao, Wangchunshu Zhou, Xinsong Zhang, Jiawei Wang• 2022

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy76.3	712
Image Classification	Stanford Cars	Accuracy74.6	660
Image Classification	ImageNet-1K	--	600
Image Classification	DTD	Accuracy78.3	599
Image Classification	Food-101	Accuracy90.1	570
Image Classification	Flowers102	Accuracy96.9	558
Natural Language Understanding	GLUE	SST-291.4	551
Image Classification	SUN397	Accuracy78	425
Image Classification	MNIST	Accuracy99	417
Image Classification	Oxford-IIIT Pets	Accuracy88.2	378

Showing 10 of 23 rows

Other info

Follow for update

@wizwand_team Discord