Write and Paint: Generative Vision-Language Models are Unified Modal Learners
About
Recent advances in vision-language pre-training have pushed the state-of-the-art on various vision-language tasks, making machines more capable of multi-modal writing (image-to-text generation) and painting (text-to-image generation). However, few studies investigate if these two essential capabilities can be learned together and boost each other, making a versatile and powerful multi-modal foundation model. In this work, we disclose the potential of symmetric generative vision-language pre-training in learning to write and paint concurrently, and propose a new unified modal model, named DaVinci, trained with prefix language modeling and prefix image modeling, a simple generative self-supervised objective on image-text pairs. Thanks to the proposed prefix multi-modal modeling framework, DaVinci is simple to train, scalable to huge data, adaptable to both writing and painting tasks, and also strong on other vision, text, and multi-modal understanding tasks. DaVinci achieves competitive performance on a wide range of 27 generation/understanding tasks and demonstrates the superiority of combining vision/language generative pre-training. Furthermore, we carefully benchmark the performance of different vision-language pre-training objectives on different scales of pre-training datasets on a heterogeneous and broad distribution coverage. Our results demonstrate the potential of exploiting self-supervision in both language and vision inputs, and establish new, stronger baselines for future comparisons at different data scales. The code and pre-trained models are available at https://github.com/shizhediao/DaVinci.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy76.3 | 664 | |
| Image Classification | ImageNet-1K | -- | 524 | |
| Image Classification | Food-101 | Accuracy90.1 | 494 | |
| Image Classification | DTD | Accuracy78.3 | 487 | |
| Image Classification | Flowers102 | Accuracy96.9 | 478 | |
| Image Classification | Stanford Cars | Accuracy74.6 | 477 | |
| Natural Language Understanding | GLUE | SST-291.4 | 452 | |
| Image Classification | SUN397 | Accuracy78 | 425 | |
| Image Classification | MNIST | Accuracy99 | 395 | |
| Image Classification | CIFAR100 | Accuracy80.1 | 331 |