X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
About
Vision language pre-training aims to learn alignments between vision and language from a large amount of data. Most existing methods only learn image-text alignments. Some others utilize pre-trained object detectors to leverage vision language alignments at the object level. In this paper, we propose to learn multi-grained vision language alignments by a unified pre-training framework that learns multi-grained aligning and multi-grained localization simultaneously. Based on it, we present X$^2$-VLM, an all-in-one model with a flexible modular architecture, in which we further unify image-text pre-training and video-text pre-training in one model. X$^2$-VLM is able to learn unlimited visual concepts associated with diverse text descriptions. Experiment results show that X$^2$-VLM performs the best on base and large scale for both image-text and video-text tasks, making a good trade-off between performance and model scale. Moreover, we show that the modular design of X$^2$-VLM results in high transferability for it to be utilized in any language or domain. For example, by simply replacing the text encoder with XLM-R, X$^2$-VLM outperforms state-of-the-art multilingual multi-modal pre-trained models without any multilingual pre-training. The code and pre-trained models are available at https://github.com/zengyan-97/X2-VLM.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Captioning | MS COCO Karpathy (test) | CIDEr139.1 | 682 | |
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy81.9 | 664 | |
| Image Classification | Flowers102 | Accuracy96.4 | 478 | |
| Visual Question Answering | VQA v2 (test-std) | Accuracy81.8 | 466 | |
| Image-to-Text Retrieval | Flickr30K 1K (test) | R@199.1 | 439 | |
| Text-to-Image Retrieval | Flickr30K 1K (test) | R@191.8 | 375 | |
| Natural Language Visual Reasoning | NLVR2 (test-p) | Accuracy89.4 | 327 | |
| Image Classification | ImageNet | Top-1 Accuracy82.2 | 324 | |
| Natural Language Visual Reasoning | NLVR2 (dev) | Accuracy86.2 | 288 | |
| Text-to-Image Retrieval | MSCOCO 5K (test) | R@167.7 | 286 |