mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
About
Large-scale pretrained foundation models have been an emerging paradigm for building artificial intelligence (AI) systems, which can be quickly adapted to a wide range of downstream tasks. This paper presents mPLUG, a new vision-language foundation model for both cross-modal understanding and generation. Most existing pre-trained models suffer from the problems of low computational efficiency and information asymmetry brought by the long visual sequence in cross-modal alignment. To address these problems, mPLUG introduces an effective and efficient vision-language architecture with novel cross-modal skip-connections, which creates inter-layer shortcuts that skip a certain number of layers for time-consuming full self-attention on the vision side. mPLUG is pre-trained end-to-end on large-scale image-text pairs with both discriminative and generative objectives. It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering. mPLUG also demonstrates strong zero-shot transferability when directly transferred to multiple video-language tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Captioning | MS COCO Karpathy (test) | CIDEr1.551 | 682 | |
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy81.27 | 664 | |
| Visual Question Answering | VQA v2 (test-std) | Accuracy81.26 | 466 | |
| Text-to-Image Retrieval | Flickr30K | R@186.4 | 460 | |
| Image-to-Text Retrieval | Flickr30K 1K (test) | R@197.6 | 439 | |
| Text-to-Image Retrieval | Flickr30K 1K (test) | R@188.4 | 375 | |
| Video Question Answering | MSRVTT-QA (test) | Accuracy21.1 | 371 | |
| Visual Question Answering | VQA 2.0 (test-dev) | Accuracy81.27 | 337 | |
| Natural Language Visual Reasoning | NLVR2 (test-p) | Accuracy84.95 | 327 | |
| Image-to-Text Retrieval | MS-COCO 5K (test) | R@182.8 | 299 |