Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning
About
Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correct corresponding information across different modalities. For this purpose, inspired by the success of masked language modeling (MLM) tasks in the NLP pre-training area, numerous masked modeling tasks have been proposed for VLP to further promote cross-modal interactions. The core idea of previous masked modeling tasks is to focus on reconstructing the masked tokens based on visible context for learning local-to-local alignment. However, most of them pay little attention to the global semantic features generated for the masked data, resulting in a limited cross-modal alignment ability of global representations. Therefore, in this paper, we propose a novel Semantic Completion Learning (SCL) task, complementary to existing masked modeling tasks, to facilitate global-to-local alignment. Specifically, the SCL task complements the missing semantics of masked data by capturing the corresponding information from the other modality, promoting learning more representative global features which have a great impact on the performance of downstream tasks. Moreover, we present a flexible vision encoder, which enables our model to perform image-text and video-text multimodal tasks simultaneously. Experimental results show that our proposed method obtains state-of-the-art performance on various vision-language benchmarks, such as visual question answering, image-text retrieval, and video-text retrieval.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 (test-std) | Accuracy78.78 | 466 | |
| Visual Question Answering | VQA 2.0 (test-dev) | Accuracy78.72 | 337 | |
| Natural Language Visual Reasoning | NLVR2 (test-p) | Accuracy84.27 | 327 | |
| Natural Language Visual Reasoning | NLVR2 (dev) | Accuracy83.63 | 288 | |
| Image Retrieval | Flickr30k (test) | R@184.56 | 195 | |
| Text-to-Video Retrieval | LSMDC | R@132.8 | 154 | |
| Text Retrieval | Flickr30k (test) | R@195.9 | 89 | |
| Text-to-Video Retrieval | MSRVTT | R@143.2 | 75 | |
| Image-Text Retrieval | COCO (test) | -- | 37 |