Leveraging per Image-Token Consistency for Vision-Language Pre-training

About

Most existing vision-language pre-training (VLP) approaches adopt cross-modal masked language modeling (CMLM) to learn vision-language associations. However, we find that CMLM is insufficient for this purpose according to our observations: (1) Modality bias: a considerable amount of masked tokens in CMLM can be recovered with only the language information, ignoring the visual inputs. (2) Under-utilization of the unmasked tokens: CMLM primarily focuses on the masked token but it cannot simultaneously leverage other tokens to learn vision-language associations. To handle those limitations, we propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training). In EPIC, for each image-sentence pair, we mask tokens that are salient to the image (i.e., Saliency-based Masking Strategy) and replace them with alternatives sampled from a language model (i.e., Inconsistent Token Generation Procedure), and then the model is required to determine for each token in the sentence whether it is consistent with the image (i.e., Image-Token Consistency Task). The proposed EPIC method is easily combined with pre-training methods. Extensive experiments show that the combination of the EPIC method and state-of-the-art pre-training approaches, including ViLT, ALBEF, METER, and X-VLM, leads to significant improvements on downstream tasks. The code is released at https://github.com/gyhdog99/epic.

Yunhao Gou, Tom Ko, Hansi Yang, James Kwok, Yu Zhang, Mingxuan Wang• 2022

Related benchmarks

Task	Dataset	Result
Natural Language Visual Reasoning	NLVR2 (dev)	Accuracy85.2	307
Image Retrieval	MS-COCO 5K (test)	R@164.1	217
Text Retrieval	MS-COCO 5K (test)	R@181	182
Text Retrieval	Flickr30K 1K (test)	R@195.8	82
Visual Entailment	SNLI-VE (dev)	Accuracy82.1	71
Image Retrieval	Flickr30K 1K (test)	R@185.1	70
Visual Question Answering	VQA v2 (std)	Accuracy78.7	31
Visual Question Answering	VQA v2 (dev)	Accuracy78.6	30
Natural Language Visual Reasoning	NLVR2 std	Accuracy85.5	14
Visual Entailment	SNLI-VE std	Accuracy82.3	8

Showing 10 of 10 rows

Other info

Code

Follow for update

@wizwand_team Discord