UNITER: UNiversal Image-TExt Representation Learning
About
Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design four pre-training tasks: Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). In addition to ITM for global image-text alignment, we also propose WRA via the use of Optimal Transport (OT) to explicitly encourage fine-grained alignment between words and image regions during pre-training. Comprehensive analysis shows that both conditional masking and OT-based WRA contribute to better pre-training. We also conduct a thorough ablation study to find an optimal combination of pre-training tasks. Extensive experiments show that UNITER achieves new state of the art across six V+L tasks (over nine datasets), including Visual Question Answering, Image-Text Retrieval, Referring Expression Comprehension, Visual Commonsense Reasoning, Visual Entailment, and NLVR$^2$. Code is available at https://github.com/ChenRocks/UNITER.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy73.82 | 664 | |
| Natural Language Understanding | GLUE (dev) | SST-2 (Acc)89.7 | 504 | |
| Visual Question Answering | VQA v2 (test-std) | Accuracy74.03 | 466 | |
| Text-to-Image Retrieval | Flickr30K | R@175.6 | 460 | |
| Natural Language Understanding | GLUE | SST-289.7 | 452 | |
| Image-to-Text Retrieval | Flickr30K 1K (test) | R@187.3 | 439 | |
| Text-to-Image Retrieval | Flickr30k (test) | Recall@175.6 | 423 | |
| Image-to-Text Retrieval | Flickr30K | R@187.3 | 379 | |
| Text-to-Image Retrieval | Flickr30K 1K (test) | R@175.6 | 375 | |
| Image-to-Text Retrieval | Flickr30k (test) | R@187.3 | 370 |