Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
About
Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones, bringing gains in terms of memory and performance. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both these kinds of data efficiently: (i) coarse-grained pre-training based on image-text data; followed by (ii) fine-grained pre-training based on image-text-box data. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection. Using deep multimodal fusion coupled with the two-stage pre-training, FIBER provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data. Code is available at https://github.com/microsoft/FIBER.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | COCO 2017 (val) | AP58.4 | 2454 | |
| Image Captioning | MS COCO Karpathy (test) | CIDEr144.4 | 682 | |
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy78.55 | 664 | |
| Visual Question Answering | VQA v2 (test-std) | Accuracy78.46 | 466 | |
| Text-to-Image Retrieval | Flickr30k (test) | Recall@181.44 | 423 | |
| Image-to-Text Retrieval | Flickr30k (test) | R@192.9 | 370 | |
| Referring Expression Comprehension | RefCOCO+ (val) | Accuracy85.74 | 345 | |
| Referring Expression Comprehension | RefCOCO (val) | Accuracy90.68 | 335 | |
| Referring Expression Comprehension | RefCOCO (testA) | Accuracy0.9259 | 333 | |
| Natural Language Visual Reasoning | NLVR2 (test-p) | Accuracy85.52 | 327 |