12-in-1: Multi-Task Vision and Language Representation Learning
About
Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded language understanding skills required for success at these tasks overlap significantly. In this work, we investigate these relationships between vision-and-language tasks by developing a large-scale, multi-task training regime. Our approach culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multi-modal verification. Compared to independently trained single-task models, this represents a reduction from approximately 3 billion parameters to 270 million while simultaneously improving performance by 2.05 points on average across tasks. We use our multi-task framework to perform in-depth analysis of the effect of joint training diverse tasks. Further, we show that finetuning task-specific models from our single multi-task model can lead to further improvements, achieving performance at or above the state-of-the-art.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | GQA | Accuracy60 | 963 | |
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy73.15 | 664 | |
| Visual Question Answering | VQA v2 (test-std) | Accuracy73.4 | 466 | |
| Image-to-Text Retrieval | Flickr30K 1K (test) | R@167.9 | 439 | |
| Visual Question Answering | VQA 2.0 (test-dev) | Accuracy73.15 | 337 | |
| Natural Language Visual Reasoning | NLVR2 (test-p) | Accuracy78.87 | 327 | |
| Referring Expression Comprehension | RefCOCOg (test) | Accuracy76.35 | 291 | |
| Natural Language Visual Reasoning | NLVR2 (dev) | Accuracy77.14 | 288 | |
| Visual Entailment | SNLI-VE (test) | Overall Accuracy76.95 | 197 | |
| Image Retrieval | Flickr30k (test) | R@167.9 | 195 |