Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

12-in-1: Multi-Task Vision and Language Representation Learning

About

Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded language understanding skills required for success at these tasks overlap significantly. In this work, we investigate these relationships between vision-and-language tasks by developing a large-scale, multi-task training regime. Our approach culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multi-modal verification. Compared to independently trained single-task models, this represents a reduction from approximately 3 billion parameters to 270 million while simultaneously improving performance by 2.05 points on average across tasks. We use our multi-task framework to perform in-depth analysis of the effect of joint training diverse tasks. Further, we show that finetuning task-specific models from our single multi-task model can lead to further improvements, achieving performance at or above the state-of-the-art.

Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, Stefan Lee• 2019

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringGQA
Accuracy60
963
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy73.15
664
Visual Question AnsweringVQA v2 (test-std)
Accuracy73.4
466
Image-to-Text RetrievalFlickr30K 1K (test)
R@167.9
439
Visual Question AnsweringVQA 2.0 (test-dev)
Accuracy73.15
337
Natural Language Visual ReasoningNLVR2 (test-p)
Accuracy78.87
327
Referring Expression ComprehensionRefCOCOg (test)
Accuracy76.35
291
Natural Language Visual ReasoningNLVR2 (dev)
Accuracy77.14
288
Visual EntailmentSNLI-VE (test)
Overall Accuracy76.95
197
Image RetrievalFlickr30k (test)
R@167.9
195
Showing 10 of 32 rows

Other info

Code

Follow for update