Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Unifying Vision-and-Language Tasks via Text Generation

About

Existing methods for vision-and-language learning typically require designing task-specific architectures and objectives for each task. For example, a multi-label answer classifier for visual question answering, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc. To alleviate these hassles, in this work, we propose a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where our models learn to generate labels in text based on the visual and textual inputs. On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models. Moreover, our generative approach shows better generalization ability on questions that have rare answers. Also, we show that our framework allows multi-task learning in a single architecture with a single set of parameters, achieving similar performance to separately optimized single-task models. Our code is publicly available at: https://github.com/j-min/VL-T5

Jaemin Cho, Jie Lei, Hao Tan, Mohit Bansal• 2021

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2
Accuracy31.8
1165
Visual Question AnsweringGQA
Accuracy19.6
963
Image CaptioningMS COCO Karpathy (test)
CIDEr1.166
682
Visual Question AnsweringVQA v2 (test-std)
Accuracy71.3
466
Natural Language Visual ReasoningNLVR2 (test-p)
Accuracy74.6
327
Visual Question AnsweringOK-VQA (test)
Accuracy5.8
296
Referring Expression ComprehensionRefCOCOg (val)--
291
Referring Expression ComprehensionRefCOCOg (test)--
291
Natural Language Visual ReasoningNLVR2 (dev)
Accuracy73.6
288
Chart Question AnsweringChartQA--
229
Showing 10 of 56 rows

Other info

Code

Follow for update