Unifying Vision-and-Language Tasks via Text Generation
About
Existing methods for vision-and-language learning typically require designing task-specific architectures and objectives for each task. For example, a multi-label answer classifier for visual question answering, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc. To alleviate these hassles, in this work, we propose a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where our models learn to generate labels in text based on the visual and textual inputs. On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models. Moreover, our generative approach shows better generalization ability on questions that have rare answers. Also, we show that our framework allows multi-task learning in a single architecture with a single set of parameters, achieving similar performance to separately optimized single-task models. Our code is publicly available at: https://github.com/j-min/VL-T5
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 | Accuracy31.8 | 1165 | |
| Visual Question Answering | GQA | Accuracy19.6 | 963 | |
| Image Captioning | MS COCO Karpathy (test) | CIDEr1.166 | 682 | |
| Visual Question Answering | VQA v2 (test-std) | Accuracy71.3 | 466 | |
| Natural Language Visual Reasoning | NLVR2 (test-p) | Accuracy74.6 | 327 | |
| Visual Question Answering | OK-VQA (test) | Accuracy5.8 | 296 | |
| Referring Expression Comprehension | RefCOCOg (val) | -- | 291 | |
| Referring Expression Comprehension | RefCOCOg (test) | -- | 291 | |
| Natural Language Visual Reasoning | NLVR2 (dev) | Accuracy73.6 | 288 | |
| Chart Question Answering | ChartQA | -- | 229 |