VQA: Visual Question Answering
About
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing ~0.25M images, ~0.76M questions, and ~10M answers (www.visualqa.org), and discuss the information it provides. Numerous baselines and methods for VQA are provided and compared with human performance. Our VQA demo is available on CloudCV (http://cloudcv.org/vqa).
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Question Answering | NExT-QA (test) | Accuracy44.92 | 204 | |
| Video Question Answering | NExT-QA (val) | Overall Acc44.24 | 176 | |
| Visual Question Answering | VQA (test-dev) | Acc (All)58.97 | 147 | |
| Image Captioning | MS-COCO (test) | CIDEr91 | 117 | |
| Visual Question Answering | VQA (test-std) | -- | 110 | |
| Open-Ended Visual Question Answering | VQA 1.0 (test-dev) | Overall Accuracy57.8 | 100 | |
| Audio-Visual Question Answering | MUSIC-AVQA 1.0 (test) | AV Localis Accuracy71.43 | 96 | |
| Visual Question Answering (Multiple-choice) | VQA 1.0 (test-dev) | Accuracy (All)62.7 | 66 | |
| Visual Question Answering | CLEVR (test) | Overall Accuracy52.3 | 61 | |
| Audio-Visual Question Answering | MUSIC-AVQA (test) | Acc (Avg)65.18 | 59 |