Where To Look: Focus Regions for Visual Question Answering
About
We present a method that learns to answer visual questions by selecting image regions relevant to the text-based query. Our method exhibits significant improvements in answering questions such as "what color," where it is necessary to evaluate a specific location, and "what room," where it selectively identifies informative image regions. Our model is tested on the VQA dataset which is the largest human-annotated visual question answering dataset to our knowledge.
Kevin J. Shih, Saurabh Singh, Derek Hoiem• 2015
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering (Multiple-choice) | VQA 1.0 (test-dev) | Accuracy (All)62.44 | 66 | |
| Visual Question Answering (Multiple-choice) | VQA 1.0 (test-standard) | Accuracy (All)62.43 | 27 | |
| Visual Question Answering (Multiple-choice) | VQA (test-dev) | Overall Accuracy62.4 | 17 | |
| Visual Question Answering | VQA COCO 2015 v1.0 (test-dev) | Overall Accuracy60.96 | 16 |
Showing 4 of 4 rows