Self-Critical Reasoning for Robust Visual Question Answering
About
Visual Question Answering (VQA) deep-learning systems tend to capture superficial statistical correlations in the training data because of strong language priors and fail to generalize to test data with a significantly different question-answer (QA) distribution. To address this issue, we introduce a self-critical training objective that ensures that visual explanations of correct answers match the most influential image regions more than other competitive answer candidates. The influential regions are either determined from human visual/textual explanations or automatically from just significant words in the question and answer. We evaluate our approach on the VQA generalization task using the VQA-CP dataset, achieving a new state-of-the-art i.e., 49.5% using textual explanations and 48.5% using automatically annotated regions.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA 2.0 (val) | Accuracy (Overall)62.3 | 143 | |
| Visual Question Answering | VQA-CP v2 (test) | Overall Accuracy49.45 | 109 | |
| Visual Question Answering | VQA v2 (val) | Accuracy62.2 | 99 | |
| Visual Question Answering | CLEVR-XAI OOD | Accuracy73.97 | 18 | |
| Visual Question Answering | CLEVR-XAI ID | Accuracy86.73 | 18 | |
| Visual Question Answering | GQA OOD (test) | Accuracy31.61 | 14 | |
| Visual Question Answering | GQA 101k ID | Accuracy51.54 | 9 | |
| Visual Question Answering | VQA-HAT ID | Accuracy37.24 | 9 | |
| Visual Question Answering | VQA-HAT OOD | Accuracy0.2826 | 9 |