Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Self-Critical Reasoning for Robust Visual Question Answering

About

Visual Question Answering (VQA) deep-learning systems tend to capture superficial statistical correlations in the training data because of strong language priors and fail to generalize to test data with a significantly different question-answer (QA) distribution. To address this issue, we introduce a self-critical training objective that ensures that visual explanations of correct answers match the most influential image regions more than other competitive answer candidates. The influential regions are either determined from human visual/textual explanations or automatically from just significant words in the question and answer. We evaluate our approach on the VQA generalization task using the VQA-CP dataset, achieving a new state-of-the-art i.e., 49.5% using textual explanations and 48.5% using automatically annotated regions.

Jialin Wu, Raymond J. Mooney• 2019

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA 2.0 (val)
Accuracy (Overall)62.3
143
Visual Question AnsweringVQA-CP v2 (test)
Overall Accuracy49.45
109
Visual Question AnsweringVQA v2 (val)
Accuracy62.2
99
Visual Question AnsweringCLEVR-XAI OOD
Accuracy73.97
18
Visual Question AnsweringCLEVR-XAI ID
Accuracy86.73
18
Visual Question AnsweringGQA OOD (test)
Accuracy31.61
14
Visual Question AnsweringGQA 101k ID
Accuracy51.54
9
Visual Question AnsweringVQA-HAT ID
Accuracy37.24
9
Visual Question AnsweringVQA-HAT OOD
Accuracy0.2826
9
Showing 9 of 9 rows

Other info

Code

Follow for update