Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Learning to Count Objects in Natural Images for Visual Question Answering

About

Visual Question Answering (VQA) models have struggled with counting objects in natural images so far. We identify a fundamental problem due to soft attention in these models as a cause. To circumvent this problem, we propose a neural network component that allows robust counting from object proposals. Experiments on a toy task show the effectiveness of this component and we obtain state-of-the-art accuracy on the number category of the VQA v2 dataset without negatively affecting other categories, even outperforming ensemble models with our single model. On a difficult balanced pair metric, the component gives a substantial improvement in counting over a strong baseline by 6.6%.

Yan Zhang, Jonathon Hare, Adam Pr\"ugel-Bennett• 2018

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy68.09
664
Visual Question AnsweringVQA v2 (test-std)--
466
Visual Question AnsweringVQAv2 (test)
VQA Accuracy68.41
72
Open-ended countingHowMany-QA 1.0 (test)
Accuracy54.7
10
Open-ended countingTallyQA Simple 1.0 (test)
Accuracy70.5
9
Open-ended countingTallyQA Complex 1.0 (test)
Accuracy (ACC)50.9
9
Open-ended countingHowMany-QA (test)
Accuracy0.561
6
Showing 7 of 7 rows

Other info

Code

Follow for update