Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

All You May Need for VQA are Image Captions

About

Visual Question Answering (VQA) has benefited from increasingly sophisticated models, but has not enjoyed the same level of engagement in terms of data creation. In this paper, we propose a method that automatically derives VQA examples at volume, by leveraging the abundance of existing image-caption annotations combined with neural models for textual question generation. We show that the resulting data is of high-quality. VQA models trained on our data improve state-of-the-art zero-shot accuracy by double digits and achieve a level of robustness that lacks in the same model trained on human-annotated VQA data.

Soravit Changpinyo, Doron Kukliansky, Idan Szpektor, Xi Chen, Nan Ding, Radu Soricut• 2022

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringOK-VQA (test)
Accuracy19.8
296
Visual Question AnsweringVQA 2.0 (val)
Accuracy (Overall)61.1
143
Showing 2 of 2 rows

Other info

Follow for update