Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Leveraging Visual Question Answering for Image-Caption Ranking

About

Visual Question Answering (VQA) is the task of taking as input an image and a free-form natural language question about the image, and producing an accurate answer. In this work we view VQA as a "feature extraction" module to extract image and caption representations. We employ these representations for the task of image-caption ranking. Each feature dimension captures (imagines) whether a fact (question-answer pair) could plausibly be true for the image and caption. This allows the model to interpret images and captions from a wide variety of perspectives. We propose score-level and representation-level fusion models to incorporate VQA knowledge in an existing state-of-the-art VQA-agnostic image-caption ranking model. We find that incorporating and reasoning about consistency between images and captions significantly improves performance. Concretely, our model improves state-of-the-art on caption retrieval by 7.1% and on image retrieval by 4.4% on the MSCOCO dataset.

Xiao Lin, Devi Parikh• 2016

Related benchmarks

TaskDatasetResultRank
Text-to-Image RetrievalFlickr30k (test)
Recall@124.9
423
Image-to-Text RetrievalFlickr30k (test)
R@133.9
370
Image-to-Text RetrievalMS-COCO 5K (test)
R@123.5
299
Text-to-Image RetrievalMSCOCO 5K (test)
R@123.5
286
Image RetrievalFlickr30k (test)
R@124.9
195
Image RetrievalFlickr30K
R@125
144
Image RetrievalMS-COCO 1K (test)
R@137
128
Text-to-Image RetrievalMSCOCO (1K test)
R@137
104
Image-to-Text RetrievalMSCOCO (1K test)
R@150.5
82
Image SearchFlickr8K
R@117.2
74
Showing 10 of 20 rows

Other info

Follow for update