Leveraging Visual Question Answering for Image-Caption Ranking
About
Visual Question Answering (VQA) is the task of taking as input an image and a free-form natural language question about the image, and producing an accurate answer. In this work we view VQA as a "feature extraction" module to extract image and caption representations. We employ these representations for the task of image-caption ranking. Each feature dimension captures (imagines) whether a fact (question-answer pair) could plausibly be true for the image and caption. This allows the model to interpret images and captions from a wide variety of perspectives. We propose score-level and representation-level fusion models to incorporate VQA knowledge in an existing state-of-the-art VQA-agnostic image-caption ranking model. We find that incorporating and reasoning about consistency between images and captions significantly improves performance. Concretely, our model improves state-of-the-art on caption retrieval by 7.1% and on image retrieval by 4.4% on the MSCOCO dataset.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Retrieval | Flickr30k (test) | Recall@124.9 | 423 | |
| Image-to-Text Retrieval | Flickr30k (test) | R@133.9 | 370 | |
| Image-to-Text Retrieval | MS-COCO 5K (test) | R@123.5 | 299 | |
| Text-to-Image Retrieval | MSCOCO 5K (test) | R@123.5 | 286 | |
| Image Retrieval | Flickr30k (test) | R@124.9 | 195 | |
| Image Retrieval | Flickr30K | R@125 | 144 | |
| Image Retrieval | MS-COCO 1K (test) | R@137 | 128 | |
| Text-to-Image Retrieval | MSCOCO (1K test) | R@137 | 104 | |
| Image-to-Text Retrieval | MSCOCO (1K test) | R@150.5 | 82 | |
| Image Search | Flickr8K | R@117.2 | 74 |