VSE++: Improving Visual-Semantic Embeddings with Hard Negatives
About
We present a new technique for learning visual-semantic embeddings for cross-modal retrieval. Inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss functions, we introduce a simple change to common loss functions used for multi-modal embeddings. That, combined with fine-tuning and use of augmented data, yields significant gains in retrieval performance. We showcase our approach, VSE++, on MS-COCO and Flickr30K datasets, using ablation studies and comparisons with existing methods. On MS-COCO our approach outperforms state-of-the-art methods by 8.8% in caption retrieval and 11.3% in image retrieval (at R@1).
Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, Sanja Fidler• 2017
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Retrieval | Flickr30K | R@139.6 | 460 | |
| Image-to-Text Retrieval | Flickr30K 1K (test) | R@152.9 | 439 | |
| Text-to-Image Retrieval | Flickr30k (test) | Recall@139.6 | 423 | |
| Text-to-Image Retrieval | Flickr30K 1K (test) | R@139.6 | 375 | |
| Image-to-Text Retrieval | Flickr30k (test) | R@152.9 | 370 | |
| Image-to-Text Retrieval | MS-COCO 5K (test) | R@141.3 | 299 | |
| Text-to-Image Retrieval | MSCOCO 5K (test) | R@141.3 | 286 | |
| Text-to-Image Retrieval | MS-COCO 5K (test) | R@130.3 | 223 | |
| Image Retrieval | MS-COCO 5K (test) | R@130.3 | 217 | |
| Text-to-Video Retrieval | MSVD (test) | R@115.4 | 204 |
Showing 10 of 73 rows
...