Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

About

We present a new technique for learning visual-semantic embeddings for cross-modal retrieval. Inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss functions, we introduce a simple change to common loss functions used for multi-modal embeddings. That, combined with fine-tuning and use of augmented data, yields significant gains in retrieval performance. We showcase our approach, VSE++, on MS-COCO and Flickr30K datasets, using ablation studies and comparisons with existing methods. On MS-COCO our approach outperforms state-of-the-art methods by 8.8% in caption retrieval and 11.3% in image retrieval (at R@1).

Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, Sanja Fidler• 2017

Related benchmarks

TaskDatasetResultRank
Text-to-Image RetrievalFlickr30K
R@139.6
460
Image-to-Text RetrievalFlickr30K 1K (test)
R@152.9
439
Text-to-Image RetrievalFlickr30k (test)
Recall@139.6
423
Text-to-Image RetrievalFlickr30K 1K (test)
R@139.6
375
Image-to-Text RetrievalFlickr30k (test)
R@152.9
370
Image-to-Text RetrievalMS-COCO 5K (test)
R@141.3
299
Text-to-Image RetrievalMSCOCO 5K (test)
R@141.3
286
Text-to-Image RetrievalMS-COCO 5K (test)
R@130.3
223
Image RetrievalMS-COCO 5K (test)
R@130.3
217
Text-to-Video RetrievalMSVD (test)
R@115.4
204
Showing 10 of 73 rows
...

Other info

Follow for update