Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Finding beans in burgers: Deep semantic-visual embedding with localization

About

Several works have proposed to learn a two-path neural network that maps images and texts, respectively, to a same shared Euclidean space where geometry captures useful semantic relationships. Such a multi-modal embedding can be trained and used for various tasks, notably image captioning. In the present work, we introduce a new architecture of this type, with a visual path that leverages recent space-aware pooling mechanisms. Combined with a textual path which is jointly trained from scratch, our semantic-visual embedding offers a versatile model. Once trained under the supervision of captioned images, it yields new state-of-the-art performance on cross-modal retrieval. It also allows the localization of new concepts from the embedding space into any input image, delivering state-of-the-art result on the visual grounding of phrases.

Martin Engilberge, Louis Chevallier, Patrick P\'erez, Matthieu Cord• 2018

Related benchmarks

TaskDatasetResultRank
Text-to-Image RetrievalFlickr30k (test)
Recall@134.9
423
Image ClassificationRSNA
AUC57.8
42
Linear ClassificationRSNA (test)--
39
Linear ClassificationCheXpert v1.0 (test)
AUC (1%)50.1
12
Showing 4 of 4 rows

Other info

Follow for update