Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Conditional Image-Text Embedding Networks

About

This paper presents an approach for grounding phrases in images which jointly learns multiple text-conditioned embeddings in a single end-to-end model. In order to differentiate text phrases into semantically distinct subspaces, we propose a concept weight branch that automatically assigns phrases to embeddings, whereas prior works predefine such assignments. Our proposed solution simplifies the representation requirements for individual embeddings and allows the underrepresented concepts to take advantage of the shared representations before feeding them into concept-specific layers. Comprehensive experiments verify the effectiveness of our approach across three phrase grounding datasets, Flickr30K Entities, ReferIt Game, and Visual Genome, where we obtain a (resp.) 4%, 3%, and 4% improvement in grounding performance over a strong region-phrase embedding baseline.

Bryan A. Plummer, Paige Kordas, M. Hadi Kiapour, Shuai Zheng, Robinson Piramuthu, Svetlana Lazebnik• 2017

Related benchmarks

TaskDatasetResultRank
Referring Expression ComprehensionReferItGame (test)
Top-1 Acc35.07
47
Visual GroundingFlickr30K Entities (test)
Accuracy59.27
29
Phrase groundingFlickr30K Entities (test)
Recall@161.9
28
Visual GroundingReferItGame (test)
Pr@0.50.3507
26
Phrase groundingFlickr30K
Accuracy59.27
20
Referring Expression ComprehensionFlickr30K Entities (test)
Top-1 Accuracy61.33
17
Phrase groundingReferIt
Accuracy34.15
14
Referring Expression GroundingRefCLEF (test)
Accuracy34.13
4
Object GroundingRefCLEF (test)
Accuracy34.13
3
Showing 9 of 9 rows

Other info

Follow for update