Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Multimodal Convolutional Neural Networks for Matching Image and Sentence

About

In this paper, we propose multimodal convolutional neural networks (m-CNNs) for matching image and sentence. Our m-CNN provides an end-to-end framework with convolutional architectures to exploit image representation, word composition, and the matching relations between the two modalities. More specifically, it consists of one image CNN encoding the image content, and one matching CNN learning the joint representation of image and sentence. The matching CNN composes words to different semantic fragments and learns the inter-modal relations between image and the composed fragments at different levels, thus fully exploit the matching relations between image and sentence. Experimental results on benchmark databases of bidirectional image and sentence retrieval demonstrate that the proposed m-CNNs can effectively capture the information necessary for image and sentence matching. Specifically, our proposed m-CNNs for bidirectional image and sentence retrieval on Flickr30K and Microsoft COCO databases achieve the state-of-the-art performances.

Lin Ma, Zhengdong Lu, Lifeng Shang, Hang Li• 2015

Related benchmarks

TaskDatasetResultRank
Image-to-Text RetrievalFlickr30K 1K (test)
R@133.6
491
Text-to-Image RetrievalFlickr30k (test)
Recall@126.2
445
Image-to-Text RetrievalFlickr30k (test)
R@133.6
392
Image RetrievalFlickr30k (test)
R@126.2
210
Text-to-Image RetrievalMS-COCO
R@132.6
151
Image RetrievalFlickr30K
R@12.62e+3
144
Image RetrievalMS-COCO 1K (test)
R@132.6
128
Text-to-Image RetrievalMSCOCO (1K test)
R@132.6
118
Text RetrievalFlickr30k (test)
R@133.6
104
Image-to-Text RetrievalMSCOCO (1K test)
R@142.8
96
Showing 10 of 25 rows

Other info

Follow for update