Multimodal Convolutional Neural Networks for Matching Image and Sentence

About

In this paper, we propose multimodal convolutional neural networks (m-CNNs) for matching image and sentence. Our m-CNN provides an end-to-end framework with convolutional architectures to exploit image representation, word composition, and the matching relations between the two modalities. More specifically, it consists of one image CNN encoding the image content, and one matching CNN learning the joint representation of image and sentence. The matching CNN composes words to different semantic fragments and learns the inter-modal relations between image and the composed fragments at different levels, thus fully exploit the matching relations between image and sentence. Experimental results on benchmark databases of bidirectional image and sentence retrieval demonstrate that the proposed m-CNNs can effectively capture the information necessary for image and sentence matching. Specifically, our proposed m-CNNs for bidirectional image and sentence retrieval on Flickr30K and Microsoft COCO databases achieve the state-of-the-art performances.

Lin Ma, Zhengdong Lu, Lifeng Shang, Hang Li• 2015

Related benchmarks

Task	Dataset	Result
Text-to-Image Retrieval	Flickr30k (test)	Recall@126.2	525
Image-to-Text Retrieval	Flickr30K 1K (test)	R@133.6	491
Image-to-Text Retrieval	Flickr30k (test)	R@133.6	472
Image Retrieval	Flickr30k (test)	R@126.2	213
Text-to-Image Retrieval	MS-COCO	R@132.6	187
Image Retrieval	Flickr30K	R@12.62e+3	164
Image Retrieval	MS-COCO 1K (test)	R@132.6	128
Text-to-Image Retrieval	MSCOCO (1K test)	R@132.6	118
Text Retrieval	Flickr30k (test)	R@133.6	107
Image-to-Text Retrieval	MSCOCO (1K test)	R@142.8	96

Showing 10 of 25 rows

Other info

Follow for update

@wizwand_team Discord