Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Explain Images with Multimodal Recurrent Neural Networks

About

In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel sentence descriptions to explain the content of images. It directly models the probability distribution of generating a word given previous words and the image. Image descriptions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on three benchmark datasets (IAPR TC-12, Flickr 8K, and Flickr 30K). Our model outperforms the state-of-the-art generative method. In addition, the m-RNN model can be applied to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Alan L. Yuille• 2014

Related benchmarks

TaskDatasetResultRank
Text-to-Image RetrievalFlickr30k (test)
Recall@112.6
423
Image-to-Text RetrievalFlickr30k (test)
R@118.4
370
Image RetrievalFlickr30k (test)
R@122.8
195
Image RetrievalFlickr30K
R@11.26e+3
144
Text-to-Image RetrievalMSCOCO (1K test)
R@129
104
Image SearchFlickr8K
R@11.45e+3
74
Image AnnotationFlickr30k (test)
R@135.4
39
Sentence RetrievalFlickr30K
R@11.84e+3
32
Image AnnotationFlickr8K
R@114.5
18
Image AnnotationCOCO 1000 (test)
R@141
18
Showing 10 of 20 rows

Other info

Follow for update