Explain Images with Multimodal Recurrent Neural Networks
About
In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel sentence descriptions to explain the content of images. It directly models the probability distribution of generating a word given previous words and the image. Image descriptions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on three benchmark datasets (IAPR TC-12, Flickr 8K, and Flickr 30K). Our model outperforms the state-of-the-art generative method. In addition, the m-RNN model can be applied to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Retrieval | Flickr30k (test) | Recall@112.6 | 423 | |
| Image-to-Text Retrieval | Flickr30k (test) | R@118.4 | 370 | |
| Image Retrieval | Flickr30k (test) | R@122.8 | 195 | |
| Image Retrieval | Flickr30K | R@11.26e+3 | 144 | |
| Text-to-Image Retrieval | MSCOCO (1K test) | R@129 | 104 | |
| Image Search | Flickr8K | R@11.45e+3 | 74 | |
| Image Annotation | Flickr30k (test) | R@135.4 | 39 | |
| Sentence Retrieval | Flickr30K | R@11.84e+3 | 32 | |
| Image Annotation | Flickr8K | R@114.5 | 18 | |
| Image Annotation | COCO 1000 (test) | R@141 | 18 |