Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)
About
In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods. In addition, we apply the m-RNN model to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval. The project page of this work is: www.stat.ucla.edu/~junhua.mao/m-RNN.html .
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Captioning | MS COCO Karpathy (test) | -- | 682 | |
| Text-to-Image Retrieval | Flickr30k (test) | Recall@122.8 | 423 | |
| Image-to-Text Retrieval | Flickr30k (test) | R@135.4 | 370 | |
| Image Retrieval | Flickr30k (test) | R@112.6 | 195 | |
| Image Retrieval | Flickr30K | R@12.28e+3 | 144 | |
| Image Captioning | MS-COCO (test) | CIDEr79 | 117 | |
| Text-to-Image Retrieval | MSCOCO (1K test) | R@129 | 104 | |
| Image-to-Text Retrieval | MSCOCO (1K test) | R@141 | 82 | |
| Caption Retrieval | MS COCO Karpathy 1k (test) | R@141 | 62 | |
| Image Search | COCO (test) | R@57.30e+3 | 53 |