Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)

About

In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods. In addition, we apply the m-RNN model to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval. The project page of this work is: www.stat.ucla.edu/~junhua.mao/m-RNN.html .

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, Alan Yuille• 2014

Related benchmarks

TaskDatasetResultRank
Image CaptioningMS COCO Karpathy (test)--
682
Text-to-Image RetrievalFlickr30k (test)
Recall@122.8
423
Image-to-Text RetrievalFlickr30k (test)
R@135.4
370
Image RetrievalFlickr30k (test)
R@112.6
195
Image RetrievalFlickr30K
R@12.28e+3
144
Image CaptioningMS-COCO (test)
CIDEr79
117
Text-to-Image RetrievalMSCOCO (1K test)
R@129
104
Image-to-Text RetrievalMSCOCO (1K test)
R@141
82
Caption RetrievalMS COCO Karpathy 1k (test)
R@141
62
Image SearchCOCO (test)
R@57.30e+3
53
Showing 10 of 17 rows

Other info

Follow for update