Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

About

Inspired by recent advances in multimodal learning and machine translation, we introduce an encoder-decoder pipeline that learns (a): a multimodal joint embedding space with images and text and (b): a novel language model for decoding distributed representations from our space. Our pipeline effectively unifies joint image-text embedding models with multimodal neural language models. We introduce the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder. The encoder allows one to rank images and sentences while the decoder can generate novel descriptions from scratch. Using LSTM to encode sentences, we match the state-of-the-art performance on Flickr8K and Flickr30K without using object detections. We also set new best results when using the 19-layer Oxford convolutional network. Furthermore we show that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic e.g. *image of a blue car* - "blue" + "red" is near images of red cars. Sample captions generated for 800 images are made available for comparison.

Ryan Kiros, Ruslan Salakhutdinov, Richard S. Zemel• 2014

Related benchmarks

TaskDatasetResultRank
Text-to-Image RetrievalFlickr30k (test)
Recall@111.8
423
Video Question AnsweringMSRVTT-QA (test)--
371
Image-to-Text RetrievalFlickr30k (test)
R@114.8
370
Text-to-Video RetrievalMSR-VTT
Recall@13.8
313
Text-to-Video RetrievalMSR-VTT (test)
R@13.8
234
Text-to-Video RetrievalMSVD
R@112.3
218
Text-to-Video RetrievalMSR-VTT (1k-A)
R@1017.1
211
Text-to-Video RetrievalMSVD (test)
R@112.3
204
Image RetrievalFlickr30k (test)
R@116.8
195
Image RetrievalFlickr30K
R@11.68e+3
144
Showing 10 of 34 rows

Other info

Follow for update