Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Latent Variable Model for Multi-modal Translation

About

In this work, we propose to model the interaction between visual and textual features for multi-modal neural machine translation (MMT) through a latent variable model. This latent variable can be seen as a multi-modal stochastic embedding of an image and its description in a foreign language. It is used in a target-language decoder and also to predict image features. Importantly, our model formulation utilises visual and textual inputs during training but does not require that images be available at test time. We show that our latent variable MMT formulation improves considerably over strong baselines, including a multi-task learning approach (Elliott and K\'ad\'ar, 2017) and a conditional variational auto-encoder approach (Toyama et al., 2016). Finally, we show improvements due to (i) predicting image features in addition to only conditioning on them, (ii) imposing a constraint on the minimum amount of information encoded in the latent variable, and (iii) by training on additional target-language image descriptions (i.e. synthetic data).

Iacer Calixto, Miguel Rios, Wilker Aziz• 2018

Related benchmarks

TaskDatasetResultRank
Multimodal Machine TranslationMulti30K (test)
BLEU-438.4
139
Multimodal Machine Translation (English-German)Multi30K 2016 (test)
BLEU37.7
52
Multimodal Machine TranslationMulti30k En-De 2017 (test)
METEOR49.9
45
Multi-modal Machine TranslationMulti30k WMT17 (test)
BLEU30.1
16
Multimodal Machine TranslationMSCOCO Ambiguous EN-DE (test)
BLEU25.5
13
Machine Translation (En-De)Multi30K MSCOCO
BLEU25.5
12
Multimodal Machine Translation (English to German)Ambiguous COCO WMT2017 (test)
BLEU25.5
11
Machine TranslationMulti30k translated (test)
BLEU-437.6
5
Multi-modal Machine TranslationMSCOCO (test)
BLEU25.5
5
Showing 9 of 9 rows

Other info

Follow for update