Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

A Visual Attention Grounding Neural Model for Multimodal Machine Translation

About

We introduce a novel multimodal machine translation model that utilizes parallel visual and textual information. Our model jointly optimizes the learning of a shared visual-language embedding and a translator. The model leverages a visual attention grounding mechanism that links the visual semantics with the corresponding textual semantics. Our approach achieves competitive state-of-the-art results on the Multi30K and the Ambiguous COCO datasets. We also collected a new multilingual multimodal product description dataset to simulate a real-world international online shopping scenario. On this dataset, our visual attention grounding model outperforms other methods by a large margin.

Mingyang Zhou, Runxiang Cheng, Yong Jae Lee, Zhou Yu• 2018

Related benchmarks

TaskDatasetResultRank
Multimodal Machine TranslationMulti30K (test)
BLEU-453.8
139
Multimodal Machine TranslationMulti30k En-De 2017 (test)
METEOR52.2
45
Multimodal Machine TranslationMulti30k En-Fr 2017 (test)
METEOR70.3
31
Multimodal Machine TranslationMSCOCO Ambiguous EN-DE (test)
BLEU28.3
13
Multimodal Machine Translation (English to German)Ambiguous COCO WMT2017 (test)
BLEU28.3
11
Machine TranslationAmbiguous COCO
BLEU28.3
6
Machine TranslationCOCO Ambiguous
BLEU45
6
Machine Translation (English → French)IKEA (test)
BLEU65.8
3
Machine Translation (English → German)IKEA (test)
BLEU63.5
3
Multimodal Machine Translation (English to French)Ambiguous COCO WMT2017 (test)
BLEU45
3
Showing 10 of 10 rows

Other info

Follow for update