Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

On Vision Features in Multimodal Machine Translation

About

Previous work on multimodal machine translation (MMT) has focused on the way of incorporating vision features into translation but little attention is on the quality of vision models. In this work, we investigate the impact of vision models on MMT. Given the fact that Transformer is becoming popular in computer vision, we experiment with various strong models (such as Vision Transformer) and enhanced features (such as object-detection and image captioning). We develop a selective attention model to study the patch-level contribution of an image in MMT. On detailed probing tasks, we find that stronger vision models are helpful for learning translation from the visual modality. Our results also suggest the need of carefully examining MMT models, especially when current benchmarks are small-scale and biased. Our code could be found at \url{https://github.com/libeineu/fairseq_mmt}.

Bei Li, Chuanhao Lv, Zefan Zhou, Tao Zhou, Tong Xiao, Anxiang Ma, JingBo Zhu• 2022

Related benchmarks

TaskDatasetResultRank
Multimodal Machine Translation (English-German)Multi30K 2016 (test)
BLEU42.5
52
Multimodal Machine TranslationMulti30k En-De 2017 (test)
METEOR62.32
45
Multimodal Machine TranslationMulti30k En-Fr 2017 (test)
METEOR76.61
31
Multimodal Machine TranslationMulti30k En-Fr 2016 (test)
METEOR Score81.75
30
Multimodal Machine TranslationMSCOCO EN-FR (test)
BLEU45.82
19
Multimodal Machine TranslationEMMT
BLEU Score41.27
18
Multimodal Machine TranslationEMMT (test)
BLEURT0.5619
18
Multi-modal Machine TranslationMulti30k WMT17 (test)
BLEU33.8
16
Multimodal Machine Translation (English-German)MSCOCO (test)
BLEU31.14
13
Multimodal Machine TranslationMulti30K 2016 (test)
BLEU40.63
11
Showing 10 of 11 rows

Other info

Code

Follow for update