MURAL: Multimodal, Multitask Retrieval Across Languages
About
Both image-caption pairs and translation pairs provide the means to learn deep representations of and connections between languages. We use both types of pairs in MURAL (MUltimodal, MUltitask Representations Across Languages), a dual encoder that solves two tasks: 1) image-text matching and 2) translation pair matching. By incorporating billions of translation pairs, MURAL extends ALIGN (Jia et al. PMLR'21)--a state-of-the-art dual encoder learned from 1.8 billion noisy image-text pairs. When using the same encoders, MURAL's performance matches or exceeds ALIGN's cross-modal retrieval performance on well-resourced languages across several datasets. More importantly, it considerably improves performance on under-resourced languages, showing that text-text learning can overcome a paucity of image-caption examples for these languages. On the Wikipedia Image-Text dataset, for example, MURAL-base improves zero-shot mean recall by 8.1% on average for eight under-resourced languages and by 6.8% on average when fine-tuning. We additionally show that MURAL's text representations cluster not only with respect to genealogical connections but also based on areal linguistics, such as the Balkan Sprachbund.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Retrieval | Multi30K (test) | Recall (EN)93.8 | 35 | |
| Image-Text Retrieval | MSCOCO (test) | EN Retrieval Score92.3 | 28 | |
| Image-Text Retrieval | Flickr30k (test) | -- | 21 | |
| Cross-modal retrieval | MSCOCO 1K | Mean Recall (ja)91.6 | 16 | |
| Cross-modal retrieval | MSCOCO (5K) | Mean Recall (ja)81.3 | 12 | |
| Image-to-Image Retrieval | Crisscrossed Captions (CxC) | R@150.3 | 10 | |
| Semantic Similarity | Crisscrossed Captions (CxC) | Mean Average74.1 | 10 | |
| Text-to-Text Retrieval | Crisscrossed Captions (CxC) | R@157.8 | 10 | |
| Image-to-Text Retrieval | Crisscrossed Captions (CxC) | R@146.5 | 10 | |
| Text-to-Image Retrieval | XTD (test) | -- | 9 |