Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Cross-Lingual Representation Alignment Through Contrastive Image-Caption Tuning

About

Multilingual alignment of sentence representations has mostly required bitexts to bridge the gap between languages. We investigate whether visual information can bridge this gap instead. Image caption datasets are very easy to create without requiring multilingual expertise, so this offers a more efficient alternative for low-resource languages. We find that multilingual image-caption alignment can implicitly align the text representations between languages, languages unseen by the encoder in pretraining can be incorporated into this alignment post-hoc, and these aligned representations are usable for cross-lingual Natural Language Understanding (NLU) and bitext retrieval.

Nathaniel Krasner, Nicholas Lanuzo, Antonios Anastasopoulos• 2025

Related benchmarks

TaskDatasetResultRank
Natural Language InferenceXNLI (test)
Average Accuracy61.8
167
Bitext RetrievalFlores-200 All 203 languages
Accuracy62.2
5
Bitext RetrievalFlores-200 in XLM-R (92 langs)
Accuracy92.6
5
Bitext RetrievalFlores-200 not in XLM-R (111 langs)
Accuracy37.1
5
Bitext RetrievalFlores-200 Quechua
Accuracy29.2
5
Showing 5 of 5 rows

Other info

Code

Follow for update