ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for Understanding and Generation
About
Recent cross-lingual cross-modal works attempt to extend Vision-Language Pre-training (VLP) models to non-English inputs and achieve impressive performance. However, these models focus only on understanding tasks utilizing encoder-only architecture. In this paper, we propose ERNIE-UniX2, a unified cross-lingual cross-modal pre-training framework for both generation and understanding tasks. ERNIE-UniX2 integrates multiple pre-training paradigms (e.g., contrastive learning and language modeling) based on encoder-decoder architecture and attempts to learn a better joint representation across languages and modalities. Furthermore, ERNIE-UniX2 can be seamlessly fine-tuned for varieties of generation and understanding downstream tasks. Pre-trained on both multilingual text-only and image-text datasets, ERNIE-UniX2 achieves SOTA results on various cross-lingual cross-modal generation and understanding tasks such as multimodal machine translation and multilingual visual question answering.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Retrieval | Multi30K (test) | Recall (EN)88.8 | 35 | |
| Multimodal Translation | Multi30k En-De | BLEU@449.3 | 6 | |
| Cross-lingual Natural Language Inference | XVNLI | Accuracy (EN)87.73 | 5 | |
| Cross-lingual Visual Question Answering | xGQA | EN Accuracy56.68 | 5 | |
| Image Captioning | MSCOCO EN | BLEU@440.7 | 4 | |
| Cross-lingual Text Retrieval | Tatoeba | Accuracy (36 Avg)93.82 | 3 | |
| Cross-lingual Image-Text Retrieval | Multi30k zero-shot | EN Mean Recall80.06 | 3 | |
| Image Captioning | COCO-CN (Zh) | BLEU@448.3 | 2 |