Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance
About
Multilingual Large Language Models (LLMs) struggle with cross-lingual tasks due to data imbalances between high-resource and low-resource languages, as well as monolingual bias in pre-training. Existing methods, such as bilingual fine-tuning and contrastive alignment, can improve cross-lingual performance, but they often require extensive parallel data or suffer from instability. To address these challenges, we introduce a Cross-Lingual Mapping Task during the pre-training phase, which enhances cross-lingual alignment without compromising monolingual fluency. Our approach bi-directionally maps languages within the LLM embedding space, improving both language generation and comprehension. We further propose a Language Alignment Coefficient to robustly quantify cross-lingual consistency, even in limited-data scenarios. Experimental results on machine translation (MT), cross-lingual natural language understanding (CLNLU), and cross-lingual question answering (CLQA) show that our model achieves gains of up to 11.9 BLEU points in MT, 6.72 points in CLQA BERTScore-Precision, and more than 5% in CLNLU accuracy over strong multilingual baselines. These findings highlight the potential of incorporating cross-lingual objectives into pre-training to improve multilingual LLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Question Answering | XQuAD | -- | 21 | |
| Machine Translation | FLORES EN-ZH | BLEU36.2 | 8 | |
| Cross-Lingual Natural Language Understanding | Belebele EN-CS | Accuracy61.33 | 5 | |
| Cross-Lingual Natural Language Understanding | Belebele CS-UK | Accuracy54.22 | 5 | |
| Cross-Lingual Natural Language Understanding | Belebele ZH-JP | Accuracy46.56 | 5 | |
| Cross-lingual Summarization | CrossSum | ROUGE-116.8 | 5 | |
| Language Modeling | CulturaX EN (test) | Perplexity25.5 | 5 | |
| Language Modeling | CulturaX ZH (test) | Perplexity20.6 | 5 | |
| Language Modeling | CulturaX CS (test) | Perplexity15.2 | 5 | |
| Language Modeling | CulturaX UK (test) | Perplexity16 | 5 |