Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance

About

Multilingual Large Language Models (LLMs) struggle with cross-lingual tasks due to data imbalances between high-resource and low-resource languages, as well as monolingual bias in pre-training. Existing methods, such as bilingual fine-tuning and contrastive alignment, can improve cross-lingual performance, but they often require extensive parallel data or suffer from instability. To address these challenges, we introduce a Cross-Lingual Mapping Task during the pre-training phase, which enhances cross-lingual alignment without compromising monolingual fluency. Our approach bi-directionally maps languages within the LLM embedding space, improving both language generation and comprehension. We further propose a Language Alignment Coefficient to robustly quantify cross-lingual consistency, even in limited-data scenarios. Experimental results on machine translation (MT), cross-lingual natural language understanding (CLNLU), and cross-lingual question answering (CLQA) show that our model achieves gains of up to 11.9 BLEU points in MT, 6.72 points in CLQA BERTScore-Precision, and more than 5% in CLNLU accuracy over strong multilingual baselines. These findings highlight the potential of incorporating cross-lingual objectives into pre-training to improve multilingual LLMs.

Weihua Zheng, Chang Liu, Zhengyuan Liu, Xin Huang, Kui Wu, Muhammad Huzaifah Md Shahrin, Aiti Aw, Roy Ka-Wei Lee• 2026

Related benchmarks

TaskDatasetResultRank
Question AnsweringXQuAD--
21
Machine TranslationFLORES EN-ZH
BLEU36.2
8
Cross-Lingual Natural Language UnderstandingBelebele EN-CS
Accuracy61.33
5
Cross-Lingual Natural Language UnderstandingBelebele CS-UK
Accuracy54.22
5
Cross-Lingual Natural Language UnderstandingBelebele ZH-JP
Accuracy46.56
5
Cross-lingual SummarizationCrossSum
ROUGE-116.8
5
Language ModelingCulturaX EN (test)
Perplexity25.5
5
Language ModelingCulturaX ZH (test)
Perplexity20.6
5
Language ModelingCulturaX CS (test)
Perplexity15.2
5
Language ModelingCulturaX UK (test)
Perplexity16
5
Showing 10 of 19 rows

Other info

Follow for update