Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages

About

While large language models (LLMs) have been pre-trained on multilingual corpora, their performance still lags behind in most languages compared to a few resource-rich languages. One common approach to mitigate this issue is to translate training data from resource-rich languages into other languages and then continue training. However, using the data obtained solely relying on translation while ignoring the original capabilities of LLMs across languages is not always effective, which we show will limit the performance of cross-lingual knowledge transfer. In this work, we propose SDRRL, a method based on Self-Distillation from Resource-Rich Languages that effectively improve multilingual performance by leveraging the internal capabilities of LLMs on resource-rich languages. We evaluate on different LLMs (LLaMA-2 and SeaLLM) and source languages across various comprehension and generation tasks, experimental results demonstrate that SDRRL can significantly enhance multilingual capabilities while minimizing the impact on original performance in resource-rich languages.

Yuanchi Zhang, Yile Wang, Zijun Liu, Shuo Wang, Xiaolong Wang, Peng Li, Maosong Sun, Yang Liu• 2024

Related benchmarks

TaskDatasetResultRank
Natural Language InferenceXNLI
Accuracy52.36
111
Commonsense ReasoningXStoryCloze
Average Score80.67
32
General Knowledge EvaluationMMMLU
MMMLU General Knowledge Accuracy47.28
29
Multilingual SafetyMultiJail In-Distribution Languages (test)
Safety Score (EN)33.65
10
Multilingual SafetyMultiJail Out-of-Distribution Languages (test)
Safety Violation Rate (KO)7.3
10
Safety AlignmentPKU-SafeRLHF in-distribution (test)
Accuracy (EN)57.22
10
Machine TranslationFLORES X-to-English
BLEU (CES)36.38
5
Machine TranslationFLORES English-to-X
BLEU (CES)27.91
5
Multilingual UnderstandingBELEBELE Target Language
CES Performance52.11
5
Multilingual UnderstandingBELEBELE English Language
CES Score66.26
5
Showing 10 of 12 rows

Other info

Code

Follow for update