Language Imbalance Driven Rewarding for Multilingual Self-improving
About
Large Language Models (LLMs) have achieved state-of-the-art performance across numerous tasks. However, these advancements have predominantly benefited "first-class" languages such as English and Chinese, leaving many other languages underrepresented. This imbalance, while limiting broader applications, generates a natural preference ranking between languages, offering an opportunity to bootstrap the multilingual capabilities of LLM in a self-improving manner. Thus, we propose $\textit{Language Imbalance Driven Rewarding}$, where the inherent imbalance between dominant and non-dominant languages within LLMs is leveraged as a reward signal. Iterative DPO training demonstrates that this approach not only enhances LLM performance in non-dominant languages but also improves the dominant language's capacity, thereby yielding an iterative reward signal. Fine-tuning Meta-Llama-3-8B-Instruct over two iterations of this approach results in continuous improvements in multilingual performance across instruction-following and arithmetic reasoning tasks, evidenced by an average improvement of 7.46% win rate on the X-AlpacaEval leaderboard and 13.9% accuracy on the MGSM benchmark. This work serves as an initial exploration, paving the way for multilingual self-improvement of LLMs. The code is available at https://github.com/ZNLP/Language-Imbalance-Driven-Rewarding
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH500 (test) | Accuracy73.2 | 895 | |
| Mathematical Reasoning | MGSM (test) | Accuracy (ZH)86 | 80 | |
| Mathematical Reasoning | MATH500 1.0 (test) | Accuracy62.46 | 57 | |
| Mathematical Reasoning | MGSM | Accuracy (Bn)55.6 | 49 | |
| Mathematical Reasoning | MMATH | Accuracy70.5 | 36 | |
| Factual Knowledge | Include Lite | Seen Accuracy41.38 | 21 | |
| Factual Knowledge | Global MMLU-Lite | Seen Accuracy58.27 | 21 | |
| General performance assessment | Overall Combined Benchmarks | Performance (Seen Data)48.26 | 21 | |
| Math Reasoning | mGSM v2 | Accuracy (Seen)76.23 | 21 | |
| Open-ended generation | CARE-pro | Score (Seen)16.21 | 21 |