Learning to Route Languages for Multilingual Policy Optimization
About
Large language models~(LLMs) are trained on heterogeneous multilingual corpora, yet existing policy optimization methods often implicitly restrict each training question to a single response language or rely on a fixed dominant language for supervision. We propose language-routed policy optimization (LRPO), an online policy optimization framework that treats language as a selectable variable. LRPO elicits multilingual rollouts for each training question and integrates their relative quality into preference-based policy updates, increasing the diversity and informativeness of training signals under the fixed rollout budget. To adaptively determine which languages to explore during reinforcement learning, we introduce a trainable language router formulated as a multi-armed bandit, balancing exploration of underutilized languages with exploitation of more informative ones. Extensive experiments show that LRPO consistently improves multilingual performance, demonstrating that adaptive language routing enables effective cross-lingual knowledge exploitation for training. We release all the resources at https://github.com/Guochry/LRPO.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Regional knowledge and conversational settings | Care | Average Score51.26 | 21 | |
| General performance assessment | Overall Combined Benchmarks | Performance (Seen Data)49.51 | 21 | |
| Math Reasoning | mGSM v2 | Accuracy (Seen)77.49 | 21 | |
| Open-ended generation | CARE-pro | Score (Seen)19.26 | 21 | |
| Factual Knowledge | Global MMLU-Lite | Seen Accuracy58.45 | 21 | |
| Factual Knowledge | Include Lite | Seen Accuracy41.1 | 21 |