Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Learning to Route Languages for Multilingual Policy Optimization

About

Large language models~(LLMs) are trained on heterogeneous multilingual corpora, yet existing policy optimization methods often implicitly restrict each training question to a single response language or rely on a fixed dominant language for supervision. We propose language-routed policy optimization (LRPO), an online policy optimization framework that treats language as a selectable variable. LRPO elicits multilingual rollouts for each training question and integrates their relative quality into preference-based policy updates, increasing the diversity and informativeness of training signals under the fixed rollout budget. To adaptively determine which languages to explore during reinforcement learning, we introduce a trainable language router formulated as a multi-armed bandit, balancing exploration of underutilized languages with exploitation of more informative ones. Extensive experiments show that LRPO consistently improves multilingual performance, demonstrating that adaptive language routing enables effective cross-lingual knowledge exploitation for training. We release all the resources at https://github.com/Guochry/LRPO.

Geyang Guo, Hiromi Wakaki, Yuki Mitsufuji, Alan Ritter, Wei Xu• 2026

Related benchmarks

TaskDatasetResultRank
Regional knowledge and conversational settingsCare
Average Score51.26
21
General performance assessmentOverall Combined Benchmarks
Performance (Seen Data)49.51
21
Math ReasoningmGSM v2
Accuracy (Seen)77.49
21
Open-ended generationCARE-pro
Score (Seen)19.26
21
Factual KnowledgeGlobal MMLU-Lite
Seen Accuracy58.45
21
Factual KnowledgeInclude Lite
Seen Accuracy41.1
21
Showing 6 of 6 rows

Other info

Follow for update