Learning to Route Languages for Multilingual Policy Optimization

About

Large language models~(LLMs) are trained on heterogeneous multilingual corpora, yet existing policy optimization methods often implicitly restrict each training question to a single response language or rely on a fixed dominant language for supervision. We propose language-routed policy optimization (LRPO), an online policy optimization framework that treats language as a selectable variable. LRPO elicits multilingual rollouts for each training question and integrates their relative quality into preference-based policy updates, increasing the diversity and informativeness of training signals under the fixed rollout budget. To adaptively determine which languages to explore during reinforcement learning, we introduce a trainable language router formulated as a multi-armed bandit, balancing exploration of underutilized languages with exploitation of more informative ones. Extensive experiments show that LRPO consistently improves multilingual performance, demonstrating that adaptive language routing enables effective cross-lingual knowledge exploitation for training. We release all the resources at https://github.com/Guochry/LRPO.

Geyang Guo, Hiromi Wakaki, Yuki Mitsufuji, Alan Ritter, Wei Xu• 2026

Related benchmarks

Task	Dataset	Result
Regional knowledge and conversational settings	Care	Average Score51.26	21
General performance assessment	Overall Combined Benchmarks	Performance (Seen Data)49.51	21
Math Reasoning	mGSM v2	Accuracy (Seen)77.49	21
Open-ended generation	CARE-pro	Score (Seen)19.26	21
Factual Knowledge	Global MMLU-Lite	Seen Accuracy58.45	21
Factual Knowledge	Include Lite	Seen Accuracy41.1	21

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord