TLPO: Token-Level Policy Optimization for Mitigating Language Confusion in Large Language Models

About

Large language models (LLMs) demonstrate strong multilingual capabilities, yet often fail to consistently generate responses in the intended language, exhibiting a phenomenon known as language confusion. Prior mitigation approaches based on sequence-level fine-tuning, such as DPO, ORPO, and GRPO, operate at the level of entire responses and can lead to unintended degradation of general model capabilities, motivating the need for more fine-grained alternatives. To address this, we introduce Token-Level Policy Optimization (TLPO), a fine-tuning framework designed to mitigate language confusion through localized, token-level updates. TLPO identifies error-prone positions, explores alternative candidate tokens, and updates the policy using a tailored objective to suppress error-inducing outputs at a granular level. This selective intervention enables effective mitigation of language confusion without compromising the model's general abilities. Experiments on multiple multilingual LLMs across diverse languages demonstrate that TLPO significantly outperforms baselines in improving language consistency while preserving downstream task accuracy.

Jinho Choo, JunSeung Lee, Jimyeong Kim, Yeeho Song, S. K. Hong, Yeong-Dae Kwon• 2026

Related benchmarks

Task	Dataset	Result
Reasoning	BBH	Accuracy50.9	770
Multitask Language Understanding	MMLU	Accuracy54.24	568
Graduate-level Question Answering	GPQA	Accuracy31.58	224
Math Word Problem Solving	GSM8K	Accuracy80.99	158
Mathematical Problem Solving	MATH	Accuracy47.73	114
Instruction Following	MIF (target)	Accuracy49.71	10
Instruction Following	MIF en	Accuracy69.22	10
Code Generation	LCB monolingual	RPR96.44	5
Graduate-level Q&A	GPQA diamond en	Accuracy33.87	5
Language Adherence	LCB cross-lingual	RPR97.68	5

Showing 10 of 27 rows

Other info

Follow for update

@wizwand_team Discord