Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TLPO: Token-Level Policy Optimization for Mitigating Language Confusion in Large Language Models

About

Large language models (LLMs) demonstrate strong multilingual capabilities, yet often fail to consistently generate responses in the intended language, exhibiting a phenomenon known as language confusion. Prior mitigation approaches based on sequence-level fine-tuning, such as DPO, ORPO, and GRPO, operate at the level of entire responses and can lead to unintended degradation of general model capabilities, motivating the need for more fine-grained alternatives. To address this, we introduce Token-Level Policy Optimization (TLPO), a fine-tuning framework designed to mitigate language confusion through localized, token-level updates. TLPO identifies error-prone positions, explores alternative candidate tokens, and updates the policy using a tailored objective to suppress error-inducing outputs at a granular level. This selective intervention enables effective mitigation of language confusion without compromising the model's general abilities. Experiments on multiple multilingual LLMs across diverse languages demonstrate that TLPO significantly outperforms baselines in improving language consistency while preserving downstream task accuracy.

Jinho Choo, JunSeung Lee, Jimyeong Kim, Yeeho Song, S. K. Hong, Yeong-Dae Kwon• 2026

Related benchmarks

TaskDatasetResultRank
ReasoningBBH
Accuracy50.9
726
Multitask Language UnderstandingMMLU
Accuracy54.24
520
Graduate-level Question AnsweringGPQA
Accuracy31.58
215
Math Word Problem SolvingGSM8K
Accuracy80.99
158
Mathematical Problem SolvingMATH
Accuracy47.73
75
Instruction FollowingMIF (target)
Accuracy49.71
10
Instruction FollowingMIF en
Accuracy69.22
10
Code GenerationLCB monolingual
RPR96.44
5
Graduate-level Q&AGPQA diamond en
Accuracy33.87
5
Language AdherenceLCB cross-lingual
RPR97.68
5
Showing 10 of 27 rows

Other info

Follow for update