Distributionally Robust Token Optimization in RLHF

About

Large Language Models (LLMs) tend to respond correctly to prompts that align well with the data they were trained and fine-tuned on. Yet, small shifts in wording, format, or language can trigger surprisingly large failures, especially on multi-step reasoning problems. To address this problem, we propose a Distributionally Robust Token Optimization (DRTO) approach, which combines token-level Reinforcement Learning from Human Feedback (RLHF) with Distributionally Robust Optimization (DRO). DRTO constructs f-divergence ambiguity sets over span-level actor losses, providing a principled way to emphasize difficult response segments during policy optimization. Empirically, DRTO enhances consistency under distribution shifts in multiple reasoning benchmarks among different tasks, achieving $+4.4$ percentage points on MATH-500 and $+2.7$ percentage points on LiveCodeBench over standard RTO.

Yeping Jin, Jiaming Hu, Ioannis Ch. Paschalidis• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MathQA	Accuracy45.2	354
Math Reasoning	GSM8K	Accuracy79.9	254
Mathematical Reasoning	GSM-PLUS	Accuracy57.2	90
Math Reasoning	GSM CoT	Accuracy (GSM CoT)83.2	7
Math Reasoning	GSM DE	Accuracy66	7
Mathematical Reasoning	GSM8K ZH (test)	Accuracy (ZH)58	7
Mathematical Reasoning	GSM8K DE (test)	Accuracy66	7
Mathematical Reasoning	GSM8K ES (test)	Accuracy72	7
Mathematical Reasoning	GSM8K FR (test)	Accuracy64	7

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord