Distributionally Robust Token Optimization in RLHF
About
Large Language Models (LLMs) tend to respond correctly to prompts that align well with the data they were trained and fine-tuned on. Yet, small shifts in wording, format, or language can trigger surprisingly large failures, especially on multi-step reasoning problems. To address this problem, we propose a Distributionally Robust Token Optimization (DRTO) approach, which combines token-level Reinforcement Learning from Human Feedback (RLHF) with Distributionally Robust Optimization (DRO). DRTO constructs f-divergence ambiguity sets over span-level actor losses, providing a principled way to emphasize difficult response segments during policy optimization. Empirically, DRTO enhances consistency under distribution shifts in multiple reasoning benchmarks among different tasks, achieving $+4.4$ percentage points on MATH-500 and $+2.7$ percentage points on LiveCodeBench over standard RTO.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MathQA | Accuracy45.2 | 354 | |
| Math Reasoning | GSM8K | Accuracy79.9 | 254 | |
| Mathematical Reasoning | GSM-PLUS | Accuracy57.2 | 90 | |
| Math Reasoning | GSM CoT | Accuracy (GSM CoT)83.2 | 7 | |
| Math Reasoning | GSM DE | Accuracy66 | 7 | |
| Mathematical Reasoning | GSM8K ZH (test) | Accuracy (ZH)58 | 7 | |
| Mathematical Reasoning | GSM8K DE (test) | Accuracy66 | 7 | |
| Mathematical Reasoning | GSM8K ES (test) | Accuracy72 | 7 | |
| Mathematical Reasoning | GSM8K FR (test) | Accuracy64 | 7 |