Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Distributionally Robust Token Optimization in RLHF

About

Large Language Models (LLMs) tend to respond correctly to prompts that align to the data they were trained and fine-tuned on. Yet, small shifts in wording, format, or language can trigger surprisingly large failures, especially on multi-step reasoning problems. To address this problem, we propose a Distributionally Robust Token Optimization (DRTO) approach, which combines token-level Reinforcement Learning from Human Feedback (RLHF) with Distributionally Robust Optimization (DRO). DRTO bounds worst case token-wise rewards by constructing an f-divergence ambiguity set over a loss minibatch, leading to a theoretical robustness. Empirically, DRTO enhances consistency under distribution shifts in mathematical reasoning benchmarks, achieving 9.17\% improvement on GSM8K and 2.49% improvement on MathQA.

Yeping Jin, Jiaming Hu, Ioannis Ch. Paschalidis• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMathQA
Accuracy45.2
305
Math ReasoningGSM8K
Accuracy79.9
187
Mathematical ReasoningGSM-PLUS
Accuracy57.2
66
Math ReasoningGSM CoT
Accuracy (GSM CoT)83.2
7
Math ReasoningGSM DE
Accuracy66
7
Mathematical ReasoningGSM8K ZH (test)
Accuracy (ZH)58
7
Mathematical ReasoningGSM8K DE (test)
Accuracy66
7
Mathematical ReasoningGSM8K ES (test)
Accuracy72
7
Mathematical ReasoningGSM8K FR (test)
Accuracy64
7
Showing 9 of 9 rows

Other info

Follow for update