Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

About

As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.

Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, Pavlo Molchanov• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME 24
Accuracy29.33
318
Mathematical ReasoningOlympiadBench
Accuracy53.48
213
Mathematical ReasoningAMC23
PASS@1 Accuracy67
207
Role-play dialogue comprehensionSocialBench
Role Knowledge95.1
61
Role-playingCharacterBench
MC4.425
50
Role-playingCharacterBench latest (full)
Overall Score4.425
47
Science Question AnsweringGPQA-Diamond Science N=198
Accuracy39.4
36
Medical Question AnsweringHealthBench Medicine N=5,000 (overall)
Rubric Score21.5
36
Deep Research EvaluationDeep Research Bench first training epoch (step 600)
Readability43.22
17
Deep Research EvaluationDeep Research Bench (step 1100)
Readability51.11
16
Showing 10 of 28 rows

Other info

GitHub

Follow for update