Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Rethinking Reward Signals in Video GRPO: When Scores Become Targets

About

Group Relative Policy Optimization (GRPO) enables stable and preference-oriented updates via group-wise comparisons for post-training video generation. However, GRPO directly optimizes reward-induced advantages. Under sustained optimization, the reward score can lose fidelity as a proxy for true video quality, consistent with the phenomenon described by Goodhart's Law. This leads to two recurring issues: (i) shortcut-driven optimization under composite objectives and (ii) reward saturation within prompt groups. To address these issues, we introduce TaRoS, a Target-Robust Reward Signaling framework for Video generation GRPO. TaRoS leverages component level performance assessment together with intra-group sparsity to organize multi-aspect rewards towards optimization objectives. In addition, it adaptively downweights components that exhibit saturation, thereby preserving effective optimization directions and mitigating redundancy. This maintains meaningful optimization directions and preserves within-group ranking separation, thereby preventing reward hacking and leading to more reliable policy updates. Extensive experiments show consistent improvements in visual fidelity, motion coherence, and text-video alignment over strong baselines.

Rui Li, Yuanzhi Liang, Ziqi Ni, Haibing Huang, Chi Zhang, Xuelong Li• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-Video GenerationVBench
Quality Score83.66
155
Video GenerationVBench (test)--
48
Text AlignmentUser Study--
12
Video GenerationVBench
Image Quality68.91
10
Realism & QualityUser Study--
4
Motion QualityUser Study
Preference Share (Significantly Better)18.7
1
Showing 6 of 6 rows

Other info

Follow for update