Rethinking Reward Signals in Video GRPO: When Scores Become Targets

About

Group Relative Policy Optimization (GRPO) enables stable and preference-oriented updates via group-wise comparisons for post-training video generation. However, GRPO directly optimizes reward-induced advantages. Under sustained optimization, the reward score can lose fidelity as a proxy for true video quality, consistent with the phenomenon described by Goodhart's Law. This leads to two recurring issues: (i) shortcut-driven optimization under composite objectives and (ii) reward saturation within prompt groups. To address these issues, we introduce TaRoS, a Target-Robust Reward Signaling framework for Video generation GRPO. TaRoS leverages component level performance assessment together with intra-group sparsity to organize multi-aspect rewards towards optimization objectives. In addition, it adaptively downweights components that exhibit saturation, thereby preserving effective optimization directions and mitigating redundancy. This maintains meaningful optimization directions and preserves within-group ranking separation, thereby preventing reward hacking and leading to more reliable policy updates. Extensive experiments show consistent improvements in visual fidelity, motion coherence, and text-video alignment over strong baselines.

Rui Li, Yuanzhi Liang, Ziqi Ni, Haibing Huang, Chi Zhang, Xuelong Li• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Video Generation	VBench	Quality Score83.66	168
Video Generation	VBench (test)	--	66
Video Generation	VBench	Total Score82.09	42
Text Alignment	User Study	--	12
Realism & Quality	User Study	--	4
Motion Quality	User Study	Preference Share (Significantly Better)18.7	1

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord