GVPO: Group Variance Policy Optimization for Large Language Model Post-Training

About

Post-training plays a crucial role in refining and aligning large language models to meet specific tasks and human preferences. While recent advancements in post-training techniques, such as Group Relative Policy Optimization (GRPO), leverage increased sampling with relative reward scoring to achieve superior performance, these methods often suffer from training instability that limits their practical adoption. As a next step, we present Group Variance Policy Optimization (GVPO). GVPO incorporates the analytical solution to KL-constrained reward maximization directly into its gradient weights, ensuring alignment with the optimal policy. The method provides intuitive physical interpretations: its gradient mirrors the mean squared error between the central distance of implicit rewards and that of actual rewards. GVPO offers two key advantages: (1) it guarantees a unique optimal solution, exactly the KL-constrained reward maximization objective, (2) it supports flexible sampling distributions that avoids on-policy and importance sampling limitations. By unifying theoretical guarantees with practical adaptability, GVPO establishes a new paradigm for reliable and versatile LLM post-training.

Kaichen Zhang, Yuzhong Hong, Junwei Bao, Hongfei Jiang, Yang Song, Dingqian Hong, Hui Xiong• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	Minerva Math	pass@1 Accuracy44.2	104
Mathematical Reasoning	GSM8K	pass@192.9	102
Mathematical Reasoning	AIME 2025	Pass@152.1	96
Mathematical Reasoning	AIME 2024	Pass@157.5	86
Mathematical Reasoning	AMC 2023	Pass@186.3	67
Mathematical Reasoning	Math Benchmarks Aggregate	Pass@170.3	44
Code Generation	TACO Verified	During-task Accuracy76.9	29
Language Understanding	MMLU	During-task Accuracy65.4	29
Mathematical Reasoning	MATH 500	During-task Accuracy (MATH 500)75	29
Math Reasoning	MATH lighteval	During-task Accuracy72.7	29

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord