Group Sequence Policy Optimization

About

This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, Junyang Lin• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH500 (test)	--	895
Mathematical Reasoning	MATH	Accuracy74.2	882
Instruction Following	AlpacaEval 2.0	Win Rate25.67	722
Mathematical Reasoning	MATH 500	Accuracy (Acc)90.3	543
Mathematical Reasoning	MATH	Accuracy85.2	535
Mathematical Reasoning	AIME 2024	Accuracy73.33	479
Code Generation	HumanEval+	Pass@175.6	393
Text-to-SQL	BIRD (dev)	Execution Accuracy (EA)68.06	387
Mathematical Reasoning	MathVista	Accuracy65.1	382
Mathematical Reasoning	AIME 2024	Accuracy37.7	370

Showing 10 of 343 rows

...

Other info

Follow for update

@wizwand_team Discord