Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Group Sequence Policy Optimization

About

This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, Junyang Lin• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH
Accuracy85.2
535
Mathematical ReasoningAIME
AIME Accuracy26.7
283
Mathematical ReasoningAIME 2024
Accuracy37.7
251
Mathematical ReasoningAIME 25
Accuracy20.5
201
Visual Mathematical ReasoningMathVista
Accuracy81
189
Multi-discipline Multimodal UnderstandingMMMU (val)
Accuracy70.8
167
Mathematical ReasoningCollegeMATH--
161
Mathematical ReasoningMATH 500
pass@167.2
153
Mathematical ReasoningAMC
Accuracy62.7
151
Mathematical ReasoningMinerva
Pass@133.5
138
Showing 10 of 91 rows
...

Other info

Follow for update