Group Sequence Policy Optimization
About
This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, Junyang Lin• 2025
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH | Accuracy85.2 | 535 | |
| Mathematical Reasoning | AIME | AIME Accuracy26.7 | 283 | |
| Mathematical Reasoning | AIME 2024 | Accuracy37.7 | 251 | |
| Mathematical Reasoning | AIME 25 | Accuracy20.5 | 201 | |
| Visual Mathematical Reasoning | MathVista | Accuracy81 | 189 | |
| Multi-discipline Multimodal Understanding | MMMU (val) | Accuracy70.8 | 167 | |
| Mathematical Reasoning | CollegeMATH | -- | 161 | |
| Mathematical Reasoning | MATH 500 | pass@167.2 | 153 | |
| Mathematical Reasoning | AMC | Accuracy62.7 | 151 | |
| Mathematical Reasoning | Minerva | Pass@133.5 | 138 |
Showing 10 of 91 rows
...