CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models

About

This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models based on Group Relative Policy Optimization (GRPO). GRPO, while effective, incurs high training costs due to the need to sample multiple completions for each question. Our experiment and theoretical analysis reveal that the number of completions impacts model accuracy yet increases training time multiplicatively, and not all completions contribute equally to policy training -- their contribution depends on their relative advantage. To address these issues, we propose CPPO, which prunes completions with low absolute advantages, significantly reducing the number needed for gradient calculation and updates. Additionally, we introduce a dynamic completion allocation strategy to maximize GPU utilization by incorporating additional questions, further enhancing training efficiency. Experiments show that CPPO achieves up to $7.98\times$ speedup on GSM8K and $3.48\times$ on Math while preserving or even enhancing the accuracy compared to the original GRPO. We release our code at \href{https://github.com/lzhxmu/CPPO}{https://github.com/lzhxmu/CPPO}.

Zhihang Lin, Mingbao Lin, Yuan Xie, Rongrong Ji• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	Accuracy75.4	442
Mathematical Reasoning	Minerva Math	Accuracy27.57	228
Mathematical Reasoning	Olympiad Bench	Accuracy18.85	222
Mathematical Reasoning	AMC	Accuracy41.87	221
Mathematical Reasoning	AIME 24/25	Accuracy6.67	171
Mathematical Reasoning	OlympiadBench	Accuracy31.53	82
Long-horizon Mathematical Reasoning	MATH	Result Accuracy74.43	23
Knowledge-intensive reasoning	GPQA	Result Score38.89	14
Mathematical Reasoning	AIME	Result Accuracy23.33	14
Mathematical Reasoning	GSM8K	Accuracy90.52	14

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord