Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

XRPO: Pushing the limits of GRPO with Targeted Exploration and Exploitation

About

Reinforcement learning algorithms such as GRPO have driven recent advances in large language model (LLM) reasoning. While scaling the number of rollouts stabilizes training, existing approaches suffer from limited exploration on challenging prompts and leave informative feedback signals underexploited, due to context-independent rollout allocation across prompts (e.g., generating 16 rollouts per prompt) and relying heavily on sparse rewards. This paper presents XRPO(eXplore - eXploit GRPO), a unified framework that recasts policy optimization through the principled lens of rollout exploration-exploitation. To enhance exploration, XRPO introduces a mathematically grounded rollout allocator that adaptively prioritizes prompts with higher potential for uncertainty reduction. It further addresses stagnation on zero-reward prompts through an in-context seeding strategy that injects curated exemplars, steering the model into more difficult reasoning trajectories. To strengthen exploitation, XRPO develops a group-relative, novelty-aware advantage sharpening mechanism that leverages sequence likelihoods to amplify low-probability yet correct responses, thereby extending the policy's reach beyond sparse rewards. Experiments across diverse math and coding benchmarks on both reasoning and non-reasoning models demonstrate that XRPO outperforms existing advances (e.g., GRPO and GSPO) up to 4% pass@1 and 6% cons@32, while accelerating training convergence by up to 2.7X.

Udbhav Bamba, Minghao Fang, Yifan Yu, Haizhong Zheng, Fan Lai• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME 2024 (test)--
209
Math ReasoningMATH
Accuracy90.54
160
Mathematical ReasoningAIME 2025 (test)
Pass@1 Rate7.71
148
Code GenerationCodeForces
Accuracy13.8
4
Math ReasoningAIME 25
Pass@135.72
4
Math ReasoningHMMT 25 (Feb)
Pass@122.29
4
Math ReasoningBrumo 25
Pass@147.39
4
Mathematical ReasoningAIME, HMMT, BRUMO average '25
Pass@448.65
4
Math ReasoningAIME 2024
Pass@19.27
2
Math ReasoningAIME 2025
Pass@10.63
2
Showing 10 of 11 rows

Other info

Follow for update