Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening

About

Reinforcement learning is emerging as a primary driver for improving language model reasoning capabilities. A fundamental question is whether current reinforcement learning algorithms -- such as Group Relative Policy Optimization (GRPO), the de facto standard algorithm used to improve language model reasoning -- merely sharpen the base model's distribution around problems it can already solve. We investigate this question in the context of formal theorem proving, which has access to a perfect verifier. We identify a degenerate rank bias in GRPO in which highly probable trajectories are reinforced and rare ones are neglected. This results in distribution sharpening: the model can solve some problems with fewer samples, but underperforms simply sampling more solutions from the original model. To overcome GRPO's rank bias we introduce unlikeliness reward, a simple method for explicitly up-weighting rare but correct solutions. We show that unlikeliness reward mitigates rank bias and improves pass@$N$ across a large range of $N$ in both synthetic and real theorem proving settings. We also uncover an unexpected link between rank bias and a seemingly mundane hyperparameter -- the number of updates per batch -- that leads to a second, complementary mitigation. We combine our insights into a revised GRPO training recipe for formal theorem proving, yielding an open pipeline that achieves competitive performance to DeepSeek-Prover-V1.5-RL on the miniF2F-test benchmark. We release our implementation at https://github.com/AndreHe02/rewarding-unlikely-release

Andre He, Daniel Fried, Sean Welleck• 2025

Related benchmarks

Task	Dataset	Result
Creative Writing	WritingBench	Score6.28	18
Formal Theorem Proving	Lean (test)	Pass@167.4	14
Creative Writing	Creative Writing EQ-Bench v3	ELO718.3	13
Creative Writing	ArenaHard creative writing v2.0	WR Score12.1	13
Reasoning	GSM8K	Distinct-10.26	4
Instruction Following	Dolly	Distinct-126	4

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord