F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare

About

Reinforcement Learning with Verifiable Rewards (RLVR) is commonly based on group sampling to estimate advantages and stabilize policy updates. In practice, computational limits often rule out very large groups, so training proceeds with finite rollout sets that can reinforce only the correct behavior they expose. At practical group sizes, updates can miss rare-correct trajectories while still containing mixed rewards, concentrating probability on more common sampled solutions. We derive the probability of such prompt-local tail-miss events as a function of group size, showing non-monotonic behavior, and in the categorical abstraction characterize how unsampled-correct mass can shrink even as total correct mass grows. Motivated by this analysis, we propose a difficulty-aware scaling coefficient, inspired by Focal loss, that down-weights updates on high-success sampled groups. Empirically, categorical simulation illustrates the same effect in the categorical setting, Maze provides a single-solution test, and LLM experiments include a representative GRPO group-size sweep together with fixed-$N$ transfer across GRPO, DAPO, and CISPO. On Qwen2.5-7B at $N{=}8$, our method improves average math pass@256 from 64.1 $\rightarrow$ 70.3 (GRPO), 69.3 $\rightarrow$ 72.5 (DAPO), and 73.2 $\rightarrow$ 76.8 (CISPO); OOD pass@256 also improves in all three cases, without increasing group size or computational cost.

Daniil Plyusov, Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Alexey Malakhov, Daria Korotyshova, Daniil Gavrilov• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH500 (test)	--	922
Instruction Following	IFEval	--	854
Mathematical Reasoning	AIME 2024 (test)	--	294
Mathematical Reasoning	AIME 2025 (test)	Pass@1 Rate21.25	191
Mathematical Reasoning	AMC (test)	--	65
Math Reasoning	AMC 2023 (test)	Pass@161.77	57
Mathematical Reasoning	MATH500	Pass@179.1	40
Mathematical Reasoning	Minerva	pass@135.7	32
Logical reasoning	SynLogic	pass@18.7	18
Mathematical Reasoning	AIME 25	Pass@113	18

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord