FlowRL: Matching Reward Distributions for LLM Reasoning
About
We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of $10.0\%$ over GRPO and $5.1\%$ over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Code Generation | HumanEval+ | Pass@175.9 | 393 | |
| Mathematical Reasoning | AIME 24 | Pass@124.9 | 59 | |
| Mathematical Reasoning | AMC 2023 | Avg@16 Score85.8 | 48 | |
| Mathematical Reasoning | Olympiad | Avg@16 Accuracy68.5 | 47 | |
| Mathematical Reasoning | Minerva | Pass@1 Rate32.3 | 21 | |
| Scientific Reasoning | Science Domain In-Domain: SampleQA, GPQA(ALL), HLE | SampleQA Score3.26 | 18 | |
| Mathematical Reasoning | Math MATH500, AIME24, Minerva-Math, AMC23 | MATH500 Score84 | 18 | |
| Code Reasoning | LiveCodeBench | Avg@1642.4 | 12 | |
| Code Reasoning | HumanEval+ | Pass@1695.7 | 12 | |
| Code Reasoning | CodeForces | Rating1.36e+3 | 12 |