FlowRL: Matching Reward Distributions for LLM Reasoning
About
We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of $10.0\%$ over GRPO and $5.1\%$ over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | AIME 24 | Pass@124.9 | 59 | |
| Scientific Reasoning | Science Domain In-Domain: SampleQA, GPQA(ALL), HLE | SampleQA Score3.26 | 18 | |
| Mathematical Reasoning | Math MATH500, AIME24, Minerva-Math, AMC23 | MATH500 Score84 | 18 | |
| Mathematical Reasoning | MATH 500 | Pass@176.4 | 12 | |
| Mathematical Reasoning | OlympiadBench | Pass@139.6 | 12 | |
| Mathematical Reasoning | Minerva | Pass@1 Rate32.3 | 12 | |
| Mathematical Reasoning | AIME 25 | Pass@19.4 | 12 | |
| Mathematical Problem Solving | Math Domain (Out-of-Domain: MATH500, AIME24, Minerva-Math, AMC23) | MATH500 Score89.6 | 11 | |
| Science and Question Answering | Science & QA SampleQA, GPQA, HLE | SampleQA Score1.76 | 11 | |
| Mathematical Reasoning | Math Domain In-Domain | MATH50090.2 | 11 |