Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

FlowRL: Matching Reward Distributions for LLM Reasoning

About

We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of $10.0\%$ over GRPO and $5.1\%$ over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.

Xuekai Zhu, Daixuan Cheng, Dinghuai Zhang, Hengli Li, Kaiyan Zhang, Che Jiang, Youbang Sun, Ermo Hua, Yuxin Zuo, Xingtai Lv, Qizheng Zhang, Lin Chen, Fanghao Shao, Bo Xue, Yunchong Song, Zhenjie Yang, Ganqu Cui, Ning Ding, Jianfeng Gao, Xiaodong Liu, Bowen Zhou, Hongyuan Mei, Zhouhan Lin• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME 24
Pass@124.9
59
Scientific ReasoningScience Domain In-Domain: SampleQA, GPQA(ALL), HLE
SampleQA Score3.26
18
Mathematical ReasoningMath MATH500, AIME24, Minerva-Math, AMC23
MATH500 Score84
18
Mathematical ReasoningMATH 500
Pass@176.4
12
Mathematical ReasoningOlympiadBench
Pass@139.6
12
Mathematical ReasoningMinerva
Pass@1 Rate32.3
12
Mathematical ReasoningAIME 25
Pass@19.4
12
Mathematical Problem SolvingMath Domain (Out-of-Domain: MATH500, AIME24, Minerva-Math, AMC23)
MATH500 Score89.6
11
Science and Question AnsweringScience & QA SampleQA, GPQA, HLE
SampleQA Score1.76
11
Mathematical ReasoningMath Domain In-Domain
MATH50090.2
11
Showing 10 of 31 rows

Other info

Follow for update