Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

DFPO: Scaling Value Modeling via Distributional Flow towards Robust and Generalizable LLM Post-Training

About

Training reinforcement learning (RL) systems in real-world environments remains challenging due to noisy supervision and poor out-of-domain (OOD) generalization, especially in LLM post-training. Recent distributional RL methods improve robustness by modeling values with multiple quantile points, but they still learn each quantile independently as a scalar. This results in rough-grained value representations that lack fine-grained conditioning on state information, struggling under complex and OOD conditions. We propose DFPO (Distributional Value Flow Policy Optimization with Conditional Risk and Consistency Control), a robust distributional RL framework that models values as continuous flows across time steps. By scaling value modeling through learning of a value flow field instead of isolated quantile predictions, DFPO captures richer state information for more accurate advantage estimation. To stabilize training under noisy feedback, DFPO further integrates conditional risk control and consistency constraints along value flow trajectories. Experiments on dialogue, math reasoning, and scientific tasks show that DFPO outperforms PPO, FlowRL, and other robust baselines under noisy supervision, achieving improved training stability and generalization.

Dingwei Zhu, Zhiheng Xi, Shihan Dou, Jiahan Li, Chenhao Huang, Junjie Ye, Sixian Li, Mingxu Chai, Yuhui Wang, Yajie Yang, Ming Zhang, Jiazheng Zhang, Shichun Liu, Caishuang Huang, Yunke Zhang, Yuran Wang, Tao Gui, Xipeng Qiu, Qi Zhang, Xuanjing Huang• 2026

Related benchmarks

TaskDatasetResultRank
Scientific ReasoningScience Domain In-Domain: SampleQA, GPQA(ALL), HLE
SampleQA Score3.17
18
Mathematical ReasoningMath MATH500, AIME24, Minerva-Math, AMC23
MATH500 Score85
18
Mathematical Problem SolvingMath Domain (Out-of-Domain: MATH500, AIME24, Minerva-Math, AMC23)
MATH500 Score91.8
11
Mathematical ReasoningMath Domain In-Domain
MATH50091
11
Science and Question AnsweringScience & QA SampleQA, GPQA, HLE
SampleQA Score1.76
11
Scientific Question AnsweringScience & QA Domain Out-of-Domain
SampleQA Score2.84
11
DialogTransportation & Travel Out-of-Domain
Accuracy90.5
6
DialogFinancial Services Out-of-Domain
Accuracy84.3
6
DialogReal-world Dialog Domains Aggregate
Average Accuracy87.31
6
DialogLife Services In-Domain
Accuracy85.67
6
Showing 10 of 13 rows

Other info

Follow for update