Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

DVPO: Distributional Value Modeling-based Policy Optimization for LLM Post-Training

About

Reinforcement learning (RL) has shown strong performance in LLM post-training, but real-world deployment often involves noisy or incomplete supervision. In such settings, complex and unreliable supervision signals can destabilize training and harm generalization. While existing approaches such as worst-case optimization (e.g., RFQI, CQL) and mean-based methods (e.g., PPO, GRPO) can improve stability, they often overlook generalization and may produce overly conservative policies, leading to uneven performance across diverse real scenarios. To this end, we introduce DVPO (Distributional Value Modeling with Risk-aware Policy Optimization), a new RL framework that combines conditional risk theory with distributional value modeling to better balance robustness and generalization. DVPO learns token-level value distributions to provide fine-grained supervision, and applies an asymmetric risk regularization to shape the distribution tails: it contracts the lower tail to dampen noisy negative deviations, while expanding the upper tail to preserve exploratory diversity. Across extensive experiments and analysis in multi-turn dialogue, math reasoning, and scientific QA, DVPO consistently outperforms PPO, GRPO, and robust Bellman-based PPO under noisy supervision, showing its potential for LLM post-training in the real-world.

Dingwei Zhu, Zhiheng Xi, Shihan Dou, Yuhui Wang, Sixian Li, Junjie Ye, Honglin Guo, Shichun Liu, Chenhao Huang, Yajie Yang, Junlin Shang, Senjie Jin, Ming Zhang, Jiazheng Zhang, Caishuang Huang, Yunke Zhang, Demei Yan, Yuran Wang, Tao Gui• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME24
Accuracy56.67
130
Scientific ReasoningGPQA
Accuracy5.25
50
Mathematical ReasoningMATH500
Accuracy90
45
Mathematical ReasoningMath MATH500, AIME24, Minerva-Math, AMC23
MATH500 Score90.6
18
Scientific ReasoningScience Domain In-Domain: SampleQA, GPQA(ALL), HLE
SampleQA Score3.21
18
Mathematical ReasoningMinerva Math
Avg@1 Accuracy31.62
18
Mathematical ReasoningAMC23
Accuracy87.5
11
General Reasoning & QAAll Evaluated Datasets
Average Accuracy39.7
7
Mathematical ReasoningMath Domain
Avg Accuracy66.45
7
Scientific Reasoning & QASampleQA
Accuracy3.31
7
Showing 10 of 13 rows

Other info

Follow for update