Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Value Flows

About

While most reinforcement learning methods today flatten the distribution of future returns to a single scalar value, distributional RL methods exploit the return distribution to provide stronger learning signals and to enable applications in exploration and safe RL. While the predominant method for estimating the return distribution is by modeling it as a categorical distribution over discrete bins or estimating a finite number of quantiles, such approaches leave unanswered questions about the fine-grained structure of the return distribution and about how to distinguish states with high return uncertainty for decision-making. The key idea in this paper is to use modern, flexible flow-based models to estimate the full future return distributions and identify those states with high return variance. We do so by formulating a new flow-matching objective that generates probability density paths satisfying the distributional Bellman equation. Building upon the learned flow models, we estimate the return uncertainty of distinct states using a new flow derivative ODE. We additionally use this uncertainty information to prioritize learning a more accurate return estimation on certain transitions. We compare our method (Value Flows) with prior methods in the offline and online-to-online settings. Experiments on $37$ state-based and $25$ image-based benchmark tasks demonstrate that Value Flows achieves a $1.3\times$ improvement on average in success rates. Website: https://pd-perry.github.io/value-flows Code: https://github.com/chongyi-zheng/value-flows

Perry Dong, Chongyi Zheng, Chelsea Finn, Dorsa Sadigh, Benjamin Eysenbach• 2025

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement Learningpuzzle-4x4-play OGBench 5 tasks v0
Average Success Rate27
28
Offline Reinforcement Learningscene-play OGBench 5 tasks v0
Average Success Rate59
26
Offline Reinforcement Learningcube-double-play OGBench 5 tasks v0
Average Success Rate69
19
Offline Reinforcement Learningpuzzle-3x3-play OGBench 5 tasks v0
Average Success Rate87
19
Continuous ControlWalker2D v5
Avg Return2.63e+3
17
Continuous ControlHopper v5
Average Return3.31e+3
15
Continuous ControlHumanoid v5
Average Return4.95e+3
13
Offline Reinforcement LearningOGBench cube-triple-play
Success Rate14
10
Singletask Offline Reinforcement Learning (State-based)OGBench State-based Singletask Offline v0
Success Rate97
10
Offline Reinforcement LearningD4RL adroit (12 tasks)
Success Rate50
10
Showing 10 of 24 rows

Other info

Follow for update