Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Long-Horizon Model-Based Offline Reinforcement Learning Without Explicit Conservatism

About

Popular offline reinforcement learning (RL) methods rely on explicit conservatism, penalizing out-of-dataset actions or restricting rollout horizons. We question the universality of this principle and revisit a complementary Bayesian perspective for test-time adaptation. By modeling a posterior over world models and training a history-dependent agent to maximize expected return, the Bayesian approach directly addresses epistemic uncertainty without explicit conservatism. We first illustrate in a bandit setting that Bayesianism excels on low-quality datasets where conservatism fails. Scaling to realistic tasks, we find that long-horizon rollouts are essential to control value overestimation once conservatism is removed. We introduce design choices that enable learning from long-horizon rollouts while mitigating compounding model errors, yielding our algorithm, NEUBAY, grounded in the neutral Bayesian principle. On D4RL and NeoRL benchmarks, NEUBAY is competitive with leading conservative algorithms, achieving new state-of-the-art on 7 datasets with rollout horizons of several hundred steps. Finally, we characterize datasets by quality and coverage to identify when NEUBAY is preferable to conservative methods.

Tianwei Ni, Esther Derman, Vineet Jain, Vincent Taboga, Siamak Ravanbakhsh, Pierre-Luc Bacon• 2025

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL halfcheetah-medium-expert
Normalized Score109.5
169
Offline Reinforcement LearningD4RL hopper-medium-expert
Normalized Score114.8
161
Offline Reinforcement LearningD4RL Medium-Replay Hopper
Normalized Score110.6
109
Offline Reinforcement LearningD4RL Medium HalfCheetah
Normalized Score78.6
105
Offline Reinforcement LearningD4RL Medium Walker2d
Normalized Score106.4
104
Offline Reinforcement LearningD4RL walker2d-random
Normalized Score34.1
101
Offline Reinforcement LearningD4RL Medium-Replay HalfCheetah
Normalized Score72.1
97
Offline Reinforcement LearningD4RL halfcheetah-random
Normalized Score37
94
Offline Reinforcement LearningD4RL walker2d medium-replay
Normalized Score99.3
62
Offline Reinforcement LearningD4RL Adroit pen (cloned)
Normalized Return91.3
53
Showing 10 of 25 rows

Other info

Follow for update