Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Long-Horizon Model-Based Offline Reinforcement Learning Without Conservatism

About

Popular offline reinforcement learning (RL) methods rely on conservatism, either by penalizing out-of-dataset actions or by restricting rollout horizons. In this work, we question the universality of this principle and instead revisit a complementary one: a Bayesian perspective. Rather than enforcing conservatism, the Bayesian approach tackles epistemic uncertainty in offline data by modeling a posterior distribution over plausible world models and training a history-dependent agent to maximize expected rewards, enabling test-time generalization. We first illustrate, in a bandit setting, that Bayesianism excels on low-quality datasets where conservatism fails. We then scale this principle to realistic tasks and show that long-horizon planning is critical for reducing value overestimation once conservatism is removed. To make this feasible, we introduce key design choices for performing and learning from long-horizon rollouts while controlling compounding errors. These yield our algorithm, NEUBAY, grounded in the neutral Bayesian principle. On D4RL and NeoRL benchmarks, NEUBAY generally matches or surpasses leading conservative algorithms, achieving new state-of-the-art on 7 datasets. Notably, it succeeds with rollout horizons of several hundred steps, contrary to dominant practice. Finally, we characterize datasets by quality and coverage, showing when NEUBAY is preferable to conservative methods. Together, we argue NEUBAY lays the foundation for a new practical direction in offline and model-based RL.

Tianwei Ni, Esther Derman, Vineet Jain, Vincent Taboga, Siamak Ravanbakhsh, Pierre-Luc Bacon• 2025

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL halfcheetah-medium-expert
Normalized Score109.5
117
Offline Reinforcement LearningD4RL hopper-medium-expert
Normalized Score114.8
115
Offline Reinforcement LearningD4RL walker2d-random
Normalized Score34.1
77
Offline Reinforcement LearningD4RL Medium-Replay Hopper
Normalized Score110.6
72
Offline Reinforcement LearningD4RL halfcheetah-random
Normalized Score37
70
Offline Reinforcement LearningD4RL Medium HalfCheetah
Normalized Score78.6
59
Offline Reinforcement LearningD4RL Medium-Replay HalfCheetah
Normalized Score72.1
59
Offline Reinforcement LearningD4RL Medium Walker2d
Normalized Score106.4
58
Offline Reinforcement LearningD4RL walker2d medium-replay
Normalized Score99.3
45
Offline Reinforcement LearningD4RL Adroit pen (cloned)
Normalized Return91.3
32
Showing 10 of 25 rows

Other info

Follow for update