Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Model-based Offline RL via Robust Value-Aware Model Learning with Implicitly Differentiable Adaptive Weighting

About

Model-based offline reinforcement learning (RL) aims to enhance offline RL with a dynamics model that facilitates policy exploration. However, \textit{model exploitation} could occur due to inevitable model errors, degrading algorithm performance. Adversarial model learning offers a theoretical framework to mitigate model exploitation by solving a maximin formulation. Within such a paradigm, RAMBO~\citep{rigter2022rambo} has emerged as a representative and most popular method that provides a practical implementation with model gradient. However, we empirically reveal that severe Q-value underestimation and gradient explosion can occur in RAMBO with only slight hyperparameter tuning, suggesting that it tends to be overly conservative and suffers from unstable model updates. To address these issues, we propose \textbf{RO}bust value-aware \textbf{M}odel learning with \textbf{I}mplicitly differentiable adaptive weighting (ROMI). Instead of updating the dynamics model with model gradient, ROMI introduces a novel robust value-aware model learning approach. This approach requires the dynamics model to predict future states with values close to the minimum Q-value within a scale-adjustable state uncertainty set, enabling controllable conservatism and stable model updates. To further improve out-of-distribution (OOD) generalization during multi-step rollouts, we propose implicitly differentiable adaptive weighting, a bi-level optimization scheme that adaptively achieves dynamics- and value-aware model learning. Empirical results on D4RL and NeoRL datasets show that ROMI significantly outperforms RAMBO and achieves competitive or superior performance compared to other state-of-the-art methods on datasets where RAMBO typically underperforms. Code is available at https://github.com/zq2r/ROMI.git.

Zhongjian Qiao, Jiafei Lyu, Boxiang Lyu, Yao Shu, Siyang Gao, Shuang Qiu• 2026

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL MuJoCo Hopper-mr v2 (medium-replay)
Avg Normalized Score102
36
Offline Reinforcement LearningD4RL MuJoCo Hopper-m v2 (medium)
Avg Normalized Score105
31
Offline Reinforcement LearningD4RL MuJoCo Walker2d medium-expert v2
Average Normalized Score113.3
31
Offline Reinforcement LearningD4RL MuJoCo halfcheetah-medium-expert
Normalized Score104.5
20
Locomotionneorl hopper-low
Mean Normalized Score22.4
19
Locomotionneorl walker2d-medium
Mean Normalized Score54.9
19
Locomotionneorl halfcheetah-medium
Mean Normalized Score57.7
19
Offline Reinforcement LearningD4RL AntMaze
Medium Diverse Success Rate30.1
19
Locomotionneorl-walker2d low
Mean Normalized Score36.4
19
Locomotionneorl-walker2d high
Mean Normalized Score75.1
18
Showing 10 of 20 rows

Other info

Follow for update