VIPO: Value Function Inconsistency Penalized Offline Reinforcement Learning

About

Offline reinforcement learning (RL) learns effective policies from pre-collected datasets, offering a practical solution for applications where online interactions are risky or costly. Model-based approaches are particularly advantageous for offline RL, owing to their data efficiency and generalizability. However, due to inherent model errors, model-based methods often artificially introduce conservatism guided by heuristic uncertainty estimation, which can be unreliable. In this paper, we introduce VIPO, a novel model-based offline RL algorithm that incorporates self-supervised feedback from value estimation to enhance model training. Specifically, the model is learned by additionally minimizing the inconsistency between the value learned directly from the offline data and the value estimated from the model. We perform comprehensive evaluations from multiple perspectives to show that VIPO can learn a highly accurate model efficiently and consistently outperform existing methods. In particular, it achieves state-of-the-art performance on almost all tasks in both D4RL and NeoRL benchmarks. Overall, VIPO offers a general framework that can be readily integrated into existing model-based offline RL algorithms to systematically enhance model accuracy.

Xuyang Chen, Keyu Yan, Guojian Wang, Lin Zhao• 2025

Related benchmarks

Task	Dataset	Result
Offline Reinforcement Learning	D4RL halfcheetah-medium-expert	Normalized Score110	169
Offline Reinforcement Learning	D4RL hopper-medium-expert	Normalized Score113.2	161
Offline Reinforcement Learning	D4RL Medium-Replay Hopper	Normalized Score109.6	109
Offline Reinforcement Learning	D4RL Medium HalfCheetah	Normalized Score80	105
Offline Reinforcement Learning	D4RL Medium Walker2d	Normalized Score93.1	104
Offline Reinforcement Learning	D4RL walker2d-random	Normalized Score20	101
Offline Reinforcement Learning	D4RL Medium-Replay HalfCheetah	Normalized Score77.2	97
Offline Reinforcement Learning	D4RL halfcheetah-random	Normalized Score42.5	94
Offline Reinforcement Learning	D4RL walker2d medium-replay	Normalized Score98.4	62
Offline Reinforcement Learning	D4RL Adroit pen (cloned)	Normalized Return71.1	53

Showing 10 of 51 rows

Other info

Follow for update

@wizwand_team Discord