Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Q-Distribution guided Q-learning for offline reinforcement learning: Uncertainty penalized Q-value via consistency model

About

``Distribution shift'' is the main obstacle to the success of offline reinforcement learning. A learning policy may take actions beyond the behavior policy's knowledge, referred to as Out-of-Distribution (OOD) actions. The Q-values for these OOD actions can be easily overestimated. As a result, the learning policy is biased by using incorrect Q-value estimates. One common approach to avoid Q-value overestimation is to make a pessimistic adjustment. Our key idea is to penalize the Q-values of OOD actions associated with high uncertainty. In this work, we propose Q-Distribution Guided Q-Learning (QDQ), which applies a pessimistic adjustment to Q-values in OOD regions based on uncertainty estimation. This uncertainty measure relies on the conditional Q-value distribution, learned through a high-fidelity and efficient consistency model. Additionally, to prevent overly conservative estimates, we introduce an uncertainty-aware optimization objective for updating the Q-value function. The proposed QDQ demonstrates solid theoretical guarantees for the accuracy of Q-value distribution learning and uncertainty measurement, as well as the performance of the learning policy. QDQ consistently shows strong performance on the D4RL benchmark and achieves significant improvements across many tasks.

Jing Zhang, Linjiajie Fang, Kexin Shi, Wenjia Wang, Bing-Yi Jing• 2024

Related benchmarks

TaskDatasetResultRank
LocomotionD4RL HalfCheetah Medium-Replay
Normalized Score0.637
68
Locomotionhalfcheetah medium v2
Average Normalized Score74.1
19
LocomotionWalker2d Medium-Expert v2
Average Normalized Score115.9
19
Locomotionhalfcheetah medium-expert v2
Average Normalized Score99.3
19
Locomotionwalker2d medium-replay v2
Average Normalized Score93.2
19
Locomotionwalker2d medium v2
Average Normalized Score86.9
19
LocomotionD4RL hopper v2 (medium)
Normalized Return102.4
16
Offline Reinforcement LearningAntMaze Umaze v0
Averaged Normalized Score98.6
14
Offline Reinforcement LearningAntMaze Medium-Play v0
Avg Normalized Score81.5
14
Offline Reinforcement Learningantmaze umaze-diverse v0
Avg Normalized Score67.8
14
Showing 10 of 15 rows

Other info

Code

Follow for update