Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning

About

Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration. However, existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states. We hypothesize that a key missing ingredient from the existing methods is a proper treatment of uncertainty in the offline setting. We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly. Implementation-wise, we adopt a practical and effective dropout-based uncertainty estimation method that introduces very little overhead over existing RL algorithms. Empirically, we observe that UWAC substantially improves model stability during training. In addition, UWAC out-performs existing offline RL methods on a variety of competitive tasks, and achieves significant performance gains over the state-of-the-art baseline on datasets with sparse demonstrations collected from human experts.

Yue Wu, Shuangfei Zhai, Nitish Srivastava, Joshua Susskind, Jian Zhang, Ruslan Salakhutdinov, Hanlin Goh• 2021

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL walker2d-random
Normalized Score1.3
77
Offline Reinforcement LearningD4RL halfcheetah-random
Normalized Score2.3
70
Offline Reinforcement LearningD4RL Walker2d Medium v2
Normalized Return75.4
67
Offline Reinforcement LearningD4RL hopper-random
Normalized Score2.6
62
Offline Reinforcement LearningD4RL halfcheetah v2 (medium-replay)
Normalized Score35.9
58
Offline Reinforcement LearningD4RL hopper-expert v2
Normalized Score110.5
56
Offline Reinforcement LearningD4RL walker2d-expert v2
Normalized Score108.4
56
Offline Reinforcement LearningD4RL halfcheetah-expert v2
Normalized Score92.9
56
Offline Reinforcement LearningD4RL Hopper-medium-replay v2
Normalized Return25.3
54
Offline Reinforcement LearningD4RL Gym walker2d (medium-replay)
Normalized Return27.1
52
Showing 10 of 65 rows

Other info

Follow for update