Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning
About
Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration. However, existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states. We hypothesize that a key missing ingredient from the existing methods is a proper treatment of uncertainty in the offline setting. We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly. Implementation-wise, we adopt a practical and effective dropout-based uncertainty estimation method that introduces very little overhead over existing RL algorithms. Empirically, we observe that UWAC substantially improves model stability during training. In addition, UWAC out-performs existing offline RL methods on a variety of competitive tasks, and achieves significant performance gains over the state-of-the-art baseline on datasets with sparse demonstrations collected from human experts.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Offline Reinforcement Learning | D4RL walker2d-random | Normalized Score1.3 | 77 | |
| Offline Reinforcement Learning | D4RL halfcheetah-random | Normalized Score2.3 | 70 | |
| Offline Reinforcement Learning | D4RL Walker2d Medium v2 | Normalized Return75.4 | 67 | |
| Offline Reinforcement Learning | D4RL hopper-random | Normalized Score2.6 | 62 | |
| Offline Reinforcement Learning | D4RL halfcheetah v2 (medium-replay) | Normalized Score35.9 | 58 | |
| Offline Reinforcement Learning | D4RL hopper-expert v2 | Normalized Score110.5 | 56 | |
| Offline Reinforcement Learning | D4RL walker2d-expert v2 | Normalized Score108.4 | 56 | |
| Offline Reinforcement Learning | D4RL halfcheetah-expert v2 | Normalized Score92.9 | 56 | |
| Offline Reinforcement Learning | D4RL Hopper-medium-replay v2 | Normalized Return25.3 | 54 | |
| Offline Reinforcement Learning | D4RL Gym walker2d (medium-replay) | Normalized Return27.1 | 52 |