BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning
About
There has recently been a surge in research in batch Deep Reinforcement Learning (DRL), which aims for learning a high-performing policy from a given dataset without additional interactions with the environment. We propose a new algorithm, Best-Action Imitation Learning (BAIL), which strives for both simplicity and performance. BAIL learns a V function, uses the V function to select actions it believes to be high-performing, and then uses those actions to train a policy network using imitation learning. For the MuJoCo benchmark, we provide a comprehensive experimental study of BAIL, comparing its performance to four other batch Q-learning and imitation-learning schemes for a large variety of batch datasets. Our experiments show that BAIL's performance is much higher than the other schemes, and is also computationally much faster than the batch Q-learning schemes.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Offline Reinforcement Learning | D4RL halfcheetah-medium-expert | Normalized Score92.6 | 117 | |
| Offline Reinforcement Learning | D4RL hopper-medium-expert | Normalized Score106.4 | 115 | |
| Offline Reinforcement Learning | D4RL walker2d-medium-expert | Normalized Score75.7 | 86 | |
| Offline Reinforcement Learning | D4RL walker2d-random | Normalized Score2.4 | 77 | |
| Offline Reinforcement Learning | D4RL halfcheetah-random | Normalized Score2.2 | 70 | |
| Offline Reinforcement Learning | D4RL hopper-random | Normalized Score8 | 62 | |
| Offline Reinforcement Learning | D4RL Gym walker2d (medium-replay) | Normalized Return51.4 | 52 | |
| Offline Reinforcement Learning | D4RL Gym walker2d medium | Normalized Return68.8 | 42 | |
| Offline Reinforcement Learning | D4RL antmaze-umaze (diverse) | Normalized Score52 | 40 | |
| Offline Reinforcement Learning | D4RL antmaze-large (play) | Normalized Score2.2 | 26 |