BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning

About

There has recently been a surge in research in batch Deep Reinforcement Learning (DRL), which aims for learning a high-performing policy from a given dataset without additional interactions with the environment. We propose a new algorithm, Best-Action Imitation Learning (BAIL), which strives for both simplicity and performance. BAIL learns a V function, uses the V function to select actions it believes to be high-performing, and then uses those actions to train a policy network using imitation learning. For the MuJoCo benchmark, we provide a comprehensive experimental study of BAIL, comparing its performance to four other batch Q-learning and imitation-learning schemes for a large variety of batch datasets. Our experiments show that BAIL's performance is much higher than the other schemes, and is also computationally much faster than the batch Q-learning schemes.

Xinyue Chen, Zijian Zhou, Zheng Wang, Che Wang, Yanqiu Wu, Keith Ross• 2019

Related benchmarks

Task	Dataset	Result
Offline Reinforcement Learning	D4RL halfcheetah-medium-expert	Normalized Score92.6	169
Offline Reinforcement Learning	D4RL hopper-medium-expert	Normalized Score106.4	161
Offline Reinforcement Learning	D4RL walker2d-medium-expert	Normalized Score75.7	132
Offline Reinforcement Learning	D4RL walker2d-random	Normalized Score2.4	101
Offline Reinforcement Learning	D4RL halfcheetah-random	Normalized Score2.2	94
Offline Reinforcement Learning	D4RL hopper-random	Normalized Score8	86
Offline Reinforcement Learning	D4RL antmaze-umaze (diverse)	Normalized Score52	74
Offline Reinforcement Learning	D4RL Gym walker2d (medium-replay)	Normalized Return51.4	73
Offline Reinforcement Learning	D4RL Gym walker2d medium	Normalized Return68.8	63
Offline Reinforcement Learning	D4RL antmaze-large (diverse)	Normalized Score1	47

Showing 10 of 26 rows

Other info

Follow for update

@wizwand_team Discord