Offline Learning from Demonstrations and Unlabeled Experience

About

Behavior cloning (BC) is often practical for robot learning because it allows a policy to be trained offline without rewards, by supervised learning on expert demonstrations. However, BC does not effectively leverage what we will refer to as unlabeled experience: data of mixed and unknown quality without reward annotations. This unlabeled data can be generated by a variety of sources such as human teleoperation, scripted policies and other agents on the same robot. Towards data-driven offline robot learning that can use this unlabeled experience, we introduce Offline Reinforced Imitation Learning (ORIL). ORIL first learns a reward function by contrasting observations from demonstrator and unlabeled trajectories, then annotates all data with the learned reward, and finally trains an agent via offline reinforcement learning. Across a diverse set of continuous control and simulated robotic manipulation tasks, we show that ORIL consistently outperforms comparable BC agents by effectively leveraging unlabeled experience.

Konrad Zolna, Alexander Novikov, Ksenia Konyushkova, Caglar Gulcehre, Ziyu Wang, Yusuf Aytar, Misha Denil, Nando de Freitas, Scott Reed• 2020

Related benchmarks

Task	Dataset	Result
Offline Reinforcement Learning	D4RL halfcheetah-expert v2	Normalized Score92.1	66
Offline Reinforcement Learning	D4RL hopper-expert v2	Normalized Score97.5	66
Offline Reinforcement Learning	D4RL walker2d-expert v2	Normalized Score29.3	66
Offline Imitation Learning	D4RL Ant v2 (expert)	Normalized Score76.8	20
Imitation Learning	Walker2d one-shot v2	Normalized Score6.9	11
Imitation Learning	Ant one-shot v2	Normalized Score17.4	11
Imitation Learning	Hopper one-shot v2	Normalized Score14.7	11
Imitation Learning	HalfCheetah one-shot v2	Normalized Score0.2	11
Cross-domain Offline Imitation Learning from Demonstrations (C-off-LfD)	D4RL MuJoCo reward-free v2 (medium, medium-replay, medium-expert)	Hopper-v2 Return (medium)52.8	7
Single-domain Offline Imitation Learning from Demonstrations (S-off-LfD)	D4RL MuJoCo reward-free v2 (medium, medium-replay, medium-expert)	Hopper-v2 (m) Score50.9	7

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord