Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Offline Learning from Demonstrations and Unlabeled Experience

About

Behavior cloning (BC) is often practical for robot learning because it allows a policy to be trained offline without rewards, by supervised learning on expert demonstrations. However, BC does not effectively leverage what we will refer to as unlabeled experience: data of mixed and unknown quality without reward annotations. This unlabeled data can be generated by a variety of sources such as human teleoperation, scripted policies and other agents on the same robot. Towards data-driven offline robot learning that can use this unlabeled experience, we introduce Offline Reinforced Imitation Learning (ORIL). ORIL first learns a reward function by contrasting observations from demonstrator and unlabeled trajectories, then annotates all data with the learned reward, and finally trains an agent via offline reinforcement learning. Across a diverse set of continuous control and simulated robotic manipulation tasks, we show that ORIL consistently outperforms comparable BC agents by effectively leveraging unlabeled experience.

Konrad Zolna, Alexander Novikov, Ksenia Konyushkova, Caglar Gulcehre, Ziyu Wang, Yusuf Aytar, Misha Denil, Nando de Freitas, Scott Reed• 2020

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL halfcheetah-expert v2
Normalized Score92.1
56
Offline Reinforcement LearningD4RL hopper-expert v2
Normalized Score97.5
56
Offline Reinforcement LearningD4RL walker2d-expert v2
Normalized Score29.3
56
Offline Imitation LearningD4RL Ant v2 (expert)
Normalized Score76.8
20
Imitation LearningWalker2d one-shot v2
Normalized Score6.9
11
Imitation LearningAnt one-shot v2
Normalized Score17.4
11
Imitation LearningHopper one-shot v2
Normalized Score14.7
11
Imitation LearningHalfCheetah one-shot v2
Normalized Score0.2
11
Cross-domain Offline Imitation Learning from Demonstrations (C-off-LfD)D4RL MuJoCo reward-free v2 (medium, medium-replay, medium-expert)
Hopper-v2 Return (medium)52.8
7
Single-domain Offline Imitation Learning from Demonstrations (S-off-LfD)D4RL MuJoCo reward-free v2 (medium, medium-replay, medium-expert)
Hopper-v2 (m) Score50.9
7
Showing 10 of 12 rows

Other info

Follow for update