Offline Learning from Demonstrations and Unlabeled Experience
About
Behavior cloning (BC) is often practical for robot learning because it allows a policy to be trained offline without rewards, by supervised learning on expert demonstrations. However, BC does not effectively leverage what we will refer to as unlabeled experience: data of mixed and unknown quality without reward annotations. This unlabeled data can be generated by a variety of sources such as human teleoperation, scripted policies and other agents on the same robot. Towards data-driven offline robot learning that can use this unlabeled experience, we introduce Offline Reinforced Imitation Learning (ORIL). ORIL first learns a reward function by contrasting observations from demonstrator and unlabeled trajectories, then annotates all data with the learned reward, and finally trains an agent via offline reinforcement learning. Across a diverse set of continuous control and simulated robotic manipulation tasks, we show that ORIL consistently outperforms comparable BC agents by effectively leveraging unlabeled experience.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Offline Reinforcement Learning | D4RL halfcheetah-expert v2 | Normalized Score92.1 | 56 | |
| Offline Reinforcement Learning | D4RL hopper-expert v2 | Normalized Score97.5 | 56 | |
| Offline Reinforcement Learning | D4RL walker2d-expert v2 | Normalized Score29.3 | 56 | |
| Offline Imitation Learning | D4RL Ant v2 (expert) | Normalized Score76.8 | 20 | |
| Imitation Learning | Walker2d one-shot v2 | Normalized Score6.9 | 11 | |
| Imitation Learning | Ant one-shot v2 | Normalized Score17.4 | 11 | |
| Imitation Learning | Hopper one-shot v2 | Normalized Score14.7 | 11 | |
| Imitation Learning | HalfCheetah one-shot v2 | Normalized Score0.2 | 11 | |
| Cross-domain Offline Imitation Learning from Demonstrations (C-off-LfD) | D4RL MuJoCo reward-free v2 (medium, medium-replay, medium-expert) | Hopper-v2 Return (medium)52.8 | 7 | |
| Single-domain Offline Imitation Learning from Demonstrations (S-off-LfD) | D4RL MuJoCo reward-free v2 (medium, medium-replay, medium-expert) | Hopper-v2 (m) Score50.9 | 7 |