CEIL: Generalized Contextual Imitation Learning
About
In this paper, we present \textbf{C}ont\textbf{E}xtual \textbf{I}mitation \textbf{L}earning~(CEIL), a general and broadly applicable algorithm for imitation learning (IL). Inspired by the formulation of hindsight information matching, we derive CEIL by explicitly learning a hindsight embedding function together with a contextual policy using the hindsight embeddings. To achieve the expert matching objective for IL, we advocate for optimizing a contextual variable such that it biases the contextual policy towards mimicking expert behaviors. Beyond the typical learning from demonstrations (LfD) setting, CEIL is a generalist that can be effectively applied to multiple settings including: 1)~learning from observations (LfO), 2)~offline IL, 3)~cross-domain IL (mismatched experts), and 4) one-shot IL settings. Empirically, we evaluate CEIL on the popular MuJoCo tasks (online) and the D4RL dataset (offline). Compared to prior state-of-the-art baselines, we show that CEIL is more sample-efficient in most online IL tasks and achieves better or competitive performances in offline tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Offline Reinforcement Learning | D4RL hopper-expert v2 | Normalized Score113 | 56 | |
| Offline Reinforcement Learning | D4RL walker2d-expert v2 | Normalized Score115.6 | 56 | |
| Offline Reinforcement Learning | D4RL halfcheetah-expert v2 | Normalized Score97.1 | 56 | |
| Offline Imitation Learning | D4RL Ant v2 (expert) | Normalized Score126.4 | 20 | |
| Imitation Learning | Hopper one-shot v2 | Normalized Score85.6 | 11 | |
| Imitation Learning | HalfCheetah one-shot v2 | Normalized Score5.6 | 11 | |
| Imitation Learning | Walker2d one-shot v2 | Normalized Score70 | 11 | |
| Imitation Learning | Ant one-shot v2 | Normalized Score29.7 | 11 | |
| Cross-domain Offline Imitation Learning from Demonstrations (C-off-LfD) | D4RL MuJoCo reward-free v2 (medium, medium-replay, medium-expert) | Hopper-v2 Return (medium)58.4 | 7 | |
| Single-domain Offline Imitation Learning from Demonstrations (S-off-LfD) | D4RL MuJoCo reward-free v2 (medium, medium-replay, medium-expert) | Hopper-v2 (m) Score110.4 | 7 |