A Simple Solution for Offline Imitation from Observations and Examples with Possibly Incomplete Trajectories
About
Offline imitation from observations aims to solve MDPs where only task-specific expert states and task-agnostic non-expert state-action pairs are available. Offline imitation is useful in real-world scenarios where arbitrary interactions are costly and expert actions are unavailable. The state-of-the-art "DIstribution Correction Estimation" (DICE) methods minimize divergence of state occupancy between expert and learner policies and retrieve a policy with weighted behavior cloning; however, their results are unstable when learning from incomplete trajectories, due to a non-robust optimization in the dual domain. To address the issue, in this paper, we propose Trajectory-Aware Imitation Learning from Observations (TAILO). TAILO uses a discounted sum along the future trajectory as the weight for weighted behavior cloning. The terms for the sum are scaled by the output of a discriminator, which aims to identify expert states. Despite simplicity, TAILO works well if there exist trajectories or segments of expert behavior in the task-agnostic data, a common assumption in prior work. In experiments across multiple testbeds, we find TAILO to be more robust and effective, particularly with incomplete trajectories.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Offline Reinforcement Learning | D4RL Walker2d Medium v2 | Normalized Return71.7 | 67 | |
| Offline Reinforcement Learning | D4RL halfcheetah v2 (medium-replay) | Normalized Score42.8 | 58 | |
| Offline Reinforcement Learning | D4RL Hopper-medium-replay v2 | Normalized Return83.4 | 54 | |
| Offline Reinforcement Learning | D4RL walker2d-medium-expert v2 | Normalized Score108.2 | 44 | |
| Offline Reinforcement Learning | D4RL Hopper Medium v2 | Normalized Return56.2 | 43 | |
| Offline Reinforcement Learning | D4RL walker2d medium-replay v2 | Normalized Score61.2 | 36 | |
| Offline Reinforcement Learning | D4RL Mujoco Hopper-Medium-Expert v2 | Normalized Score111.5 | 22 | |
| Offline Reinforcement Learning | D4RL Mujoco Halfcheetah-Medium-Expert v2 | Normalized Score94.3 | 17 | |
| Offline Reinforcement Learning | D4RL Mujoco Halfcheetah-Medium v2 | Normalized Score39.8 | 3 |