Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

A Simple Solution for Offline Imitation from Observations and Examples with Possibly Incomplete Trajectories

About

Offline imitation from observations aims to solve MDPs where only task-specific expert states and task-agnostic non-expert state-action pairs are available. Offline imitation is useful in real-world scenarios where arbitrary interactions are costly and expert actions are unavailable. The state-of-the-art "DIstribution Correction Estimation" (DICE) methods minimize divergence of state occupancy between expert and learner policies and retrieve a policy with weighted behavior cloning; however, their results are unstable when learning from incomplete trajectories, due to a non-robust optimization in the dual domain. To address the issue, in this paper, we propose Trajectory-Aware Imitation Learning from Observations (TAILO). TAILO uses a discounted sum along the future trajectory as the weight for weighted behavior cloning. The terms for the sum are scaled by the output of a discriminator, which aims to identify expert states. Despite simplicity, TAILO works well if there exist trajectories or segments of expert behavior in the task-agnostic data, a common assumption in prior work. In experiments across multiple testbeds, we find TAILO to be more robust and effective, particularly with incomplete trajectories.

Kai Yan, Alexander G. Schwing, Yu-Xiong Wang• 2023

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL Walker2d Medium v2
Normalized Return71.7
67
Offline Reinforcement LearningD4RL halfcheetah v2 (medium-replay)
Normalized Score42.8
58
Offline Reinforcement LearningD4RL Hopper-medium-replay v2
Normalized Return83.4
54
Offline Reinforcement LearningD4RL walker2d-medium-expert v2
Normalized Score108.2
44
Offline Reinforcement LearningD4RL Hopper Medium v2
Normalized Return56.2
43
Offline Reinforcement LearningD4RL walker2d medium-replay v2
Normalized Score61.2
36
Offline Reinforcement LearningD4RL Mujoco Hopper-Medium-Expert v2
Normalized Score111.5
22
Offline Reinforcement LearningD4RL Mujoco Halfcheetah-Medium-Expert v2
Normalized Score94.3
17
Offline Reinforcement LearningD4RL Mujoco Halfcheetah-Medium v2
Normalized Score39.8
3
Showing 9 of 9 rows

Other info

Code

Follow for update