Imitation Learning via Off-Policy Distribution Matching

About

When performing imitation learning from expert demonstrations, distribution matching is a popular approach, in which one alternates between estimating distribution ratios and then using these ratios as rewards in a standard reinforcement learning (RL) algorithm. Traditionally, estimation of the distribution ratio requires on-policy data, which has caused previous work to either be exorbitantly data-inefficient or alter the original objective in a manner that can drastically change its optimum. In this work, we show how the original distribution ratio estimation objective may be transformed in a principled manner to yield a completely off-policy objective. In addition to the data-efficiency that this provides, we are able to show that this objective also renders the use of a separate RL optimization unnecessary.Rather, an imitation policy may be learned directly from this objective without the use of explicit rewards. We call the resulting algorithm ValueDICE and evaluate it on a suite of popular imitation learning benchmarks, finding that it can achieve state-of-the-art sample efficiency and performance.

Ilya Kostrikov, Ofir Nachum, Jonathan Tompson• 2019

Related benchmarks

Task	Dataset	Result
Offline Reinforcement Learning	D4RL hopper-expert v2	Normalized Score65.6	66
Offline Reinforcement Learning	D4RL walker2d-expert v2	Normalized Score28.2	66
Offline Reinforcement Learning	D4RL halfcheetah-expert v2	Normalized Score9.8	66
Robotic Manipulation	Robomimic Can	Success Rate41.8	30
Robotic Manipulation	Robomimic Lift	Success Rate47.6	28
Continuous Control	MuJoCo Ant	Average Reward4.51e+3	26
Robotic Manipulation	Robomimic Square	Success Rate8.3	26
Continuous Control	MuJoCo HalfCheetah	Average Reward4.84e+3	25
Offline Imitation Learning	D4RL Ant v2 (expert)	Normalized Score90.5	20
Action-matching	MIMIC-III (test)	Accuracy79.4	9

Showing 10 of 41 rows

Other info

Follow for update

@wizwand_team Discord