RvS: What is Essential for Offline RL via Supervised Learning?

About

Recent work has shown that supervised learning alone, without temporal difference (TD) learning, can be remarkably effective for offline RL. When does this hold true, and which algorithmic components are necessary? Through extensive experiments, we boil supervised learning for offline RL down to its essential elements. In every environment suite we consider, simply maximizing likelihood with a two-layer feedforward MLP is competitive with state-of-the-art results of substantially more complex methods based on TD learning or sequence modeling with Transformers. Carefully choosing model capacity (e.g., via regularization or architecture) and choosing which information to condition on (e.g., goals or rewards) are critical for performance. These insights serve as a field guide for practitioners doing Reinforcement Learning via Supervised Learning (which we coin "RvS learning"). They also probe the limits of existing RvS methods, which are comparatively weak on random data, and suggest a number of open problems.

Scott Emmons, Benjamin Eysenbach, Ilya Kostrikov, Sergey Levine• 2021

Related benchmarks

Task	Dataset	Result
Locomotion	D4RL walker2d-medium-expert	Normalized Score106	90
walker2d locomotion	D4RL walker2d medium-replay	Normalized Score60.6	78
Offline Reinforcement Learning	D4RL antmaze-umaze (diverse)	Normalized Score66.2	74
hopper locomotion	D4RL hopper medium-replay	Normalized Score73.5	71
Locomotion	D4RL Halfcheetah medium	Normalized Score42.6	70
Locomotion	D4RL Walker2d medium	Normalized Score0.717	70
Offline Reinforcement Learning	Kitchen Partial	Normalized Score71.7	69
Locomotion	D4RL halfcheetah-medium-expert	Normalized Score92.2	53
hopper locomotion	D4RL hopper-medium-expert	Normalized Score101.7	53
Offline Reinforcement Learning	antmaze medium-play	Score71.8	44

Showing 10 of 56 rows

Other info

Follow for update

@wizwand_team Discord