Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

In-context Reinforcement Learning with Algorithm Distillation

About

We propose Algorithm Distillation (AD), a method for distilling reinforcement learning (RL) algorithms into neural networks by modeling their training histories with a causal sequence model. Algorithm Distillation treats learning to reinforcement learn as an across-episode sequential prediction problem. A dataset of learning histories is generated by a source RL algorithm, and then a causal transformer is trained by autoregressively predicting actions given their preceding learning histories as context. Unlike sequential policy prediction architectures that distill post-learning or expert sequences, AD is able to improve its policy entirely in-context without updating its network parameters. We demonstrate that AD can reinforcement learn in-context in a variety of environments with sparse rewards, combinatorial task structure, and pixel-based observations, and find that AD learns a more data-efficient RL algorithm than the one that generated the source data.

Michael Laskin, Luyu Wang, Junhyuk Oh, Emilio Parisotto, Stephen Spencer, Richie Steigerwald, DJ Strouse, Steven Hansen, Angelos Filos, Ethan Brooks, Maxime Gazeau, Himanshu Sahni, Satinder Singh, Volodymyr Mnih• 2022

Related benchmarks

TaskDatasetResultRank
DarkroomGrid World
Offline Training Time (hour)0.21
6
Dark Key-to-DoorGrid World
Offline Training Time (hour)0.65
3
Darkroom HardGrid World
Offline Training Time (hour)0.22
3
HalfCheetahD4RL
Training Time (hour)28.56
3
HopperD4RL
Offline Training Time (hour)18.15
3
Large Dark Key-to-DoorLarge Grid World
Offline Training Time (hour)6.87
3
Large DarkroomLarge Grid World
Offline Training Time (hour)3.52
3
Large Darkroom DynamicLarge Grid World
Offline Training Time (hour)2.71
3
Large Darkroom HardLarge Grid World
Offline Training Time (hour)4.26
3
Walker2dD4RL
Offline Training Time26.25
3
Showing 10 of 10 rows

Other info

Follow for update