Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Does Zero-Shot Reinforcement Learning Exist?

About

A zero-shot RL agent is an agent that can solve any RL task in a given environment, instantly with no additional planning or learning, after an initial reward-free learning phase. This marks a shift from the reward-centric RL paradigm towards "controllable" agents that can follow arbitrary instructions in an environment. Current RL agents can solve families of related tasks at best, or require planning anew for each task. Strategies for approximate zero-shot RL ave been suggested using successor features (SFs) [BBQ+ 18] or forward-backward (FB) representations [TO21], but testing has been limited. After clarifying the relationships between these schemes, we introduce improved losses and new SF models, and test the viability of zero-shot RL schemes systematically on tasks from the Unsupervised RL benchmark [LYL+21]. To disentangle universal representation learning from exploration, we work in an offline setting and repeat the tests on several existing replay buffers. SFs appear to suffer from the choice of the elementary state features. SFs with Laplacian eigenfunctions do well, while SFs based on auto-encoders, inverse curiosity, transition models, low-rank transition matrix, contrastive learning, or diversity (APS), perform unconsistently. In contrast, FB representations jointly learn the elementary and successor features from a single, principled criterion. They perform best and consistently across the board, reaching 85% of supervised RL performance with a good replay buffer, in a zero-shot manner.

Ahmed Touati, J\'er\'emy Rapin, Yann Ollivier• 2022

Related benchmarks

TaskDatasetResultRank
Zero-shot Reinforcement LearningExORL APS (Jaco environment) v1 (test)
Reach Bottom Left53
8
Zero-shot Reinforcement LearningExORL APS Cheetah environment v1 (test)
Run Backward250
8
Zero-shot Reinforcement LearningExORL APS Quadruped environment v1 (test)
Jump Score757
4
Zero-shot Reinforcement LearningExORL RND (Quadruped environment) v1 (test)
Jump Success758
4
Zero-shot Reinforcement LearningExORL APS Walker environment v1 (test)
Flip Count426
4
Zero-shot Reinforcement LearningExORL RND Walker environment v1 (test)
Flip548
4
RunDMC Cheetah Average of APS, Proto, RND datasets
Mean Return248
3
RunDMC Walker Average of APS, Proto, RND datasets
Mean Return356
3
Run-BDMC Cheetah Average of APS, Proto, RND datasets
Mean Return229
3
StandDMC Walker Average of APS, Proto, RND datasets
Mean Return754
3
Showing 10 of 22 rows

Other info

Follow for update