Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Distributional Successor Features Enable Zero-Shot Policy Optimization

About

Intelligent agents must be generalists, capable of quickly adapting to various tasks. In reinforcement learning (RL), model-based RL learns a dynamics model of the world, in principle enabling transfer to arbitrary reward functions through planning. However, autoregressive model rollouts suffer from compounding error, making model-based RL ineffective for long-horizon problems. Successor features offer an alternative by modeling a policy's long-term state occupancy, reducing policy evaluation under new rewards to linear regression. Yet, zero-shot policy optimization for new tasks with successor features can be challenging. This work proposes a novel class of models, i.e., Distributional Successor Features for Zero-Shot Policy Optimization (DiSPOs), that learn a distribution of successor features of a stationary dataset's behavior policy, along with a policy that acts to realize different successor features achievable within the dataset. By directly modeling long-term outcomes in the dataset, DiSPOs avoid compounding error while enabling a simple scheme for zero-shot policy optimization across reward functions. We present a practical instantiation of DiSPOs using diffusion models and show their efficacy as a new class of transferable models, both theoretically and empirically across various simulated robotics problems. Videos and code available at https://weirdlabuw.github.io/dispo/.

Chuning Zhu, Xinqi Wang, Tyler Han, Simon S. Du, Abhishek Gupta• 2024

Related benchmarks

TaskDatasetResultRank
Offline multitask Reinforcement LearningFranka Kitchen kitchen-mixed
Average Episodic Return46
23
Offline multitask Reinforcement LearningFranka Kitchen kitchen-partial
Average Episodic Return43
13
Reinforcement LearningHopper (forward)
Average Episodic Return832
12
Offline multitask Reinforcement LearningHopper backward
Average Episodic Return367
12
Reinforcement LearningAntMaze umaze D4RL
Average Episodic Return593
8
Reinforcement LearningAntMaze medium-diverse D4RL
Avg Episodic Return631
8
Reinforcement LearningAntMaze medium-play D4RL
Average Episodic Return624
8
Reinforcement LearningAntMaze large-diverse D4RL
Average Episodic Return359
8
Reinforcement LearningAntMaze large-play D4RL
Average Episodic Return306
8
Reinforcement LearningAntMaze umaze-diverse D4RL
Average Episodic Return568
8
Showing 10 of 20 rows

Other info

Code

Follow for update