Distributional Successor Features Enable Zero-Shot Policy Optimization

About

Intelligent agents must be generalists, capable of quickly adapting to various tasks. In reinforcement learning (RL), model-based RL learns a dynamics model of the world, in principle enabling transfer to arbitrary reward functions through planning. However, autoregressive model rollouts suffer from compounding error, making model-based RL ineffective for long-horizon problems. Successor features offer an alternative by modeling a policy's long-term state occupancy, reducing policy evaluation under new rewards to linear regression. Yet, zero-shot policy optimization for new tasks with successor features can be challenging. This work proposes a novel class of models, i.e., Distributional Successor Features for Zero-Shot Policy Optimization (DiSPOs), that learn a distribution of successor features of a stationary dataset's behavior policy, along with a policy that acts to realize different successor features achievable within the dataset. By directly modeling long-term outcomes in the dataset, DiSPOs avoid compounding error while enabling a simple scheme for zero-shot policy optimization across reward functions. We present a practical instantiation of DiSPOs using diffusion models and show their efficacy as a new class of transferable models, both theoretically and empirically across various simulated robotics problems. Videos and code available at https://weirdlabuw.github.io/dispo/.

Chuning Zhu, Xinqi Wang, Tyler Han, Simon S. Du, Abhishek Gupta• 2024

Related benchmarks

Task	Dataset	Result
Offline multitask Reinforcement Learning	Franka Kitchen kitchen-mixed	Average Episodic Return46	23
Offline multitask Reinforcement Learning	Franka Kitchen kitchen-partial	Average Episodic Return43	13
Reinforcement Learning	AntMaze umaze D4RL	Average Episodic Return593	12
Reinforcement Learning	AntMaze large-play D4RL	Average Episodic Return306	12
Reinforcement Learning	Hopper (forward)	Average Episodic Return832	12
Offline multitask Reinforcement Learning	Hopper backward	Average Episodic Return367	12
Reinforcement Learning	Hopper stand	Average Episodic Return800	9
Reinforcement Learning	AntMaze medium-diverse D4RL	Avg Episodic Return631	8
Reinforcement Learning	AntMaze medium-play D4RL	Average Episodic Return624	8
Reinforcement Learning	AntMaze large-diverse D4RL	Average Episodic Return359	8

Showing 10 of 20 rows

Other info

Code

Follow for update

@wizwand_team Discord