Distributional Successor Features Enable Zero-Shot Policy Optimization
About
Intelligent agents must be generalists, capable of quickly adapting to various tasks. In reinforcement learning (RL), model-based RL learns a dynamics model of the world, in principle enabling transfer to arbitrary reward functions through planning. However, autoregressive model rollouts suffer from compounding error, making model-based RL ineffective for long-horizon problems. Successor features offer an alternative by modeling a policy's long-term state occupancy, reducing policy evaluation under new rewards to linear regression. Yet, zero-shot policy optimization for new tasks with successor features can be challenging. This work proposes a novel class of models, i.e., Distributional Successor Features for Zero-Shot Policy Optimization (DiSPOs), that learn a distribution of successor features of a stationary dataset's behavior policy, along with a policy that acts to realize different successor features achievable within the dataset. By directly modeling long-term outcomes in the dataset, DiSPOs avoid compounding error while enabling a simple scheme for zero-shot policy optimization across reward functions. We present a practical instantiation of DiSPOs using diffusion models and show their efficacy as a new class of transferable models, both theoretically and empirically across various simulated robotics problems. Videos and code available at https://weirdlabuw.github.io/dispo/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Offline multitask Reinforcement Learning | Franka Kitchen kitchen-mixed | Average Episodic Return46 | 23 | |
| Offline multitask Reinforcement Learning | Franka Kitchen kitchen-partial | Average Episodic Return43 | 13 | |
| Reinforcement Learning | Hopper (forward) | Average Episodic Return832 | 12 | |
| Offline multitask Reinforcement Learning | Hopper backward | Average Episodic Return367 | 12 | |
| Reinforcement Learning | AntMaze umaze D4RL | Average Episodic Return593 | 8 | |
| Reinforcement Learning | AntMaze medium-diverse D4RL | Avg Episodic Return631 | 8 | |
| Reinforcement Learning | AntMaze medium-play D4RL | Average Episodic Return624 | 8 | |
| Reinforcement Learning | AntMaze large-diverse D4RL | Average Episodic Return359 | 8 | |
| Reinforcement Learning | AntMaze large-play D4RL | Average Episodic Return306 | 8 | |
| Reinforcement Learning | AntMaze umaze-diverse D4RL | Average Episodic Return568 | 8 |