Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Learning One Representation to Optimize All Rewards

About

We introduce the forward-backward (FB) representation of the dynamics of a reward-free Markov decision process. It provides explicit near-optimal policies for any reward specified a posteriori. During an unsupervised phase, we use reward-free interactions with the environment to learn two representations via off-the-shelf deep learning methods and temporal difference (TD) learning. In the test phase, a reward representation is estimated either from observations or an explicit reward description (e.g., a target state). The optimal policy for that reward is directly obtained from these representations, with no planning. We assume access to an exploration scheme or replay buffer for the first phase. The corresponding unsupervised loss is well-principled: if training is perfect, the policies obtained are provably optimal for any reward function. With imperfect training, the sub-optimality is proportional to the unsupervised approximation error. The FB representation learns long-range relationships between states and actions, via a predictive occupancy map, without having to synthesize states as in model-based approaches. This is a step towards learning controllable agents in arbitrary black-box stochastic environments. This approach compares well to goal-oriented RL algorithms on discrete and continuous mazes, pixel-based MsPacman, and the FetchReach virtual robot arm. We also illustrate how the agent can immediately adapt to new tasks beyond goal-oriented RL.

Ahmed Touati, Yann Ollivier• 2021

Related benchmarks

TaskDatasetResultRank
Offline multitask Reinforcement LearningFranka Kitchen kitchen-mixed
Average Episodic Return5
23
Offline multitask Reinforcement LearningFranka Kitchen kitchen-partial
Average Episodic Return4
13
Offline multitask Reinforcement LearningHopper backward
Average Episodic Return269
12
Reinforcement LearningHopper (forward)
Average Episodic Return726
12
Goal-conditioned Reinforcement LearningOGBench scene play (5 tasks) zero-shot
Average Return13
10
Reinforcement LearningAntMaze umaze D4RL
Average Episodic Return469
8
Reinforcement LearningAntMaze umaze-diverse D4RL
Average Episodic Return474
8
Reinforcement LearningAntMaze medium-diverse D4RL
Avg Episodic Return294
8
Reinforcement LearningAntMaze large-diverse D4RL
Average Episodic Return181
8
Reinforcement LearningAntMaze large-play D4RL
Average Episodic Return165
8
Showing 10 of 30 rows

Other info

Code

Follow for update