Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Foundation Policies with Hilbert Representations

About

Unsupervised and self-supervised objectives, such as next token prediction, have enabled pre-training generalist models from large amounts of unlabeled data. In reinforcement learning (RL), however, finding a truly general and scalable unsupervised pre-training objective for generalist policies from offline data remains a major open question. While a number of methods have been proposed to enable generic self-supervised RL, based on principles such as goal-conditioned RL, behavioral cloning, and unsupervised skill learning, such methods remain limited in terms of either the diversity of the discovered behaviors, the need for high-quality demonstration data, or the lack of a clear adaptation mechanism for downstream tasks. In this work, we propose a novel unsupervised framework to pre-train generalist policies that capture diverse, optimal, long-horizon behaviors from unlabeled offline data such that they can be quickly adapted to any arbitrary new tasks in a zero-shot manner. Our key insight is to learn a structured representation that preserves the temporal structure of the underlying environment, and then to span this learned latent space with directional movements, which enables various zero-shot policy "prompting" schemes for downstream tasks. Through our experiments on simulated robotic locomotion and manipulation benchmarks, we show that our unsupervised policies can solve goal-conditioned and general RL tasks in a zero-shot fashion, even often outperforming prior methods designed specifically for each setting. Our code and videos are available at https://seohong.me/projects/hilp/.

Seohong Park, Tobias Kreiman, Sergey Levine• 2024

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement Learninghalfcheetah medium v2
Average Score43.85
27
Offline Reinforcement Learninghalfcheetah medium-expert v2
Normalized Score68.47
18
Offline Reinforcement Learningwalker2d medium v2
Normalized Score56.34
18
Offline Reinforcement Learninghopper medium v2--
14
Zero-shot Reinforcement LearningExORL RND Walker environment v1 (test)
Flip563
12
Zero-shot Reinforcement LearningExORL RND (Quadruped environment) v1 (test)
Jump Success556
12
Goal-conditioned Reinforcement LearningOGBench scene play (5 tasks) zero-shot
Average Return19
10
Zero-shot Reinforcement LearningExORL APS (Jaco environment) v1 (test)
Reach Bottom Left88
8
Visual ControlExORL Cheetah Zero-shot RND
Walk Score690
8
Zero-shot Reinforcement LearningExORL APS Cheetah environment v1 (test)
Run Backward373
8
Showing 10 of 38 rows

Other info

Follow for update