METRA: Scalable Unsupervised RL with Metric-Aware Abstraction
About
Unsupervised pre-training strategies have proven to be highly effective in natural language processing and computer vision. Likewise, unsupervised reinforcement learning (RL) holds the promise of discovering a variety of potentially useful behaviors that can accelerate the learning of a wide array of downstream tasks. Previous unsupervised RL approaches have mainly focused on pure exploration and mutual information skill learning. However, despite the previous attempts, making unsupervised RL truly scalable still remains a major open challenge: pure exploration approaches might struggle in complex environments with large state spaces, where covering every possible transition is infeasible, and mutual information skill learning approaches might completely fail to explore the environment due to the lack of incentives. To make unsupervised RL scalable to complex, high-dimensional environments, we propose a novel unsupervised RL objective, which we call Metric-Aware Abstraction (METRA). Our main idea is, instead of directly covering the entire state space, to only cover a compact latent space $Z$ that is metrically connected to the state space $S$ by temporal distances. By learning to move in every direction in the latent space, METRA obtains a tractable set of diverse behaviors that approximately cover the state space, being scalable to high-dimensional environments. Through our experiments in five locomotion and manipulation environments, we demonstrate that METRA can discover a variety of useful behaviors even in complex, pixel-based environments, being the first unsupervised RL method that discovers diverse locomotion behaviors in pixel-based Quadruped and Humanoid. Our code and videos are available at https://seohong.me/projects/metra/
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reinforcement Learning | Walker URLB (downstream) | Flip Success Score373 | 12 | |
| Reinforcement Learning | Jaco URLB (downstream) | Reach Count BL26 | 12 | |
| Reinforcement Learning | Quadruped URLB (downstream) | Jump Score218 | 12 | |
| Downstream Task Performance | Ant Hole | Average Performance (Ant Hole)224.1 | 7 | |
| Downstream Task Performance | Ant North | Average Performance-1.90e+3 | 7 | |
| Safe Locomotion | Ant North | Safe State Ratio20.4 | 7 | |
| Safe Locomotion | Quadruped North | Safe State Ratio21.1 | 7 | |
| Hierarchical Control | Halfcheetah | Performance Score21.58 | 7 | |
| Safe Locomotion | Ant Hole | Safe State Ratio74.7 | 7 | |
| Safe Locomotion | HalfCheetah Not-Flip | Safe State Ratio89.2 | 7 |