Flattening Hierarchies with Policy Bootstrapping
About
Offline goal-conditioned reinforcement learning (GCRL) is a promising approach for pretraining generalist policies on large datasets of reward-free trajectories, akin to the self-supervised objectives used to train foundation models for computer vision and natural language processing. However, scaling GCRL to longer horizons remains challenging due to the combination of sparse rewards and discounting, which obscures the comparative advantages of primitive actions with respect to distant goals. Hierarchical RL methods achieve strong empirical results on long-horizon goal-reaching tasks, but their reliance on modular, timescale-specific policies and subgoal generation introduces significant additional complexity and hinders scaling to high-dimensional goal spaces. In this work, we introduce an algorithm to train a flat (non-hierarchical) goal-conditioned policy by bootstrapping on subgoal-conditioned policies with advantage-weighted importance sampling. Our approach eliminates the need for a generative model over the (sub)goal space, which we find is key for scaling to high-dimensional control in large state spaces. We further show that existing hierarchical and bootstrapping-based approaches correspond to specific design choices within our derivation. Across a comprehensive suite of state- and pixel-based locomotion and manipulation benchmarks, our method matches or surpasses state-of-the-art offline GCRL algorithms and scales to complex, long-horizon tasks where prior approaches fail. Project page: https://johnlyzhou.github.io/saw/
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Goal-conditioned Reinforcement Learning | OGBench antmaze-medium-stitch v0 | Success Rate64 | 12 | |
| Goal-conditioned Reinforcement Learning | OGBench humanoidmaze-medium-stitch v0 | Success Rate63.6 | 12 | |
| Goal-conditioned Reinforcement Learning | OGBench humanoidmaze-large-stitch v0 | Success Rate11.6 | 12 | |
| Goal-conditioned Reinforcement Learning | OGBench antmaze-large-explore v0 | Success Rate1.9 | 12 | |
| Goal-conditioned Reinforcement Learning | OGBench antmaze-large-stitch v0 | Success Rate3.1 | 12 | |
| Goal-conditioned Reinforcement Learning | OGBench antmaze-giant-stitch v0 | Success Rate0.00e+0 | 12 | |
| Goal-conditioned Reinforcement Learning | OGBench humanoidmaze-giant-stitch v0 | Success Rate0.00e+0 | 12 | |
| Goal-conditioned Reinforcement Learning | OGBench antmaze-medium-navigate v0 | Success Rate96.3 | 11 | |
| Goal-conditioned Reinforcement Learning | manipulation cube-single-play | Success Rate73 | 11 | |
| Goal-conditioned Reinforcement Learning | OGBench antmaze-giant-navigate v0 | Success Rate68.5 | 11 |