Latent Action World Models for Control with Unlabeled Trajectories
About
Inspired by how humans combine direct interaction with action-free experience (e.g., videos), we study world models that learn from heterogeneous data. Standard world models typically rely on action-conditioned trajectories, which limits effectiveness when action labels are scarce. We introduce a family of latent-action world models that jointly use action-conditioned and action-free data by learning a shared latent action representation. This latent space aligns observed control signals with actions inferred from passive observations, enabling a single dynamics model to train on large-scale unlabeled trajectories while requiring only a small set of action-labeled ones. We use the latent-action world model to learn a latent-action policy through offline reinforcement learning (RL), thereby bridging two traditionally separate domains: offline RL, which typically relies on action-conditioned data, and action-free training, which is rarely used with subsequent RL. On the DeepMind Control Suite, our approach achieves strong performance while using about an order of magnitude fewer action-labeled samples than purely action-conditioned baselines. These results show that latent actions enable training on both passive and interactive data, which makes world models learn more efficiently.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Offline Reinforcement Learning | walker2d medium | Normalized Score91.4 | 51 | |
| Offline Reinforcement Learning | walker2d medium-replay | Normalized Score75.9 | 50 | |
| Offline Reinforcement Learning | halfcheetah medium-replay | Normalized Score68.4 | 43 | |
| Offline Reinforcement Learning | DMControl walker-walk (expert) | Normalized Score94.3 | 12 | |
| Offline Reinforcement Learning | DMControl cheetah-run (expert) | Normalized Score52.4 | 12 | |
| Offline Reinforcement Learning | DeepMind Control Suite hopper-stand medium | Mean Normalized Return65 | 6 | |
| Offline Reinforcement Learning | DeepMind Control Suite hopper-stand plan2explore | Mean Normalized Return54.1 | 6 | |
| Offline Reinforcement Learning | DeepMind Control Suite walker-walk plan2explore | Mean Normalized Return81.9 | 6 | |
| Offline Reinforcement Learning | DeepMind Control Suite cheetah-run plan2explore | Mean Normalized Return26.5 | 6 | |
| Offline Reinforcement Learning | DeepMind Control Suite hopper-stand medium-replay | Mean Normalized Return46.9 | 6 |