Factored Latent Action World Models
About
Learning latent actions from action-free video has emerged as a powerful paradigm for scaling up controllable world model learning. Latent actions provide a natural interface for users to iteratively generate and manipulate videos. However, most existing approaches rely on monolithic inverse and forward dynamics models that learn a single latent action to control the entire scene, and therefore struggle in complex environments where multiple entities act simultaneously. This paper introduces Factored Latent Action Model (FLAM), a factored dynamics framework that decomposes the scene into independent factors, each inferring its own latent action and predicting its own next-step factor value. This factorized structure enables more accurate modeling of complex multi-entity dynamics and improves video generation quality in action-free video settings compared to monolithic models. Based on experiments on both simulation and real-world multi-entity datasets, we find that FLAM outperforms prior work in prediction accuracy and representation quality, and facilitates downstream policy learning, demonstrating the benefits of factorized latent action models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| World Model Prediction | MultiGrid | PSNR56.5 | 7 | |
| World Model Prediction | Bigfish | PSNR30.8 | 7 | |
| World Model Prediction | Leaper | PSNR38.1 | 7 | |
| World Model Prediction | Starpilot | PSNR29.3 | 7 | |
| World Model Prediction | nuPlan | PSNR19.7 | 6 | |
| Factor-agent correspondence | MultiGrid | Disentanglement0.91 | 4 | |
| Behavior Cloning | Bigfish (1k) | Mean Episodic Return1.8 | 3 | |
| Behavior Cloning | Bigfish 10k | Mean Episodic Return8.8 | 3 | |
| Behavior Cloning | Starpilot (10k) | Mean Episodic Return9.1 | 3 | |
| Behavior Cloning | Starpilot 1k | Mean Episodic Return1.8 | 3 |