Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Factored Latent Action World Models

About

Learning latent actions from action-free video has emerged as a powerful paradigm for scaling up controllable world model learning. Latent actions provide a natural interface for users to iteratively generate and manipulate videos. However, most existing approaches rely on monolithic inverse and forward dynamics models that learn a single latent action to control the entire scene, and therefore struggle in complex environments where multiple entities act simultaneously. This paper introduces Factored Latent Action Model (FLAM), a factored dynamics framework that decomposes the scene into independent factors, each inferring its own latent action and predicting its own next-step factor value. This factorized structure enables more accurate modeling of complex multi-entity dynamics and improves video generation quality in action-free video settings compared to monolithic models. Based on experiments on both simulation and real-world multi-entity datasets, we find that FLAM outperforms prior work in prediction accuracy and representation quality, and facilitates downstream policy learning, demonstrating the benefits of factorized latent action models.

Zizhao Wang, Chang Shi, Jiaheng Hu, Kevin Rohling, Roberto Mart\'in-Mart\'in, Amy Zhang, Peter Stone• 2026

Related benchmarks

TaskDatasetResultRank
World Model PredictionMultiGrid
PSNR56.5
7
World Model PredictionBigfish
PSNR30.8
7
World Model PredictionLeaper
PSNR38.1
7
World Model PredictionStarpilot
PSNR29.3
7
World Model PredictionnuPlan
PSNR19.7
6
Factor-agent correspondenceMultiGrid
Disentanglement0.91
4
Behavior CloningBigfish (1k)
Mean Episodic Return1.8
3
Behavior CloningBigfish 10k
Mean Episodic Return8.8
3
Behavior CloningStarpilot (10k)
Mean Episodic Return9.1
3
Behavior CloningStarpilot 1k
Mean Episodic Return1.8
3
Showing 10 of 10 rows

Other info

Follow for update