Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

About

Imitation learning has emerged as a promising approach towards building generalist robots. However, scaling imitation learning for large robot foundation models remains challenging due to its reliance on high-quality expert demonstrations. Meanwhile, large amounts of video data depicting a wide range of environments and diverse behaviors are readily available. This data provides a rich source of information about real-world dynamics and agent-environment interactions. Leveraging this data directly for imitation learning, however, has proven difficult due to the lack of action annotation. In this work, we present Unified World Models (UWM), a framework that allows for leveraging both video and action data for policy learning. Specifically, a UWM integrates an action diffusion process and a video diffusion process within a unified transformer architecture, where independent diffusion timesteps govern each modality. By controlling each diffusion timestep, UWM can flexibly represent a policy, a forward dynamics, an inverse dynamics, and a video generator. Through simulated and real-world experiments, we show that: (1) UWM enables effective pretraining on large-scale multitask robot datasets with both dynamics and action predictions, resulting in more generalizable and robust policies than imitation learning, (2) UWM naturally facilitates learning from action-free video data through independent control of modality-specific diffusion timesteps, further improving the performance of finetuned policies. Our results suggest that UWM offers a promising step toward harnessing large, heterogeneous datasets for scalable robot learning, and provides a simple unification between the often disparate paradigms of imitation learning and world modeling. Videos and code are available at https://weirdlabuw.github.io/uwm/.

Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, Abhishek Gupta• 2025

Related benchmarks

Task	Dataset	Result
Robot Manipulation	LIBERO	Object Achievement97.6	957
Robotic Manipulation	LIBERO	Spatial Success Rate82.3	527
Robotic Manipulation	Calvin ABC->D	Task-1 Score81.3	71
Robotic Manipulation	RoboCasa	Average Success Rate60.8	39
Robot Manipulation	RoboTwin Clean 2.0	Average Success Rate81.7	39
Robot Manipulation	RoboTwin Randomized 2.0	--	33
Robotic Tabletop Manipulation	RoboCasa GR1 Tabletop Tasks	Average Success Rate20	28
Tabletop manipulation	LIBERO	Success Rate79	17
Robot Manipulation	RoboCasa-GR1 24 tasks	Average Success Rate60.8	16
Kitchen manipulation	RoboCasa 24 kitchen manipulation tasks	Average Success Rate60.8	12

Showing 10 of 22 rows

Other info

Follow for update

@wizwand_team Discord