Ctrl-World: A Controllable Generative World Model for Robot Manipulation

About

Generalist robot policies can now perform a wide range of manipulation skills, but evaluating and improving their ability with unfamiliar objects and instructions remains a significant challenge. Rigorous evaluation requires a large number of real-world rollouts, while systematic improvement demands additional corrective data with expert labels. Both of these processes are slow, costly, and difficult to scale. World models offer a promising, scalable alternative by enabling policies to rollout within imagination space. However, a key challenge is building a controllable world model that can handle multi-step interactions with generalist robot policies. This requires a world model compatible with modern generalist policies by supporting multi-view prediction, fine-grained action control, and consistent long-horizon interactions, which is not achieved by previous works. In this paper, we make a step forward by introducing a controllable multi-view world model that can be used to evaluate and improve the instruction-following ability of generalist robot policies. Our model maintains long-horizon consistency with a pose-conditioned memory retrieval mechanism and achieves precise action control through frame-level action conditioning. Trained on the DROID dataset (95k trajectories, 564 scenes), our model generates spatially and temporally consistent trajectories under novel scenarios and new camera placements for over 20 seconds. We show that our method can accurately rank policy performance without real-world robot rollouts. Moreover, by synthesizing successful trajectories in imagination and using them for supervised fine-tuning, our approach can improve policy success by 44.7\%.

Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, Chelsea Finn• 2025

Related benchmarks

Task	Dataset	Result
Robotic Manipulation Video Synthesis	RoboTwin 2.0 (unseen trials)	PSNR29.05	33
Long-horizon Video Generation	RoboArena	PSNR16.66	19
Long-horizon Spatiotemporal Consistency	LIBERO, RoboTwin, and Real-Robot platforms (test)	Round-trip LPIPS0.149	16
Video Generation	DROID Wrist	LPIPS0.365	16
World Modeling	WorldArena (test)	Image Quality42.44	15
World Model Evaluation	World Arena Benchmark	EWM Score59.98	15
Dual-arm manipulation	WorldArena	EWM Score59.98	14
Trajectory Generation	Task Data Wrist OOD (test)	LPIPS0.292	8
Video Generation	Robo4D-200k (val)	PSNR21.03	8
Policy Evaluation	In-distribution	r Score87.8	8

Showing 10 of 48 rows

Other info

Follow for update

@wizwand_team Discord