Large Video Planner Enables Generalizable Robot Control

About

General-purpose robots require decision-making models that generalize across diverse tasks and environments. Recent works build robot foundation models by extending multimodal large language models (MLLMs) with action outputs, creating vision-language-action (VLA) systems. These efforts are motivated by the intuition that MLLMs' large-scale language and image pretraining can be effectively transferred to the action output modality. In this work, we explore an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models. Unlike static images and language, videos capture spatio-temporal sequences of states and actions in the physical world that are naturally aligned with robotic behavior. We curate an internet-scale video dataset of human activities and task demonstrations, and train, for the first time at a foundation-model scale, an open video model for generative robotics planning. The model produces zero-shot video plans for novel scenes and tasks, which we post-process to extract executable robot actions. We evaluate task-level generalization through third-party selected tasks in the wild and real-robot experiments, demonstrating successful physical execution. Together, these results show robust instruction following, strong generalization, and real-world feasibility. We release both the model and dataset to support open, reproducible video-based robot learning. Our website is available at https://www.boyuan.space/large-video-planner/.

Boyuan Chen, Tianyuan Zhang, Haoran Geng, Caiyi Zhang, Peihao Li, Kiwhan Song, William T. Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, Vincent Sitzmann, Yilun Du• 2025

Related benchmarks

Task	Dataset	Result
Robotics Video Generation	DreamGen Bench	GR1 Object Score (Qwen-IF)82.9	15
Embodied World Modeling	EWMBench	Scene Composition Score87.95	11
Video Generation	PBench	Background Consistency (I2V-Bg)0.979	11
Physical Reasoning and Instruction Following	WorldModelBench (test)	Instruction Adherence Score2.01	11
Language Control	3D Navigation Evaluation Suite	Visual Consistency100	5
Object Navigation	3D Navigation Evaluation Suite	Visual Consistency100	5
4D Robot Scene Generation	DROID and BridgeData V2 (300 unseen samples)	PSNR19.613	5
Precise Navigation	3D Navigation Evaluation Suite	Visual Consistency100	5
Scene Reasoning	3D Navigation Evaluation Suite	Visual Consistency93	5
Video Prediction	LIBERO	PSNR19.582	5

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord