Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VideoVLA: Video Generators Can Be Generalizable Robot Manipulators

About

Generalization in robot manipulation is essential for deploying robots in open-world environments and advancing toward artificial general intelligence. While recent Vision-Language-Action (VLA) models leverage large pre-trained understanding models for perception and instruction following, their ability to generalize to novel tasks, objects, and settings remains limited. In this work, we present VideoVLA, a simple approach that explores the potential of transforming large video generation models into robotic VLA manipulators. Given a language instruction and an image, VideoVLA predicts an action sequence as well as the future visual outcomes. Built on a multi-modal Diffusion Transformer, VideoVLA jointly models video, language, and action modalities, using pre-trained video generative models for joint visual and action forecasting. Our experiments show that high-quality imagined futures correlate with reliable action predictions and task success, highlighting the importance of visual imagination in manipulation. VideoVLA demonstrates strong generalization, including imitating other embodiments' skills and handling novel objects. This dual-prediction strategy - forecasting both actions and their visual consequences - explores a paradigm shift in robot learning and unlocks generalization capabilities in manipulation systems.

Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, Baining Guo• 2025

Related benchmarks

TaskDatasetResultRank
Robot ManipulationSimplerEnv WidowX Robot tasks (test)
Success Rate (Spoon)75
79
Robot ManipulationSimplerEnv WidowX
Success Rate: Put Spoon on Towel75
58
Robotic ManipulationSIMPLER Visual Matching WidowX robot
Put Spoon on Towel Score75
51
Robotic ManipulationSIMPLER Google Robot VA
Pick Up Coke Can Success Rate0.898
35
Robotic ManipulationSIMPLER Google Robot Visual Matching
PickCan Success Rate92.3
24
Robot ManipulationSimplerEnv OOD
Put Spoon on Towel Success Rate75
19
Robotic ManipulationSimplerEnv WidowX v1 (test)
Spoon Success Rate75
10
Robotic ManipulationSIMPLER Overall
Success Rate (All)63
7
Cross-embodiment Skill GeneralizationSIMPLER Google robot Cross-embodiment evaluation
Spoon Put on Towel Success Rate56.3
5
Cross-embodiment skill transferRealman Robot Real-world Skill Transfer
Move Block Success Rate81.3
5
Showing 10 of 16 rows

Other info

Follow for update