VideoVLA: Video Generators Can Be Generalizable Robot Manipulators
About
Generalization in robot manipulation is essential for deploying robots in open-world environments and advancing toward artificial general intelligence. While recent Vision-Language-Action (VLA) models leverage large pre-trained understanding models for perception and instruction following, their ability to generalize to novel tasks, objects, and settings remains limited. In this work, we present VideoVLA, a simple approach that explores the potential of transforming large video generation models into robotic VLA manipulators. Given a language instruction and an image, VideoVLA predicts an action sequence as well as the future visual outcomes. Built on a multi-modal Diffusion Transformer, VideoVLA jointly models video, language, and action modalities, using pre-trained video generative models for joint visual and action forecasting. Our experiments show that high-quality imagined futures correlate with reliable action predictions and task success, highlighting the importance of visual imagination in manipulation. VideoVLA demonstrates strong generalization, including imitating other embodiments' skills and handling novel objects. This dual-prediction strategy - forecasting both actions and their visual consequences - explores a paradigm shift in robot learning and unlocks generalization capabilities in manipulation systems.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robot Manipulation | SimplerEnv WidowX Robot tasks (test) | Success Rate (Spoon)75 | 79 | |
| Robotic Manipulation | SIMPLER Google Robot Visual Matching | PickCan Success Rate92.3 | 24 | |
| Robotic Manipulation | SIMPLER Visual Matching WidowX robot | Put Spoon on Towel Score75 | 24 | |
| Robotic Manipulation | SIMPLER Google Robot VA | Pick Up Coke Can Success Rate0.898 | 20 | |
| Robot Manipulation | SimplerEnv OOD | Put Spoon on Towel Success Rate75 | 19 | |
| Robotic Manipulation | SIMPLER Overall | Success Rate (All)63 | 7 | |
| Cross-embodiment Skill Generalization | SIMPLER Google robot Cross-embodiment evaluation | Spoon Put on Towel Success Rate56.3 | 5 | |
| Cross-embodiment skill transfer | Realman Robot Real-world Skill Transfer | Move Block Success Rate81.3 | 5 | |
| Place | Realman robot collected dataset Real-world In-Domain 1.0 (test) | Pick Up Success87.5 | 5 | |
| Robotic Manipulation | SIMPLER novel YCB and GSO objects 1.0 (test) | Success Count: Green Cube96 | 5 |