Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Unified Vision-Language-Action Model

About

Vision-language-action models (VLAs) have garnered significant attention for their potential in advancing robotic manipulation. However, previous approaches predominantly rely on the general comprehension capabilities of vision-language models (VLMs) to generate action signals, often overlooking the rich temporal and causal structure embedded in visual observations. In this paper, we present UniVLA, a unified and native multimodal VLA model that autoregressively models vision, language, and action signals as discrete token sequences. This formulation enables flexible multimodal tasks learning, particularly from large-scale video data. By incorporating world modeling during post-training, UniVLA captures causal dynamics from videos, facilitating effective transfer to downstream policy learning--especially for long-horizon tasks. Our approach sets new state-of-the-art results across several widely used simulation benchmarks, including CALVIN, LIBERO, and Simplenv-Bridge, significantly surpassing previous methods. For example, UniVLA achieves 95.5% average success rate on LIBERO benchmark, surpassing pi0-FAST's 85.5%. We further demonstrate its broad applicability on real-world ALOHA manipulation and autonomous driving.

Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, Zhaoxiang Zhang• 2025

Related benchmarks

TaskDatasetResultRank
Robot ManipulationLIBERO
Object Achievement98.8
957
Robotic ManipulationLIBERO
Spatial Success Rate97
527
Robotic ManipulationLIBERO-Plus
Language Understanding Score71.8
249
Robot ManipulationLIBERO (test)
Average Success Rate95.5
220
Autonomous DrivingNAVSIM v1 (test)
NC96.9
147
Robotic ManipulationCalvin ABCD→D
Avg Length4.26
130
Long-horizon robot manipulationCalvin ABCD→D
Task 1 Completion Rate94.8
127
Autonomous Driving PlanningNAVSIM v1
NC96.9
126
Robot ManipulationSimplerEnv WidowX
Success Rate: Put Spoon on Towel83.3
98
Robot Policy LearningLIBERO
S (Spatial) Rate96.5
73
Showing 10 of 21 rows

Other info

Follow for update