UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent

About

Recent advancements in Vision-Language-Action (VLA) models have leveraged pre-trained Vision-Language Models (VLMs) to improve the generalization capabilities. VLMs, typically pre-trained on vision-language understanding tasks, provide rich semantic knowledge and reasoning abilities. However, prior research has shown that VLMs often focus on high-level semantic content and neglect low-level features, limiting their ability to capture detailed spatial information and understand physical dynamics. These aspects, which are crucial for embodied control tasks, remain underexplored in existing pre-training paradigms. In this paper, we investigate the training paradigm for VLAs, and introduce \textbf{UP-VLA}, a \textbf{U}nified VLA model training with both multi-modal \textbf{U}nderstanding and future \textbf{P}rediction objectives, enhancing both high-level semantic comprehension and low-level spatial understanding. Experimental results show that UP-VLA achieves a 33% improvement on the Calvin ABC-D benchmark compared to the previous state-of-the-art method. Additionally, UP-VLA demonstrates improved success rates in real-world manipulation tasks, particularly those requiring precise spatial information.

Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, Jianyu Chen• 2025

Related benchmarks

Task	Dataset	Result
Robotic Manipulation	Calvin ABCD→D	Avg Length4.08	130
Long-horizon robot manipulation	Calvin ABCD→D	Task 1 Completion Rate96.2	127
Robotic Manipulation	Calvin ABC->D	Task-1 Score92.8	71
Long-horizon task completion	Calvin ABC->D	Success Rate (1)92.8	67
Sequential Robotic Manipulation	CALVIN	Success Rate (1 task)96.2	63
Robot Manipulation	Calvin ABC->D	Average Successful Length4.08	62
Robotic Manipulation	RLBench (test)	Average Success Rate42	49
Long-horizon robotic manipulation	Calvin ABC->D	Average Trajectory Length2.74	40
Robot Manipulation	RoboTwin Clean 2.0	--	39
Robot Manipulation	RoboTwin Randomized 2.0	Overall Success Rate15.16	33

Showing 10 of 23 rows

Other info

Code

Follow for update

@wizwand_team Discord