Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent

About

Recent advancements in Vision-Language-Action (VLA) models have leveraged pre-trained Vision-Language Models (VLMs) to improve the generalization capabilities. VLMs, typically pre-trained on vision-language understanding tasks, provide rich semantic knowledge and reasoning abilities. However, prior research has shown that VLMs often focus on high-level semantic content and neglect low-level features, limiting their ability to capture detailed spatial information and understand physical dynamics. These aspects, which are crucial for embodied control tasks, remain underexplored in existing pre-training paradigms. In this paper, we investigate the training paradigm for VLAs, and introduce \textbf{UP-VLA}, a \textbf{U}nified VLA model training with both multi-modal \textbf{U}nderstanding and future \textbf{P}rediction objectives, enhancing both high-level semantic comprehension and low-level spatial understanding. Experimental results show that UP-VLA achieves a 33% improvement on the Calvin ABC-D benchmark compared to the previous state-of-the-art method. Additionally, UP-VLA demonstrates improved success rates in real-world manipulation tasks, particularly those requiring precise spatial information.

Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, Jianyu Chen• 2025

Related benchmarks

TaskDatasetResultRank
Long-horizon robot manipulationCalvin ABCD→D
Task 1 Completion Rate96.2
127
Robotic ManipulationCalvin ABCD→D
Avg Length4.08
89
Long-horizon task completionCalvin ABC->D
Success Rate (1)92.8
67
Robotic ManipulationRLBench (test)
Average Success Rate42
49
Robot ManipulationCalvin ABC->D
Average Successful Length4.078
48
Sequential Robotic ManipulationCALVIN
Success Rate (1 task)92.8
45
Long-horizon robotic manipulationCalvin ABC->D
Task 1 Success Rate92.8
34
Instruction-following robotic manipulationCALVIN ABC→D (unseen environment D)
Success Rate (Length 1)92.8
29
Robot ManipulationRoboTwin Clean 2.0--
24
Robot ManipulationRoboTwin Randomized 2.0--
20
Showing 10 of 11 rows

Other info

Code

Follow for update