Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning

About

Vision-Language-Action (VLA) models have recently shown impressive generalization and language-guided manipulation capabilities. However, their performance degrades on tasks requiring precise spatial reasoning due to limited spatial reasoning inherited from Vision-Language Models (VLMs). Existing VLAs rely on extensive action-data pretraining to ground VLMs in 3D space, which reduces training efficiency and is still insufficient for accurate spatial understanding. In this work, we present DepthVLA, a simple yet effective VLA architecture that explicitly incorporates spatial awareness through a pretrained depth prediction module. DepthVLA adopts a mixture-of-transformers design that unifies a VLM, a depth transformer, and an action expert with fully shared attentions, forming an end-to-end model with enhanced spatial reasoning. Extensive evaluations in both real-world and simulated environments show that DepthVLA outperforms state-of-the-art approaches, achieving 78.5% vs. 65.0% progress in real-world tasks, 94.9% vs. 93.6% in the LIBERO simulator, and 74.8% vs. 58.8% in the Simpler simulator. Our code will be made publicly available.

Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Zhuoguang Chen, Tao Jiang, Hang Zhao• 2025

Related benchmarks

TaskDatasetResultRank
Robot ManipulationLIBERO
Object Achievement94.3
957
Robotic ManipulationLIBERO
Spatial Success Rate96.4
527
Robot ManipulationLIBERO
Spatial Success Rate96.4
116
Robot ManipulationSimplerEnv
Pick-Can Success Rate70.1
5
Showing 4 of 4 rows

Other info

Follow for update