QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models

About

Spatial perception and reasoning are crucial for Vision-Language-Action (VLA) models to accomplish fine-grained manipulation tasks. However, existing approaches often lack the ability to understand and reason over the essential 3D structures necessary for precise control. To address this limitation, we propose QDepth-VLA, a general framework that augments VLA models with an auxiliary depth prediction task. A dedicated depth expert is designed to predict quantized latent tokens of depth maps obtained from a VQ-VAE encoder, enabling the model to learn depth-aware representations that capture critical geometric cues. Experimental results on the simulation benchmarks and real-world tasks demonstrate that QDepth-VLA yields strong spatial reasoning and competitive performance on manipulation tasks.

Yixuan Li, Yuhui Chen, Mingcai Zhou, Haoran Li, Zhengtao Zhang, Dongbin Zhao• 2025

Related benchmarks

Task	Dataset	Result
Robotic Manipulation	LIBERO	Spatial Success Rate97.6	570
Robot Manipulation	LIBERO (test)	Average Success Rate94.9	237
Robot Manipulation	LIBERO	Spatial Success97.6	90
Robotic Manipulation	SimplerEnv WidowX (test)	Put Spoon Success Rate82	20
Robot Manipulation	Simpler Google Robot tasks	Pick Coke Can Success Rate98.3	7
Robot Manipulation	Simpler benchmark WidowX250 Robot tasks	Success Rate: Put Carrot on Plate57.5	6
Pick-&-Place	Piper Arm Real-World Task 1 1.0 (test)	Success Rate70	3
Pick-&-Place	Piper Arm Real-World Task 3 1.0 (test)	Success Rate50	3
Stacking	Piper Arm Real-World Task 4 1.0 (test)	Success Rate10	3
Pick-&-Place	Piper Arm Real-World Task 2 1.0 (test)	Success Rate40	3

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord