Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

StereoVLA: Enhancing Vision-Language-Action Models with Stereo Vision

About

Stereo cameras closely mimic human binocular vision, providing rich spatial cues critical for precise robotic manipulation. Despite their advantage, the adoption of stereo vision in vision-language-action models (VLAs) remains underexplored. In this work, we present StereoVLA, a VLA model that leverages rich geometric cues from stereo vision. We propose a novel Geometric-Semantic Feature Extraction module that utilizes vision foundation models to extract and fuse two key features: 1) geometric features from subtle stereo-view differences for spatial perception; 2) semantic-rich features from the monocular view for instruction following. Additionally, we propose an auxiliary Interaction-Region Depth Estimation task to further enhance spatial perception and accelerate model convergence. Extensive experiments show that our approach outperforms baselines by a large margin in diverse tasks under the stereo setting and demonstrates strong robustness to camera pose variations.

Shengliang Deng, Mi Yan, Yixin Zheng, Jiayi Su, Wenhao Zhang, Xiaoguang Zhao, Heming Cui, Zhizheng Zhang, He Wang• 2025

Related benchmarks

TaskDatasetResultRank
Robot ManipulationSimulation held-out environments
Pick Success Rate (MSProc)6.6
14
Robot PickingPick MSProc sim
Success Rate6.3
11
Robotic ManipulationRobotic Manipulation Dataset Medium Camera Pose Randomization 1.0
Success Rate71.9
5
Robotic ManipulationRobotic Manipulation Dataset Large Camera Pose Randomization 1.0
Success Rate61.3
5
Robotic ManipulationRobotic Manipulation Dataset Small Camera Pose Randomization 1.0
Success Rate79.3
5
PickMolmoSpace
Success Rate7
5
Showing 6 of 6 rows

Other info

Follow for update