StereoVLA: Enhancing Vision-Language-Action Models with Stereo Vision

About

While Vision-Language-Action (VLA) models excel in generalist manipulation, they often lack fine-grained spatial awareness and show limited viewpoint robustness. This limitation largely stems from the reliance on pretrained RGB encoders, which lack explicit geometric cues and prioritize semantic alignment over geometric representation. We argue that effective visual representations for VLA models must jointly encode both semantic and geometric information. In this paper, we introduce StereoVLA, the first VLA model to incorporate rich geometric cues from large-scale synthetic stereo data. StereoVLA employs a Geometric-and-Semantic (GeoSem) vision encoder that extracts geometric cues from subtle stereo-view disparities for precise spatial perception, while simultaneously capturing semantic features from pixel observations to support language-conditioned manipulation. Additionally, we introduce two synergistic co-training objectives: Interaction-Region Depth Estimation for precise spatial reasoning, and Camera Parameter Estimation to implicitly align perception and action coordinate systems. Compared with baselines that employ various input modalities, StereoVLA achieves a 33.4% absolute gain in success rate in real-world experiments and demonstrates robustness to near-hemispheric camera perspectives. Project page: https://shengliangd.github.io/StereoVLA-Webpage.

Shengliang Deng, Mi Yan, Yixin Zheng, Jiayi Su, Wenhao Zhang, Yitao Zeng, Xiaoguang Zhao, Heming Cui, Zhizheng Zhang, He Wang• 2025

Related benchmarks

Task	Dataset	Result
Robot Manipulation	Simulation held-out environments	Pick Success Rate (MSProc)6.6	14
Robot Picking	Pick MSProc sim	Success Rate6.3	11
Robotic Manipulation	Robotic Manipulation Dataset Medium Camera Pose Randomization 1.0	Success Rate71.9	5
Robotic Manipulation	Robotic Manipulation Dataset Large Camera Pose Randomization 1.0	Success Rate61.3	5
Robotic Manipulation	Robotic Manipulation Dataset Small Camera Pose Randomization 1.0	Success Rate79.3	5
Pick	MolmoSpace	Success Rate7	5

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord