Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

About

While significant research has focused on developing embodied reasoning capabilities using Vision-Language Models (VLMs) or integrating advanced VLMs into Vision-Language-Action (VLA) models for end-to-end robot control, few studies directly address the critical gap between upstream VLM-based reasoning and downstream VLA policy learning. In this work, we take an initial step toward bridging embodied reasoning with VLA policy learning by introducing Vlaser - a Vision-Language-Action Model with synergistic embodied reasoning capability, which is a foundational vision-language model designed to integrate high-level reasoning with low-level control for embodied agents. Built upon the high-quality Vlaser-6M dataset, Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks - including spatial reasoning, embodied grounding, embodied QA, and task planning. Furthermore, we systematically examine how different VLM initializations affect supervised VLA fine-tuning, offering novel insights into mitigating the domain shift between internet-scale pre-training data and embodied-specific policy learning data. Based on these insights, our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.

Ganlin Yang, Tianyi Zhang, Haoran Hao, Weiyun Wang, Yibin Liu, Dehui Wang, Guanzhou Chen, Zijian Cai, Junting Chen, Weijie Su, Wengang Zhou, Yu Qiao, Jifeng Dai, Jiangmiao Pang, Gen Luo, Wenhai Wang, Yao Mu, Zhi Hou• 2025

Related benchmarks

Task	Dataset	Result
3D Visual Grounding	ScanRefer	--	172
Spatial Reasoning	EmbSpatial	Overall Accuracy75.3	131
3D Dense Captioning	Scan2Cap	--	127
Visual Reasoning	BLINK	Accuracy84.9	116
Spatial Reasoning	MindCube	Accuracy34.6	91
Embodied AI Task Planning	EB-ALFRED	Average Score50	87
Spatial Reasoning	MMSI-Bench	Average Accuracy27.3	67
Embodied Task Completion	EB-Habitat	--	63
Spatial Reasoning	SPAR-Bench	Overall Score41.2	59
Embodied Reasoning and Question Answering	ERQA	Score41	53

Showing 10 of 53 rows

Other info

GitHub

Follow for update

@wizwand_team Discord