Igniting VLMs toward the Embodied Space

About

While foundation models show remarkable progress in language and vision, existing vision-language models (VLMs) still have limited spatial and embodiment understanding. Transferring VLMs to embodied domains reveals fundamental mismatches between modalities, pretraining distributions, and training objectives, leaving action comprehension and generation as a central bottleneck on the path to AGI. We introduce WALL-OSS, an end-to-end embodied foundation model that leverages large-scale multimodal pretraining to achieve (1) embodiment-aware vision-language understanding, (2) strong language-action association, and (3) robust manipulation capability. Our approach employs a tightly coupled architecture and multi-strategies training curriculum that enables Unified Cross-Level CoT-seamlessly unifying instruction reasoning, subgoal decomposition, and fine-grained action synthesis within a single differentiable framework. Our results show that WALL-OSS attains high success on complex long-horizon manipulations, demonstrates strong instruction-following capabilities, complex understanding and reasoning, and outperforms strong baselines, thereby providing a reliable and scalable path from VLMs to embodied foundation models.

Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, Lucy Liang, Make Wang, Qian Wang, Roy Gan, Ryan Yu, Shalfun Li, Starrick Liu, Sylas Chen, Vincent Chen, Zach Xu• 2025

Related benchmarks

Task	Dataset	Result
Diagram Question Answering	AI2D	AI2D Accuracy58.6	509
Multi-discipline Multimodal Understanding	MMMU	Accuracy37.11	422
Multimodal Model Evaluation	MME	MME Score1.15e+3	80
Document Visual Question Answering	DocVQA	ANLS63.62	49
Robotic Manipulation	WISER (train)	Grasp Success Rate100	18
Robotic Strawberry Harvesting	Real-world strawberry harvesting environment	Score78.8	18
Robotic Manipulation	WISER (test)	Grasp Success68	18
Insertion	Real-world	Success Rate25	16
Object Hallucination Evaluation	HallBench	Accuracy36.57	12
Robotic Manipulation	RoboChallenge Table30	Arrange Fruits Success Rate80	9

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord