Igniting VLMs toward the Embodied Space
About
While foundation models show remarkable progress in language and vision, existing vision-language models (VLMs) still have limited spatial and embodiment understanding. Transferring VLMs to embodied domains reveals fundamental mismatches between modalities, pretraining distributions, and training objectives, leaving action comprehension and generation as a central bottleneck on the path to AGI. We introduce WALL-OSS, an end-to-end embodied foundation model that leverages large-scale multimodal pretraining to achieve (1) embodiment-aware vision-language understanding, (2) strong language-action association, and (3) robust manipulation capability. Our approach employs a tightly coupled architecture and multi-strategies training curriculum that enables Unified Cross-Level CoT-seamlessly unifying instruction reasoning, subgoal decomposition, and fine-grained action synthesis within a single differentiable framework. Our results show that WALL-OSS attains high success on complex long-horizon manipulations, demonstrates strong instruction-following capabilities, complex understanding and reasoning, and outperforms strong baselines, thereby providing a reliable and scalable path from VLMs to embodied foundation models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robotic Manipulation | WISER (train) | Grasp Success Rate100 | 18 | |
| Robotic Strawberry Harvesting | Real-world strawberry harvesting environment | Score78.8 | 18 | |
| Robotic Manipulation | WISER (test) | Grasp Success68 | 18 | |
| Insertion | Real-world | Success Rate25 | 11 | |
| Robotic Manipulation | RoboChallenge Table30 | Arrange Fruits Success Rate80 | 9 | |
| Embodied Aerial Tracking | CARLA Seen Maps (Town02, Town05, Town06, Town07, Town10HD) | Close ATF (Veh)25.67 | 5 | |
| Embodied Aerial Tracking | CARLA Unseen Maps - Pedestrians | ATF (Close)39.3 | 5 | |
| Bimanual Table-cleaning | ALOHA table-cleaning | Tape SR27.4 | 5 | |
| Embodied Aerial Tracking | CARLA Unseen Maps - Vehicles | ATF (Close Range)29.63 | 5 | |
| Embodied Aerial Tracking | CARLA | Average Latency (s)0.4524 | 4 |