Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
About
While significant research has focused on developing embodied reasoning capabilities using Vision-Language Models (VLMs) or integrating advanced VLMs into Vision-Language-Action (VLA) models for end-to-end robot control, few studies directly address the critical gap between upstream VLM-based reasoning and downstream VLA policy learning. In this work, we take an initial step toward bridging embodied reasoning with VLA policy learning by introducing Vlaser - a Vision-Language-Action Model with synergistic embodied reasoning capability, which is a foundational vision-language model designed to integrate high-level reasoning with low-level control for embodied agents. Built upon the high-quality Vlaser-6M dataset, Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks - including spatial reasoning, embodied grounding, embodied QA, and task planning. Furthermore, we systematically examine how different VLM initializations affect supervised VLA fine-tuning, offering novel insights into mitigating the domain shift between internet-scale pre-training data and embodied-specific policy learning data. Based on these insights, our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Reasoning | BLINK | Accuracy84.9 | 76 | |
| Spatial Reasoning | MindCube | Accuracy34.6 | 69 | |
| Spatial Reasoning | EmbSpatial | Overall Accuracy75.3 | 63 | |
| Spatial Reasoning | OmniSpatial (test) | Dyn. Score39.1 | 53 | |
| Spatial Reasoning | SITE | Accuracy47.5 | 39 | |
| Embodied Task Completion | EB-Habitat | -- | 32 | |
| Embodied Reasoning and Question Answering | ERQA | Score41 | 30 | |
| Spatial Reasoning | MindCube tiny (test) | Rot. Accuracy31.5 | 30 | |
| Spatial Reasoning | MMSI-Bench (test) | PR Score29.8 | 29 | |
| Embodied AI Task Planning | EB-ALFRED | Average Score50 | 28 |