DM0: An Embodied-Native Vision-Language-Action Model towards Physical AI
About
Moving beyond the traditional paradigm of adapting internet-pretrained models to physical tasks, we present DM0, an Embodied-Native Vision-Language-Action (VLA) framework designed for Physical AI. Unlike approaches that treat physical grounding as a fine-tuning afterthought, DM0 unifies embodied manipulation and navigation by learning from heterogeneous data sources from the onset. Our methodology follows a comprehensive three-stage pipeline: Pretraining, Mid-Training, and Post-Training. First, we conduct large-scale unified pretraining on the Vision-Language Model (VLM) using diverse corpora--seamlessly integrating web text, autonomous driving scenarios, and embodied interaction logs-to jointly acquire semantic knowledge and physical priors. Subsequently, we build a flow-matching action expert atop the VLM. To reconcile high-level reasoning with low-level control, DM0 employs a hybrid training strategy: for embodied data, gradients from the action expert are not backpropagated to the VLM to preserve generalized representations, while the VLM remains trainable on non-embodied data. Furthermore, we introduce an Embodied Spatial Scaffolding strategy to construct spatial Chain-of-Thought (CoT) reasoning, effectively constraining the action solution space. Experiments on the RoboChallenge benchmark demonstrate that DM0 achieves state-of-the-art performance in both Specialist and Generalist settings on Table30.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robotic Manipulation | Table30 RoboChallenge (test) | Overall Success Rate62 | 10 | |
| arrange flowers | Table30 RoboChallenge ARX5 | Success Rate20 | 3 | |
| arrange fruits in basket | Table30 RoboChallenge - UR5 | Success Rate70 | 3 | |
| arrange paper cups | RoboChallenge Table30 ARX5 | Success Rate0.1 | 3 | |
| fold dishcloth | Table30 RoboChallenge ARX5 | Success Rate10 | 3 | |
| hang toothbrush cup | Table30 RoboChallenge - UR5 | Success Rate90 | 3 | |
| make vegetarian sandwich | Table30 RoboChallenge ALOHA | Success Rate0.00e+0 | 3 | |
| move objects into box | RoboChallenge Table30 Franka | Success Rate50 | 3 | |
| Open the drawer | Table30 RoboChallenge ARX5 | Success Rate0.9 | 3 | |
| Overall Robotic Manipulation | Table30 RoboChallenge | Success Rate37.3 | 3 |