Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution
About
In this report, we introduce Xiaomi-Robotics-0, an advanced vision-language-action (VLA) model optimized for high performance and fast and smooth real-time execution. The key to our method lies in a carefully designed training recipe and deployment strategy. Xiaomi-Robotics-0 is first pre-trained on large-scale cross-embodiment robot trajectories and vision-language data, endowing it with broad and generalizable action-generation capabilities while avoiding catastrophic forgetting of the visual-semantic knowledge of the underlying pre-trained VLM. During post-training, we propose several techniques for training the VLA model for asynchronous execution to address the inference latency during real-robot rollouts. During deployment, we carefully align the timesteps of consecutive predicted action chunks to ensure continuous and seamless real-time rollouts. We evaluate Xiaomi-Robotics-0 extensively in simulation benchmarks and on two challenging real-robot tasks that require precise and dexterous bimanual manipulation. Results show that our method achieves state-of-the-art performance across all simulation benchmarks. Moreover, Xiaomi-Robotics-0 can roll out fast and smoothly on real robots using a consumer-grade GPU, achieving high success rates and throughput on both real-robot tasks. To facilitate future research, code and model checkpoints are open-sourced at https://xiaomi-robotics-0.github.io
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy88.5 | 935 | |
| Robot Manipulation | LIBERO | Goal Achievement98.8 | 494 | |
| Visual Question Answering | AI2D | Accuracy78.7 | 174 | |
| Robot Manipulation | Calvin ABC->D | Average Successful Length4.75 | 36 | |
| Robot Manipulation | SimplerEnv Google Robot Visual Matching | Pick Coke Can98.7 | 28 | |
| Robotic Manipulation | SimplerEnv Google Robot - Visual Aggregation | Pick Coke Can88.2 | 28 | |
| Vision-Language Understanding | MMBench | Accuracy84.4 | 14 | |
| Scientific Question Answering | SciQA | Accuracy79.4 | 13 | |
| Robot Manipulation | SimplerEnv WidowX | Success Rate: Put Spoon on Towel95.8 | 12 | |
| Embodied Reasoning | ERQA | Accuracy40.8 | 6 |