Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution

About

In this report, we introduce Xiaomi-Robotics-0, an advanced vision-language-action (VLA) model optimized for high performance and fast and smooth real-time execution. The key to our method lies in a carefully designed training recipe and deployment strategy. Xiaomi-Robotics-0 is first pre-trained on large-scale cross-embodiment robot trajectories and vision-language data, endowing it with broad and generalizable action-generation capabilities while avoiding catastrophic forgetting of the visual-semantic knowledge of the underlying pre-trained VLM. During post-training, we propose several techniques for training the VLA model for asynchronous execution to address the inference latency during real-robot rollouts. During deployment, we carefully align the timesteps of consecutive predicted action chunks to ensure continuous and seamless real-time rollouts. We evaluate Xiaomi-Robotics-0 extensively in simulation benchmarks and on two challenging real-robot tasks that require precise and dexterous bimanual manipulation. Results show that our method achieves state-of-the-art performance across all simulation benchmarks. Moreover, Xiaomi-Robotics-0 can roll out fast and smoothly on real robots using a consumer-grade GPU, achieving high success rates and throughput on both real-robot tasks. To facilitate future research, code and model checkpoints are open-sourced at https://xiaomi-robotics-0.github.io

Rui Cai, Jun Guo, Xinze He, Piaopiao Jin, Jie Li, Bingxuan Lin, Futeng Liu, Wei Liu, Fei Ma, Kun Ma, Feng Qiu, Heng Qu, Yifei Su, Qiao Sun, Dong Wang, Donghao Wang, Yunhong Wang, Rujie Wu, Diyun Xiang, Yu Yang, Hangjun Ye, Yuan Zhang, Quanyun Zhou• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy88.5	935
Robot Manipulation	LIBERO	Goal Achievement98.8	494
Visual Question Answering	AI2D	Accuracy78.7	174
Robot Manipulation	Calvin ABC->D	Average Successful Length4.75	36
Robot Manipulation	SimplerEnv Google Robot Visual Matching	Pick Coke Can98.7	28
Robotic Manipulation	SimplerEnv Google Robot - Visual Aggregation	Pick Coke Can88.2	28
Vision-Language Understanding	MMBench	Accuracy84.4	14
Scientific Question Answering	SciQA	Accuracy79.4	13
Robot Manipulation	SimplerEnv WidowX	Success Rate: Put Spoon on Towel95.8	12
Embodied Reasoning	ERQA	Accuracy40.8	6

Showing 10 of 10 rows

Other info

GitHub

Follow for update

@wizwand_team Discord