XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations

About

Recent progress in large-scale robotic datasets and vision-language models (VLMs) has advanced research on vision-language-action (VLA) models. However, existing VLA models still face two fundamental challenges: (i) producing precise low-level actions from high-dimensional observations, (ii) bridging domain gaps across heterogeneous data sources, including diverse robot embodiments and human demonstrations. Existing methods often encode latent variables from either visual dynamics or robotic actions to guide policy learning, but they fail to fully exploit the complementary multi-modal knowledge present in large-scale, heterogeneous datasets. In this work, we present X Robotic Model 1 (XR-1), a novel framework for versatile and scalable VLA learning across diverse robots, tasks, and environments. XR-1 introduces the \emph{Unified Vision-Motion Codes (UVMC)}, a discrete latent representation learned via a dual-branch VQ-VAE that jointly encodes visual dynamics and robotic motion. UVMC addresses these challenges by (i) serving as an intermediate representation between the observations and actions, and (ii) aligning multimodal dynamic information from heterogeneous data sources to capture complementary knowledge. To effectively exploit UVMC, we propose a three-stage training paradigm: (i) self-supervised UVMC learning, (ii) UVMC-guided pretraining on large-scale cross-embodiment robotic datasets, and (iii) task-specific post-training. We validate XR-1 through extensive real-world experiments with more than 14,000 rollouts on six different robot embodiments, spanning over 120 diverse manipulation tasks. XR-1 consistently outperforms state-of-the-art baselines such as $\pi_{0.5}$, $\pi_0$, RDT, UniVLA, and GR00T-N1.5 while demonstrating strong generalization to novel objects, background variations, distractors, and illumination changes. Our project is at https://xr-1-vla.github.io/.

Shichao Fan, Kun Wu, Zhengping Che, Xinhua Wang, Di Wu, Fei Liao, Ning Liu, Yixue Zhang, Zhen Zhao, Zhiyuan Xu, Meng Li, Qingjie Liu, Shanghang Zhang, Min Wan, Jian Tang• 2025

Related benchmarks

Task	Dataset	Result
Robotic Manipulation	RoboCasa Panda GR00T (test)	Success Rate45.2	16
Robotic Manipulation	RoboCasa Robotiq-85 GR00T (test)	Success Rate43.8	8
Robotic Manipulation	RoboCasa Overall GR00T dataset (test)	Average Success Rate41.3	8
Long-horizon robotic manipulation	CALVIN D->D	Success Rate (1 Task)96.4	7
Robotic Manipulation	AgileX Cobot Magic V2.0	Success Rate (Open DrawerButton)90	6
Robotic Manipulation	Tien Kung 2.0	TK2 Press Button Success Rate90	6
Robotic Manipulation	Single-Arm UR-5e	SUR-Find Tape95	6
Robotic Manipulation	Dual-Arm UR-5e	Find TapeBasket Success Rate85	6
Robotic Manipulation	Tien Kung 1.0	TK1 Close Drawer Success Rate80	6
Robotic Manipulation	Dual-Arm Franka	DFR Move CupMilk Success Rate80	6

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord