RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization

About

Vision-Language-Action (VLA) models hold promise for generalist robotics but currently struggle with data scarcity, architectural inefficiencies, and the inability to generalize across different hardware platforms. We introduce RDT2, a robotic foundation model built upon a 7B parameter VLM designed to enable zero-shot deployment on novel embodiments for open-vocabulary tasks. To achieve this, we collected one of the largest open-source robotic datasets--over 10,000 hours of demonstrations in diverse families--using an enhanced, embodiment-agnostic Universal Manipulation Interface (UMI). Our approach employs a novel three-stage training recipe that aligns discrete linguistic knowledge with continuous control via Residual Vector Quantization (RVQ), flow-matching, and distillation for real-time inference. Consequently, RDT2 becomes one of the first models that simultaneously zero-shot generalizes to unseen objects, scenes, instructions, and even robotic platforms. Besides, it outperforms state-of-the-art baselines in dexterous, long-horizon, and dynamic downstream tasks like playing table tennis. See https://rdt-robotics.github.io/rdt2/ for more information.

Songming Liu, Bangguo Li, Kai Ma, Lingxuan Wu, Hengkai Tan, Xiao Ouyang, Hang Su, Jun Zhu• 2026

Related benchmarks

Task	Dataset	Result
Screw sorting	Screw sorting	Per-Operation Success Rate0.00e+0	6
Pick-&-Place	Single-arm pick-and-place 16 trials	Success Rate40	4
Shoebox unpacking	Shoebox Unpacking	Success Rate0.00e+0	4
Sorting items and trash	Sorting items and trash	Per-Operation Success Rate18	4
Button Pressing	Real-world Button Pressing	Reaction Time (ms)97	3
Cloth Folding	Real-world Cloth Folding	Success Rate77	3
Table Bussing	Real-world Table Bussing	Progress Score58	3
Unzipping	Real-world Unzipping	Success Rate45	3
Table Tennis	Real-world Table Tennis	Hit Rate (1x)0.88	2

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord