X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model
About
Successful generalist Vision-Language-Action (VLA) models rely on effective training across diverse robotic platforms with large-scale, cross-embodiment, heterogeneous datasets. To facilitate and leverage the heterogeneity in rich, diverse robotic data sources, we propose a novel Soft Prompt approach with minimally added parameters, by infusing prompt learning concepts into cross-embodiment robot learning and introducing separate sets of learnable embeddings for each distinct data source. These embeddings serve as embodiment-specific prompts, which in unity empower VLA models with effective exploitation of varying cross-embodiment features. Our new X-VLA, a neat flow-matching-based VLA architecture, relies exclusively on soft-prompted standard Transformer encoders, enjoying both scalability and simplicity. Evaluated across 6 simulations as well as 3 real-world robots, our 0.9B instantiation-X-VLA-0.9B simultaneously achieves SOTA performance over a sweep of benchmarks, demonstrating superior results on a wide axes of capabilities, from flexible dexterity to quick adaptation across embodiments, environments, and tasks. Website: https://thu-air-dream.github.io/X-VLA/
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robot Manipulation | LIBERO | Goal Achievement97.8 | 494 | |
| Robot Manipulation | RoboTwin Randomized 2.0 | Success Rate: Place Dual Shoes88 | 20 | |
| Robot Manipulation | RoboTwin Clean 2.0 | Place Dual Shoes Success79 | 20 | |
| Robot Manipulation | LIBERO-Plus Zero-shot | Camera Score22.2 | 20 | |
| Robotic Manipulation | WidowX | Spoon Success Rate100 | 17 | |
| Robotic Manipulation | Google Robot Variant Aggregation | Pick Success Rate85.5 | 15 | |
| Language-conditioned manipulation | LIBERO Long | Avg Success Score97.6 | 6 | |
| Bimanual Robotic Manipulation | RoboTwin Hard 2.0 | Success Rate (H=1)82.5 | 5 | |
| Bimanual Robotic Manipulation | RoboTwin Easy 2.0 | Success Rate (H=1)81.6 | 5 | |
| Robotic Manipulation | GenieSim 2.2 | Success Rate: Clear Countertop Waste62.2 | 4 |