GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data
About
Embodied foundation models are gaining increasing attention for their zero-shot generalization, scalability, and adaptability to new tasks through few-shot post-training. However, existing models rely heavily on real-world data, which is costly and labor-intensive to collect. Synthetic data offers a cost-effective alternative, yet its potential remains largely underexplored. To bridge this gap, we explore the feasibility of training Vision-Language-Action models entirely with large-scale synthetic action data. We curate SynGrasp-1B, a billion-frame robotic grasping dataset generated in simulation with photorealistic rendering and extensive domain randomization. Building on this, we present GraspVLA, a VLA model pretrained on large-scale synthetic action data as a foundational model for grasping tasks. GraspVLA integrates autoregressive perception tasks and flow-matching-based action generation into a unified Chain-of-Thought process, enabling joint training on synthetic action data and Internet semantics data. This design helps mitigate sim-to-real gaps and facilitates the transfer of learned actions to a broader range of Internet-covered objects, achieving open-vocabulary generalization in grasping. Extensive evaluations across real-world and simulation benchmarks demonstrate GraspVLA's advanced zero-shot generalizability and few-shot adaptability to specific human preferences. We will release SynGrasp-1B dataset and pre-trained weights to benefit the community.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robot Manipulation | LIBERO | Goal Achievement91.2 | 494 | |
| Robot Manipulation | LIBERO (test) | Average Success Rate54.6 | 142 | |
| Long-horizon robot manipulation | CALVIN | Task Completion Rate (1)56.2 | 15 | |
| Robot Manipulation | Real-world post-training dataset Task 2: Move condiment cup into slot 1.0 (test) | Success Rate53.3 | 7 | |
| Robot Manipulation | Real-world post-training dataset Task 1: Move pink tulip to vase 1.0 (test) | Success Rate33.3 | 7 | |
| Robotic Manipulation | Robotic Manipulation Dataset Small Camera Pose Randomization 1.0 | Success Rate82.5 | 5 | |
| Robotic Manipulation | Robotic Manipulation Dataset Medium Camera Pose Randomization 1.0 | Success Rate63.4 | 5 | |
| Robotic Manipulation | Robotic Manipulation Dataset Large Camera Pose Randomization 1.0 | Success Rate54.8 | 5 |