$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
About
In order for robots to be useful, they must perform practically relevant tasks in the real world, outside of the lab. While vision-language-action (VLA) models have demonstrated impressive results for end-to-end robot control, it remains an open question how far such models can generalize in the wild. We describe $\pi_{0.5}$, a new model based on $\pi_{0}$ that uses co-training on heterogeneous tasks to enable broad generalization. $\pi_{0.5}$\ uses data from multiple robots, high-level semantic prediction, web data, and other sources to enable broadly generalizable real-world robotic manipulation. Our system uses a combination of co-training and hybrid multi-modal examples that combine image observations, language commands, object detections, semantic subtask prediction, and low-level actions. Our experiments show that this kind of knowledge transfer is essential for effective generalization, and we demonstrate for the first time that an end-to-end learning-enabled robotic system can perform long-horizon and dexterous manipulation skills, such as cleaning a kitchen or bedroom, in entirely new homes.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy0.00e+0 | 935 | |
| Robot Manipulation | LIBERO | Goal Achievement98 | 494 | |
| Visual Question Answering | AI2D | Accuracy14.4 | 174 | |
| Robot Manipulation | LIBERO (test) | Average Success Rate96.9 | 142 | |
| Long-horizon robot manipulation | Calvin ABCD→D | Task 1 Completion Rate71 | 96 | |
| Robot Manipulation | SimplerEnv WidowX Robot tasks (test) | Success Rate (Spoon)49.3 | 79 | |
| Robotic Manipulation | LIBERO 1.0 (test) | Long92.4 | 30 | |
| Robotic Manipulation | LIBERO v1 (test) | Config 10 Score92.4 | 27 | |
| Robotic Manipulation | Calvin ABCD→D | Success Rate (1 Inst)94.4 | 26 | |
| Robotic Manipulation | SIMPLER Visual Matching WidowX robot | Put Spoon on Towel Score79.2 | 24 |