RT-1: Robotics Transformer for Real-World Control at Scale
About
By transferring knowledge from large, diverse, task-agnostic datasets, modern machine learning models can solve specific downstream tasks either zero-shot or with small task-specific datasets to a high level of performance. While this capability has been demonstrated in other fields such as computer vision, natural language processing or speech recognition, it remains to be shown in robotics, where the generalization capabilities of the models are particularly critical due to the difficulty of collecting real-world robotic data. We argue that one of the keys to the success of such general robotic models lies with open-ended task-agnostic training, combined with high-capacity architectures that can absorb all of the diverse, robotic data. In this paper, we present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties. We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks. The project's website and videos can be found at robotics-transformer1.github.io
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robot Manipulation | LIBERO | -- | 494 | |
| Long-horizon robot manipulation | Calvin ABCD→D | Task 1 Completion Rate84.4 | 96 | |
| Robot Manipulation | SimplerEnv WidowX Robot tasks (test) | Success Rate (Spoon)0.00e+0 | 79 | |
| Long-horizon task completion | Calvin ABC->D | Success Rate (1)84.4 | 67 | |
| Robot Manipulation | SimplerEnv Google Robot tasks Visual Matching | Pick Coke Can Success Rate85.7 | 62 | |
| Robot Manipulation | SimplerEnv Google Robot tasks Variant Aggregation | Pick Coke Can Success Rate89.8 | 44 | |
| Robot Manipulation | Calvin ABC->D | Average Successful Length0.9 | 36 | |
| Instruction-following robotic manipulation | CALVIN ABC→D (unseen environment D) | Success Rate (Length 1)53.3 | 29 | |
| Robotic Manipulation | SimplerEnv Google Robot - Visual Aggregation | Pick Coke Can89.8 | 28 | |
| Robot Manipulation | SimplerEnv Google Robot Visual Matching | Pick Coke Can85.7 | 28 |