PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation
About
The ability for robots to comprehend and execute manipulation tasks based on natural language instructions is a long-term goal in robotics. The dominant approaches for language-guided manipulation use 2D image representations, which face difficulties in combining multi-view cameras and inferring precise 3D positions and relationships. To address these limitations, we propose a 3D point cloud based policy called PolarNet for language-guided manipulation. It leverages carefully designed point cloud inputs, efficient point cloud encoders, and multimodal transformers to learn 3D point cloud representations and integrate them with language instructions for action prediction. PolarNet is shown to be effective and data efficient in a variety of experiments conducted on the RLBench benchmark. It outperforms state-of-the-art 2D and 3D approaches in both single-task and multi-task learning. It also achieves promising results on a real robot.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robotic Manipulation | RLBench | Avg Success Score46.4 | 56 | |
| Robotic Manipulation | RLBench (test) | Average Success Rate46.4 | 34 | |
| Multi-task Robotic Manipulation | RLBench | Avg Success Rate48.7 | 16 | |
| Robotic Manipulation | RLBench 10 tasks | Pick & Lift Success Rate97.8 | 13 | |
| Multi-task Robotic Manipulation | RLBench 100 demonstrations (test) | Average Success Rate89.8 | 11 | |
| Robotic Manipulation | RLBench 18Task | Average Success Rate46.4 | 9 | |
| Multi-task Robotic Manipulation | GemBench | Avg Success38.4 | 8 | |
| Vision-based Robotic Manipulation | GemBench (test) | Average Score38.4 | 8 | |
| Robot Manipulation | RLBench 10 Tasks single-variation | Success Rate92.1 | 6 | |
| Robotic Manipulation | GemBench Level 3 (Articulated objects) | Success Rate38.5 | 6 |