MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting
About
Open-world generalization requires robotic systems to have a profound understanding of the physical world and the user command to solve diverse and complex tasks. While the recent advancement in vision-language models (VLMs) has offered unprecedented opportunities to solve open-world problems, how to leverage their capabilities to control robots remains a grand challenge. In this paper, we introduce Marking Open-world Keypoint Affordances (MOKA), an approach that employs VLMs to solve robotic manipulation tasks specified by free-form language instructions. Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world. By prompting the pre-trained VLM, our approach utilizes the VLM's commonsense knowledge and concept understanding acquired from broad data sources to predict affordances and generate motions. To facilitate the VLM's reasoning in zero-shot and few-shot manners, we propose a visual prompting technique that annotates marks on images, converting affordance reasoning into a series of visual question-answering problems that are solvable by the VLM. We further explore methods to enhance performance with robot experiences collected by MOKA through in-context learning and policy distillation. We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robotic Manipulation | Task VII | Success Rate0.00e+0 | 5 | |
| Robotic Manipulation | Task VIII | Success Rate50 | 5 | |
| Shape rope | Robotic Manipulation Tasks (real-world) | Success Rate2.00e+3 | 4 | |
| Stack bowls | Real-world Robotic Tasks | Success Rate10 | 4 | |
| Non-toppling push | Robotic Manipulation Tasks (real-world) | Success Rate0.00e+0 | 4 | |
| Pivoting | Robotic Manipulation Tasks (real-world) | Success Rate0.00e+0 | 4 | |
| Shape dough | Robotic Manipulation Tasks (real-world) | Success Rate0.00e+0 | 4 |