MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting

About

Open-world generalization requires robotic systems to have a profound understanding of the physical world and the user command to solve diverse and complex tasks. While the recent advancement in vision-language models (VLMs) has offered unprecedented opportunities to solve open-world problems, how to leverage their capabilities to control robots remains a grand challenge. In this paper, we introduce Marking Open-world Keypoint Affordances (MOKA), an approach that employs VLMs to solve robotic manipulation tasks specified by free-form language instructions. Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world. By prompting the pre-trained VLM, our approach utilizes the VLM's commonsense knowledge and concept understanding acquired from broad data sources to predict affordances and generate motions. To facilitate the VLM's reasoning in zero-shot and few-shot manners, we propose a visual prompting technique that annotates marks on images, converting affordance reasoning into a series of visual question-answering problems that are solvable by the VLM. We further explore methods to enhance performance with robot experiences collected by MOKA through in-context learning and policy distillation. We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.

Fangchen Liu, Kuan Fang, Pieter Abbeel, Sergey Levine• 2024

Related benchmarks

Task	Dataset	Result
Robotic Manipulation	SimplerEnv WidowX Robot	Success Rate: Put Spoon on Towel45.8	15
Stack bowls	Real-world Robotic Tasks	Success Rate10	12
Robot Tool Use	GROW2Bench Simulation	Pound Success Rate36.7	8
Placement	100-task benchmark (test)	PE (cm)2.8	8
Shape dough	Robotic Manipulation Tasks (real-world)	Success Rate0.00e+0	8
Move the moka pot to the right of drawer	xArm 6 Real-world Tabletop	Grasp Success Rate16.7	5
Move the nearest object to the right side of the drawer	xArm 6 Real-world Tabletop	Object Correctness83.3	5
Pick the [x] toothbrush and place it to the bucket	xArm 6 Real-world Tabletop	Correct Object Pick16.7	5
Place the fork in the green bin	xArm 6 Real-world Tabletop	Grasp Success Rate16.7	5
Put the screwdriver between drawer and the vase	xArm 6 Real-world Tabletop	Grasp Success Rate83.3	5

Showing 10 of 22 rows

Other info

Follow for update

@wizwand_team Discord