CoPa: General Robotic Manipulation through Spatial Constraints of Parts with Foundation Models

About

Foundation models pre-trained on web-scale data are shown to encapsulate extensive world knowledge beneficial for robotic manipulation in the form of task planning. However, the actual physical implementation of these plans often relies on task-specific learning methods, which require significant data collection and struggle with generalizability. In this work, we introduce Robotic Manipulation through Spatial Constraints of Parts (CoPa), a novel framework that leverages the common sense knowledge embedded within foundation models to generate a sequence of 6-DoF end-effector poses for open-world robotic manipulation. Specifically, we decompose the manipulation process into two phases: task-oriented grasping and task-aware motion planning. In the task-oriented grasping phase, we employ foundation vision-language models (VLMs) to select the object's grasping part through a novel coarse-to-fine grounding mechanism. During the task-aware motion planning phase, VLMs are utilized again to identify the spatial geometry constraints of task-relevant object parts, which are then used to derive post-grasp poses. We also demonstrate how CoPa can be seamlessly integrated with existing robotic planning algorithms to accomplish complex, long-horizon tasks. Our comprehensive real-world experiments show that CoPa possesses a fine-grained physical understanding of scenes, capable of handling open-set instructions and objects with minimal prompt engineering and without additional training. Project page: https://copa-2024.github.io/

Haoxu Huang, Fanqi Lin, Yingdong Hu, Shengjie Wang, Yang Gao• 2024

Related benchmarks

Task	Dataset	Result
Average Robotic Manipulation Success	Real-world Hardware	Success Rate60	5
Hammer Nail	Real-world Hardware	Success Rate30	5
Knock tower	Real-world Hardware	Success Rate80	5
Reach blocks	Real-world Hardware	Success Rate60	5
Sweep toys	Real-world Hardware	Success Rate70	5
Robotic Manipulation	Real World (unseen environments and tasks)	Task 1 Success Rate40	4
Object-centric manipulation	Real-world 10 object-centric tasks	Egg Placing Success Rate20	4
Articulated Object Manipulation	Real-world 3 articulated-object tasks	Drawer Opening Success Rate40	3

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord