Generalizable Coarse-to-Fine Robot Manipulation via Language-Aligned 3D Keypoints

About

Hierarchical coarse-to-fine policy, where a coarse branch predicts a region of interest to guide a fine-grained action predictor, has demonstrated significant potential in robotic 3D manipulation tasks by especially enhancing sample efficiency and enabling more precise manipulation. However, even augmented with pre-trained models, these hierarchical policies still suffer from generalization issues. To enhance generalization to novel instructions and environment variations, we propose Coarse-to-fine Language-Aligned manipulation Policy (CLAP), a framework that integrates three key components: 1) task decomposition, 2) VLM fine-tuning for 3D keypoint prediction, and 3) 3D-aware representation. Through comprehensive experiments in simulation and on a real robot, we demonstrate its superior generalization capability. Specifically, on GemBench, a benchmark designed for evaluating generalization, our approach achieves a 12\% higher average success rate than the SOTA method while using only 1/5 of the training trajectories. In real-world experiments, our policy, trained on only 10 demonstrations, successfully generalizes to novel instructions and environments.

Jianshu Hu, Lidi Wang, Shujia Li, Yunpeng Jiang, Xiao Li, Paul Weng, Yutong Ban• 2025

Related benchmarks

Task	Dataset	Result
Multi-task Robotic Manipulation	GemBench	Avg Success62	8
Robot Manipulation	Real-world Robot Manipulation Table Color Variation	Place Shape Sorter Success Rate50	2
Robot Manipulation	Real-world Robot Manipulation Distracted Objects	Success Rate: Place Shape in Sorter0.4	2
Robot Manipulation	Real-world Robot Manipulation Light Strength Variation	Place Shape in Shape Sorter50	2
Robot Manipulation	Real-world Robot Manipulation Average across all variations	Success Rate: Place Shape (Sorter)50	2
Robot Manipulation	Real-world Robot Manipulation No Variation	Place Shape in Sorter Success60	2

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord