RISE: 3D Perception Makes Real-World Robot Imitation Simple and Effective

About

Precise robot manipulations require rich spatial information in imitation learning. Image-based policies model object positions from fixed cameras, which are sensitive to camera view changes. Policies utilizing 3D point clouds usually predict keyframes rather than continuous actions, posing difficulty in dynamic and contact-rich scenarios. To utilize 3D perception efficiently, we present RISE, an end-to-end baseline for real-world imitation learning, which predicts continuous actions directly from single-view point clouds. It compresses the point cloud to tokens with a sparse 3D encoder. After adding sparse positional encoding, the tokens are featurized using a transformer. Finally, the features are decoded into robot actions by a diffusion head. Trained with 50 demonstrations for each real-world task, RISE surpasses currently representative 2D and 3D policies by a large margin, showcasing significant advantages in both accuracy and efficiency. Experiments also demonstrate that RISE is more general and robust to environmental change compared with previous baselines. Project website: rise-policy.github.io.

Chenxi Wang, Hongjie Fang, Hao-Shu Fang, Cewu Lu• 2024

Related benchmarks

Task	Dataset	Result
One-Move manipulation	One-Move	Success Rate65	12
Three-Scoop manipulation	Three-Scoop	Success Rate (SR)15	12
Swap-Easy manipulation	Swap-Easy	SR15	12
Add-Salt manipulation	Add-Salt	SR45	12
Swap-Hard manipulation	Swap-Hard	SR10	9
Simulated Robotic Manipulation	RoboTwin 2.0	Hammer Success Rate90	6
Robotic Insertion	Cobot Mobile ALOHA In-distribution (train)	Task 1 Success Rate20	5
Open Oven	Real-world dexterous manipulation	Hook Success Rate100	4
Open jar	Real-world dexterous manipulation	Hook Success Rate80	4
Pull Tissue	Real-world dexterous manipulation	Grasp Success Rate75	4

Showing 10 of 23 rows

Other info

Follow for update

@wizwand_team Discord