OCRA: Object-Centric Learning with 3D and Tactile Priors for Human-to-Robot Action Transfer

About

We present OCRA, an Object-Centric framework for video-based human-to-Robot Action transfer that learns directly from human demonstration videos to enable robust manipulation. Object-centric learning emphasizes task-relevant objects and their interactions while filtering out irrelevant background, providing a natural and scalable way to teach robots. OCRA leverages multi-view RGB videos, the state-of-the-art 3D foundation model VGGT, and advanced detection and segmentation models to reconstruct object-centric 3D point clouds, capturing rich interactions between objects. To handle properties not easily perceived by vision alone, we incorporate tactile priors via a large-scale dataset of over one million tactile images. These 3D and tactile priors are fused through a multimodal module (ResFiLM) and fed into a Diffusion Policy to generate robust manipulation actions. Extensive experiments on both vision-only and visuo-tactile tasks show that OCRA significantly outperforms existing baselines and ablations, demonstrating its effectiveness for learning from human demonstration videos.

Kuanning Wang, Ke Fan, Yuqian Fu, Siyu Lin, Hu Luo, Daniel Seita, Yanwei Fu, Yu-Gang Jiang, Xiangyang Xue• 2026

Related benchmarks

Task	Dataset	Result
Pour	Real-world Pouring Water	Success Rate70	3
Scoop	Real-world Scooping Ball	Success Rate70	3
Stack	Real-world Stacking Cup	Success Rate1	3
Sweep	Real-world Sweeping Objects	Success Rate100	3

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord