Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction
About
Humans can learn to manipulate new objects by simply watching others; providing robots with the ability to learn from such demonstrations would enable a natural interface specifying new behaviors. This work develops Robot See Robot Do (RSRD), a method for imitating articulated object manipulation from a single monocular RGB human demonstration given a single static multi-view object scan. We first propose 4D Differentiable Part Models (4D-DPM), a method for recovering 3D part motion from a monocular video with differentiable rendering. This analysis-by-synthesis approach uses part-centric feature fields in an iterative optimization which enables the use of geometric regularizers to recover 3D motions from only a single video. Given this 4D reconstruction, the robot replicates object trajectories by planning bimanual arm motions that induce the demonstrated object part motion. By representing demonstrations as part-centric trajectories, RSRD focuses on replicating the demonstration's intended behavior while considering the robot's own morphological limits, rather than attempting to reproduce the hand's motion. We evaluate 4D-DPM's 3D tracking accuracy on ground truth annotated 3D part trajectories and RSRD's physical execution performance on 9 objects across 10 trials each on a bimanual YuMi robot. Each phase of RSRD achieves an average of 87% success rate, for a total end-to-end success rate of 60% across 90 trials. Notably, this is accomplished using only feature fields distilled from large pretrained vision models -- without any task-specific training, fine-tuning, dataset collection, or annotation. Project page: https://robot-see-robot-do.github.io
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 4D reconstruction of articulated object | ArtHOI RGBD | CD (mm)7.768 | 15 | |
| 4D Reconstruction | RSRD | Chamfer Distance (mm)8.739 | 12 | |
| Reconstruction | Video2Articulation-S | CD-w3.39 | 8 | |
| Tracking | Articulat3D-Sim | EPE3.14 | 5 | |
| Joint Estimation | Video2Articulation-S | Axis Error68.49 | 4 | |
| Joint Estimation | Articulat3D-Sim | Axis Error55.75 | 4 | |
| Joint Estimation (Prismatic) | Video2Articulation-S | Axis Angle (deg)69.91 | 4 | |
| Joint Estimation (Revolute) | Video2Articulation-S | Axis Error (deg)67.06 | 4 | |
| Reconstruction | Articulat3D-Sim | CD-w69.16 | 4 | |
| View Synthesis | Video2Articulation-S | PSNR24.78 | 3 |