Understanding 3D Object Interaction from a Single Image
About
Humans can easily understand a single image as depicting multiple potential objects permitting interaction. We use this skill to plan our interactions with the world and accelerate understanding new objects without engaging in interaction. In this paper, we would like to endow machines with the similar ability, so that intelligent agents can better explore the 3D scene or manipulate objects. Our approach is a transformer-based model that predicts the 3D location, physical properties and affordance of objects. To power this model, we collect a dataset with Internet videos, egocentric videos and indoor images to train and validate our approach. Our model yields strong performance on our data, and generalizes well to robotics data. Project site: https://jasonqsy.github.io/3DOI/
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Affordance prediction | AGD20K unseen | KLD3.565 | 20 | |
| Articulated Object Manipulation | Real-robot manipulation trials Textured Hinge | OSR60 | 9 | |
| Articulated Object Manipulation | 50 tasks in campus environments | Right Hinge Time (s)33.4 | 9 | |
| Articulated Object Manipulation | Real-robot manipulation trials Right Hinge | OSR40 | 9 | |
| Articulated Object Manipulation | Real-robot manipulation trials Mean across 50 tasks | Overall Success Rate (OSR)52 | 9 | |
| Articulated Object Manipulation | Real-robot manipulation trials Prismatic Hinge | OSR70 | 9 | |
| Articulated Object Manipulation | Real-robot manipulation trials Left Hinge | OSR40 | 9 | |
| Articulated Object Manipulation | Real-robot manipulation trials Bottom Hinge | OSR50 | 8 | |
| Articulated Object Axis Estimation | Campus-scale 50 tasks (test) | Right Hinge Axis EA-Score71.8 | 4 | |
| Articulated Object Segmentation | Campus-scale 50 tasks (test) | Right Hinge Mask IoU72 | 3 |