SAM 3D: 3Dfy Anything in Images
About
We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Simulator Stability Evaluation | MuJoCo Cluttered Tabletop Scenes (Scenarios 1-5) | Max Kinetic Energy (J)2.08 | 10 | |
| 3D Reconstruction | SAM3D (test) | Uni3D36.9 | 7 | |
| Visual Fidelity | Cluttered Tabletop Scenes Scenario 5 | PSNR20.32 | 3 | |
| Visual Fidelity | Cluttered Tabletop Scenes Scenario 1 | PSNR18.11 | 3 | |
| Pose Estimation | Picasso 1.0 (Overall) | ADD-S11.71 | 3 | |
| Scene Reconstruction | GSO simulation | Stability51.4 | 3 | |
| Scene Reconstruction | YCB simulation | Stability48.8 | 3 | |
| Visual Fidelity | Cluttered Tabletop Scenes Scenario 2 | PSNR18.99 | 3 | |
| Visual Fidelity | Cluttered Tabletop Scenes Scenario 3 | PSNR17.34 | 3 | |
| Visual Fidelity | Cluttered Tabletop Scenes Scenario 4 | PSNR21.11 | 3 |