Zero-1-to-3: Zero-shot One Image to 3D Object
About
We introduce Zero-1-to-3, a framework for changing the camera viewpoint of an object given just a single RGB image. To perform novel view synthesis in this under-constrained setting, we capitalize on the geometric priors that large-scale diffusion models learn about natural images. Our conditional diffusion model uses a synthetic dataset to learn controls of the relative camera viewpoint, which allow new images to be generated of the same object under a specified camera transformation. Even though it is trained on a synthetic dataset, our model retains a strong zero-shot generalization ability to out-of-distribution datasets as well as in-the-wild images, including impressionist paintings. Our viewpoint-conditioned diffusion approach can further be used for the task of 3D reconstruction from a single image. Qualitative and quantitative experiments show that our method significantly outperforms state-of-the-art single-view 3D reconstruction and novel view synthesis models by leveraging Internet-scale pre-training.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Novel View Synthesis | THuman 2.0 (test) | LPIPS0.1163 | 39 | |
| 3D Reconstruction | Google Scanned Objects (GSO) (test) | LPIPS0.23 | 17 | |
| Novel View Synthesis | Google Scanned Objects | PSNR18.51 | 15 | |
| Novel View Synthesis | Google Scanned Objects (GSO) (test) | PSNR18.93 | 14 | |
| Novel View Synthesis | Objaverse (test) | PSNR17.37 | 14 | |
| Novel View Synthesis | InterHand2.6M (test) | LPIPS0.17 | 12 | |
| Novel View Synthesis | GSO challenging | PSNR21.79 | 10 | |
| 2D Multi-view Generation | Anime3D++ (test) | SSIM0.865 | 10 | |
| Multi-view Generation | GSO | PSNR18.8219 | 9 | |
| Multi-view Generation | 3D-FUTURE | PSNR17.0526 | 9 |