Point-E: A System for Generating 3D Point Clouds from Complex Prompts
About
While recent work on text-conditional 3D object generation has shown promising results, the state-of-the-art methods typically require multiple GPU-hours to produce a single sample. This is in stark contrast to state-of-the-art generative image models, which produce samples in a number of seconds or minutes. In this paper, we explore an alternative method for 3D object generation which produces 3D models in only 1-2 minutes on a single GPU. Our method first generates a single synthetic view using a text-to-image diffusion model, and then produces a 3D point cloud using a second diffusion model which conditions on the generated image. While our method still falls short of the state-of-the-art in terms of sample quality, it is one to two orders of magnitude faster to sample from, offering a practical trade-off for some use cases. We release our pre-trained point cloud diffusion models, as well as evaluation code and models, at https://github.com/openai/point-e.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Reconstruction | ShapeNet (test) | -- | 74 | |
| Text-to-3D Generation | GPTEval3D 110 prompts 1.0 | GPTEval3D Alignment725.2 | 20 | |
| 2D-to-3D Reconstruction | ShapeNet 1 (test) | Chamfer Distance22.93 | 18 | |
| 3D Shape Reconstruction | OmniObject3D | CD0.448 | 17 | |
| Image-to-3D Generation | NeRF4 | CLIP-Similarity0.48 | 12 | |
| Text-to-3D Generation | Objaverse | CLIP Score0.22 | 12 | |
| 3D Shape Reconstruction | Pix3D | FS@10.1779 | 10 | |
| 3D Reconstruction | GSO 13 (test) | Chamfer Distance0.0426 | 8 | |
| 3D Reconstruction | Google Scanned Objects (GSO) 30 instances | Chamfer Distance0.043 | 8 | |
| Single-view 3D Reconstruction | Google Scanned Objects (GSO) 13 | Chamfer Distance0.0426 | 8 |