Shap-E: Generating Conditional 3D Implicit Functions
About
We present Shap-E, a conditional generative model for 3D assets. Unlike recent work on 3D generative models which produce a single output representation, Shap-E directly generates the parameters of implicit functions that can be rendered as both textured meshes and neural radiance fields. We train Shap-E in two stages: first, we train an encoder that deterministically maps 3D assets into the parameters of an implicit function; second, we train a conditional diffusion model on outputs of the encoder. When trained on a large dataset of paired 3D and text data, our resulting models are capable of generating complex and diverse 3D assets in a matter of seconds. When compared to Point-E, an explicit generative model over point clouds, Shap-E converges faster and reaches comparable or better sample quality despite modeling a higher-dimensional, multi-representation output space. We release model weights, inference code, and samples at https://github.com/openai/shap-e.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-3D Generation | GPTEval3D 110 prompts 1.0 | GPTEval3D Alignment842.8 | 20 | |
| 3D Shape Reconstruction | OmniObject3D | CD0.434 | 17 | |
| Text-to-3D | Toys4k | CLIP Score25.04 | 14 | |
| Single-view 3D Reconstruction | GSO (test) | CD0.204 | 13 | |
| Text-to-3D Generation | Objaverse | CLIP Score30.52 | 12 | |
| 3D Asset Reconstruction | Toys4k | CD0.6724 | 11 | |
| 3D Shape Reconstruction | Pix3D | FS@10.2016 | 10 | |
| Image-conditioned 3D Generation | Objaverse (test) | FID138.5 | 10 | |
| 3D Reconstruction | GSO 13 (test) | Chamfer Distance0.0436 | 8 | |
| 3D Reconstruction | Google Scanned Objects (GSO) 30 instances | Chamfer Distance0.044 | 8 |