LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation
About
3D content creation has achieved significant progress in terms of both quality and speed. Although current feed-forward models can produce 3D objects in seconds, their resolution is constrained by the intensive computation required during training. In this paper, we introduce Large Multi-View Gaussian Model (LGM), a novel framework designed to generate high-resolution 3D models from text prompts or single-view images. Our key insights are two-fold: 1) 3D Representation: We propose multi-view Gaussian features as an efficient yet powerful representation, which can then be fused together for differentiable rendering. 2) 3D Backbone: We present an asymmetric U-Net as a high-throughput backbone operating on multi-view images, which can be produced from text or single-view image input by leveraging multi-view diffusion models. Extensive experiments demonstrate the high fidelity and efficiency of our approach. Notably, we maintain the fast speed to generate 3D objects within 5 seconds while boosting the training resolution to 512, thereby achieving high-resolution 3D content generation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-3D Generation | GPTEval3D 110 prompts 1.0 | GPTEval3D Alignment1.09e+3 | 20 | |
| 3D Shape Reconstruction | OmniObject3D | CD0.114 | 17 | |
| 3D Reconstruction | Google Scanned Objects (GSO) (test) | LPIPS0.063 | 17 | |
| 3D Character Generation | Anime3D++ (test) | SSIM87.6 | 16 | |
| Text-to-3D | Toys4k | CLIP Score24.83 | 14 | |
| Single-view 3D Reconstruction | GSO (test) | CD0.196 | 13 | |
| Text-to-3D Generation | Objaverse | CLIP Score30.06 | 12 | |
| Image-to-3D Generation | NeRF4 | CLIP-Similarity0.48 | 12 | |
| 3D Asset Reconstruction | Toys4k | CD0.566 | 11 | |
| Image-conditioned 3D Generation | Objaverse (test) | FID19.93 | 10 |