Structured 3D Latents for Scalable and Versatile 3D Generation

About

We introduce a novel 3D generation method for versatile and high-quality 3D asset creation. The cornerstone is a unified Structured LATent (SLAT) representation which allows decoding to different output formats, such as Radiance Fields, 3D Gaussians, and meshes. This is achieved by integrating a sparsely-populated 3D grid with dense multiview visual features extracted from a powerful vision foundation model, comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding. We employ rectified flow transformers tailored for SLAT as our 3D generation models and train models with up to 2 billion parameters on a large 3D asset dataset of 500K diverse objects. Our model generates high-quality results with text or image conditions, significantly surpassing existing methods, including recent ones at similar scales. We showcase flexible output format selection and local 3D editing capabilities which were not offered by previous models. Code, model, and data will be released.

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, Jiaolong Yang• 2024

Related benchmarks

Task	Dataset	Result
3D Building Reconstruction	NYC Urban Dataset	FID170.6	50
4D Generation	Consistent4D	LPIPS0.2479	40
3D Reconstruction	GSO	CD Mean0.005	35
Text-to-3D	Toys4k	CLIP Score30.8	25
Routing	GSO novel objects {c ∈ C0}	Regret1	24
Text-to-3D Generation	GPTEval3D 110 prompts 1.0	GPTEval3D Alignment1.09e+3	20
3D Asset Reconstruction	Toys4k	CD0.0083	18
3D Mesh Generation	Objaverse	Chamfer Distance0.361	18
Single-view 3D Reconstruction	GSO (test)	CD4.57e+3	18
3D Generation	UniLat1K	CLIP Score90.83	16

Showing 10 of 159 rows

...

Other info

Code

Follow for update

@wizwand_team Discord