SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass

About

3D content generation has recently attracted significant research interest, driven by its critical applications in VR/AR and embodied AI. In this work, we tackle the challenging task of synthesizing multiple 3D assets within a single scene image. Concretely, our contributions are fourfold: (i) we present SceneGen, a novel framework that takes a scene image and corresponding object masks as input, simultaneously producing multiple 3D assets with geometry and texture. Notably, SceneGen operates with no need for extra optimization or asset retrieval; (ii) we introduce a novel feature aggregation module that integrates local and global scene information from visual and geometric encoders within the feature extraction module. Coupled with a position head, this enables the generation of 3D assets and their relative spatial positions in a single feedforward pass; (iii) we demonstrate SceneGen's direct extensibility to multi-image input scenarios. Despite being trained solely on single-image inputs, our architecture yields improved generation performance when multiple images are provided; and (iv) extensive quantitative and qualitative evaluations confirm the efficiency and robustness of our approach. We believe this paradigm offers a novel solution for high-quality 3D content generation, potentially advancing its practical applications in downstream tasks. The code and model will be publicly available at: https://mengmouxu.github.io/SceneGen.

Yanxu Meng, Haoning Wu, Ya Zhang, Weidi Xie• 2025

Related benchmarks

Task	Dataset	Result
3D Scene Generation	3D-Front (test)	CD (Surface)0.1432	28
Scene Reconstruction	MIT-Indoor-67 (test)	Realism3.028	8
3D Scene Reconstruction	OVOW-3D-Scene-Bench	Scene IoU (AABB)9.8	7
4D Scene Reconstruction	OVOW-4D-Scene-Bench	Scene IoU (AABB)9.6	7
Posed Object Generation	Google Scanned Objects (GSO) (test)	CD (Chamfer Distance)57.06	7
Image-to-3D Scene Reconstruction	20 reconstructed scenes (test)	CLIP Similarity0.6381	6
Scene-level generation	3D-FRONT	PSNR18.32	6
3D Scene Reconstruction	3D-Front (last 1000 samples)	S. Chamfer Distance0.1531	6
Single-image-to-3D scene reconstruction	3D-Front last 1000 samples (test)	Penetration Depth0.2263	6
3D Scene Reconstruction	Replica	Failure Rate0.00e+0	5

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord