Self-Evolving 3D Scene Generation from a Single Image

About

Generating high-quality, textured 3D scenes from a single image remains a fundamental challenge in vision and graphics. Recent image-to-3D generators recover reasonable geometry from single views, but their object-centric training limits generalization to complex, large-scale scenes with faithful structure and texture. We present EvoScene, a self-evolving, training-free framework that progressively reconstructs complete 3D scenes from single images. The key idea is combining the complementary strengths of existing models: geometric reasoning from 3D generation models and visual knowledge from video generation models. Through three iterative stages--Spatial Prior Initialization, Visual-guided 3D Scene Mesh Generation, and Spatial-guided Novel View Generation--EvoScene alternates between 2D and 3D domains, gradually improving both structure and appearance. Experiments on diverse scenes demonstrate that EvoScene achieves superior geometric stability, view-consistent textures, and unseen-region completion compared to strong baselines, producing ready-to-use 3D meshes for practical applications.

Kaizhi Zheng, Yue Fan, Jing Gu, Zishuo Xu, Xuehai He, Xin Eric Wang• 2025

Related benchmarks

Task	Dataset	Result
Image-to-3D Generation	100 diverse scene images GPT-4o & GPT-Image-1	Human Pref Win Rate - Geometry Quality85	6
3D Scene Generation	100 Diverse Wide-scene Images (test)	LPIPS48.2	5
3D Generation	Computational Cost Evaluation	VRAM (GB)68	5
Image-to-3D scene generation	100 diverse scene images GPT-4o and GPT-Image-1 (test)	CLIP Similarity0.8643	4

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord