Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction

About

We present Gen3R, a method that bridges the strong priors of foundational reconstruction models and video diffusion models for scene-level 3D generation. We repurpose the VGGT reconstruction model to produce geometric latents by training an adapter on its tokens, which are regularized to align with the appearance latents of pre-trained video diffusion models. By jointly generating these disentangled yet aligned latents, Gen3R produces both RGB videos and corresponding 3D geometry, including camera poses, depth maps, and global point clouds. Experiments demonstrate that our approach achieves state-of-the-art results in single- and multi-image conditioned 3D scene generation. Additionally, our method can enhance the robustness of reconstruction by leveraging generative priors, demonstrating the mutual benefit of tightly coupling reconstruction and generative models.

Jiaxin Huang, Yuanbo Yang, Bangbang Yang, Lin Ma, Yuewen Ma, Yiyi Liao• 2026

Related benchmarks

Task	Dataset	Result
Novel View Synthesis	CO3D Teddybear	LPIPS0.368	20
Novel View Synthesis	CO3D Hydrant	LPIPS0.341	12
1-view-based novel view generation	RealEstate10K	PSNR20.09	7
1-view-based novel view generation	DL3DV-10K	PSNR15.94	7
Geometry Generation	TartanAir 1-view	Accuracy3.025	7
Geometry Generation	TartanAir 2-views	Accuracy2.2825	7
Single-image world generation	WorldScore Indoor	3D Consistency82.12	7
Single-image world generation	DL3DV	3D Consistency75.29	7
Novel View Synthesis	CO3D Hydrant	FID115.8	5

Showing 9 of 9 rows

Other info

GitHub

Follow for update

@wizwand_team Discord