Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction

About

We present Gen3R, a method that bridges the strong priors of foundational reconstruction models and video diffusion models for scene-level 3D generation. We repurpose the VGGT reconstruction model to produce geometric latents by training an adapter on its tokens, which are regularized to align with the appearance latents of pre-trained video diffusion models. By jointly generating these disentangled yet aligned latents, Gen3R produces both RGB videos and corresponding 3D geometry, including camera poses, depth maps, and global point clouds. Experiments demonstrate that our approach achieves state-of-the-art results in single- and multi-image conditioned 3D scene generation. Additionally, our method can enhance the robustness of reconstruction by leveraging generative priors, demonstrating the mutual benefit of tightly coupling reconstruction and generative models.

Jiaxin Huang, Yuanbo Yang, Bangbang Yang, Lin Ma, Yuewen Ma, Yiyi Liao• 2026

Related benchmarks

TaskDatasetResultRank
Novel View SynthesisCO3D Teddybear
LPIPS0.368
20
Novel View SynthesisCO3D Hydrant
LPIPS0.341
12
1-view-based novel view generationRealEstate10K
PSNR20.09
7
1-view-based novel view generationDL3DV-10K
PSNR15.94
7
Geometry GenerationTartanAir 1-view
Accuracy3.025
7
Geometry GenerationTartanAir 2-views
Accuracy2.2825
7
Single-image world generationWorldScore Indoor
3D Consistency82.12
7
Single-image world generationDL3DV
3D Consistency75.29
7
Novel View SynthesisCO3D Hydrant
FID115.8
5
Showing 9 of 9 rows

Other info

GitHub

Follow for update