RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations
About
Human perceive the 3D world through 2D observations from limited viewpoints. While recent feed-forward generalizable 3D reconstruction models excel at recovering 3D structures from sparse images, their representations are often confined to observed regions, leaving unseen geometry un-modeled. This raises a key, fundamental challenge: Can we infer a complete 3D structure from partial 2D observations? We present RnG (Reconstruction and Generation), a novel feed-forward Transformer that unifies these two tasks by predicting an implicit, complete 3D representation. At the core of RnG, we propose a reconstruction-guided causal attention mechanism that separates reconstruction and generation at the attention level, and treats the KV-cache as an implicit 3D representation. Then, arbitrary poses can efficiently query this cache to render high-fidelity, novel-view RGBD outputs. As a result, RnG not only accurately reconstructs visible geometry but also generates plausible, coherent unseen geometry and appearance. Our method achieves state-of-the-art performance in both generalizable 3D reconstruction and novel view generation, while operating efficiently enough for real-time interactive applications. Project page: https://npucvr.github.io/RnG
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Novel View Synthesis | Google Scanned Objects (GSO) (test) | PSNR26.276 | 24 | |
| 3D Reconstruction | GSO (test) | Chamfer Distance (CD)0.0067 | 8 | |
| Source View Depth Estimation | GSO (test) | Relative Error (Rel)0.584 | 8 | |
| Novel View Depth Estimation | GSO (test) | Relative Error0.717 | 5 | |
| Pose Estimation | GSO (test) | RA@585.146 | 5 |