NOVA3R: Non-pixel-aligned Visual Transformer for Amodal 3D Reconstruction
About
We present NOVA3R, an effective approach for non-pixel-aligned 3D reconstruction from a set of unposed images in a feed-forward manner. Unlike pixel-aligned methods that tie geometry to per-ray predictions, our formulation learns a global, view-agnostic scene representation that decouples reconstruction from pixel alignment. This addresses two key limitations in pixel-aligned 3D: (1) it recovers both visible and invisible points with a complete scene representation, and (2) it produces physically plausible geometry with fewer duplicated structures in overlapping regions. To achieve this, we introduce a scene-token mechanism that aggregates information across unposed images and a diffusion-based 3D decoder that reconstructs complete, non-pixel-aligned point clouds. Extensive experiments on both scene-level and object-level datasets demonstrate that NOVA3R outperforms state-of-the-art methods in terms of reconstruction accuracy and completeness.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Scene Reconstruction | 7-Scenes (test) | -- | 27 | |
| Scene Completion | SCRREAM Complete | CD4.8 | 15 | |
| 3D Reconstruction | SCRREAM Occluded | Chamfer Distance (CD)3.56 | 10 | |
| 3D Reconstruction | SCRREAM Complete | CD3.43 | 10 | |
| Scene Completion | SCRREAM Visible | CD0.043 | 10 | |
| 3D Reconstruction | SCRREAM Visible | CD3.2 | 10 | |
| 3D Reconstruction | NRGB-D (Occluded) | Chamfer Distance8.31 | 8 | |
| 3D Reconstruction | NRGB-D (Complete) | Chamfer Distance (CD)8.19 | 8 | |
| 3D Reconstruction | NRGB-D Visible | Chamfer Distance (CD)5.05 | 8 | |
| Object Completion | GSO K=1 (1030-object) | CD0.02 | 6 |