PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction
About
We introduce PixARMesh, a method to autoregressively reconstruct complete 3D indoor scene meshes directly from a single RGB image. Unlike prior methods that rely on implicit signed distance fields and post-hoc layout optimization, PixARMesh jointly predicts object layout and geometry within a unified model, producing coherent and artist-ready meshes in a single forward pass. Building on recent advances in mesh generative models, we augment a point-cloud encoder with pixel-aligned image features and global scene context via cross-attention, enabling accurate spatial reasoning from a single image. Scenes are generated autoregressively from a unified token stream containing context, pose, and mesh, yielding compact meshes with high-fidelity geometry. Experiments on synthetic and real-world datasets show that PixARMesh achieves state-of-the-art reconstruction quality while producing lightweight, high-quality meshes ready for downstream applications.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Scene Reconstruction | ScanNet Matterport3D Pix3D | Runtime (s)4.5 | 9 | |
| 3D Scene Reconstruction | 3D-FRONT | F Value7.51e+3 | 9 | |
| Scene Reconstruction | 3D-FRONT | CD0.0984 | 8 | |
| Object Pose Accuracy | 3D-FRONT | Box IoU70.37 | 7 | |
| Object Reconstruction | 3D-FRONT | Chamfer Distance (CD)0.004 | 7 |