Neural 3D Scene Reconstruction with the Manhattan-world Assumption
About
This paper addresses the challenge of reconstructing 3D indoor scenes from multi-view images. Many previous works have shown impressive reconstruction results on textured objects, but they still have difficulty in handling low-textured planar regions, which are common in indoor scenes. An approach to solving this issue is to incorporate planer constraints into the depth map estimation in multi-view stereo-based methods, but the per-view plane estimation and depth optimization lack both efficiency and multi-view consistency. In this work, we show that the planar constraints can be conveniently integrated into the recent implicit neural representation-based reconstruction methods. Specifically, we use an MLP network to represent the signed distance function as the scene geometry. Based on the Manhattan-world assumption, planar constraints are employed to regularize the geometry in floor and wall regions predicted by a 2D semantic segmentation network. To resolve the inaccurate segmentation, we encode the semantics of 3D points with another MLP and design a novel loss that jointly optimizes the scene geometry and semantics in 3D space. Experiments on ScanNet and 7-Scenes datasets show that the proposed method outperforms previous methods by a large margin on 3D reconstruction quality. The code is available at https://zju3dv.github.io/manhattan_sdf.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Geometry Reconstruction | ScanNet | Accuracy7.2 | 54 | |
| 3D Reconstruction | 7 Scenes | -- | 32 | |
| 3D Scene Reconstruction | ScanNet v2 (test) | Accuracy0.072 | 26 | |
| Scene-level 3D Reconstruction | ScanNet (test) | F-score68.8 | 20 | |
| 3D Reconstruction | ScanNet | F-score60.2 | 13 | |
| 3D Scene Reconstruction | ScanNet | Accuracy4.4 | 9 | |
| Scene-level 3D Reconstruction | ScanNet | Accuracy7.2 | 8 | |
| Scene-level reconstruction | ScanNet | Chamfer Distance (L1)0.07 | 8 |