Atlas: End-to-End 3D Scene Reconstruction from Posed Images

About

We present an end-to-end 3D reconstruction method for a scene by directly regressing a truncated signed distance function (TSDF) from a set of posed RGB images. Traditional approaches to 3D reconstruction rely on an intermediate representation of depth maps prior to estimating a full 3D model of a scene. We hypothesize that a direct regression to 3D is more effective. A 2D CNN extracts features from each image independently which are then back-projected and accumulated into a voxel volume using the camera intrinsics and extrinsics. After accumulation, a 3D CNN refines the accumulated features and predicts the TSDF values. Additionally, semantic segmentation of the 3D model is obtained without significant computation. This approach is evaluated on the Scannet dataset where we significantly outperform state-of-the-art baselines (deep multiview stereo followed by traditional TSDF fusion) both quantitatively and qualitatively. We compare our 3D semantic segmentation to prior methods that use a depth sensor since no previous work attempts the problem with only RGB input.

Zak Murez, Tarrence van As, James Bartolozzi, Ayan Sinha, Vijay Badrinarayanan, Andrew Rabinovich• 2020

Related benchmarks

Task	Dataset	Result
3D Semantic Segmentation	ScanNet (val)	mIoU36.8	144
3D Semantic Segmentation	ScanNet (test)	mIoU34	117
3D Semantic Segmentation	ScanNet v2 (test)	mIoU34	110
3D Semantic Occupancy Prediction	SurroundOcc-nuScenes (val)	mIoU15	59
3D Geometry Reconstruction	ScanNet	Accuracy13	54
3D Semantic Segmentation	Replica	3D mIoU25.4	47
3D Semantic Occupancy Prediction	nuScenes 1.0 (val)	IoU (barrier)10.64	31
2D Depth Estimation	ScanNet	AbsRel0.061	26
3D Scene Reconstruction	ScanNet v2 (test)	Accuracy0.084	26
Point Cloud Reconstruction	ScanNet	--	20

Showing 10 of 27 rows

Other info

Follow for update

@wizwand_team Discord