MVS2D: Efficient Multi-view Stereo via Attention-Driven 2D Convolutions

About

Deep learning has made significant impacts on multi-view stereo systems. State-of-the-art approaches typically involve building a cost volume, followed by multiple 3D convolution operations to recover the input image's pixel-wise depth. While such end-to-end learning of plane-sweeping stereo advances public benchmarks' accuracy, they are typically very slow to compute. We present \ouralg, a highly efficient multi-view stereo algorithm that seamlessly integrates multi-view constraints into single-view networks via an attention mechanism. Since \ouralg only builds on 2D convolutions, it is at least $2\times$ faster than all the notable counterparts. Moreover, our algorithm produces precise depth estimations and 3D reconstructions, achieving state-of-the-art results on challenging benchmarks ScanNet, SUN3D, RGBD, and the classical DTU dataset. our algorithm also out-performs all other algorithms in the setting of inexact camera poses. Our code is released at \url{https://github.com/zhenpeiyang/MVS2D}

Zhenpei Yang, Zhile Ren, Qi Shan, Qixing Huang• 2021

Related benchmarks

Task	Dataset	Result
Monocular Depth Estimation	DDAD (test)	RMSE9.82	122
Depth Estimation	ScanNet	AbsRel0.098	121
Monocular Depth Estimation	KITTI (test)	Abs Rel Error0.058	114
Multi-view Stereo	DTU (test)	--	61
Multi-view Depth Estimation	DDAD (test)	AbsRel0.133	40
Multi-view Stereo Reconstruction	DTU (evaluation)	Mean Distance (mm) - Acc.0.394	35
Multi-view Depth Estimation	ScanNet (test)	Abs Rel0.059	23
Depth Estimation	ScanNet v1 (test)	AbsRel0.059	14
Video Depth Estimation	ScanNet++	Absolute Relative Error27.2	10
Depth Estimation	SUN3D (Real)	AbsRel0.099	7

Showing 10 of 15 rows

Other info

Code

Follow for update

@wizwand_team Discord