MVS2D: Efficient Multi-view Stereo via Attention-Driven 2D Convolutions
About
Deep learning has made significant impacts on multi-view stereo systems. State-of-the-art approaches typically involve building a cost volume, followed by multiple 3D convolution operations to recover the input image's pixel-wise depth. While such end-to-end learning of plane-sweeping stereo advances public benchmarks' accuracy, they are typically very slow to compute. We present \ouralg, a highly efficient multi-view stereo algorithm that seamlessly integrates multi-view constraints into single-view networks via an attention mechanism. Since \ouralg only builds on 2D convolutions, it is at least $2\times$ faster than all the notable counterparts. Moreover, our algorithm produces precise depth estimations and 3D reconstructions, achieving state-of-the-art results on challenging benchmarks ScanNet, SUN3D, RGBD, and the classical DTU dataset. our algorithm also out-performs all other algorithms in the setting of inexact camera poses. Our code is released at \url{https://github.com/zhenpeiyang/MVS2D}
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Monocular Depth Estimation | DDAD (test) | RMSE9.82 | 122 | |
| Monocular Depth Estimation | KITTI (test) | Abs Rel Error0.058 | 103 | |
| Depth Estimation | ScanNet | AbsRel0.098 | 94 | |
| Multi-view Stereo | DTU (test) | -- | 61 | |
| Multi-view Depth Estimation | DDAD (test) | AbsRel0.133 | 40 | |
| Multi-view Stereo Reconstruction | DTU (evaluation) | Mean Distance (mm) - Acc.0.394 | 35 | |
| Multi-view Depth Estimation | ScanNet (test) | Abs Rel0.059 | 23 | |
| Depth Estimation | ScanNet v1 (test) | AbsRel0.059 | 11 | |
| Video Depth Estimation | ScanNet++ | Absolute Relative Error27.2 | 10 | |
| Depth Estimation | SUN3D (Real) | AbsRel0.099 | 7 |