Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

WT-MVSNet: Window-based Transformers for Multi-view Stereo

About

Recently, Transformers were shown to enhance the performance of multi-view stereo by enabling long-range feature interaction. In this work, we propose Window-based Transformers (WT) for local feature matching and global feature aggregation in multi-view stereo. We introduce a Window-based Epipolar Transformer (WET) which reduces matching redundancy by using epipolar constraints. Since point-to-line matching is sensitive to erroneous camera pose and calibration, we match windows near the epipolar lines. A second Shifted WT is employed for aggregating global information within cost volume. We present a novel Cost Transformer (CT) to replace 3D convolutions for cost volume regularization. In order to better constrain the estimated depth maps from multiple views, we further design a novel geometric consistency loss (Geo Loss) which punishes unreliable areas where multi-view consistency is not satisfied. Our WT multi-view stereo method (WT-MVSNet) achieves state-of-the-art performance across multiple datasets and ranks $1^{st}$ on Tanks and Temples benchmark.

Jinli Liao, Yikang Ding, Yoli Shavit, Dihe Huang, Shihao Ren, Jia Guo, Wensen Feng, Kai Zhang• 2022

Related benchmarks

TaskDatasetResultRank
Multi-view StereoTanks and Temples Intermediate set
Mean F1 Score65.34
110
Multi-view StereoTanks & Temples Advanced
Mean F-score39.91
71
Multi-view StereoDTU (test)
Accuracy30.9
61
Multi-view StereoTanks&Temples
Family81.87
46
Showing 4 of 4 rows

Other info

Follow for update