ETO:Efficient Transformer-based Local Feature Matching by Organizing Multiple Homography Hypotheses
About
We tackle the efficiency problem of learning local feature matching. Recent advancements have given rise to purely CNN-based and transformer-based approaches, each augmented with deep learning techniques. While CNN-based methods often excel in matching speed, transformer-based methods tend to provide more accurate matches. We propose an efficient transformer-based network architecture for local feature matching. This technique is built on constructing multiple homography hypotheses to approximate the continuous correspondence in the real world and uni-directional cross-attention to accelerate the refinement. On the YFCC100M dataset, our matching accuracy is competitive with LoFTR, a state-of-the-art transformer-based architecture, while the inference speed is boosted to 4 times, even outperforming the CNN-based methods. Comprehensive evaluations on other open datasets such as Megadepth, ScanNet, and HPatches demonstrate our method's efficacy, highlighting its potential to significantly enhance a wide array of downstream applications.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Outdoor Pose Estimation | MegaDepth (test) | AUC @ 5°51.7 | 10 | |
| Outdoor Pose Estimation | YFCC100M (test) | AUC @ 5 deg44.8 | 8 | |
| Indoor Pose Estimation | ScanNet 32 (test) | AUC @5°20.1 | 6 | |
| Homography Estimation | HPatches 52 sequences (illumination) + 56 sequences (viewpoint) v1.0 | Avg Corner Error (1px)42 | 5 |