ETO:Efficient Transformer-based Local Feature Matching by Organizing Multiple Homography Hypotheses

About

We tackle the efficiency problem of learning local feature matching. Recent advancements have given rise to purely CNN-based and transformer-based approaches, each augmented with deep learning techniques. While CNN-based methods often excel in matching speed, transformer-based methods tend to provide more accurate matches. We propose an efficient transformer-based network architecture for local feature matching. This technique is built on constructing multiple homography hypotheses to approximate the continuous correspondence in the real world and uni-directional cross-attention to accelerate the refinement. On the YFCC100M dataset, our matching accuracy is competitive with LoFTR, a state-of-the-art transformer-based architecture, while the inference speed is boosted to 4 times, even outperforming the CNN-based methods. Comprehensive evaluations on other open datasets such as Megadepth, ScanNet, and HPatches demonstrate our method's efficacy, highlighting its potential to significantly enhance a wide array of downstream applications.

Junjie Ni, Guofeng Zhang, Guanglin Li, Yijin Li, Xinyang Liu, Zhaoyang Huang, Hujun Bao• 2024

Related benchmarks

Task	Dataset	Result
Outdoor Pose Estimation	MegaDepth (test)	AUC @ 5°51.7	10
Outdoor Pose Estimation	YFCC100M (test)	AUC @ 5 deg44.8	8
Indoor Pose Estimation	ScanNet 32 (test)	AUC @5°20.1	6
Homography Estimation	HPatches 52 sequences (illumination) + 56 sequences (viewpoint) v1.0	Avg Corner Error (1px)42	5

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord