MatchFormer: Interleaving Attention in Transformers for Feature Matching

About

Local feature matching is a computationally intensive task at the subpixel level. While detector-based methods coupled with feature descriptors struggle in low-texture scenes, CNN-based methods with a sequential extract-to-match pipeline, fail to make use of the matching capacity of the encoder and tend to overburden the decoder for matching. In contrast, we propose a novel hierarchical extract-and-match transformer, termed as MatchFormer. Inside each stage of the hierarchical encoder, we interleave self-attention for feature extraction and cross-attention for feature matching, yielding a human-intuitive extract-and-match scheme. Such a match-aware encoder releases the overloaded decoder and makes the model highly efficient. Further, combining self- and cross-attention on multi-scale features in a hierarchical architecture improves matching robustness, particularly in low-texture indoor scenes or with less outdoor training data. Thanks to such a strategy, MatchFormer is a multi-win solution in efficiency, robustness, and precision. Compared to the previous best method in indoor pose estimation, our lite MatchFormer has only 45% GFLOPs, yet achieves a +1.3% precision gain and a 41% running speed boost. The large MatchFormer reaches state-of-the-art on four different benchmarks, including indoor pose estimation (ScanNet), outdoor pose estimation (MegaDepth), homography estimation and image matching (HPatch), and visual localization (InLoc).

Qing Wang, Jiaming Zhang, Kailun Yang, Kunyu Peng, Rainer Stiefelhagen• 2022

Related benchmarks

Task	Dataset	Result
Relative Pose Estimation	MegaDepth 1500	--	151
Relative Pose Estimation	MegaDepth (test)	Pose AUC @5°53.3	83
Homography Estimation	HPatches	--	55
Pose Estimation	MegaDepth 1500 (test)	AUC @ 5°52.9	38
Pose Estimation	ScanNet 1500 (test)	AUC@5°24.3	26
Relative Pose Estimation	MegaDepth-1800 (test)	Matches Count2.42e+3	16
Relative Pose Estimation	ScanNet Indoor (test)	AUC@5°15.8	16
Relative Pose Estimation	MegaDepth 19 (test)	Average Rank8.3	14
Indoor Localization	InLoc DUC2 v1.0	SR (0.25m, 10°)55.7	13
Two-view Pose Estimation	ScanNet (test)	Pose Error AUC (5°)27.3	13

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord