MatchFormer: Interleaving Attention in Transformers for Feature Matching
About
Local feature matching is a computationally intensive task at the subpixel level. While detector-based methods coupled with feature descriptors struggle in low-texture scenes, CNN-based methods with a sequential extract-to-match pipeline, fail to make use of the matching capacity of the encoder and tend to overburden the decoder for matching. In contrast, we propose a novel hierarchical extract-and-match transformer, termed as MatchFormer. Inside each stage of the hierarchical encoder, we interleave self-attention for feature extraction and cross-attention for feature matching, yielding a human-intuitive extract-and-match scheme. Such a match-aware encoder releases the overloaded decoder and makes the model highly efficient. Further, combining self- and cross-attention on multi-scale features in a hierarchical architecture improves matching robustness, particularly in low-texture indoor scenes or with less outdoor training data. Thanks to such a strategy, MatchFormer is a multi-win solution in efficiency, robustness, and precision. Compared to the previous best method in indoor pose estimation, our lite MatchFormer has only 45% GFLOPs, yet achieves a +1.3% precision gain and a 41% running speed boost. The large MatchFormer reaches state-of-the-art on four different benchmarks, including indoor pose estimation (ScanNet), outdoor pose estimation (MegaDepth), homography estimation and image matching (HPatch), and visual localization (InLoc).
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Relative Pose Estimation | MegaDepth (test) | Pose AUC @5°53.3 | 83 | |
| Pose Estimation | MegaDepth 1500 (test) | AUC @ 5°52.9 | 27 | |
| Pose Estimation | ScanNet 1500 (test) | AUC@5°24.3 | 26 | |
| Relative Pose Estimation | MegaDepth-1800 (test) | Matches Count2.42e+3 | 16 | |
| Relative Pose Estimation | ScanNet Indoor (test) | AUC@5°15.8 | 16 | |
| Relative Pose Estimation | MegaDepth 19 (test) | Average Rank8.3 | 14 | |
| Indoor Localization | InLoc DUC2 v1.0 | SR (0.25m, 10°)55.7 | 13 | |
| Two-view Pose Estimation | ScanNet (test) | Pose Error AUC (5°)27.3 | 13 | |
| Two-view relative pose estimation | MegaDepth | AUC @5°66.5 | 13 | |
| Relative Pose Estimation | MegaDepth outdoor (test) | AUC@5°53.3 | 13 |