GESS: Multi-cue Guided Local Feature Learning via Geometric and Semantic Synergy
About
Robust local feature detection and description are foundational tasks in computer vision. Existing methods primarily rely on single appearance cues for modeling, leading to unstable keypoints and insufficient descriptor discriminability. In this paper, we propose a multi-cue guided local feature learning framework that leverages semantic and geometric cues to synergistically enhance detection robustness and descriptor discriminability. Specifically, we construct a joint semantic-normal prediction head and a depth stability prediction head atop a lightweight backbone. The former leverages a shared 3D vector field to deeply couple semantic and normal cues, thereby resolving optimization interference from heterogeneous inconsistencies. The latter quantifies the reliability of local regions from a geometric consistency perspective, providing deterministic guidance for robust keypoint selection. Based on these predictions, we introduce the Semantic-Depth Aware Keypoint (SDAK) mechanism for feature detection. By coupling semantic reliability with depth stability, SDAK reweights keypoint responses to suppress spurious features in unreliable regions. For descriptor construction, we design a Unified Triple-Cue Fusion (UTCF) module, which employs a semantic-scheduled gating mechanism to adaptively inject multi-attribute features, improving descriptor discriminability. Extensive experiments on four benchmarks validate the effectiveness of the proposed framework. The source code and pre-trained model will be available at: https://github.com/yiyscut/GESS.git.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Localization | Aachen Day-Night v1.1 (Day) | SR (0.25m, 2°)89.9 | 70 | |
| Pose Estimation | MegaDepth 1500 (test) | AUC @ 5°53.9 | 38 | |
| 3D Reconstruction | ETH local feature benchmark Gendarmenmarkt | Track Length8.12 | 24 | |
| 3D Reconstruction | ETH local feature benchmark Tower of London | Track Length8.56 | 24 | |
| 3D Reconstruction | Madrid Metropolis | Track Length8.93 | 19 | |
| Visual Localization | Aachen Day-Night v1.0 (Night) | Success Rate (0.25m, 2°)82.7 | 17 | |
| 3D Reconstruction | ETH Herzjesu Small-Scale | Track Length5.24 | 16 | |
| Visual Localization | Aachen Day-Night v1.0 (Day) | Success Rate (0.25m, 2°)86.4 | 14 | |
| Visual Localization | Aachen Day-Night v1.1 | Success Rate (2°, 0.25m)76.4 | 12 | |
| Pose Estimation | ScanNet (test) | AUC@5°15.8 | 11 |