Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching

About

Despite remarkable advances in image-driven stereo matching over the past decade, Synthetic-to-Realistic Zero-Shot (Syn-to-Real) generalization remains an open challenge. This suboptimal generalization performance mainly stems from cross-domain shifts and ill-posed ambiguities inherent in image textures, particularly in occluded, textureless, repetitive, and non-Lambertian (specular/transparent) regions. To improve Syn-to-Real generalization, we propose GREATEN, a framework that incorporates surface normals as domain-invariant, object-intrinsic, and discriminative geometric cues to compensate for the limitations of image textures. The proposed framework consists of three key components. First, a Gated Contextual-Geometric Fusion (GCGF) module adaptively suppresses unreliable contextual cues in image features and fuses the filtered image features with normal-driven geometric features to construct domain-invariant and discriminative contextual-geometric representations. Second, a Specular-Transparent Augmentation (STA) strategy improves the robustness of GCGF against misleading visual cues in non-Lambertian regions. Third, sparse attention designs preserve the fine-grained global feature extraction capability of GREAT-Stereo for handling occlusion and texture-related ambiguities while substantially reducing computational overhead, including Sparse Spatial (SSA), Sparse Dual-Matching (SDMA), and Simple Volume (SVA) attentions. Trained exclusively on synthetic data such as SceneFlow, GREATEN-IGEV achieves outstanding Syn-to-Real performance. Specifically, it reduces errors by 30% on ETH3D, 8.5% on the non-Lambertian Booster, and 14.1% on KITTI-2015, compared to FoundationStereo, Monster-Stereo, and DEFOM-Stereo, respectively. In addition, GREATEN-IGEV runs 19.2% faster than GREAT-IGEV and supports high-resolution (3K) inference on Middlebury with disparity ranges up to 768.

Jiahao Li, Xinhong Chen, Zhengmin Jiang, Cheng Huang, Yung-Hui Li, Jianping Wang• 2026

Related benchmarks

TaskDatasetResultRank
Stereo MatchingKITTI 2015--
118
Stereo MatchingKITTI 2012
Error Rate (3px, All)3.11
108
Stereo MatchingScene Flow (test)
EPE0.39
77
Stereo MatchingKITTI 2015 (all pixels)
D1 Error (Background)1.28
48
Stereo MatchingETH3D (non-occluded)
Bad 1.0 Error0.5
43
Stereo MatchingKITTI Noc 2015
D1 Error (Background)1.19
42
Stereo MatchingBooster Q
EPE1.53
33
Stereo MatchingMiddlebury non-occluded Half resolution
D23.54
14
Stereo MatchingETH3D (All)
D1 Error0.73
10
Showing 9 of 9 rows

Other info

Follow for update