Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching
About
Despite remarkable advances in image-driven stereo matching over the past decade, Synthetic-to-Realistic Zero-Shot (Syn-to-Real) generalization remains an open challenge. This suboptimal generalization performance mainly stems from cross-domain shifts and ill-posed ambiguities inherent in image textures, particularly in occluded, textureless, repetitive, and non-Lambertian (specular/transparent) regions. To improve Syn-to-Real generalization, we propose GREATEN, a framework that incorporates surface normals as domain-invariant, object-intrinsic, and discriminative geometric cues to compensate for the limitations of image textures. The proposed framework consists of three key components. First, a Gated Contextual-Geometric Fusion (GCGF) module adaptively suppresses unreliable contextual cues in image features and fuses the filtered image features with normal-driven geometric features to construct domain-invariant and discriminative contextual-geometric representations. Second, a Specular-Transparent Augmentation (STA) strategy improves the robustness of GCGF against misleading visual cues in non-Lambertian regions. Third, sparse attention designs preserve the fine-grained global feature extraction capability of GREAT-Stereo for handling occlusion and texture-related ambiguities while substantially reducing computational overhead, including Sparse Spatial (SSA), Sparse Dual-Matching (SDMA), and Simple Volume (SVA) attentions. Trained exclusively on synthetic data such as SceneFlow, GREATEN-IGEV achieves outstanding Syn-to-Real performance. Specifically, it reduces errors by 30% on ETH3D, 8.5% on the non-Lambertian Booster, and 14.1% on KITTI-2015, compared to FoundationStereo, Monster-Stereo, and DEFOM-Stereo, respectively. In addition, GREATEN-IGEV runs 19.2% faster than GREAT-IGEV and supports high-resolution (3K) inference on Middlebury with disparity ranges up to 768.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Stereo Matching | KITTI 2015 | -- | 118 | |
| Stereo Matching | KITTI 2012 | Error Rate (3px, All)3.11 | 108 | |
| Stereo Matching | Scene Flow (test) | EPE0.39 | 77 | |
| Stereo Matching | KITTI 2015 (all pixels) | D1 Error (Background)1.28 | 48 | |
| Stereo Matching | ETH3D (non-occluded) | Bad 1.0 Error0.5 | 43 | |
| Stereo Matching | KITTI Noc 2015 | D1 Error (Background)1.19 | 42 | |
| Stereo Matching | Booster Q | EPE1.53 | 33 | |
| Stereo Matching | Middlebury non-occluded Half resolution | D23.54 | 14 | |
| Stereo Matching | ETH3D (All) | D1 Error0.73 | 10 |