Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching

About

Despite remarkable advances in image-driven stereo matching over the past decade, Synthetic-to-Realistic Zero-Shot (Syn-to-Real) generalization remains an open challenge. This suboptimal generalization performance mainly stems from cross-domain shifts and ill-posed ambiguities inherent in image textures, particularly in occluded, textureless, repetitive, and non-Lambertian (specular/transparent) regions. To improve Syn-to-Real generalization, we propose GREATEN, a framework that incorporates surface normals as domain-invariant, object-intrinsic, and discriminative geometric cues to compensate for the limitations of image textures. The proposed framework consists of three key components. First, a Gated Contextual-Geometric Fusion (GCGF) module adaptively suppresses unreliable contextual cues in image features and fuses the filtered image features with normal-driven geometric features to construct domain-invariant and discriminative contextual-geometric representations. Second, a Specular-Transparent Augmentation (STA) strategy improves the robustness of GCGF against misleading visual cues in non-Lambertian regions. Third, sparse attention designs preserve the fine-grained global feature extraction capability of GREAT-Stereo for handling occlusion and texture-related ambiguities while substantially reducing computational overhead, including Sparse Spatial (SSA), Sparse Dual-Matching (SDMA), and Simple Volume (SVA) attentions. Trained exclusively on synthetic data such as SceneFlow, GREATEN-IGEV achieves outstanding Syn-to-Real performance. Specifically, it reduces errors by 30% on ETH3D, 8.5% on the non-Lambertian Booster, and 14.1% on KITTI-2015, compared to FoundationStereo, Monster-Stereo, and DEFOM-Stereo, respectively. In addition, GREATEN-IGEV runs 19.2% faster than GREAT-IGEV and supports high-resolution (3K) inference on Middlebury with disparity ranges up to 768.

Jiahao Li, Xinhong Chen, Zhengmin Jiang, Cheng Huang, Yung-Hui Li, Jianping Wang• 2026

Related benchmarks

Task	Dataset	Result
Stereo Matching	KITTI 2015	--	118
Stereo Matching	KITTI 2012	Error Rate (3px, All)3.11	108
Stereo Matching	Scene Flow (test)	EPE0.39	84
Stereo Matching	KITTI 2015 (all pixels)	D1 Error (Background)1.28	48
Stereo Matching	ETH3D (non-occluded)	Bad 1.0 Error0.5	43
Stereo Matching	KITTI Noc 2015	D1 Error (Background)1.19	42
Stereo Matching	Booster Q	EPE1.53	33
Stereo Matching	Middlebury non-occluded Half resolution	D23.54	14
Stereo Matching	ETH3D (All)	D1 Error0.73	10

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord