Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Stronger, Steadier & Superior: Geometric Consistency in Depth VFM Forges Domain Generalized Semantic Segmentation

About

Vision Foundation Models (VFMs) have delivered remarkable performance in Domain Generalized Semantic Segmentation (DGSS). However, recent methods often overlook the fact that visual cues are susceptible, whereas the underlying geometry remains stable, rendering depth information more robust. In this paper, we investigate the potential of integrating depth information with features from VFMs, to improve the geometric consistency within an image and boost the generalization performance of VFMs. We propose a novel fine-tuning DGSS framework, named DepthForge, which integrates the visual cues from frozen DINOv2 or EVA02 and depth cues from frozen Depth Anything V2. In each layer of the VFMs, we incorporate depth-aware learnable tokens to continuously decouple domain-invariant visual and spatial information, thereby enhancing depth awareness and attention of the VFMs. Finally, we develop a depth refinement decoder and integrate it into the model architecture to adaptively refine multi-layer VFM features and depth-aware learnable tokens. Extensive experiments are conducted based on various DGSS settings and five different datsets as unseen target domains. The qualitative and quantitative results demonstrate that our method significantly outperforms alternative approaches with stronger performance, steadier visual-spatial attention, and superior generalization ability. In particular, DepthForge exhibits outstanding performance under extreme conditions (e.g., night and snow). Code is available at https://github.com/anonymouse-xzrptkvyqc/DepthForge.

Siyu Chen, Ting Han, Changshe Zhang, Xin Luo, Meiliu Wu, Guorong Cai, Jinhe Su• 2025

Related benchmarks

TaskDatasetResultRank
Video Semantic SegmentationCityscapes-C (test)
mIoU46.77
78
Semantic segmentationMapillary (test)
mIoU75.93
43
Video Semantic SegmentationCamVid
mIoU51.28
41
Semantic segmentationGTAV to Cityscapes, BDD, Mapillary Synthetic-to-Real (test)
mIoU (Cityscapes)69.04
22
Semantic segmentationBDD (test)
mIoU66.19
18
Semantic segmentationFLAIR Cross-Regional
mIoU61.56
16
Semantic segmentationLoveDA Cross-Style
mIoU57.5
16
Semantic segmentationOpenEarthMap (Cross-Continent)
mIoU66.85
16
Semantic segmentationFive-Billion-Pixels Cross-sensor
mIoU58.79
16
Semantic segmentationPotsdam&Vaihingen Cross Spectral Band
mIoU59.57
16
Showing 10 of 20 rows

Other info

Follow for update