Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Metric-Guided Feature Fusion of Visual Foundation Models for Segmentation Tasks

About

Although large-scale visual foundation models (VFMs) achieve remarkable performance in semantic understanding, they still underperform in instance-aware dense prediction tasks. They exhibit different biases in representation: for instance, promptable segmentation models (e.g., SAM2) focus on fine-grained region boundaries, while self-supervised models (e.g., DINOv3) emphasize object-level structure. This observation highlights the potential of combining complementary features from different VFMs to enhance downstream dense prediction tasks. However, naive multi-VFM fusion seldom leads to reliable gains, and interpretable principles for leveraging their complementary features are still underexplored. In this work, we propose a metric-guided approach that effectively selects and aggregates complementary features from different VFMs based on explicit assessment scores. Specifically, we design a suite of label-free metrics in feature space across two aspects, Structural Coherence and Edge Fidelity, to assess features of VFM encoders. Guided by these scores, we identify complementary edge-strong and structure-strong encoder pairs, and integrate them via a master-auxiliary fusion scheme. This feature fusion requires no complex architectural changes and is trained only in a single stage. Our model shows consistent performance gains across multiple dense prediction tasks compared with the baselines, with better object-level semantics and more accurately localized boundaries. The code is available at {https://github.com/gyc-code/metric-guided-fusion}.

Yachan Guo, JoseLuis Gomez Zurita, Danna Xue, Yi Xiao, AntonioManuel Lopez Pena• 2026

Related benchmarks

TaskDatasetResultRank
Semantic segmentationCityscapes (val)
mIoU82.8
527
Instance SegmentationCOCO (val)
APmk47.3
485
Instance SegmentationCityscapes (val)
AP39.5
247
Panoptic SegmentationCOCO
PQ56.9
31
Instance SegmentationSYNTHIA to Cityscapes--
8
Instance SegmentationKITTI-360
AP21.9
3
Instance SegmentationUrbansyn CS
AP32.5
3
Showing 7 of 7 rows

Other info

Follow for update