Metric-Guided Feature Fusion of Visual Foundation Models for Segmentation Tasks

About

Although large-scale visual foundation models (VFMs) achieve remarkable performance in semantic understanding, they still underperform in instance-aware dense prediction tasks. They exhibit different biases in representation: for instance, promptable segmentation models (e.g., SAM2) focus on fine-grained region boundaries, while self-supervised models (e.g., DINOv3) emphasize object-level structure. This observation highlights the potential of combining complementary features from different VFMs to enhance downstream dense prediction tasks. However, naive multi-VFM fusion seldom leads to reliable gains, and interpretable principles for leveraging their complementary features are still underexplored. In this work, we propose a metric-guided approach that effectively selects and aggregates complementary features from different VFMs based on explicit assessment scores. Specifically, we design a suite of label-free metrics in feature space across two aspects, Structural Coherence and Edge Fidelity, to assess features of VFM encoders. Guided by these scores, we identify complementary edge-strong and structure-strong encoder pairs, and integrate them via a master-auxiliary fusion scheme. This feature fusion requires no complex architectural changes and is trained only in a single stage. Our model shows consistent performance gains across multiple dense prediction tasks compared with the baselines, with better object-level semantics and more accurately localized boundaries. The code is available at {https://github.com/gyc-code/metric-guided-fusion}.

Yachan Guo, JoseLuis Gomez Zurita, Danna Xue, Yi Xiao, AntonioManuel Lopez Pena• 2026

Related benchmarks

Task	Dataset	Result
Semantic segmentation	Cityscapes (val)	mIoU82.8	552
Instance Segmentation	COCO (val)	APmk47.3	520
Instance Segmentation	Cityscapes (val)	AP39.5	247
Panoptic Segmentation	COCO	PQ56.9	46
Instance Segmentation	SYNTHIA to Cityscapes	--	8
Instance Segmentation	KITTI-360	AP21.9	3
Instance Segmentation	Urbansyn CS	AP32.5	3

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord