Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Revealing the Semantic Selection Gap in DINOv3 through Training-Free Few-Shot Segmentation

About

Recent self-supervised Vision Transformers (ViTs), such as DINOv3, provide rich feature representations for dense vision tasks. This study investigates the intrinsic few-shot semantic segmentation (FSS) capabilities of frozen DINOv3 features through a training-free baseline, FSSDINO, utilizing class-specific prototypes and Gram-matrix refinement. Our results across binary, multi-class, and cross-domain (CDFSS) benchmarks demonstrate that this minimal approach, applied to the final backbone layer, is highly competitive with specialized methods involving complex decoders or test-time adaptation. Crucially, we conduct an Oracle-guided layer analysis, identifying a significant performance gap between the standard last-layer features and globally optimal intermediate representations. We reveal a "Safest vs. Optimal" dilemma: while the Oracle proves higher performance is attainable, matching the results of compute-intensive adaptation methods, current unsupervised and support-guided selection metrics consistently yield lower performance than the last-layer baseline. This characterizes a "Semantic Selection Gap" in Foundation Models, a disconnect where traditional heuristics fail to reliably identify high-fidelity features. Our work establishes the "Last-Layer" as a deceptively strong baseline and provides a rigorous diagnostic of the latent semantic potentials in DINOv3.The code is publicly available at https://github.com/hussni0997/fssdino.

Hussni Mohd Zakir, Eric Tatt Wei Ho• 2026

Related benchmarks

TaskDatasetResultRank
Few-shot Semantic SegmentationCOCO-20i binary
mIoU58.54
14
Cross-Domain Few-Shot SegmentationDeepGlobe (test)
mIoU59.78
12
Cross-Domain Few-Shot SegmentationISIC (test)
mIoU61.67
12
Cross-Domain Few-Shot SegmentationSUIM (test)
mIoU62.33
6
Showing 4 of 4 rows

Other info

Follow for update