Mind the Discriminability Trap in Source-Free Cross-domain Few-shot Learning
About
Source-Free Cross-Domain Few-Shot Learning (SF-CDFSL) focuses on fine-tuning with limited training data from target domains (e.g., medical or satellite images), where Vision-Language Models (VLMs) such as CLIP and SigLIP have shown promising results. Current works in traditional visual models suggest that improving visual discriminability enhances performance. However, in VLM-based SF-CDFSL tasks, we find that \textbf{strengthening visual-modal discriminability actually suppresses VLMs' performance}. In this paper, we aim to delve into this phenomenon for an interpretation and a solution. By both theoretical and experimental proofs, our study reveals that fine-tuning with the typical cross-entropy loss ($\mathcal{L}_{\mathrm{vlm}}$) inherently includes a visual learning part and a cross-modal learning part, where the cross-modal part is crucial for rectifying the heavily disrupted modality misalignment in SF-CDFSL. However, we find that the visual learning essentially acts as a shortcut that encourages the model to reduce $\mathcal{L}_{\mathrm{vlm}}$ without considering the cross-modal part, therefore hindering the cross-modal alignment and harming the performance. Based on this interpretation, we further propose an approach to address this problem: first, we perturb the visual learning to guide the model to focus on the cross-modal alignment. Then, we use the visual-text semantic relationships to gradually align the visual and textual modalities during the fine-tuning. Extensive experiments on various settings, backbones (CLIP, SigLip, PE-Core), and tasks (4 CDFSL datasets and 11 FSL datasets) show that we consistently set new state-of-the-art results. Code is available at https://github.com/zhenyuZ-HUST/CVPR26-Mind-the-Discriminability-Trap.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 5-way 1-shot Classification | CD-FSL ISIC, EuroSAT, CropDisease, ChestX (test) | Accuracy (ISIC)45.01 | 74 | |
| 5-way 5-shot Classification | CD-FSL ISIC, EuroSAT, CropDisease, ChestX (test) | Accuracy (ISIC)61.41 | 60 | |
| Image Classification | 11-Dataset Average | Average Accuracy84 | 42 | |
| Image Classification | EuroSAT | Top-1 Accuracy92.5 | 8 |