Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Fine-tuning Pre-trained Vision-Language Models in a Human-Annotation-Free Manner

About

Large-scale vision-language models (VLMs) such as CLIP exhibit strong zero-shot generalization, but adapting them to downstream tasks typically requires costly labeled data. Existing unsupervised self-training methods rely on pseudo-labeling, yet often suffer from unreliable confidence filtering, confirmation bias, and underutilization of low-confidence samples. We propose Collaborative Fine-Tuning (CoFT), an unsupervised adaptation framework that leverages unlabeled data through a dual-model, cross-modal collaboration mechanism. CoFT introduces a dual-prompt learning strategy with positive and negative textual prompts to explicitly model pseudo-label cleanliness in a sample-dependent manner, removing the need for hand-crafted thresholds or noise assumptions. The negative prompt also regularizes lightweight visual adaptation modules, improving robustness under noisy supervision. CoFT employs a two-phase training scheme, transitioning from parameter-efficient fine-tuning on high-confidence samples to full fine-tuning guided by collaboratively filtered pseudo-labels. Building on CoFT, CoFT+ further enhances adaptation via iterative fine-tuning, momentum contrastive learning, and LLM-generated prompts. Extensive experiments demonstrate consistent gains over existing unsupervised methods and even few-shot supervised baselines.

Qian-Wei Wang, Guanghao Meng, Ren Cai, Yaguang Song, Shu-Tao Xia• 2026

Related benchmarks

TaskDatasetResultRank
Image ClassificationFlowers102
Accuracy79.6
558
Image ClassificationUCF101
Top-1 Acc80.23
527
Image ClassificationDTD
Accuracy52.3
487
Image ClassificationFood101
Accuracy90.58
457
Image ClassificationStanfordCars
Accuracy79.13
384
Image ClassificationOxfordPets
Accuracy93.16
298
Image ClassificationFGVCAircraft
Accuracy31.25
289
Image ClassificationCaltech101
Accuracy95.5
228
Image ClassificationEuroSAT
Accuracy90.2
226
Image ClassificationCIFAR-100-N
Accuracy80.89
62
Showing 10 of 10 rows

Other info

Follow for update