Bi-CoG: Bi-Consistency-Guided Self-Training for Vision-Language Models
About
Exploiting unlabeled data through semi-supervised learning (SSL) or leveraging pre-trained models via fine-tuning are two prevailing paradigms for addressing label-scarce scenarios. Recently, growing attention has been given to combining fine-tuning of pre-trained vision-language models (VLMs) with SSL, forming the emerging paradigm of semi-supervised fine-tuning. However, existing methods often suffer from model bias and hyperparameter sensitivity, due to reliance on prediction consistency or pre-defined confidence thresholds. To address these limitations, we propose a simple yet effective plug-and-play methodology named $\underline{\textbf{Bi-Co}}$nsistency-$\underline{\textbf{G}}$uided Self-Training (Bi-CoG), which assigns high-quality and low-bias pseudo-labels, by simultaneously exploiting inter-model and intra-model consistency, along with an error-aware dynamic pseudo-label assignment strategy. Both theoretical analysis and extensive experiments over 14 datasets demonstrate the effectiveness of Bi-CoG, which consistently and significantly improves the performance of existing methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | StanfordCars | Accuracy73.77 | 384 | |
| Image Classification | FGVCAircraft | Accuracy31.98 | 289 | |
| Image Classification | OxfordPets | H Score96.93 | 182 | |
| Image Classification | ImageNet-100 | Accuracy90.15 | 163 | |
| Image Classification | Food101 | Base Accuracy87.46 | 69 | |
| Image Classification | Caltech101 | Base Accuracy95.09 | 68 | |
| Skin lesion classification | ISIC 2018 (test) | -- | 30 | |
| Image Classification | CIFAR-10 low-resolution (test) | Accuracy91.07 | 14 | |
| Image Classification | CIFAR-100 low-resolution (test) | Accuracy66.84 | 14 | |
| Action Recognition | UCF101 | Harmonic Mean (HM)83.09 | 7 |