Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Finetune like you pretrain: Improved finetuning of zero-shot vision models

About

Finetuning image-text models such as CLIP achieves state-of-the-art accuracies on a variety of benchmarks. However, recent works like WiseFT (Wortsman et al., 2021) and LP-FT (Kumar et al., 2022) have shown that even subtle differences in the finetuning process can lead to surprisingly large differences in the final performance, both for in-distribution (ID) and out-of-distribution (OOD) data. In this work, we show that a natural and simple approach of mimicking contrastive pretraining consistently outperforms alternative finetuning approaches. Specifically, we cast downstream class labels as text prompts and continue optimizing the contrastive loss between image embeddings and class-descriptive prompt embeddings (contrastive finetuning). Our method consistently outperforms baselines across 7 distribution shifts, 6 transfer learning, and 3 few-shot learning benchmarks. On WILDS-iWILDCam, our proposed approach FLYP outperforms the top of the leaderboard by $2.3\%$ ID and $2.7\%$ OOD, giving the highest reported accuracy. Averaged across 7 OOD datasets (2 WILDS and 5 ImageNet associated shifts), FLYP gives gains of $4.2\%$ OOD over standard finetuning and outperforms the current state of the art (LP-FT) by more than $1\%$ both ID and OOD. Similarly, on 3 few-shot learning benchmarks, our approach gives gains up to $4.6\%$ over standard finetuning and $4.4\%$ over the state of the art. In total, these benchmarks establish contrastive finetuning as a simple, intuitive, and state-of-the-art approach for supervised finetuning of image-text models like CLIP. Code is available at https://github.com/locuslab/FLYP.

Sachin Goyal, Ananya Kumar, Sankalp Garg, Zico Kolter, Aditi Raghunathan• 2022

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet A
Top-1 Acc48.1
654
Image ClassificationStanford Cars
Accuracy89.6
635
Image ClassificationImageNet V2
Top-1 Acc73
611
Image ClassificationImageNet-1K--
600
Image ClassificationFood-101--
542
Image ClassificationDTD
Accuracy76.74
485
Action RecognitionUCF101--
431
Image ClassificationImageNet--
431
ClassificationCars
Accuracy83.19
395
Image ClassificationOxford-IIIT Pets
Accuracy77.6
306
Showing 10 of 98 rows
...

Other info

Code

Follow for update