Finetune like you pretrain: Improved finetuning of zero-shot vision models

About

Finetuning image-text models such as CLIP achieves state-of-the-art accuracies on a variety of benchmarks. However, recent works like WiseFT (Wortsman et al., 2021) and LP-FT (Kumar et al., 2022) have shown that even subtle differences in the finetuning process can lead to surprisingly large differences in the final performance, both for in-distribution (ID) and out-of-distribution (OOD) data. In this work, we show that a natural and simple approach of mimicking contrastive pretraining consistently outperforms alternative finetuning approaches. Specifically, we cast downstream class labels as text prompts and continue optimizing the contrastive loss between image embeddings and class-descriptive prompt embeddings (contrastive finetuning). Our method consistently outperforms baselines across 7 distribution shifts, 6 transfer learning, and 3 few-shot learning benchmarks. On WILDS-iWILDCam, our proposed approach FLYP outperforms the top of the leaderboard by $2.3\%$ ID and $2.7\%$ OOD, giving the highest reported accuracy. Averaged across 7 OOD datasets (2 WILDS and 5 ImageNet associated shifts), FLYP gives gains of $4.2\%$ OOD over standard finetuning and outperforms the current state of the art (LP-FT) by more than $1\%$ both ID and OOD. Similarly, on 3 few-shot learning benchmarks, our approach gives gains up to $4.6\%$ over standard finetuning and $4.4\%$ over the state of the art. In total, these benchmarks establish contrastive finetuning as a simple, intuitive, and state-of-the-art approach for supervised finetuning of image-text models like CLIP. Code is available at https://github.com/locuslab/FLYP.

Sachin Goyal, Ananya Kumar, Sankalp Garg, Zico Kolter, Aditi Raghunathan• 2022

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet V2	Top-1 Acc78.21	749
Image Classification	ImageNet A	Top-1 Acc68.85	698
Image Classification	Stanford Cars	Accuracy89.6	660
Image Classification	ImageNet-1K	--	600
Image Classification	Food-101	--	570
Classification	Cars	Accuracy83.19	492
Image Classification	DTD	Accuracy76.74	487
Image Classification	ImageNet-Sketch	Top-1 Accuracy49.87	473
Action Recognition	UCF101	--	433
Image Classification	ImageNet	--	431

Showing 10 of 124 rows

...

Other info

Code

Follow for update

@wizwand_team Discord