Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language Models
About
Advancements in vision-language models (VLMs) have propelled the field of computer vision, particularly in the zero-shot learning setting. Despite their promise, the effectiveness of these models often diminishes due to domain shifts in test environments. To address this, we introduce the Test-Time Prototype Shifting (TPS) framework, a pioneering approach designed to adapt VLMs to test datasets using unlabeled test inputs. Our method is based on the notion of modulating per-class prototypes in the shared embedding space. By pre-computing and caching prototypes generated with the pre-trained text encoder, TPS not only facilitates optimization-free prototype reuse for subsequent predictions but also enables seamless integration with current advancements in prompt engineering. At test-time, TPS dynamically learns shift vectors for each prototype based solely on the given test sample, effectively bridging the domain gap and enhancing classification accuracy. A notable aspect of our framework is its significantly reduced memory and computational demands when compared to conventional text-prompt tuning methods. Extensive evaluations across 15 image classification datasets involving natural distribution shifts and cross-dataset generalization, as well as in context-dependent visual reasoning, demonstrate TPS's superior performance, achieving state-of-the-art results while reducing resource requirements.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | ImageNet-R | Top-1 Acc76.98 | 474 | |
| Fine grained classification | Aircraft | Top-1 Acc24.78 | 62 | |
| Fine grained classification | EuroSAT | Accuracy42.56 | 57 | |
| Image Classification | ImageNet A | Accuracy58.19 | 50 | |
| Fine-grained Image Classification | UCF101 | Accuracy67.46 | 34 | |
| Image Classification | ImageNet V | -- | 31 | |
| Fine grained classification | Food101 | -- | 30 | |
| Fine grained classification | SUN397 | Top-1 Accuracy64.68 | 25 | |
| Fine grained classification | Pets | Accuracy87.44 | 22 | |
| Image Classification | ImageNet Natural Distribution Shifts suite (ImageNet, ImageNet-A, ImageNet-V2, ImageNet-R, ImageNet-Sketch) (test) | Top-1 Accuracy (ImageNet)70.19 | 21 |