Class Incremental Learning with Pre-trained Vision-Language Models
About
With the advent of large-scale pre-trained models, interest in adapting and exploiting them for continual learning scenarios has grown. In this paper, we propose an approach to exploiting pre-trained vision-language models (e.g. CLIP) that enables further adaptation instead of only using zero-shot learning of new tasks. We augment a pre-trained CLIP model with additional layers after the Image Encoder or before the Text Encoder. We investigate three different strategies: a Linear Adapter, a Self-attention Adapter, each operating on the image embedding, and Prompt Tuning which instead modifies prompts input to the CLIP text encoder. We also propose a method for parameter retention in the adapter layers that uses a measure of parameter importance to better maintain stability and plasticity during incremental learning. Our experiments demonstrate that the simplest solution -- a single Linear Adapter layer with parameter retention -- produces the best results. Experiments on several conventional benchmarks consistently show a significant margin of improvement over the current state-of-the-art.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Class-incremental learning | CIFAR-100 | Average Accuracy83.46 | 60 | |
| Class-incremental learning | ImageNet-R 10-task | -- | 44 | |
| Class-incremental learning | ImageNet-R 20-task | Average Accuracy82.81 | 33 | |
| Class-incremental learning | CIFAR100 10 Tasks | Accuracy84.32 | 29 | |
| Class-incremental learning | ImageNet-R 5-task | Avg Accuracy (A_bar)81.52 | 27 | |
| Class-incremental learning | CIFAR-100 20 tasks | Avg Acc84.52 | 26 | |
| Class-incremental learning | Mini-ImageNet100 5-task setting | Accuracy (Last Task)93.57 | 12 | |
| Class-incremental learning | Mini-ImageNet100 (10-task setting) | Last Accuracy93.07 | 12 |