Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Class Incremental Learning with Pre-trained Vision-Language Models

About

With the advent of large-scale pre-trained models, interest in adapting and exploiting them for continual learning scenarios has grown. In this paper, we propose an approach to exploiting pre-trained vision-language models (e.g. CLIP) that enables further adaptation instead of only using zero-shot learning of new tasks. We augment a pre-trained CLIP model with additional layers after the Image Encoder or before the Text Encoder. We investigate three different strategies: a Linear Adapter, a Self-attention Adapter, each operating on the image embedding, and Prompt Tuning which instead modifies prompts input to the CLIP text encoder. We also propose a method for parameter retention in the adapter layers that uses a measure of parameter importance to better maintain stability and plasticity during incremental learning. Our experiments demonstrate that the simplest solution -- a single Linear Adapter layer with parameter retention -- produces the best results. Experiments on several conventional benchmarks consistently show a significant margin of improvement over the current state-of-the-art.

Xialei Liu, Xusheng Cao, Haori Lu, Jia-wen Xiao, Andrew D. Bagdanov, Ming-Ming Cheng• 2023

Related benchmarks

TaskDatasetResultRank
Class-incremental learningCIFAR-100
Average Accuracy83.46
60
Class-incremental learningImageNet-R 10-task--
44
Class-incremental learningImageNet-R 20-task
Average Accuracy82.81
33
Class-incremental learningCIFAR100 10 Tasks
Accuracy84.32
29
Class-incremental learningImageNet-R 5-task
Avg Accuracy (A_bar)81.52
27
Class-incremental learningCIFAR-100 20 tasks
Avg Acc84.52
26
Class-incremental learningMini-ImageNet100 5-task setting
Accuracy (Last Task)93.57
12
Class-incremental learningMini-ImageNet100 (10-task setting)
Last Accuracy93.07
12
Showing 8 of 8 rows

Other info

Follow for update