Continual Learning with Vision-Language Models via Semantic-Geometry Preservation
About
Continual learning of pretrained vision-language models (VLMs) is prone to catastrophic forgetting, yet current approaches adapt to new tasks without explicitly preserving the cross-modal semantic geometry inherited from pretraining and previous stages, allowing new-task supervision to induce geometric distortion. We observe that the most pronounced drift tends to concentrate in vulnerable neighborhoods near the old-new semantic interface, where shared visual patterns are easily re-explained by new textual semantics. To address this under an exemplar-free constraint, we propose Semantic Geometry Preservation for Continual Learning (SeGP-CL). SeGP-CL first probes the drift-prone region by constructing a compact set of adversarial anchors with dual-targeted projected gradient descent (DPGD), which drives selected new-task seeds toward old-class semantics while remaining faithful in raw visual space. During training, we preserve cross-modal structure by anchor-guided cross-modal geometry distillation (ACGD), and stabilize the textual reference frame across tasks via a lightweight text semantic-geometry regularization (TSGR). After training, we estimate anchor-induced raw-space drift to transfer old visual prototypes and perform dual-path inference by fusing cross-modal and visual cues. Extensive experiments on five continual learning benchmarks demonstrate that SeGP-CL consistently improves stability and forward transfer, achieving state-of-the-art performance while better preserving semantic geometry of VLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | Food101 | Accuracy84.5 | 457 | |
| Class-incremental learning | CUB200 10 Tasks | FN (Final Acc)80.1 | 59 | |
| Class-incremental learning | ImageNet-R 10-task | -- | 54 | |
| Image Classification | ImageNet 1k (full) | Top-1 Acc67.6 | 53 | |
| Class-incremental classification | CIFAR100 10 Tasks | Average Accuracy89.8 | 16 | |
| Class-incremental classification | ImageNet Sub 10 tasks | Average Accuracy89.9 | 16 | |
| Image Classification | Oxford Pets | Accuracy87.7 | 15 | |
| Class-incremental classification | UCF101 10 Tasks | Average Accuracy95.9 | 9 | |
| Image Classification | CIFAR-Last | Accuracy84.6 | 8 | |
| Global Visual-Text Matching | CIFAR100 (test) | Forward Transfer (FWT)72.3 | 5 |