Soft-TransFormers for Continual Learning

About

Inspired by the Well-initialized Lottery Ticket Hypothesis (WLTH), we introduce Soft-TransFormers (Soft-TF), a continual learning framework that adapts a frozen pre-trained Transformer through task-specific soft subnetworks: real-valued multiplicative masks over the query, key, value, and output projections of selected self-attention layers. The masks are initialized at one, so optimization starts exactly at the pre-trained solution, and mask-space gradient descent is intrinsically biased toward modulating the backbone's dominant pathways; we prove that, under standard convex-Lipschitz assumptions, both the convergence rate and the parameter drift of mask-only fine-tuning are controlled by the distance from the pre-trained weights to a task-optimal configuration. This bounded drift yields two properties. Since the backbone and per-task masks are never overwritten, forgetting is structurally eliminated. And since every task subnetwork stays near the shared pre-trained solution, a wrong mask still evaluates a near-generalist function, so task-inference errors are largely harmless and class-incremental accuracy is decoupled from task-inference reliability. As a plug-in, Soft-TF couples with L2P, DualPrompt, HiDe-Prompt, and NoRGa, selecting masks by task-key matching, an entropy-gradient criterion, or a learned task-identity classifier. Across class-incremental learning benchmarks -- Split-CIFAR100, Split-ImageNet-R, CUB-200, and 5-Datasets -- Soft-TF consistently outperforms prompt-based, adapter-based, and LoRA-style baselines at comparable trainable-parameter budgets, while keeping inference cost identical to the unmodified backbone.

Haeyong Kang, Chang D. Yoo• 2024

Related benchmarks

Task	Dataset	Result
Class-incremental learning	CIFAR-100 (10-split)	Accuracy97.87	87
Continual Learning	CIFAR-100 (10-split)	ACC92.35	54
Class-incremental learning	5-Datasets	FAA95.68	49
Class-incremental learning	CUB-200 Split	FAA97.9	45
Class-incremental learning	Split ImageNet-R 10 incremental tasks	Class Accuracy82.38	40
Class-incremental learning	CIFAR100 20-Split	Accuracy99.05	38
Class-incremental learning	CIFAR100 10-Split	Accuracy (ACC)98.25	22
Class-incremental learning	ImageNet-R (10-Split)	Accuracy91.94	22

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord