Learning to Remember, Learn, and Forget in Attention-Based Models

About

In-Context Learning (ICL) in transformers acts as an online associative memory and is believed to underpin their high performance on complex sequence processing tasks. However, in gated linear attention models, this memory has a fixed capacity and is prone to interference, especially for long sequences. We propose Palimpsa, a self-attention model that views ICL as a continual learning problem that must address a stability-plasticity dilemma. Palimpsa uses Bayesian metaplasticity, where the plasticity of each attention state is tied to an importance state grounded by a prior distribution that captures accumulated knowledge. We demonstrate that various gated linear attention models emerge as specific architecture choices and posterior approximations, and that Mamba2 is a special case of Palimpsa where forgetting dominates. This theoretical link enables the transformation of any non-metaplastic model into a metaplastic one, significantly expanding its memory capacity. Our experiments show that Palimpsa consistently outperforms baselines on the Multi-Query Associative Recall (MQAR) benchmark and on Commonsense Reasoning tasks.

Djohan Bonnet, Jamie Lohoff, Jan Finkbeiner, Elidona Skhikerujah, Emre Neftci• 2026

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	Accuracy51.63	1896
Commonsense Reasoning	WinoGrande	Accuracy57.06	1442
Commonsense Reasoning	PIQA	Accuracy71.06	757
Language Modeling	WikiText	PPL19.02	740
Language Modeling	LAMBADA	Accuracy43.55	412
Commonsense Reasoning	ARC Challenge	Accuracy34.64	243
Commonsense Reasoning	SocialIQA	Accuracy41.61	158
Common Sense Reasoning	ARC Easy	ARC (easy) Accuracy67.97	101

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord