Memory Grafting: Scaling Language Model Pre-training via Offline Conditional Memory

About

Scaling conditional memory offers a promising way to increase language-model capacity, but existing methods such as Engram learn large memory tables from scratch during pre-training, making memory scaling expensive and sometimes ineffective. We propose Memory Grafting, a conditional memory scaling method that utilizes frozen hidden states from a grafting model as conditional n-gram memory. Given frequent local n-grams, we run the grafting model offline, store final-token hidden representations as memory values, and let the recipient model retrieve them through exact longest-match suffix lookup. Retrieved memories are adapted by lightweight projections and gates, while a hash-based Engram fallback preserves coverage for unmatched contexts. Since the grafting model is only run offline and exact lookup has expected O(1) complexity with respect to memory-bank size, Memory Grafting expands external latent capacity with limited training and inference overhead. Experiments under matched recipient architectures and pre-training budgets show that Memory Grafting improves over both MoE and vanilla Engram baselines. In the 2.8B-scale setting, it improves the average benchmark score from 51.95 for MoE and 52.43 for vanilla Engram to 53.86. In the 0.92B-scale setting, all grafting-model variants improve over the baselines, with Qwen3.5-35B-A3B giving the strongest gains. These results suggest that pretrained models can serve as reusable constructors of external latent memory, providing a practical step toward scaling future language models beyond trainable parameters alone.

Runxi Cheng, Yuchen Guan, Yongxian Wei, Qianpu Sun, Qixiu Li, Sinan Du, Feng Xiong, Chun Yuan, Yan Lu, Yeyun Gong• 2026

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	WinoGrande	Accuracy60.93	1581
Sentence Completion	HellaSwag	Accuracy47.54	440
Multiple-choice Question Answering	ARC Easy	Accuracy73.4	269
Word Prediction	LAMBADA	Accuracy48.19	222
Commonsense Reasoning	SocialIQA	Accuracy42.94	164
Physical Commonsense Reasoning	PIQA	Accuracy (PIQA)76.17	99
Reading Comprehension	RACE	Accuracy35.98	86
Boolean Question Answering	BoolQ	Accuracy62.54	57

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord