Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Memory Grafting: Scaling Language Model Pre-training via Offline Conditional Memory

About

Scaling conditional memory offers a promising way to increase language-model capacity, but existing methods such as Engram learn large memory tables from scratch during pre-training, making memory scaling expensive and sometimes ineffective. We propose Memory Grafting, a conditional memory scaling method that utilizes frozen hidden states from a grafting model as conditional n-gram memory. Given frequent local n-grams, we run the grafting model offline, store final-token hidden representations as memory values, and let the recipient model retrieve them through exact longest-match suffix lookup. Retrieved memories are adapted by lightweight projections and gates, while a hash-based Engram fallback preserves coverage for unmatched contexts. Since the grafting model is only run offline and exact lookup has expected O(1) complexity with respect to memory-bank size, Memory Grafting expands external latent capacity with limited training and inference overhead. Experiments under matched recipient architectures and pre-training budgets show that Memory Grafting improves over both MoE and vanilla Engram baselines. In the 2.8B-scale setting, it improves the average benchmark score from 51.95 for MoE and 52.43 for vanilla Engram to 53.86. In the 0.92B-scale setting, all grafting-model variants improve over the baselines, with Qwen3.5-35B-A3B giving the strongest gains. These results suggest that pretrained models can serve as reusable constructors of external latent memory, providing a practical step toward scaling future language models beyond trainable parameters alone.

Runxi Cheng, Yuchen Guan, Yongxian Wei, Qianpu Sun, Qixiu Li, Sinan Du, Feng Xiong, Chun Yuan, Yan Lu, Yeyun Gong• 2026

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningWinoGrande
Accuracy60.93
1442
Sentence CompletionHellaSwag
Accuracy47.54
364
Multiple-choice Question AnsweringARC Easy
Accuracy73.4
257
Word PredictionLAMBADA
Accuracy48.19
192
Commonsense ReasoningSocialIQA
Accuracy42.94
158
Physical Commonsense ReasoningPIQA
Accuracy (PIQA)76.17
99
Reading ComprehensionRACE
Accuracy35.98
59
Boolean Question AnsweringBoolQ
Accuracy62.54
57
Showing 8 of 8 rows

Other info

Follow for update