MIDUS: Memory-Infused Depth Up-Scaling
About
Scaling large language models (LLMs) demands approaches that increase capacity without incurring excessive parameter growth or inference cost. Depth Up-Scaling (DUS) has emerged as a promising strategy by duplicating layers and applying Continual Pre-training (CPT), but its reliance on feed-forward networks (FFNs) limits efficiency and attainable gains. We introduce Memory-Infused Depth Up-Scaling (MIDUS), which replaces FFNs in duplicated blocks with a head-wise memory (HML) layer. Motivated by observations that attention heads have distinct roles both across and within layers, MIDUS assigns an independent memory bank to each head, enabling head-wise retrieval and injecting information into subsequent layers while preserving head-wise functional structure. This design combines sparse memory access with head-wise representations and incorporates an efficient per-head value factorization module, thereby relaxing the usual efficiency-performance trade-off. Across our CPT experiments, MIDUS exhibits robust performance improvements over strong DUS baselines while maintaining a highly efficient parameter footprint. Our findings establish MIDUS as a compelling and resource-efficient alternative to conventional FFN replication for depth up-scaling by leveraging its head-wise memory design.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-task Language Understanding | MMLU | Accuracy37.82 | 842 | |
| Language Modeling | WikiText-103 (test) | Perplexity7.4 | 524 | |
| Boolean Question Answering | BoolQ | Accuracy66.21 | 307 | |
| Commonsense Reasoning | WinoGrande | Accuracy61.56 | 231 | |
| Question Answering | ARC | Accuracy66.33 | 154 | |
| Logical reasoning | LogiQA | Accuracy23.2 | 84 | |
| Physical Reasoning | PIQA | Accuracy75.9 | 44 | |
| Commonsense Question Answering | CSQA | Accuracy50.04 | 44 | |
| Zero-shot Question Answering and Reasoning | Evaluation Suite Zero-shot (ARC, LogiQA, Wino, CSQA, BoolQ, PIQA, MMLU) | ARC83.5 | 21 | |
| Language Modeling | Wikipedia | Perplexity11.64 | 14 |