MIDUS: Memory-Infused Depth Up-Scaling
About
Expanding pre-trained language models offers a practical way to increase capacity without training larger models from scratch. Depth Up-Scaling (DUS) does so by duplicating Transformer blocks and inserting them into a pre-trained backbone. This process also duplicates FFN-heavy blocks, increasing parameter and compute cost while adding capacity through a block-level dense residual branch. Yet prior work suggests that added capacity need not remain tied to dense FFN branches, while attention heads often play heterogeneous roles, motivating more efficient head-level residual corrections. We propose Memory-Infused Depth Up-Scaling (MIDUS), which replaces the duplicated FFN branches with memory layers and turns added depth into lightweight retrieval-based residual capacity. We introduce a Head-wise Memory Layer (HML), which combines multi-head product-key memory with Head-wise Implicit Value Expansion (HIVE). HML assigns each head a distinct key space, while HIVE realizes head-specific values from a shared latent bank through compact projections. Alongside empirical improvements in performance and efficiency, our head-importance and fixed-retrieval structural analyses characterize HML with HIVE as a structurally distinct, head-conditioned alternative to FFN-based residual expansion.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-task Language Understanding | MMLU | Accuracy37.82 | 881 | |
| Language Modeling | WikiText-103 (test) | Perplexity7.4 | 703 | |
| Commonsense Reasoning | WinoGrande | Accuracy61.56 | 453 | |
| Boolean Question Answering | BoolQ | Accuracy66.21 | 350 | |
| Question Answering | ARC | Accuracy66.33 | 230 | |
| Logical reasoning | LogiQA | Accuracy23.2 | 100 | |
| Physical Reasoning | PIQA | Accuracy75.9 | 90 | |
| Commonsense Question Answering | CSQA | Accuracy50.04 | 61 | |
| Language Modeling | Wikipedia | Perplexity11.64 | 43 | |
| Zero-shot Question Answering and Reasoning | Evaluation Suite Zero-shot (ARC, LogiQA, Wino, CSQA, BoolQ, PIQA, MMLU) | ARC83.5 | 21 |