Depth-Recurrent Attention Mixtures: Giving Latent Reasoning the Attention it Deserves
About
Depth-recurrence facilitates latent reasoning by sharing parameters across depths. However, prior work lacks combined FLOP-, parameter-, and memory-matched baselines, underutilizes depth-recurrence due to partially fixed layer stacks, and ignores the bottleneck of constant hidden-sizes that restricts many-step latent reasoning. To address this, we introduce a modular framework of depth-recurrent attention mixtures (Dreamer), combining sequence attention, depth attention, and sparse expert attention. It alleviates the hidden-size bottleneck through attention along depth, decouples scaling dimensions, and allows depth-recurrent models to scale efficiently and effectively. Across language reasoning benchmarks, our models require 2 to 8x fewer training tokens for the same accuracy as FLOP-, parameter-, and memory-matched SOTA, and outperform ca. 2x larger SOTA models with the same training tokens. We further present insights into knowledge usage across depths, e.g., showing 2 to 11x larger expert selection diversity than SOTA MoEs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MathQA (test) | Accuracy50.8 | 33 | |
| Mathematical Reasoning | MMLU Mathematics (test) | Average Accuracy50 | 18 | |
| Mathematical Reasoning | MATH (test) | Accuracy54.5 | 14 | |
| Language Modeling | Maths-College ajibawa-2023 (val) | PPL5.9 | 6 | |
| Mathematical Reasoning | GSM8K (test) | Accuracy56.3 | 6 | |
| Mathematical Reasoning | Average (GSM8K, MATH, MMLU-MATH, MATHQA) (test) | Accuracy51.4 | 6 |