Depth-Recurrent Attention Mixtures: Giving Latent Reasoning the Attention it Deserves

About

Depth-recurrence facilitates latent reasoning by sharing parameters across depths. However, prior work lacks combined FLOP-, parameter-, and memory-matched baselines, underutilizes depth-recurrence due to partially fixed layer stacks, and ignores the bottleneck of constant hidden-sizes that restricts many-step latent reasoning. To address this, we introduce a modular framework of depth-recurrent attention mixtures (Dreamer), combining sequence attention, depth attention, and sparse expert attention. It alleviates the hidden-size bottleneck through attention along depth, decouples scaling dimensions, and allows depth-recurrent models to scale efficiently and effectively. Across language reasoning benchmarks, our models require 2 to 8x fewer training tokens for the same accuracy as FLOP-, parameter-, and memory-matched SOTA, and outperform ca. 2x larger SOTA models with the same training tokens. We further present insights into knowledge usage across depths, e.g., showing 2 to 11x larger expert selection diversity than SOTA MoEs.

Jonas Knupp, Jan Hendrik Metzen, Jeremias Bohn, Georg Groh, Kristian Kersting• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MathQA (test)	Accuracy50.8	52
Mathematical Reasoning	MMLU Mathematics (test)	Average Accuracy50	24
Mathematical Reasoning	MATH (test)	Accuracy54.5	14
Language Modeling	Maths-College ajibawa-2023 (val)	PPL5.9	6
Mathematical Reasoning	GSM8K (test)	Accuracy56.3	6
Mathematical Reasoning	Average (GSM8K, MATH, MMLU-MATH, MATHQA) (test)	Accuracy51.4	6

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord