Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference

About

Structured dilated attention has an appealing inference-time efficiency knob: it reduces the FLOPs of the attention and the KV cache size by a factor of the dilation size D, while preserving long-range connectivity. However, we find a persistent failure mode of them -- sparsifying a pretrained attention model to a dilated pattern leads to severe accuracy degradation. We introduce RAT+, a dense-pretraining architecture that augments attention with full-sequence recurrence and active recurrence learning. A single RAT+ model is pretrained densely once, then flexibly switched at inference time to dilated attention (optionally with local windows) or hybrid layer/head compositions, requiring only a short 1B-token resolution adaptation rather than retraining separate sparse models. At 1.5B parameters trained on 100B tokens, RAT+ closely matches dense accuracy at 16 and drops by about 2-3 points at 64 on commonsense reasoning and LongBench tasks, respectively. Moreover, RAT+ outperforms attention when sparsifying to the top-k block attention. We further scale to 2.6B parameters and 200B tokens and observe the same trend.

Xiuying Wei, Caglar Gulcehre• 2026

Related benchmarks

TaskDatasetResultRank
Long-context Language UnderstandingLongBench (test)
Average Score20.52
133
Needle-In-A-Haystack RetrievalRULER
S-NIAH-1 (Pass-Key Retrieval)100
42
Common Sense ReasoningCommon-sense reasoning tasks (ARC-C, ARC-E, HellaSwag, Lambada, PIQA, WinoGrande) (test)
ARC-C Accuracy44.88
16
Commonsense ReasoningCommonsense Reasoning Tasks (ARC-C, ARC-E, HellaSwag, LAMBADA, PIQA, WinoGrande)
ARC-C Accuracy41.47
13
Showing 4 of 4 rows

Other info

Follow for update