RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference

About

Structured dilated attention has an appealing inference-time efficiency knob: it reduces the FLOPs of attention and the KV cache size by a factor of the dilation size D, while preserving long-range connectivity. While prior work studies it by training each configuration from scratch, directly sparsifying a pretrained attention model into a dilated pattern leads to severe accuracy degradation, preventing flexible reuse across inference scenarios. We introduce RAT+, a dense-pretraining architecture that augments attention with full-sequence recurrence and active recurrence learning. A single RAT+ model is pretrained densely once and can then be flexibly switched at inference time to dilated attention (optionally with local windows) or hybrid layer/head compositions, requiring only a short 1B-token resolution adaptation rather than retraining separate sparse models. At 1.5B parameters trained on 100B tokens, RAT+ closely matches dense accuracy at D = 16, and drops by about 2-3 points at D = 64 on commonsense reasoning and LongBench tasks. We further scale to 2.6B and 7.6B parameters and observe even more promising performance (e.g., a 1-point average accuracy loss with a 64x reduction in attention FLOPs and KV cache size). Code is available at https://github.com/wimh966/rat-plus.

Xiuying Wei, Caglar Gulcehre• 2026

Related benchmarks

Task	Dataset	Result
Long-context Language Understanding	LongBench (test)	Average Score20.52	147
Needle-In-A-Haystack Retrieval	RULER	S-NIAH-1 (Pass-Key Retrieval)100	94
Commonsense Reasoning	Commonsense Reasoning Tasks (ARC-C, ARC-E, HellaSwag, LAMBADA, PIQA, WinoGrande)	ARC-C Accuracy41.47	25
Common Sense Reasoning	Common-sense reasoning tasks (ARC-C, ARC-E, HellaSwag, Lambada, PIQA, WinoGrande) (test)	ARC-C Accuracy44.88	16

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord