Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention

About

Linear attention and state-space models offer constant-memory alternatives to softmax attention, but often struggle with in-context associative recall. The Delta Rule mitigates this by writing each token via one step of online gradient descent. However, its step size relies on a single scalar gate that ignores the feature-wise curvature of the inner objective. We propose Online Scaled DeltaNet (OSDN), which augments the scalar gate with a diagonal preconditioner updated online via hypergradient feedback. Crucially, this right-preconditioning is algebraically equivalent to a per-feature scaling of the write-side key. This equivalence allows OSDN to strictly preserve the hardware-friendly chunkwise parallel pipeline of DeltaNet without incurring high-dimensional state overhead. Theoretically, by exploiting the exact-quadratic structure of the inner regression loss, we establish super-geometric convergence against a right-Newton comparator and prove an algorithm-aligned token-local residual contraction bound. To handle non-stationary contexts, we further introduce Adaptive Preconditioner Forgetting (APF) to dynamically refresh stale calibration. Empirically, OSDN demonstrates strong performance across scales. At the 340M-parameter scale, OSDN improves JRT-style in-context recall by 32% over DeltaNet. Scaling to 1.3B parameters, it achieves a 39% reduction in the recall residual ratio while maintaining parity on general downstream tasks (e.g., perplexity and LongBench) -- demonstrating that our online-preconditioning mechanism effectively transfers and amplifies at the billion-parameter scale.

Chenyu Zhou, Hongpei Li, Yuerou Liu, Jianghao Lin, Dongdong Ge, Yinyu Ye• 2026

Related benchmarks

TaskDatasetResultRank
Language ModelingFineWeb-Edu (val)
Perplexity9.1
51
Commonsense Reasoning and Short-Context Language UnderstandingCommonsense Reasoning and Short-Context Language Understanding Suite (PIQA, HellaSwag, WinoGrande, ARC-Easy, ARC-Challenge, SIQA, BoolQ, LAMBADA) zero-shot
PIQA Accuracy (Zero-shot)73.3
2
In-context recallJRT-style cloze (FDA and SWDE datasets) 1.3B / 100B checkpoints
FDA Accuracy24.1
2
Language ModelingLAMBADA
Perplexity (PPL)10.98
2
Long-context Language UnderstandingLongBench English (14-task average)
LongBench Average Score11.6
2
Language ModelingWikiText
PPL18.42
2
Showing 6 of 6 rows

Other info

Follow for update