Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Gated Delta Networks: Improving Mamba2 with Delta Rule

About

Linear Transformers have gained attention as efficient alternatives to standard Transformers, but their performance in retrieval and long-context tasks has been limited. To address these limitations, recent work has explored two distinct mechanisms: gating for adaptive memory control and the delta update rule for precise memory modifications. We observe that these mechanisms are complementary: gating enables rapid memory erasure while the delta rule facilitates targeted updates. Building on this insight, we introduce the gated delta rule and develop a parallel training algorithm optimized for modern hardware. Our proposed architecture, Gated DeltaNet, consistently surpasses existing models like Mamba2 and DeltaNet across multiple benchmarks, including language modeling, common-sense reasoning, in-context retrieval, length extrapolation, and long-context understanding. We further enhance performance by developing hybrid architectures that combine Gated DeltaNet layers with sliding window attention or Mamba2 layers, achieving both improved training efficiency and superior task performance.

Songlin Yang, Jan Kautz, Ali Hatamizadeh• 2024

Related benchmarks

TaskDatasetResultRank
Multi-task Language UnderstandingMMLU
Accuracy23
842
Commonsense ReasoningPIQA
Accuracy71.33
647
Language ModelingWikiText
PPL19.06
479
Language ModelingLAMBADA
Accuracy41.74
183
Language ModelingLAMBADA
Perplexity60.16
99
Commonsense ReasoningSocialIQA
Accuracy39.2
97
Long-context UnderstandingLongBench (test)
Avg Score6.86
80
Long-context Question AnsweringLongBench (test)
HotpotQA4.5
59
Needle-In-A-Haystack RetrievalRULER
S-NIAH-1 (Pass-Key Retrieval)100
42
Single-Doc Question AnsweringLongBench
MultifieldQA Score0.114
36
Showing 10 of 22 rows

Other info

Follow for update