Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Forget Attention: Importance-Aware Attention Is All You Need

About

Combining attention's global retrieval with the sequential importance signal of state space models (SSMs) is the open challenge of hybrid language modeling. Transformers see everywhere but cannot prioritize; SSMs know what matters but cannot revisit. Existing hybrids -- Jamba (block level) and Hymba (head level) -- place the two in separate compartments, so neither informs the other during the attention computation itself. We propose SISA (SSM-Informed Softmax Attention), which adds an SSM-derived importance term directly inside the attention score and realizes the full operation as a single SDPA call on augmented query/key vectors -- no recurrent state, no custom kernel. At 152M / 5B tokens, SISA reaches LAMBADA-greedy 17.3% (vs. Transformer 13.9 and Mamba-3 15.5) and attains NIAH 100% from step 1K, 7x faster than Transformer's retrieval convergence; at 369M, Mamba-3 leads LAMBADA while SISA preserves perfect NIAH and stock-SDPA execution. SISA thus defines a third design axis for SSM-attention hybrids -- score-level fusion -- beyond the block-level and head-level paradigms that have dominated the field.

Soohyeong Shin, Yeongwook Yang• 2026

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
HellaSwag Accuracy26.9
711
Question AnsweringARC Easy--
597
Sentence CompletionHellaSwag
Accuracy26.9
364
Coreference ResolutionWinoGrande
Accuracy52.5
61
Pronoun ResolutionWinoGrande
Accuracy52.5
58
Long-context retrievalNeedle-in-a-Haystack
Retrieval Accuracy100
29
Science QAARC Easy
Accuracy35.8
17
Needle-In-A-Haystack RetrievalNIAH
NIAH Score100
14
Last-word predictionLAMBADA
Accuracy17.3
7
Showing 9 of 9 rows

Other info

Follow for update