The Hidden Attention of Mamba Models

About

The Mamba layer offers an efficient selective state space model (SSM) that is highly effective in modeling multiple domains, including NLP, long-range sequence processing, and computer vision. Selective SSMs are viewed as dual models, in which one trains in parallel on the entire sequence via an IO-aware parallel scan, and deploys in an autoregressive manner. We add a third view and show that such models can be viewed as attention-driven models. This new perspective enables us to empirically and theoretically compare the underlying mechanisms to that of the self-attention layers in transformers and allows us to peer inside the inner workings of the Mamba model with explainability methods. Our code is publicly available.

Ameen Ali, Itamar Zimerman, Lior Wolf• 2024

Related benchmarks

Task	Dataset	Result
Word Alignment	RWTH Gold Alignment de-en (test)	AER0.7	31
Explanation Faithfulness	Med-BIOS	Delta AF5.326	24
Explanation Faithfulness	Emotion	Delta AF Score4.706	24
Explanation Faithfulness	SNLI	Delta AF0.554	24
Explanation Faithfulness	SST-2	Delta AF0.341	24
Token Alignment	IWSLT Fr-En 2017 (test)	AER66	22
Token Alignment	IWSLT DE→EN 2017 (test)	AER0.72	22
Long-context Reasoning	LongBench 256 tokens v2	Accuracy100	14
Long-range dependency modeling	Long Range Arena 100 tokens	ListOps Accuracy74	14
Long-context Understanding	RULER 1000 tokens	NIAH90.43	14

Showing 10 of 12 rows

Other info

Code

Follow for update

@wizwand_team Discord