Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

About

While Transformers have been the main architecture behind deep learning's success in language modeling, state-space models (SSMs) such as Mamba have recently been shown to match or outperform Transformers at small to medium scale. We show that these families of models are actually quite closely related, and develop a rich framework of theoretical connections between SSMs and variants of attention, connected through various decompositions of a well-studied class of structured semiseparable matrices. Our state space duality (SSD) framework allows us to design a new architecture (Mamba-2) whose core layer is an a refinement of Mamba's selective SSM that is 2-8X faster, while continuing to be competitive with Transformers on language modeling.

Tri Dao, Albert Gu• 2024

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	Accuracy53.9	1896
Commonsense Reasoning	WinoGrande	Accuracy72.3	1442
Question Answering	ARC Challenge	Accuracy46.8	906
Multi-task Language Understanding	MMLU	Accuracy23	881
Instruction Following	IFEval	IFEval Accuracy43.7	836
Commonsense Reasoning	PIQA	Accuracy73.6	757
Language Modeling	WikiText	PPL15.9	740
Commonsense Reasoning	HellaSwag	HellaSwag Accuracy61.4	711
Physical Commonsense Reasoning	PIQA	Accuracy81	696
Question Answering	ARC Challenge	Accuracy (ARC)42.7	598

Showing 10 of 280 rows

...

Other info

Follow for update

@wizwand_team Discord