Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

About

Recurrent neural networks (RNNs) have fast inference and scale efficiently on long sequences, but they are difficult to train and hard to scale. We propose Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that mixes gated linear recurrences with local attention. Hawk exceeds the reported performance of Mamba on downstream tasks, while Griffin matches the performance of Llama-2 despite being trained on over 6 times fewer tokens. We also show that Griffin can extrapolate on sequences significantly longer than those seen during training. Our models match the hardware efficiency of Transformers during training, and during inference they have lower latency and significantly higher throughput. We scale Griffin up to 14B parameters, and explain how to shard our models for efficient distributed training.

Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando De Freitas, Caglar Gulcehre• 2024

Related benchmarks

Task	Dataset	Result
Language Modeling	PG-19	--	206
Long-range sequence modeling	Long Range Arena (LRA)	Text Accuracy71.75	177
Physical Commonsense Reasoning	PIQA (val)	Accuracy66.1	118
Question Answering	ARC Challenge (test)	Accuracy25.4	103
Commonsense Reasoning	WinoGrande (val)	Accuracy52.6	87
Multiple-choice Question Answering	ARC Easy (test)	Accuracy48.4	68
Commonsense Reasoning	PIQA 1.0 (test)	Accuracy81	64
Commonsense Reasoning	HellaSwag (val)	Accuracy38.8	54
Word Prediction	LAMBADA (test)	Accuracy37.6	53
Commonsense Reasoning	WinoGrande 1.0 (test)	Accuracy72.6	31

Showing 10 of 27 rows

Other info

Follow for update

@wizwand_team Discord