Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

About

Efficiently modeling sequences with infinite context length has long been a challenging problem. Previous approaches have either suffered from quadratic computational complexity or limited extrapolation ability in length generalization. In this work, we present Samba, a simple hybrid architecture that layer-wise combines Mamba, a selective State Space Model (SSM), with Sliding Window Attention (SWA). Samba selectively compresses a given sequence into recurrent hidden states while still maintaining the ability to precisely recall recent memories with the attention mechanism. We scale Samba up to 3.8B parameters with 3.2T training tokens and demonstrate that it significantly outperforms state-of-the-art models across a variety of benchmarks. Pretrained on sequences of 4K length, Samba shows improved perplexity in context lengths of up to 1M in zero-shot. When finetuned on 4K-length sequences, Samba efficiently extrapolates to a 256K context length with perfect memory recall on the Passkey Retrieval task, and exhibits superior retrieval extrapolation on the challenging Phonebook task compared to full-attention models. As a linear-time sequence model, Samba achieves a 3.73x higher throughput compared to Transformers with grouped-query attention for user prompts of 128K length, and a 3.64x speedup when generating 64K tokens with unlimited streaming. Our code for training on open source data is publicly available at https://github.com/microsoft/Samba.

Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, Weizhu Chen• 2024

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText	PPL16.13	740
Language Modeling	LAMBADA	Accuracy44.94	412
Zero-shot Reasoning	PIQA	PIQA Zero-shot Accuracy70.94	62
Zero-shot Reasoning	WinoGrande	Accuracy55.56	54
Common Sense Reasoning	HellaSwag 0-shot	Accuracy53.42	38
Recall-intensive retrieval	Recall-intensive retrieval tasks SWDE, SQUADE, FDA, Trivial QA, NQ, Drop	Performance on SWDE33	31
Common Sense Reasoning	ARC-Challenge 0-shot	Accuracy36.17	31
Common Sense Reasoning	ARC-Easy (ARC-E) 0-shot	Accuracy68.81	24
Sudoku Solving	Sudoku 10k 9x9 boards (val)	Board Accuracy90.4	12
Zero-shot Common Sense Reasoning	BoolQ	Accuracy62.11	12

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord