Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

About

Efficiently modeling sequences with infinite context length has long been a challenging problem. Previous approaches have either suffered from quadratic computational complexity or limited extrapolation ability in length generalization. In this work, we present Samba, a simple hybrid architecture that layer-wise combines Mamba, a selective State Space Model (SSM), with Sliding Window Attention (SWA). Samba selectively compresses a given sequence into recurrent hidden states while still maintaining the ability to precisely recall recent memories with the attention mechanism. We scale Samba up to 3.8B parameters with 3.2T training tokens and demonstrate that it significantly outperforms state-of-the-art models across a variety of benchmarks. Pretrained on sequences of 4K length, Samba shows improved perplexity in context lengths of up to 1M in zero-shot. When finetuned on 4K-length sequences, Samba efficiently extrapolates to a 256K context length with perfect memory recall on the Passkey Retrieval task, and exhibits superior retrieval extrapolation on the challenging Phonebook task compared to full-attention models. As a linear-time sequence model, Samba achieves a 3.73x higher throughput compared to Transformers with grouped-query attention for user prompts of 128K length, and a 3.64x speedup when generating 64K tokens with unlimited streaming. Our code for training on open source data is publicly available at https://github.com/microsoft/Samba.

Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, Weizhu Chen• 2024

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText
PPL16.13
732
Language ModelingLAMBADA
Accuracy44.94
268
Zero-shot ReasoningPIQA
PIQA Zero-shot Accuracy70.94
62
Zero-shot ReasoningWinoGrande
Accuracy55.56
54
Common Sense ReasoningHellaSwag 0-shot
Accuracy53.42
34
Common Sense ReasoningARC-Challenge 0-shot
Accuracy36.17
31
Recall-intensive retrievalRecall-intensive retrieval tasks SWDE, SQUADE, FDA, Trivial QA, NQ, Drop
Performance on SWDE33
24
Common Sense ReasoningARC-Easy (ARC-E) 0-shot
Accuracy68.81
24
Sudoku SolvingSudoku 10k 9x9 boards (val)
Board Accuracy90.4
12
Zero-shot Common Sense ReasoningBoolQ
Accuracy62.11
12
Showing 10 of 14 rows

Other info

Follow for update