Block-State Transformers

About

State space models (SSMs) have shown impressive results on tasks that require modeling long-range dependencies and efficiently scale to long sequences owing to their subquadratic runtime complexity. Originally designed for continuous signals, SSMs have shown superior performance on a plethora of tasks, in vision and audio; however, SSMs still lag Transformer performance in Language Modeling tasks. In this work, we propose a hybrid layer named Block-State Transformer (BST), that internally combines an SSM sublayer for long-range contextualization, and a Block Transformer sublayer for short-term representation of sequences. We study three different, and completely parallelizable, variants that integrate SSMs and block-wise attention. We show that our model outperforms similar Transformer-based architectures on language modeling perplexity and generalizes to longer sequences. In addition, the Block-State Transformer demonstrates more than tenfold increase in speed at the layer level compared to the Block-Recurrent Transformer when model parallelization is employed.

Mahan Fathi, Jonathan Pilault, Orhan Firat, Christopher Pal, Pierre-Luc Bacon, Ross Goroshin• 2023

Related benchmarks

Task	Dataset	Result
Long-range sequence modeling	Long Range Arena (LRA) (test)	Accuracy (Avg)86.96	163
Language Modeling	arXiv (test)	PPL2.41	145
Language Modeling	PG-19 (test)	Perplexity10.37	112
Language Modeling	GitHub (val)	Perplexity1.83	13

Showing 4 of 4 rows

Other info

Code

Follow for update

@wizwand_team Discord