Block-State Transformers
About
State space models (SSMs) have shown impressive results on tasks that require modeling long-range dependencies and efficiently scale to long sequences owing to their subquadratic runtime complexity. Originally designed for continuous signals, SSMs have shown superior performance on a plethora of tasks, in vision and audio; however, SSMs still lag Transformer performance in Language Modeling tasks. In this work, we propose a hybrid layer named Block-State Transformer (BST), that internally combines an SSM sublayer for long-range contextualization, and a Block Transformer sublayer for short-term representation of sequences. We study three different, and completely parallelizable, variants that integrate SSMs and block-wise attention. We show that our model outperforms similar Transformer-based architectures on language modeling perplexity and generalizes to longer sequences. In addition, the Block-State Transformer demonstrates more than tenfold increase in speed at the layer level compared to the Block-Recurrent Transformer when model parallelization is employed.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Long-range sequence modeling | Long Range Arena (LRA) (test) | Accuracy (Avg)86.96 | 158 | |
| Language Modeling | arXiv (test) | PPL2.41 | 137 | |
| Language Modeling | PG-19 (test) | Perplexity10.37 | 106 | |
| Language Modeling | GitHub (val) | Perplexity1.83 | 13 |