Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MambaByte: Token-free Selective State Space Model

About

Token-free language models learn directly from raw bytes and remove the inductive bias of subword tokenization. Operating on bytes, however, results in significantly longer sequences. In this setting, standard autoregressive Transformers scale poorly as the effective memory required grows with sequence length. The recent development of the Mamba state space model (SSM) offers an appealing alternative approach with a fixed-sized memory state and efficient decoding. We propose MambaByte, a token-free adaptation of the Mamba SSM trained autoregressively on byte sequences. In terms of modeling, we show MambaByte to be competitive with, and even to outperform, state-of-the-art subword Transformers on language modeling tasks while maintaining the benefits of token-free language models, such as robustness to noise. In terms of efficiency, we develop an adaptation of speculative decoding with tokenized drafting and byte-level verification. This results in a $2.6\times$ inference speedup to the standard MambaByte implementation, showing similar decoding efficiency as the subword Mamba. These findings establish the viability of SSMs in enabling token-free language modeling.

Junxiong Wang, Tushaar Gangavarapu, Jing Nathan Yan, Alexander M. Rush• 2024

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy49.21
1891
Commonsense ReasoningWinoGrande
Accuracy52.97
1085
Question AnsweringARC-E
Accuracy71.53
416
Question AnsweringBoolQ
Accuracy72.48
317
Language ModelingPG-19 (test)--
110
Question AnsweringARC-C
Accuracy0.3642
87
Physical Commonsense ReasoningPIQA
Accuracy69.67
78
Language ModelingSTORIES (test)
Bits Per Byte0.908
6
Showing 8 of 8 rows

Other info

Follow for update