Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers

About

Autoregressive transformers are spectacular models for short sequences but scale poorly to long sequences such as high-resolution images, podcasts, code, or books. We proposed Megabyte, a multi-scale decoder architecture that enables end-to-end differentiable modeling of sequences of over one million bytes. Megabyte segments sequences into patches and uses a local submodel within patches and a global model between patches. This enables sub-quadratic self-attention, much larger feedforward layers for the same compute, and improved parallelism during decoding -- unlocking better performance at reduced cost for both training and generation. Extensive experiments show that Megabyte allows byte-level models to perform competitively with subword models on long context language modeling, achieve state-of-the-art density estimation on ImageNet, and model audio from raw files. Together, these results establish the viability of tokenization-free autoregressive sequence modeling at scale.

Lili Yu, D\'aniel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, Mike Lewis• 2023

Related benchmarks

TaskDatasetResultRank
Language ModelingPG-19 (test)
Perplexity36.4
106
Density EstimationImageNet 64x64 (test)
Bits Per Sub-Pixel3.4
62
Language ModelingPG-19 (val)
Perplexity42.8
19
Language ModelingSTORIES (test)
Bits Per Byte0.978
6
Showing 4 of 4 rows

Other info

Follow for update