Language Modeling Is Compression

About

It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.

Gr\'egoire Del\'etang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, Marcus Hutter, Joel Veness• 2023

Related benchmarks

Task	Dataset	Result
Lossless Compression	ObjectFolder	Bits/Byte3.465	33
Lossless Compression	TouchandGo	Bits/Byte2.055	33
Lossless Compression	Kodak	Bits per Byte4.862	31
Lossless Image Compression	CLIC m	bpp0.5292	29
Lossless Image Compression	DIV2K	BPD4.378	25
Lossless Compression	ObjectFolder cross-dataset 2.0	Bits/Byte3.659	18
Lossless Compression	ActiveCloth (cross-dataset)	Bits/Byte2.62	18
Lossless Image Compression	CLIC p	Bits per Byte4.29	18
Lossless Compression	ObjTac	Bits per Byte0.54	17
Lossless Compression	SSVTP	Bits per Byte1.975	17

Showing 10 of 38 rows

Other info

Follow for update

@wizwand_team Discord